Robots.txt: how to exclude sub-directories correctly?

fablau · Dec 13, 2013, 3:42 PM

Hello here,

I am trying to figure out the correct way to tell SEs to crawls this:

http://www.mysite.com/directory/

But not this:

http://www.mysite.com/directory/sub-directory/

or this:

http://www.mysite.com/directory/sub-directory2/sub-directory/...

But with the fact I have thousands of sub-directories with almost infinite combinations, I can't put the following definitions in a manageable way:

disallow: /directory/sub-directory/

disallow: /directory/sub-directory2/

disallow: /directory/sub-directory/sub-directory/

disallow: /directory/sub-directory2/subdirectory/

etc...

I would end up having thousands of definitions to disallow all the possible sub-directory combinations.

So, is the following way a correct, better and shorter way to define what I want above:

allow: /directory/$

disallow: /directory/*

Would the above work?

Any thoughts are very welcome! Thank you in advance.

Best,

Fab.

MickEdwards · Nov 10, 2017, 5:46 AM

I mentioned both. You add a meta robots to noindex and remove from the sitemap.

sjunaidali · Nov 10, 2017, 5:13 AM

But google is still free to index a link/page even if it is not included in xml sitemap.

MickEdwards · Nov 9, 2017, 12:34 PM

Install Yoast Wordpress SEO plugin and use that to restrict what is indexed and what is allowed in a sitemap.

sjunaidali · Nov 9, 2017, 11:54 AM

I am using wordpress, Enfold theme (themeforest).

I want some files to be accessed by google, but those should not be indexed.

Here is an example: http://prntscr.com/h8918o

I have currently blocked some JS directories/files using robots.txt (check screenshot)

But due to this I am not able to pass Mobile Friendly Test on Google: http://prntscr.com/h8925z (check screenshot)

Is its possible to allow access, but use a tag like noindex in the robots.txt file. Or is there any other way out.

fablau · Apr 11, 2019, 3:24 PM

Yes, everything looks good, Webmaster Tools gave me the expected results with the following directives:

allow: /directory/$

disallow: /directory/*

Which allows this URL:

http://www.mysite.com/directory/

But doesn't allow the following one:

http://www.mysite.com/directory/sub-directory2/...

This page also gives an update similar to mine:

https://support.google.com/webmasters/answer/156449?hl=en

I think I am good! Thanks

fablau · Dec 16, 2013, 3:46 PM

Thank you Michael, it is my understanding then that my idea of doing this:

allow: /directory/$

disallow: /directory/*

Should work just fine. I will test it within Google Webmaster Tools, and let you know if any problems arise.

In the meantime if anyone else has more ideas about all this and can confirm me that would be great!

Thank you again.

MickEdwards · Dec 16, 2013, 7:26 PM

I've always stuck to Disallow and followed -

"This is currently a bit awkward, as there is no "Allow" field. The easy way is to put all files to be disallowed into a separate directory, say "stuff", and leave the one file in the level above this directory:"

http://www.robotstxt.org/robotstxt.html

From https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt this seems contradictory

| /* | equivalent to / | equivalent to / | Equivalent to "/" -- the trailing wildcard is ignored. |

I think this post will be very useful for you - http://moz.com/community/q/allow-or-disallow-first-in-robots-txt

fablau · Dec 13, 2013, 7:05 PM

Thank you Michael,

Google and other SEs actually recognize the "allow:" command:

https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt

The fact is: if I don't specify that, how can I be sure that the following single command:

disallow: /directory/*

Doesn't prevent SEs to spider the /directory/ index as I'd like to?

MickEdwards · Dec 13, 2013, 4:59 PM

As long as you dont have directories somewhere in /* that you want indexed then I think that will work. There is no allow so you don't need the first line just

disallow: /directory/*

You can test out here- https://support.google.com/webmasters/answer/156449?rd=1

Welcome to the Q&A Forum

Browse the forum for helpful insights and fresh discussions about all things SEO.

Moz Q&A is closed.

Robots.txt: how to exclude sub-directories correctly?

Got a burning SEO question?

Browse Questions

Explore more categories

Related Questions

Using a Reverse Proxy and 301 redirect to appear Sub Domain as Sub Directory - what are the SEO Risks?

How do I know if I am correctly solving an uppercase url issue that may be affecting Googlebot?

Linking from & to in domains and sub-domains

Keywords in URL: sub-directory or single layer keywords?

Different Header on Home Page vs Sub pages

Should comments and feeds be disallowed in robots.txt?

Robots Disallow Backslash - Is it right command

Block an entire subdomain with robots.txt?

Products

Moz Solutions

Free SEO Tools

Resources

About Moz

Why Moz

Get Involved