Allow or Disallow First in Robots.txt

irvingw

If I want to override a Disallow directive in robots.txt with an Allow command, do I have the Allow command before or after the Disallow command?

example:

Allow: /models/ford///page*

Disallow: /models////page

Net66SEO

Just caught this a bit late and probably to late to add something but my two pence is test it in Webmaster Tools, via Crawl -> Robot.txt tester - if you've not used this before simply add the url you want to test and Google highlights the directive that allows or disallows it.

fablau

Thank you Cyrus, yes, I have tried your suggested robots.txt checker and despite it validates the file, it shows me a couple of warnings about the "unusual" use of wildcard. It is my understanding that I would probably need to discuss all this with Google folks directly.

Thank you for you answer... and, yes Keri, I know this is a old thread, but still useful today!

Thanks

Cyrus-Shepard

Can't say with 100% confidence, but sounds like it might work. You could always upload it to a server and use a robots.txt checker to validate, although sometimes the validator tools may incorporate slight differences in edge cases like this that make them moot.

KeriMorgret

Just a quick note, this question is actually from spring of 2012.

fablau

What about something like:

allow: /directory/$

disallow: /directory/*

Where I want this to be indexed:

http://www.mysite.com/directory/

But not this:

http://www.mysite.com/directory/sub-directory/

Ideas?

irvingw

I really appreciate all that effort you put in to ensure your method was correct. many thanks.

Cyrus-Shepard

Interesting question - I've had this discussion a couple of times with different SEOs. Here's my best understanding: There are actually 2 different answers - one if you are talking about Google, and one for every other search engine.

For most search engines, the "Allow" should come first. This is because the first matching pattern always wins, for the reasons Geoff stated.

But Google is different. They state:

"At a group-member level, in particular for allow and disallow directives, the most specific rule based on the length of the [path] entry will trump the less specific (shorter) rule. The order of precedence for rules with wildcards is undefined."

Robots.txt Specifications - Webmasters — Google Developers

So for Google, order is not important, only the specificity of the rule based on the length of the entry. But the order of precedence for rules with wildcards is undefined.

This last part is important, because your directives contain wildcards. If I'm reading this right, your particular directives:

Allow: /models/ford///page*

Disallow: /models////pageSo if it's "undefined" which directive will Google follow, if order isn't important? Fortunately, there's a simple way to find out.Google Webmaster allows you to test any robots.txt file. I created a dummy file based on your rules, In this case, your directives worked perfectly no matter what order I put them in.

| http://cyrusshepard.com/models/ford/test/test/pages | Allowed by line 2: Allow: /models/ford///page* | Allowed by line 2: Allow: /models/ford///page* |
| http://cyrusshepard.com/models/chevy/test/test/pages | Blocked by line 3: Disallow: /models////page | Blocked by line 3: Disallow: /models////page |

So, to summarize:1. Always put Allow directives first, as most search engines follow the "first rule counts" rule.2. Google doesn't care about order, but rather the specificity based on the length of the entry.3. The order of precedence for rules with wildcards is undefined.4. When in doubt, check your robots.txt file in Google Webmaster tools.Hope this helps.(sorry for the very long answer which basically says you were right all along

NakulGoyal

I understand your concern. I am basing my answer based on the fact that if you don't have a robots.txt at all, Google will still crawl you, which means its an allow by default. So all that matters in my opinion is the disallow, but because you need an allow from the wildcard disallow, you could allow that and disallow next.

Honestly, I don't think it matters. If you think the way a bot would work, it's not like robots.txt 1 line is read, then the bot goes crawling and then comes back reads the next line and so on. Does that make sense ? It reads all the lines in the robots.txt and then follows the directives. But to be sure, you can do either of the scenarios and see for yourself. I am sure the results would be same either way.

zigojacko

The allow directives need to come before the disallow directives for the same directory/file paths. (I have never personally tested this although it makes logical sense to instruct a robot to access one particular path within a directory structure before it sees that it is blocked from crawling that directory).

For example:-

Allow: /profiles

Disallow: /s2/profiles/me

Allow: /s2/profiles

Allow: /s2/photos

Allow: /s2/static

Disallow: /s2

As per how Google have formatted their robots.txt.

irvingw

Thanks. I want to make sure I get this right in a syntax universally understood by all engines. I have seen webmasters all over the place on this one with some saying that crawlers use a first matching rule and others that say that crawlers use a last matching rule. I am almost thinking to have the allow command twice - before and after, to cover all bases.

NakulGoyal

I don't think it matters, but I think I would disallow first, because by default everything is an Allow.

Welcome to the Q&A Forum

Browse the forum for helpful insights and fresh discussions about all things SEO.

Moz Q&A is closed.

Allow or Disallow First in Robots.txt

Got a burning SEO question?

Browse Questions

Explore more categories

Related Questions

Robots.txt Tester - syntax not understood

Robots.txt Syntax for Dynamic URLs

Is there a limit to how many URLs you can put in a robots.txt file?

Staging & Development areas should be not indexable (i.e. no followed/no index in meta robots etc)

Oh no googlebot can not access my robots.txt file

Removing robots.txt on WordPress site problem

Internal search : rel=canonical vs noindex vs robots.txt

Should I set up a disallow in the robots.txt for catalog search results?

Products

Moz Solutions

Free SEO Tools

Resources

About Moz

Why Moz

Get Involved