Robots.txt wildcards - the devs had a disagreement - which is correct?

McTaggart

Hi – the lead website developer was assuming that this wildcard: Disallow: /shirts/?* would block URLs including a ? within this directory, and all the subdirectories of this directory that included a “?”

The second developer suggested that this wildcard would only block URLs featuring a ? that come immediately after /shirts/ - for example: /shirts?minprice=10&maxprice=20 BUT argued that this robots.txt directive would not block URLS featuring a ? in sub directories - e.g. /shirts/blue?mprice=100&maxp=20

So which of the developers is correct?

Beyond that, I assumed that the ? should feature a * on each side of it – for example - /? - to work as intended above? Am I correct in assuming that?

McTaggart

Thanks Logan - much appreciated, as ever - that really helps - if I was to add another * to **Allow: /?resultspage= > so **Allow: /?*resultspage= - what would happen then? ****

LoganRay

Ok, gotcha. Add the following directives:

Disallow: /shirts/?

This prevents crawling of the following:

/shirts**/golden/**?minprice=10&maxprice=20
/shirts/?minprice=10&maxprice=20

Allow: /*?resultspage=

Allows crawling of the following:

/shirts/navy/?resultspage=02
/shirts/?resultspage=01

McTaggart

Thanks Logan - much appreciated - the aim would be to prevent bots crawling any parameter'd URL but only in the products section, and not all of them - see below.

I noticed the shirt URLs can be produce many pages of results - e.g. if you look for a type of shirt you can get up to 20 pages of results - the resulting URLs also feature a ?

So you end up with - for example - /shirts/?resultspage=01 and then /shirts/?resultspage=02 or shirts/navy/?resultspage=01 and /shirts/navy/?resultspage=02 - and so on - and it would be good to index them somehow. So I wonder how I can override disallow parameters robots.txt instruction only for specific paths and even individual pages?

LoganRay

Disallow: /shirts/?* will only block URLs that end with /shirts/ before beginning a parameter string. If you want to block /shirts**/golden/**?minprice=10&maxprice=20 you'll have to add the asterisk before and after the ?

What the end goal here? Preventing bots from crawling any parameter'd URL?

McTaggart

I suppose the nub of the disagreement is this: would Disallow: /shirts/?* block /shirts/?minprice=10&maxprice=20 and also block URLS further down the URL directory structure - e.g. /shirts/mens/navyblue/?minprice=10&maxprice=20 ?

McTaggart

Thanks Logan - the lead website developer was assuming that this wildcard: Disallow: /shirts/?* would block URLs including a ? within this directory, and all the subdirectories of this directory that included a “?”

If I amended the URL to
/shirts/?minprice=10&maxprice=20 would robots.txt work as intended right there?

and would that robots.txt work as intended further down the directory structure of the URLs? E.g.
/shirts**/golden/**?minprice=10&maxprice=20

LoganRay

Hi Luke,

The second developer is correct....well, more correct than the first. Your example of /shirts?minprice=10&maxprice=20 would not be blocked by this direction, since there's no slack after shirts.

For future reference, you can test how directives function in Google Search Console. Under the 'Crawl' menu, there's a robots.txt tester in which you can manually edit the robots.txt directives (they don't apply to the live file) and enter test URLs to see which directive, if any, would prevent crawling.

You are correct in your assumption that a * on either side of the ? would prevent crawling of both /shirts/blue?mprice=100&maxp=20 and /shirts/?minprice=10&maxprice=20

Welcome to the Q&A Forum

Browse the forum for helpful insights and fresh discussions about all things SEO.

Robots.txt wildcards - the devs had a disagreement - which is correct?

Got a burning SEO question?

Browse Questions

Explore more categories

Related Questions

Block session id URLs with robots.txt

Google cache is for a 3rd parties site for HTTP version and correct for HTTPS

Hhreflang been setup correctly?

How did my dev site end up in the search results?

Block subdomain directory in robots.txt

Use Canonical or Robots.txt for Map View URL without Backlink Potential

How to structure your site correctly for optimal juice flow?

Robots.txt disallow subdomain