Robots.txt wildcards - the devs had a disagreement - which is correct?
-
Hi – the lead website developer was assuming that this wildcard: Disallow: /shirts/?* would block URLs including a ? within this directory, and all the subdirectories of this directory that included a “?”
The second developer suggested that this wildcard would only block URLs featuring a ? that come immediately after /shirts/ - for example: /shirts?minprice=10&maxprice=20 BUT argued that this robots.txt directive would not block URLS featuring a ? in sub directories - e.g. /shirts/blue?mprice=100&maxp=20
So which of the developers is correct?
Beyond that, I assumed that the ? should feature a * on each side of it – for example - /? - to work as intended above? Am I correct in assuming that?
-
Thanks Logan - much appreciated, as ever - that really helps
- if I was to add another * to **Allow: /?resultspage= > so **Allow: /?*resultspage= - what would happen then? ****
-
Ok, gotcha. Add the following directives:
Disallow: /shirts/?
This prevents crawling of the following:
- /shirts**/golden/**?minprice=10&maxprice=20
- /shirts/?minprice=10&maxprice=20
Allow: /*?resultspage=
Allows crawling of the following:
- /shirts/navy/?resultspage=02
- /shirts/?resultspage=01
-
Thanks Logan - much appreciated - the aim would be to prevent bots crawling any parameter'd URL but only in the products section, and not all of them - see below.
I noticed the shirt URLs can be produce many pages of results - e.g. if you look for a type of shirt you can get up to 20 pages of results - the resulting URLs also feature a ?
So you end up with - for example - /shirts/?resultspage=01 and then /shirts/?resultspage=02 or shirts/navy/?resultspage=01 and /shirts/navy/?resultspage=02 - and so on - and it would be good to index them somehow. So I wonder how I can override disallow parameters robots.txt instruction only for specific paths and even individual pages?
-
Disallow: /shirts/?* will only block URLs that end with /shirts/ before beginning a parameter string. If you want to block /shirts**/golden/**?minprice=10&maxprice=20 you'll have to add the asterisk before and after the ?
What the end goal here? Preventing bots from crawling any parameter'd URL?
-
I suppose the nub of the disagreement is this: would Disallow: /shirts/?* block /shirts/?minprice=10&maxprice=20 and also block URLS further down the URL directory structure - e.g. /shirts/mens/navyblue/?minprice=10&maxprice=20 ?
-
Thanks Logan - the lead website developer was assuming that this wildcard: Disallow: /shirts/?* would block URLs including a ? within this directory, and all the subdirectories of this directory that included a “?”
If I amended the URL to
/shirts/?minprice=10&maxprice=20 would robots.txt work as intended right there?and would that robots.txt work as intended further down the directory structure of the URLs? E.g.
/shirts**/golden/**?minprice=10&maxprice=20 -
Hi Luke,
The second developer is correct....well, more correct than the first. Your example of /shirts?minprice=10&maxprice=20 would not be blocked by this direction, since there's no slack after shirts.
For future reference, you can test how directives function in Google Search Console. Under the 'Crawl' menu, there's a robots.txt tester in which you can manually edit the robots.txt directives (they don't apply to the live file) and enter test URLs to see which directive, if any, would prevent crawling.
You are correct in your assumption that a * on either side of the ? would prevent crawling of both /shirts/blue?mprice=100&maxp=20 and /shirts/?minprice=10&maxprice=20
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Meta robots
Hi, I am checking a website for SEO and I've noticed that a lot of pages from the blog have the following meta robots: meta name="robots" content="follow" Normally these pages should be indexed, since search engines will index and follow by default. In this case however, a lot of pages from this blog are not indexed. Is this because the meta robots is specified, but only contains follow? So will search engines only index and follow by default if there is no meta robots specified at all? And secondly, if I would change the meta robots, should I just add index or remove the meta robots completely from the code? Thanks for checking!
Intermediate & Advanced SEO | | Mat_C0 -
H1 and Schema Codes Set Up Correctly?
Greetings: It was pointed out to me that the h1 tags on my website (www.nyc-officespace-leader.com) all had exactly the same text and that duplication may be contributing to the very low page authority for most URLs. The duplicate h1 appears in line 54-54 (see below) of the home page: www.nyc-officespace-leader.com: itemscope itemtype="http://schema.org/LocalBusiness" style="position:absolute;top:-9999em;"> <span<br>itemprop="name">Metro Manhattan Office Space</span<br> <img< p="">But the above refers to schema" so is this really duplicate H1 or is there an exception if the H1 is within a schema? Also, I was told that the company street address and city and state were set up incorrectly as part of an alt tag. However these items also appear as schema in lines 49-68 shown below: Dangerous for me to perform surgery on the code without being certain about these key items!! Could ask my developer, however they may be uncomfortable considering that they set this up in the 1st place. So the view of neutral professionals would be highly welcome! itemprop="address" itemscope itemtype="http://schema.org/PostalAddress">
Intermediate & Advanced SEO | | Kingalan1
<span<br>itemprop="streetAddress">347 5th Ave #1008
<span<br>itemprop="addressLocality">New York
<span<br>itemprop="addressRegion">NY
<span<br>itemprop="postalCode">10016<div<br>itemprop="brand" itemscope itemtype="http://schema.org/Organization">
---------------------------------------------------------------------------</div<br></span<br></span<br></span<br></span<br></img<>0 -
Robots blocked by pages webmasters tools
a mistake made in software. How can I solve the problem quickly? help me. XTRjH
Intermediate & Advanced SEO | | mihoreis0 -
Homepage meta title not indexing correctly on google
Hello everyone! We're having a spot of trouble with our website www.whichledlight.com The meta title is coming up wrong on google. In Google it currently reads out
Intermediate & Advanced SEO | | TrueluxGroup
'Which LED Light: LED Bulbs & Lamps Compared'
when it should be
'LED Bulbs & Lamps Compared | Which LED Light' Last snapshot of the page from google was yesterday (5th April 2016) Anyone got any ideas?
Is all the markup correct in the ?0 -
Robots.txt - blocking JavaScript and CSS, best practice for Magento
Hi Mozzers, I'm looking for some feedback regarding best practices for setting up Robots.txt file in Magento. I'm concerned we are blocking bots from crawling essential information for page rank. My main concern comes with blocking JavaScript and CSS, are you supposed to block JavaScript and CSS or not? You can view our robots.txt file here Thanks, Blake
Intermediate & Advanced SEO | | LeapOfBelief0 -
Help with Robots.txt On a Shared Root
Hi, I posted a similar question last week asking about subdomains but a couple of complications have arisen. Two different websites I am looking after share the same root domain which means that they will have to share the same robots.txt. Does anybody have suggestions to separate the two on the same file without complications? It's a tricky one. Thank you in advance.
Intermediate & Advanced SEO | | Whittie0 -
Can't find X-Robots tag!
Hi all. I've been checking out http://www.unthankbooks.com/ as it seems to have some indexing problems. I ran a server header check, and got a 200 response. However, it also shows the following: X-Robots-Tag:
Intermediate & Advanced SEO | | Blink-SEO
noindex, nofollow It's not in the page HTML though. Could it be being picked up from somewhere else?0 -
Is my schema implemented correctly and not spammy?
i already used the validation tool from Google. It tell's me it's implemented, but i would like to know specifically if everything is implemented correctly and in a non spam way. Here is a product link to test: http://www.suddora.com/pink-sweatbands-wholesale-pink-wristbands.html
Intermediate & Advanced SEO | | Hyrule1