Robots.txt wildcards - the devs had a disagreement - which is correct?
-
Hi – the lead website developer was assuming that this wildcard: Disallow: /shirts/?* would block URLs including a ? within this directory, and all the subdirectories of this directory that included a “?”
The second developer suggested that this wildcard would only block URLs featuring a ? that come immediately after /shirts/ - for example: /shirts?minprice=10&maxprice=20 BUT argued that this robots.txt directive would not block URLS featuring a ? in sub directories - e.g. /shirts/blue?mprice=100&maxp=20
So which of the developers is correct?
Beyond that, I assumed that the ? should feature a * on each side of it – for example - /? - to work as intended above? Am I correct in assuming that?
-
Thanks Logan - much appreciated, as ever - that really helps - if I was to add another * to **Allow: /?resultspage= > so **Allow: /?*resultspage= - what would happen then? ****
-
Ok, gotcha. Add the following directives:
Disallow: /shirts/?
This prevents crawling of the following:
- /shirts**/golden/**?minprice=10&maxprice=20
- /shirts/?minprice=10&maxprice=20
Allow: /*?resultspage=
Allows crawling of the following:
- /shirts/navy/?resultspage=02
- /shirts/?resultspage=01
-
Thanks Logan - much appreciated - the aim would be to prevent bots crawling any parameter'd URL but only in the products section, and not all of them - see below.
I noticed the shirt URLs can be produce many pages of results - e.g. if you look for a type of shirt you can get up to 20 pages of results - the resulting URLs also feature a ?
So you end up with - for example - /shirts/?resultspage=01 and then /shirts/?resultspage=02 or shirts/navy/?resultspage=01 and /shirts/navy/?resultspage=02 - and so on - and it would be good to index them somehow. So I wonder how I can override disallow parameters robots.txt instruction only for specific paths and even individual pages?
-
Disallow: /shirts/?* will only block URLs that end with /shirts/ before beginning a parameter string. If you want to block /shirts**/golden/**?minprice=10&maxprice=20 you'll have to add the asterisk before and after the ?
What the end goal here? Preventing bots from crawling any parameter'd URL?
-
I suppose the nub of the disagreement is this: would Disallow: /shirts/?* block /shirts/?minprice=10&maxprice=20 and also block URLS further down the URL directory structure - e.g. /shirts/mens/navyblue/?minprice=10&maxprice=20 ?
-
Thanks Logan - the lead website developer was assuming that this wildcard: Disallow: /shirts/?* would block URLs including a ? within this directory, and all the subdirectories of this directory that included a “?”
If I amended the URL to
/shirts/?minprice=10&maxprice=20 would robots.txt work as intended right there?and would that robots.txt work as intended further down the directory structure of the URLs? E.g.
/shirts**/golden/**?minprice=10&maxprice=20 -
Hi Luke,
The second developer is correct....well, more correct than the first. Your example of /shirts?minprice=10&maxprice=20 would not be blocked by this direction, since there's no slack after shirts.
For future reference, you can test how directives function in Google Search Console. Under the 'Crawl' menu, there's a robots.txt tester in which you can manually edit the robots.txt directives (they don't apply to the live file) and enter test URLs to see which directive, if any, would prevent crawling.
You are correct in your assumption that a * on either side of the ? would prevent crawling of both /shirts/blue?mprice=100&maxp=20 and /shirts/?minprice=10&maxprice=20
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
No index detected in robots meta tag GSC issue_Help Please
Hi Everyone, We just did a site migration ( URL structure change, site redesign, CMS change). During migration, dev team messed up badly on a few things including SEO. The old site had pages canonicalized and self canonicalized <> New site doesn't have anything (CMS dev error) so we are working retroactively to add canonicalization mechanism The legacy site had URL’s ending with a trailing slash “/” <> new site got redirected to Set of url’s without “/” New site action : All robots are allowed: A new sitemap is submitted to google search console So here is my problem (it been a long 24hr night for me 🙂 ) 1. Now when I look at GSC homepage URL it says that old page is self canonicalized and currently in index (old page with a trailing slash at the end of URL). 2. When I try to perform a live URL test, I get the message "No: 'noindex' detected in 'robots' meta tag" , so indexation cant be done. I have no idea where noindex is coming from. 3. Robots.txt in search console still showing old file ( no noindex there ) I tried to submit new file but old one still coming up. When I click on "See live robots.txt" I get current robots. 4. I see that old page is still canonicalized and attempting to index redirected old page might be confusing google Hope someone can help to get the new page indexed! I really need it 🙂 Please ping me if you need more clarification. Thank you ! Thank you
Intermediate & Advanced SEO | | bgvsiteadmin1 -
What do you add to your robots.txt on your ecommerce sites?
We're looking at expanding our robots.txt, we currently don't have the ability to noindex/nofollow. We're thinking about adding the following: Checkout Basket Then possibly: Price Theme Sortby other misc filters. What do you include?
Intermediate & Advanced SEO | | ThomasHarvey0 -
Robots txt is case senstive? Pls suggest
Hi i have seen few urls in the html improvements duplicate titles Can i disable one of the below url in the robots.txt? /store/Solar-Home-UPS-1KV-System/75652
Intermediate & Advanced SEO | | Rahim119
/store/solar-home-ups-1kv-system/75652 if i disable this Disallow: /store/Solar-Home-UPS-1KV-System/75652 will the Search engines scan this /store/solar-home-ups-1kv-system/75652 im little confused with case senstive.. Pls suggest go ahead or not in the robots.txt0 -
Should I use meta noindex and robots.txt disallow?
Hi, we have an alternate "list view" version of every one of our search results pages The list view has its own URL, indicated by a URL parameter I'm concerned about wasting our crawl budget on all these list view pages, which effectively doubles the amount of pages that need crawling When they were first launched, I had the noindex meta tag be placed on all list view pages, but I'm concerned that they are still being crawled Should I therefore go ahead and also apply a robots.txt disallow on that parameter to ensure that no crawling occurs? Or, will Googlebot/Bingbot also stop crawling that page over time? I assume that noindex still means "crawl"... Thanks 🙂
Intermediate & Advanced SEO | | ntcma0 -
Google Keyword Planner tool is not correct
Hi All, I know you are all know about Google keyword planner tool. As i know its shows most keywords searched totally wrong. One of the keyword searches less than (<10) but i got 20 exact keyword hits in only one business day and one of the keyword shows more then 10 K searches give us only 3-4 hits in one day.
Intermediate & Advanced SEO | | dotlineseo0 -
Robots.txt help
Hi Moz Community, Google is indexing some developer pages from a previous website where I currently work: ddcblog.dev.examplewebsite.com/categories/sub-categories Was wondering how I include these in a robots.txt file so they no longer appear on Google. Can I do it under our homepage GWT account or do I have to have a separate account set up for these URL types? As always, your expertise is greatly appreciated, -Reed
Intermediate & Advanced SEO | | IceIcebaby0 -
Using 2 wildcards in the robots.txt file
I have a URL string which I don't want to be indexed. it includes the characters _Q1 ni the middle of the string. So in the robots.txt can I use 2 wildcards in the string to take out all of the URLs with that in it? So something like /_Q1. Will that pickup and block every URL with those characters in the string? Also, this is not directly of the root, but in a secondary directory, so .com/.../_Q1. So do I have to format the robots.txt as //_Q1* as it will be in the second folder or just using /_Q1 will pickup everything no matter what folder it is on? Thanks.
Intermediate & Advanced SEO | | seo1234560 -
Should I robots block site directories with primarily duplicate content?
Our site, CareerBliss.com, primarily offers unique content in the form of company reviews and exclusive salary information. As a means of driving revenue, we also have a lot of job listings in ouir /jobs/ directory, as well as educational resources (/career-tools/education/) in our. The bulk of this information are feeds, which exist on other websites (duplicate). Does it make sense to go ahead and robots block these portions of our site? My thinking is in doing so, it will help reallocate our site authority helping the /salary/ and /company-reviews/ pages rank higher, and this is where most of the people are finding our site via search anyways. ie. http://www.careerbliss.com/jobs/cisco-systems-jobs-812156/ http://www.careerbliss.com/jobs/jobs-near-you/?l=irvine%2c+ca&landing=true http://www.careerbliss.com/career-tools/education/education-teaching-category-5/
Intermediate & Advanced SEO | | CareerBliss0