Moz Q&A is closed.
After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.
What does Disallow: /french-wines/?* actually do - robots.txt
-
Hello Mozzers - Just wondering what this robots.txt instruction means: Disallow: /french-wines/?*
Does it stop Googlebot crawling and indexing URLs in that "French Wines" folder - specifically the URLs that include a question mark?
Would it stop the crawling of deeper folders - e.g. /french-wines/rhone-region/ that include a question mark in their URL?
I think this has been done to block URLs containing query strings.
Thanks, Luke
-
Glad to help, Luke!
-
Thanks Logan for your help with this - much appreciated. Really helpful!
-
Disallow: /?* is the same thing as Disallow:/?, since the asterisk is a wildcard, both of those disallows prevent any URL that begins with /? from being crawled.
And yes, it is incredibly easy to disallow the wrong thing! The robots.txt tester in Search Console (under the Crawl menu) is very helpful for figuring out what a disallow will catch and what it will let by. I highly recommend testing any new disallows there before releasing them into the wild.
-
Thanks again Logan.
What would Disallow: /?* do because that is what the site I am looking at has implemented. Perhaps it works both ways around?
I imagine it's easy to disallow the wrong thing or possibly not disallow the right thing. Ugh.
-
Disallow: /*?
This disallow literally says to crawlers 'if a URL starts with a slash (all URLs) and has a parameter, don't crawl it'. The * is a wildcard that says anything between / and ? is applicable to the disallow.
It's very easy to disallow the wrong this especially in regards to parameters, for this reason I always do these 2 things rather than using robots.txt:
- Set the purpose of each parameter in Search Console - Go to Crawl > URL Parameters to configure for your site
- Self-referring canonicals - most people disallow URLs with parameters in robots.txt to prevent indexing, but this only prevents crawling. A self-referring canonical pointing to the root level of that URL will prevent indexing or URLs with parameters.
Hope that's helpful!
-
Thanks Logan - I was just reading: Disallow: /*? # block any URL that includes a ? (and thus a query string) - do you know why the ? comes before the * in this case?
-
Hi Luke,
You are correct that this was done to block URLs with parameters. However, since there's no wildcard (the asterisk) before the folder name, the URL would have to start with /french-wines/. This disallow is really only preventing crawling on the single URL www.yoursite.com/french-wines/ with any parameters appended.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Disallow: /jobs/? is this stopping the SERPs from indexing job posts
Hi,
Intermediate & Advanced SEO | | JamesHancocks1
I was wondering what this would be used for as it's in the Robots.exe of a recruitment agency website that posts jobs. Should it be removed? Disallow: /jobs/?
Disallow: /jobs/page/*/ Thanks in advance.
James0 -
What can we do to optimize / be mobile-friendly for PDFs?
I'm getting a "Your page is not mobile-friendly." notice in the SERPs for all of our PDFs. I check the pdf on the phone and it appears just fine. rFtLq
Intermediate & Advanced SEO | | johnnybgunn0 -
How to handle a blog subdomain on the main sitemap and robots file?
Hi, I have some confusion about how our blog subdomain is handled in our sitemap. We have our main website, example.com, and our blog, blog.example.com. Should we list the blog subdomain URL in our main sitemap? In other words, is listing a subdomain allowed in the root sitemap? What does the final structure look like in terms of the sitemap and robots file? Specifically: **example.com/sitemap.xml ** would I include a link to our blog subdomain (blog.example.com)? example.com/robots.xml would I include a link to BOTH our main sitemap and blog sitemap? blog.example.com/sitemap.xml would I include a link to our main website URL (even though it's not a subdomain)? blog.example.com/robots.xml does a subdomain need its own robots file? I'm a technical SEO and understand the mechanics of much of on-page SEO.... but for some reason I never found an answer to this specific question and I am wondering how the pros do it. I appreciate your help with this.
Intermediate & Advanced SEO | | seo.owl0 -
Block in robots.txt instead of using canonical?
When I use a canonical tag for pages that are variations of the same page, it basically means that I don't want Google to index this page. But at the same time, spiders will go ahead and crawl the page. Isn't this a waste of my crawl budget? Wouldn't it be better to just disallow the page in robots.txt and let Google focus on crawling the pages that I do want indexed? In other words, why should I ever use rel=canonical as opposed to simply disallowing in robots.txt?
Intermediate & Advanced SEO | | YairSpolter0 -
Recovering from robots.txt error
Hello, A client of mine is going through a bit of a crisis. A developer (at their end) added Disallow: / to the robots.txt file. Luckily the SEOMoz crawl ran a couple of days after this happened and alerted me to the error. The robots.txt file was quickly updated but the client has found the vast majority of their rankings have gone. It took a further 5 days for GWMT to file that the robots.txt file had been updated and since then we have "Fetched as Google" and "Submitted URL and linked pages" in GWMT. In GWMT it is still showing that that vast majority of pages are blocked in the "Blocked URLs" section, although the robots.txt file below it is now ok. I guess what I want to ask is: What else is there that we can do to recover these rankings quickly? What time scales can we expect for recovery? More importantly has anyone had any experience with this sort of situation and is full recovery normal? Thanks in advance!
Intermediate & Advanced SEO | | RikkiD220 -
Robots.txt: Can you put a /* wildcard in the middle of a URL?
We have noticed that Google is indexing the language/country directory versions of directories we have disallowed in our robots.txt. For example: Disallow: /images/ is blocked just fine However, once you add our /en/uk/ directory in front of it, there are dozens of pages indexed. The question is: Can I put a wildcard in the middle of the string, ex. /en/*/images/, or do I need to list out every single country for every language in the robots file. Anyone know of any workarounds?
Intermediate & Advanced SEO | | IHSwebsite0 -
Could you use a robots.txt file to disalow a duplicate content page from being crawled?
A website has duplicate content pages to make it easier for users to find the information from a couple spots in the site navigation. Site owner would like to keep it this way without hurting SEO. I've thought of using the robots.txt file to disallow search engines from crawling one of the pages. Would you think this is a workable/acceptable solution?
Intermediate & Advanced SEO | | gregelwell0 -
Do search engines understand special/foreign characters?
We carry a few brands that have special foreign characters, e.g., Kühl, Lolë, but do search engines recognize special unicode characters? Obviously we would want to spend more energy optimizing keywords that potential customers can type with a keyboard, but is it worthwhile to throw in some encoded keywords and anchor text for people that copy-paste these words into a search? Do search engines typically equate special characters to their closest English equivalent, or are "Kuhl", "Kühl" and "Kühl" three entirely different terms?
Intermediate & Advanced SEO | | TahoeMountain400