Moz Q&A is closed.
After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.
Disallow wildcard match in Robots.txt
-
This is in my robots.txt file, does anyone know what this is supposed to accomplish, it doesn't appear to be blocking URLs with question marks
Disallow: /?crawler=1
Disallow: /?mobile=1Thank you
-
This is a good reply.
Everyone gets really confused because Robots.txt has very minor, partial wildcard support and that makes people think that Robots.txt files use Regex, which they do not. Instead of having some weird half and half implementation, it would be much better IMO if the Robots.txt initiative / directive were updated to say "yes, you can use full regular expressions with regards to URL string matching".
Many people are left in a kind of silly guessing game because Google doesn't 'properly' elaborate or invest in expanding the definitions to their currently (publicly) assumed end-game.
People assume that if "*" will match any string of characters, "?" will match any individual character when used in a robots.txt file. This would make sense, but it's not the case. AFAIK there are only one or two supported wildcard characters in Robots.txt and that's why people get confused, looking for escape characters and the suchlike.
-
Hi Amanda,
Those lines tell GoogleBot not to crawl urls that have that text fragments.
For example, wont crawl: domain.com/category/product**?mobile=1**BUT, that doesnt mean that will not crawl every URL with question marks. For that, the line should be like this:
Disallow: /*?I do highly recommend you to read this guides:
About /robots.txt - Official site - Robotstxt.org
Robots.txt - Moz
Robots.txt: the ultimate guide - YOAST
The Complete Guide to Robots.txt - PORTENTHope it helps.
Best luck.
GR
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
What is the importance of exact match keywords for local SEO in service industry businesses?
I am working with a local service contractor. Several of his competitors have domain names with exact match keywords. Audits of competitor sites and use of other research tools reveals that their sites are behind in content and technical SEO. The competitor sites consistently rank higher in organic search results. I am new to SEO and I understand that some of my lack of clarity here is a result of not understanding the value of key word use in local SEO vs. wider efforts.
Technical SEO | | Andrew Woffenden0 -
Good to use disallow or noindex for these?
Hello everyone, I am reaching out to seek your expert advice on a few technical SEO aspects related to my website. I highly value your expertise in this field and would greatly appreciate your insights.
Technical SEO | | williamhuynh
Below are the specific areas I would like to discuss: a. Double and Triple filter pages: I have identified certain URLs on my website that have a canonical tag pointing to the main /quick-ship page. These URLs are as follows: https://www.interiorsecrets.com.au/collections/lounge-chairs/quick-ship+black
https://www.interiorsecrets.com.au/collections/lounge-chairs/quick-ship+black+fabric Considering the need to optimize my crawl budget, I would like to seek your advice on whether it would be advisable to disallow or noindex these pages. My understanding is that by disallowing or noindexing these URLs, search engines can avoid wasting resources on crawling and indexing duplicate or filtered content. I would greatly appreciate your guidance on this matter. b. Page URLs with parameters: I have noticed that some of my page URLs include parameters such as ?variant and ?limit. Although these URLs already have canonical tags in place, I would like to understand whether it is still recommended to disallow or noindex them to further conserve crawl budget. My understanding is that by doing so, search engines can prevent the unnecessary expenditure of resources on indexing redundant variations of the same content. I would be grateful for your expert opinion on this matter. Additionally, I would be delighted if you could provide any suggestions regarding internal linking strategies tailored to my website's structure and content. Any insights or recommendations you can offer would be highly valuable to me. Thank you in advance for your time and expertise in addressing these concerns. I genuinely appreciate your assistance. If you require any further information or clarification, please let me know. I look forward to hearing from you. Cheers!0 -
Role of Robots.txt and Search Console parameters settings
Hi, wondering if anyone can point me to resources or explain the difference between these two. If a site has url parameters disallowed in Robots.txt is it redundant to edit settings in Search Console parameters to anything other than "Let Googlebot Decide"?
Technical SEO | | LivDetrick0 -
Robot.txt : How to block a specific file type in several subdirectories ?
Hello everyone ! I need help setting up a robot.txt. I'm trying to block all pdf files in particular directories so I'm using this command. In the example below the line is blocking all .gif in the entire site. Block files of a specific file type (for example, .gif) | Disallow: /*.gif$ 2 questions : Can I use this command to specify one particular directory in which I want to block pdf files ? Will this line be recognized by googlebots ? Disallow: /fileadmin/xxxxxxx/xxx/xxxxxxx/*.pdf$ Then I realized that I would have to write as many lines as many directories there are in which I want to block pdf files. Let's say I want to block pdf files in all these 3 directories /fileadmin/directory1 /fileadmin/directory1/sub1 /fileadmin/directory1/sub1/pdf Is there a pattern-matching rule I could use to blocks access to pdf files in all subdirectories instead of writing 3x the above line for each subdirectory ? For exemple : Disallow: /fileadmin/directory1*/ Many thanks in advance for any insight you may have.
Technical SEO | | LabeliumUSA0 -
Robots.txt Syntax for Dynamic URLs
I want to Disallow certain dynamic pages in robots.txt and am unsure of the proper syntax. The pages I want to disallow all include the string ?Page= Which is the proper syntax?
Technical SEO | | btreloar
Disallow: ?Page=
Disallow: ?Page=*
Disallow: ?Page=
Or something else?0 -
Googlebot does not obey robots.txt disallow
Hi Mozzers! We are trying to get Googlebot to steer away from our internal search results pages by adding a parameter "nocrawl=1" to facet/filter links and then robots.txt disallow all URLs containing that parameter. We implemented this late august and since that, the GWMT message "Googlebot found an extremely high number of URLs on your site", stopped coming. But today we received yet another. The weird thing is that Google gives many of our nowadays robots.txt disallowed URLs as examples of URLs that may cause us problems. What could be the reason? Best regards, Martin
Technical SEO | | TalkInThePark0 -
Robots.txt and canonical tag
In the SEOmoz post - http://www.seomoz.org/blog/robot-access-indexation-restriction-techniques-avoiding-conflicts, it's being said - If you have a robots.txt disallow in place for a page, the canonical tag will never be seen. Does it so happen that if a page is disallowed by robots.txt, spiders DO NOT read the html code ?
Technical SEO | | seoug_20050 -
Robots.txt File Redirects to Home Page
I've been doing some site analysis for a new SEO client and it has been brought to my attention that their robots.txt file redirects to their homepage. I was wondering: Is there a benfit to setup your robots.txt file to do this? Will this effect how their site will get indexed? Thanks for your response! Kyle Site URL: http://www.radisphere.net/
Technical SEO | | kchandler0