Moz Q&A is closed.
After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.
Are robots.txt wildcards still valid? If so, what is the proper syntax for setting this up?
-
I've got several URL's that I need to disallow in my robots.txt file. For example, I've got several documents that I don't want indexed and filters that are getting flagged as duplicate content. Rather than typing in thousands of URL's I was hoping that wildcards were still valid.
-
Great job. I just wanted to add this from Google Webmasters
http://googlewebmastercentral.blogspot.com/2008/06/improving-on-robots-exclusion-protocol.html
and this from Google Developers
https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt
-
Yup wildcard syntax is indeed still valid. However I can only confirm that the big 3 (Google, Yahoo and Bing) actively observe it. Other secondary search engines may not.
In your case you are probably looking for a syntax along the lines of:
User-agent: *
Disallow: /*.pdf$ This would set that any user agent should be blocked from any file name that ends in .pdf (a $ ties it to the end so pdf.txt would not be blocked in this case)Keep an eye on how you block them. Missing a trailing slash could block a directory rather than a file, or not appending a strict symbol ($) could mean that phrases throughout a directory could be blocked rather than just a filename.
Also keep in mind if you are using URL re-writing this may play into how you need to block things; and you may also want to remember that disallowing access in a robot.txt does NOT prevent search engines from indexing the data, it is up to them if they honor the request. So if it is very important to block the file access from search engines then robots.txt may not be the way to do it.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
I have two robots.txt pages for www and non-www version. Will that be a problem?
There are two robots.txt pages. One for www version and another for non-www version though I have moved to the non-www version.
Technical SEO | | ramb0 -
Robots.txt in subfolders and hreflang issues
A client recently rolled out their UK business to the US. They decided to deploy with 2 WordPress installations: UK site - https://www.clientname.com/uk/ - robots.txt location: UK site - https://www.clientname.com/uk/robots.txt
Technical SEO | | lauralou82
US site - https://www.clientname.com/us/ - robots.txt location: UK site - https://www.clientname.com/us/robots.txt We've had various issues with /us/ pages being indexed in Google UK, and /uk/ pages being indexed in Google US. They have the following hreflang tags across all pages: We changed the x-default page to .com 2 weeks ago (we've tried both /uk/ and /us/ previously). Search Console says there are no hreflang tags at all. Additionally, we have a robots.txt file on each site which has a link to the corresponding sitemap files, but when viewing the robots.txt tester on Search Console, each property shows the robots.txt file for https://www.clientname.com only, even though when you actually navigate to this URL (https://www.clientname.com/robots.txt) you’ll get redirected to either https://www.clientname.com/uk/robots.txt or https://www.clientname.com/us/robots.txt depending on your location. Any suggestions how we can remove UK listings from Google US and vice versa?0 -
One robots.txt file for multiple sites?
I have 2 sites hosted with Blue Host and was told to put the robots.txt in the root folder and just use the one robots.txt for both sites. Is this right? It seems wrong. I want to block certain things on one site. Thanks for the help, Rena
Technical SEO | | renalynd270 -
2 sitemaps on my robots.txt?
Hi, I thought that I just could link one sitemap from my site's robots.txt but... I may be wrong. So, I need to confirm if this kind of implementation is right or wrong: robots.txt for Magento Community and Enterprise ...
Technical SEO | | Webicultors
Sitemap: http://www.mysite.es/media/sitemap/es.xml
Sitemap: http://www.mysite.pt/media/sitemap/pt.xml Thanks in advance,0 -
My old URL's are still indexing when I have redirected all of them, why is this happening?
I have built a new website and have redirected all my old URL's to their new ones but for some reason Google is still indexing the old URL's. Also, the page authority for all of my pages has dropped to 1 (apart from the homepage) but before they were between 12 to 15. Can anyone help me with this?
Technical SEO | | One2OneDigital0 -
Block Domain in robots.txt
Hi. We had some URLs that were indexed in Google from a www1-subdomain. We have now disabled the URLs (returning a 404 - for other reasons we cannot do a redirect from www1 to www) and blocked via robots.txt. But the amount of indexed pages keeps increasing (for 2 weeks now). Unfortunately, I cannot install Webmaster Tools for this subdomain to tell Google to back off... Any ideas why this could be and whether it's normal? I can send you more domain infos by personal message if you want to have a look at it.
Technical SEO | | zeepartner0 -
Removing robots.txt on WordPress site problem
Hi..am a little confused since I ticked the box in WordPress to allow search engines to now crawl my site (previously asked for them not to) but Google webmaster tools is telling me I still have robots.txt blocking them so am unable to submit the sitemap. Checked source code and the robots instruction has gone so a little lost. Any ideas please?
Technical SEO | | Wallander0 -
Should we use Google's crawl delay setting?
We’ve been noticing a huge uptick in Google’s spidering lately, and along with it a notable worsening of render times. Yesterday, for example, Google spidered our site at a rate of 30:1 (google spider vs. organic traffic.) So in other words, for every organic page request, Google hits the site 30 times. Our render times have lengthened to an avg. of 2 seconds (and up to 2.5 seconds). Before this renewed interest Google has taken in us we were seeing closer to one second average render times, and often half of that. A year ago, the ratio of Spider to Organic was between 6:1 and 10:1. Is requesting a crawl-delay from Googlebot a viable option? Our goal would be only to reduce Googlebot traffic, and hopefully improve render times and organic traffic. Thanks, Trisha
Technical SEO | | lzhao0