Moz Q&A is closed.
After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.
Are robots.txt wildcards still valid? If so, what is the proper syntax for setting this up?
-
I've got several URL's that I need to disallow in my robots.txt file. For example, I've got several documents that I don't want indexed and filters that are getting flagged as duplicate content. Rather than typing in thousands of URL's I was hoping that wildcards were still valid.
-
Great job. I just wanted to add this from Google Webmasters
http://googlewebmastercentral.blogspot.com/2008/06/improving-on-robots-exclusion-protocol.html
and this from Google Developers
https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt
-
Yup wildcard syntax is indeed still valid. However I can only confirm that the big 3 (Google, Yahoo and Bing) actively observe it. Other secondary search engines may not.
In your case you are probably looking for a syntax along the lines of:
User-agent: *
Disallow: /*.pdf$ This would set that any user agent should be blocked from any file name that ends in .pdf (a $ ties it to the end so pdf.txt would not be blocked in this case)Keep an eye on how you block them. Missing a trailing slash could block a directory rather than a file, or not appending a strict symbol ($) could mean that phrases throughout a directory could be blocked rather than just a filename.
Also keep in mind if you are using URL re-writing this may play into how you need to block things; and you may also want to remember that disallowing access in a robot.txt does NOT prevent search engines from indexing the data, it is up to them if they honor the request. So if it is very important to block the file access from search engines then robots.txt may not be the way to do it.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Unsolved Google Search Console Still Reporting Errors After Fixes
Hello, I'm working on a website that was too bloated with content. We deleted many pages and set up redirects to newer pages. We also resolved an unreasonable amount of 400 errors on the site. I also removed several ancient sitemaps that listed content deleted years ago that Google was crawling. According to Moz and Screaming Frog, these errors have been resolved. We've submitted the fixes for validation in GSC, but the validation repeatedly fails. What could be going on here? How can we resolve these error in GSC.
Technical SEO | | tif-swedensky0 -
One robots.txt file for multiple sites?
I have 2 sites hosted with Blue Host and was told to put the robots.txt in the root folder and just use the one robots.txt for both sites. Is this right? It seems wrong. I want to block certain things on one site. Thanks for the help, Rena
Technical SEO | | renalynd270 -
Is there a limit to how many URLs you can put in a robots.txt file?
We have a site that has way too many urls caused by our crawlable faceted navigation. We are trying to purge 90% of our urls from the indexes. We put no index tags on the url combinations that we do no want indexed anymore, but it is taking google way too long to find the no index tags. Meanwhile we are getting hit with excessive url warnings and have been it by Panda. Would it help speed the process of purging urls if we added the urls to the robots.txt file? Could this cause any issues for us? Could it have the opposite effect and block the crawler from finding the urls, but not purge them from the index? The list could be in excess of 100MM urls.
Technical SEO | | kcb81780 -
Is there any value in having a blank robots.txt file?
I've read an audit where the writer recommended creating and uploading a blank robots.txt file, there was no current file in place. Is there any merit in having a blank robots.txt file? What is the minimum you would include in a basic robots.txt file?
Technical SEO | | NicDale0 -
Genesis WP Theme H1 Tag not properly Used?
I am in the process of redesigning my website, and I have been working on the Genesis framework a lot lately, so I used the Genesis framework to make my new site. The URL is http://protechig.com As I look at the H1 on the page (homepage only, every other page has solid h1s from an SEO perspective.) The first thing that I see is that the home page H1 is a links (to protech's home page). The second thing that I see is the the title text is replaced with an image (my logo) and there is a text-indent:-99999; and overflow:hiden; I just want to know from an SEO perspective if this is okay, and, if it isn't, what I could/should to to rectify it. Thanks Zach
Technical SEO | | Zachary_Russell0 -
Robots.txt file getting a 500 error - is this a problem?
Hello all! While doing some routine health checks on a few of our client sites, I spotted that a new client of ours - who's website was not designed built by us - is returning a 500 internal server error when I try to look at the robots.txt file. As we don't host / maintain their site, I would have to go through their head office to get this changed, which isn't a problem but I just wanted to check whether this error will actually be having a negative effect on their site / whether there's a benefit to getting this changed? Thanks in advance!
Technical SEO | | themegroup0 -
Is blocking RSS Feeds with robots.txt necessary?
Is it necessary to block an rss feed with robots.txt? It seems they are automatically not indexed (http://googlewebmastercentral.blogspot.com/2007/12/taking-feeds-out-of-our-web-search.html) And, google says here that it's important not to block RSS feeds (http://googlewebmastercentral.blogspot.com/2009/10/using-rssatom-feeds-to-discover-new.html) I'm just checking!
Technical SEO | | nicole.healthline0 -
How to set up a rel canonical in big commmerce?
I have no clue how to set this up in the Bigcommerce store platform
Technical SEO | | Firestarter-SEO0