Moz Q&A is closed.
After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.
Wildcarding Robots.txt for Particular Word in URL
-
Hey All,
So I know that this isn't a standard robots.txt, I'm aware of how to block or wildcard certain folders but I'm wondering whether it's possible to block all URL's with a certain word in it?
We have a client that was hacked a year ago and now they want us to help remove some of the pages that were being autogenerated with the word "viagra" in it. I saw this article and tried implementing it https://builtvisible.com/wildcards-in-robots-txt/ and it seems that I've been able to remove some of the URL's (although I can't confirm yet until I do a full pull of the SERPs on the domain). However, when I test certain URL's inside of WMT it still says that they are allowed which makes me think that it's not working fully or working at all.
In this case these are the lines I've added to the robots.txt
Disallow: /*&viagra
Disallow: /*&Viagra
I know I have the solution of individually requesting URL's to be removed from the index but I want to see if anybody has every had success with wildcarding URL's with a certain word in their robots.txt? The individual URL route could be very tedious.
Thanks!
Jon
-
Hey Paul,
Great answer, for some reason it totally slipped my mind that robots.txt is a crawling directive and not an index one. Yes the pages return a 404 on the headers. I've grabbed a copy of the complete SERPS and will now manually disallow them.
Thanks!
Jon
-
Thank for the endorsement, Christy! Funny, I only just now saw Rand's recent WBF related to this topic, but pleased to see my answer lines up exactly with his info.

P.
-
You need to be aware, Jonathan, that there is absolutely nothing about a robots.txt disallow that will help remove a URL from the search engine indexes. Robots is a crawling directive, NOT an indexing directive. In fact, in most cases, blocking URLs in robots.txt will actually cause them to remain in the index even longer.
I'm assuming you have cleaned up the site so the actual spam URLs no longer resolve. Those URLs should now result in a 404 error page. You must confirm they are actually returning the correct 404 code in the headers. As long as this is the case, it is a matter of waiting while the search engines crawl the spam URLs often enough to recognise they are really gone and remove them from the index. The problem with adding them to the robots.txt is that is actually telling the search engines NOT to crawl them, so they are unlikely to discover that they lead to 404s, hence they may remain in the index even longer.
Unfortunately you can't use a no-index tag on the offending pages, because the pages should no longer exist on the site. I don't think even a careful implementation of a X-Robots noindex directive in htaccess would work, because the URLs should be resulting in a 404.
Make certain the problem URLs return a clean 404, use the Google Search Console Remove URLs tool for as many of them as you can (for example you can request removal for entire directories, if the spam happened to be built that way), and then be patient for the rest. But do NOT block them in robots.txt - you'll just prolong the agony and waste your time.
Hope that all makes sense?
Paul
-
Hi Jon,
Why not just: Disallow: /viagra
-
Jon,
I have never done it with a robots.txt, one easy why that I think you could do it would be on the page level. You could add a noindex nofollow to the page itself.
You can generate it automatically too and have it fired depending on the url by using a substring search on the url as well. That will get them all for sure.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
URL in russian
Hi everyone, I am doing an audit of a site that currently have a lot of 500 errors due to the russian langage. Basically, all the url's look that way for every page in russian: http://www.exemple.com/ru-kg/pешения-для/food-packaging-machines/
Intermediate & Advanced SEO | | alexrbrg
http://www.exemple.com/ru-kg/pешения-для/wood-flour-solutions/
http://www.exemple.com/ru-kg/pешения-для/cellulose-solutions/ I am wondering if this error is really caused by the server or if Google have difficulty reading the russian langage in URL's. Is it better to have the URL's only in english ?0 -
Double hyphen in URL - bad?
Instead of a URL such as domain.com/double-dash/ programming wants to use domain.com/double--dash/ for some reason that makes things easier for them. Would a double dash in the URL have a negative effect on the page ranking?
Intermediate & Advanced SEO | | CFSSEO0 -
Baidu Spider appearing on robots.txt
Hi, I'm not too sure what to do about this or what to think of it. This magically appeared in my companies robots.txt file (literally magically appeared/text is below) User-agent: Baiduspider
Intermediate & Advanced SEO | | IceIcebaby
User-agent: Baiduspider-video
User-agent: Baiduspider-image
Disallow: / I know that Baidu is the Google of China, but I'm not sure why this would appear in our robots.txt all of a sudden. Should I be worried about a hack? Also, would I want to disallow Baidu from crawling my companies website? Thanks for your help,
-Reed0 -
How to deal with URLs and tabbed content
Hi All, We're currently redesigning a website for a new home developer and we're trying to figure out the best way to deal with tabbed content in the URL structure. The design of the site at the moment will have a page for a development and within that you can select your house type, then when on the house type page there will be tabs displayed for the user to see things like the plot map, availability and pricing, specifications, etc. The way our development team are looking at handling this is for the URL to use a hashtag or a query string at the end of it so we can still land users on these specific tabs for PPC for example. My question is really, has anyone had any experience with this? Any recommendations on how to best display the urls for SEO? Thanks
Intermediate & Advanced SEO | | J_Sinclair0 -
If I own a .com url and also have the same url with .net, .info, .org, will I want to point them to the .com IP address?
I have a domain, for example, mydomain.com and I purchased mydomain.net, mydomain.info, and mydomain.org. Should I point the host @ to the IP where the .com is hosted in wpengine? I am not doing anything with the .org, .info, .net domains. I simply purchased them to prevent competitors from buying the domains.
Intermediate & Advanced SEO | | djlittman0 -
Meta NoIndex tag and Robots Disallow
Hi all, I hope you can spend some time to answer my first of a few questions 🙂 We are running a Magento site - layered/faceted navigation nightmare has created thousands of duplicate URLS! Anyway, during my process to tackle the issue, I disallowed in Robots.txt anything in the querystring that was not a p (allowed this for pagination). After checking some pages in Google, I did a site:www.mydomain.com/specificpage.html and a few duplicates came up along with the original with
Intermediate & Advanced SEO | | bjs2010
"There is no information about this page because it is blocked by robots.txt" So I had added in Meta Noindex, follow on all these duplicates also but I guess it wasnt being read because of Robots.txt. So coming to my question. Did robots.txt block access to these pages? If so, were these already in the index and after disallowing it with robots, Googlebot could not read Meta No index? Does Meta Noindex Follow on pages actually help Googlebot decide to remove these pages from index? I thought Robots would stop and prevent indexation? But I've read this:
"Noindex is a funny thing, it actually doesn’t mean “You can’t index this”, it means “You can’t show this in search results”. Robots.txt disallow means “You can’t index this” but it doesn’t mean “You can’t show it in the search results”. I'm a bit confused about how to use these in both preventing duplicate content in the first place and then helping to address dupe content once it's already in the index. Thanks! B0 -
Robots.txt: Can you put a /* wildcard in the middle of a URL?
We have noticed that Google is indexing the language/country directory versions of directories we have disallowed in our robots.txt. For example: Disallow: /images/ is blocked just fine However, once you add our /en/uk/ directory in front of it, there are dozens of pages indexed. The question is: Can I put a wildcard in the middle of the string, ex. /en/*/images/, or do I need to list out every single country for every language in the robots file. Anyone know of any workarounds?
Intermediate & Advanced SEO | | IHSwebsite0 -
Blocking Dynamic URLs with Robots.txt
Background: My e-commerce site uses a lot of layered navigation and sorting links. While this is great for users, it ends up in a lot of URL variations of the same page being crawled by Google. For example, a standard category page: www.mysite.com/widgets.html ...which uses a "Price" layered navigation sidebar to filter products based on price also produces the following URLs which link to the same page: http://www.mysite.com/widgets.html?price=1%2C250 http://www.mysite.com/widgets.html?price=2%2C250 http://www.mysite.com/widgets.html?price=3%2C250 As there are literally thousands of these URL variations being indexed, so I'd like to use Robots.txt to disallow these variations. Question: Is this a wise thing to do? Or does Google take into account layered navigation links by default, and I don't need to worry. To implement, I was going to do the following in Robots.txt: User-agent: * Disallow: /*? Disallow: /*= ....which would prevent any dynamic URL with a '?" or '=' from being indexed. Is there a better way to do this, or is this a good solution? Thank you!
Intermediate & Advanced SEO | | AndrewY1