Moz Q&A is closed.
After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.
Wildcarding Robots.txt for Particular Word in URL
-
Hey All,
So I know that this isn't a standard robots.txt, I'm aware of how to block or wildcard certain folders but I'm wondering whether it's possible to block all URL's with a certain word in it?
We have a client that was hacked a year ago and now they want us to help remove some of the pages that were being autogenerated with the word "viagra" in it. I saw this article and tried implementing it https://builtvisible.com/wildcards-in-robots-txt/ and it seems that I've been able to remove some of the URL's (although I can't confirm yet until I do a full pull of the SERPs on the domain). However, when I test certain URL's inside of WMT it still says that they are allowed which makes me think that it's not working fully or working at all.
In this case these are the lines I've added to the robots.txt
Disallow: /*&viagra
Disallow: /*&Viagra
I know I have the solution of individually requesting URL's to be removed from the index but I want to see if anybody has every had success with wildcarding URL's with a certain word in their robots.txt? The individual URL route could be very tedious.
Thanks!
Jon
-
Hey Paul,
Great answer, for some reason it totally slipped my mind that robots.txt is a crawling directive and not an index one. Yes the pages return a 404 on the headers. I've grabbed a copy of the complete SERPS and will now manually disallow them.
Thanks!
Jon
-
Thank for the endorsement, Christy! Funny, I only just now saw Rand's recent WBF related to this topic, but pleased to see my answer lines up exactly with his info.
P.
-
You need to be aware, Jonathan, that there is absolutely nothing about a robots.txt disallow that will help remove a URL from the search engine indexes. Robots is a crawling directive, NOT an indexing directive. In fact, in most cases, blocking URLs in robots.txt will actually cause them to remain in the index even longer.
I'm assuming you have cleaned up the site so the actual spam URLs no longer resolve. Those URLs should now result in a 404 error page. You must confirm they are actually returning the correct 404 code in the headers. As long as this is the case, it is a matter of waiting while the search engines crawl the spam URLs often enough to recognise they are really gone and remove them from the index. The problem with adding them to the robots.txt is that is actually telling the search engines NOT to crawl them, so they are unlikely to discover that they lead to 404s, hence they may remain in the index even longer.
Unfortunately you can't use a no-index tag on the offending pages, because the pages should no longer exist on the site. I don't think even a careful implementation of a X-Robots noindex directive in htaccess would work, because the URLs should be resulting in a 404.
Make certain the problem URLs return a clean 404, use the Google Search Console Remove URLs tool for as many of them as you can (for example you can request removal for entire directories, if the spam happened to be built that way), and then be patient for the rest. But do NOT block them in robots.txt - you'll just prolong the agony and waste your time.
Hope that all makes sense?
Paul
-
Hi Jon,
Why not just: Disallow: /viagra
-
Jon,
I have never done it with a robots.txt, one easy why that I think you could do it would be on the page level. You could add a noindex nofollow to the page itself.
You can generate it automatically too and have it fired depending on the url by using a substring search on the url as well. That will get them all for sure.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Mass URL changes and redirecting those old URLS to the new. What is SEO Risk and best practices?
Hello good people of the MOZ community, I am looking to do a mass edit of URLS on content pages within our sites. The way these were initially setup was to be unique by having the date in the URL which was a few years ago and can make evergreen content now seem dated. The new URLS would follow a better folder path style naming convention and would be way better URLS overall. Some examples of the **old **URLS would be https://www.inlineskates.com/Buying-Guide-for-Inline-Skates/buying-guide-9-17-2012,default,pg.html
Intermediate & Advanced SEO | | kirin44355
https://www.inlineskates.com/Buying-Guide-for-Kids-Inline-Skates/buying-guide-11-13-2012,default,pg.html
https://www.inlineskates.com/Buying-Guide-for-Inline-Hockey-Skates/buying-guide-9-3-2012,default,pg.html
https://www.inlineskates.com/Buying-Guide-for-Aggressive-Skates/buying-guide-7-19-2012,default,pg.html The new URLS would look like this which would be a great improvement https://www.inlineskates.com/Learn/Buying-Guide-for-Inline-Skates,default,pg.html
https://www.inlineskates.com/Learn/Buying-Guide-for-Kids-Inline-Skates,default,pg.html
https://www.inlineskates.com/Learn/Buying-Guide-for-Inline-Hockey-Skates,default,pg.html
https://www.inlineskates.com/Learn/Buying-Guide-for-Aggressive-Skates,default,pg.html My worry is that we do rank fairly well organically for some of the content and don't want to anger the google machine. The way I would be doing the process would be to edit the URLS to the new layout, then do the redirect for them and push live. Is there a great SEO risk to doing this?
Is there a way to do a mass "Fetch as googlebot" to reindex these if I do say 50 a day? I only see the ability to do 1 URL at a time in the webmaster backend.
Is there anything else I am missing? I believe this change would overall be good in the long run but do not want to take a huge hit initially by doing something incorrectly. This would be done on 5- to a couple hundred links across various sites I manage. Thanks in advance,
Chris Gorski0 -
Link juice through URL parameters
Hi guys, hope you had a fantastic bank holiday weekend. Quick question re URL parameters, I understand that links which pass through an affiliate URL parameter aren't taken into consideration when passing link juice through one site to another. However, when a link contains a tracking URL parameter (let's say gclid=), does link juice get passed through? We have a number of external links pointing to our main site, however, they are linking directly to a unique tracking parameter. I'm just curious to know about this. Thanks, Brett
Intermediate & Advanced SEO | | Brett-S0 -
Large robots.txt file
We're looking at potentially creating a robots.txt with 1450 lines in it. This will remove 100k+ pages from the crawl that are all old pages (I know, the ideal would be to delete/noindex but not viable unfortunately) Now the issue i'm thinking is that a large robots.txt will either stop the robots.txt from being followed or will slow our crawl rate down. Does anybody have any experience with a robots.txt of that size?
Intermediate & Advanced SEO | | ThomasHarvey0 -
Duplicate content on URL trailing slash
Hello, Some time ago, we accidentally made changes to our site which modified the way urls in links are generated. At once, trailing slashes were added to many urls (only in links). Links that used to send to
Intermediate & Advanced SEO | | yacpro13
example.com/webpage.html Were now linking to
example.com/webpage.html/ Urls in the xml sitemap remained unchanged (no trailing slash). We started noticing duplicate content (because our site renders the same page with or without the trailing shash). We corrected the problematic php url function so that now, all links on the site link to a url without trailing slash. However, Google had time to index these pages. Is implementing 301 redirects required in this case?1 -
Should I include URLs that are 301'd or only include 200 status URLs in my sitemap.xml?
I'm not sure if I should be including old URLs (content) that are being redirected (301) to new URLs (content) in my sitemap.xml. Does anyone know if it is best to include or leave out 301ed URLs in a xml sitemap?
Intermediate & Advanced SEO | | Jonathan.Smith0 -
Removing dashes in our URLs?
Hi Forum, Our site has an errant product review module that is resulting in about 9-10 404 errors per day on Google Webmaster Tools. We've found that by changing our product page URLs to only include 2 dashes, the module stops causing 404 errors for that page. Does changing our URL from "oursite.com/girls-pink-yoga-capri.html" to "oursite.com/girlspink-yoga-capri.html" hurt our SEO for a search for "girls pink yoga capri"? If so, by how much (assuming everthing else on the page is optimized properly) Thanks for your input.
Intermediate & Advanced SEO | | pano0 -
Is it safe to redirect multiple URLs to a single URL?
Hi, I have an old Wordress website with about 300-400 original pages of content on it. All relating to my company's industry: travel in Africa. It's a legitimate site with travel stories, photos, advice etc. Nothing spammy about. No adverts on it. No affiliates. The site hasn't been updated for a couple of years and we no longer have a need for it. Many of the stories on it are quite out of date. The site has built up a modest Mozrank value over the last 5 years, and has a few hundreds organically achieved inbound links. Recently I set up a swanky new branded website on ExpressionEngine on a new domain. My intention is to: Shut down the old site Focus all attention on building up content on the new website Ask the people linking to the old site to my new site instead (I wonder how many will actually do so...) Where possible, setup a 301 redirect from pages on the old site to their closest match on the new site Setup a 301 redirect from the old site's home page to new site's homepage Sounds good, right? But there is one issue I need some advice on... The old site has about 100 pages that do not have a good match on the new site. These pages are outdated or inferior quality, so it doesn't really make sense to rewrite them and put them on the new site. I call these my "black sheep pages". So... for these "black sheep pages" should I (A) redirect the urls to the new site's homepage (B) redirect the urls the old site's home page (which in turn, redirects to the new site's homepage, or (C) not redirect the urls, and let them die a lonely 404 death? OPTION A: oldsite.com/page1.php -> newsite.com
Intermediate & Advanced SEO | | AndreVanKets
oldsite.com/page2.php -> newsite.com
oldsite.com/page3.php -> newsite.com
oldsite.com/page4.php -> newsite.com
oldsite.com/page5.php -> newsite.com
oldsite.com -> newsite.com OPTION B: oldsite.com/page1.php -> oldsite.com
oldsite.com/page2.php -> oldsite.com
oldsite.com/page3.php -> oldsite.com
oldsite.com/page4.php -> oldsite.com
oldsite.com/page5.php -> oldsite.com
oldsite.com -> newsite.com OPTION 😄 oldsite.com/page1.php : do not redirect, let page 404 and disappear forever
oldsite.com/page2.php : do not redirect, let page 404 and disappear forever
oldsite.com/page3.php : do not redirect, let page 404 and disappear forever
oldsite.com/page4.php : do not redirect, let page 404 and disappear forever
oldsite.com/page5.php : do not redirect, let page 404 and disappear forever
oldsite.com -> newsite.com My intuition tells me that Option A would pass the most "link juice" to my new site, but I am concerned that it could also be seen by Google as a spammy redirect technique. What would you do? Help 😐1 -
Block an entire subdomain with robots.txt?
Is it possible to block an entire subdomain with robots.txt? I write for a blog that has their root domain as well as a subdomain pointing to the exact same IP. Getting rid of the option is not an option so I'd like to explore other options to avoid duplicate content. Any ideas?
Intermediate & Advanced SEO | | kylesuss12