Moz Q&A is closed.
After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.
Wildcarding Robots.txt for Particular Word in URL
-
Hey All,
So I know that this isn't a standard robots.txt, I'm aware of how to block or wildcard certain folders but I'm wondering whether it's possible to block all URL's with a certain word in it?
We have a client that was hacked a year ago and now they want us to help remove some of the pages that were being autogenerated with the word "viagra" in it. I saw this article and tried implementing it https://builtvisible.com/wildcards-in-robots-txt/ and it seems that I've been able to remove some of the URL's (although I can't confirm yet until I do a full pull of the SERPs on the domain). However, when I test certain URL's inside of WMT it still says that they are allowed which makes me think that it's not working fully or working at all.
In this case these are the lines I've added to the robots.txt
Disallow: /*&viagra
Disallow: /*&Viagra
I know I have the solution of individually requesting URL's to be removed from the index but I want to see if anybody has every had success with wildcarding URL's with a certain word in their robots.txt? The individual URL route could be very tedious.
Thanks!
Jon
-
Hey Paul,
Great answer, for some reason it totally slipped my mind that robots.txt is a crawling directive and not an index one. Yes the pages return a 404 on the headers. I've grabbed a copy of the complete SERPS and will now manually disallow them.
Thanks!
Jon
-
Thank for the endorsement, Christy! Funny, I only just now saw Rand's recent WBF related to this topic, but pleased to see my answer lines up exactly with his info.
P.
-
You need to be aware, Jonathan, that there is absolutely nothing about a robots.txt disallow that will help remove a URL from the search engine indexes. Robots is a crawling directive, NOT an indexing directive. In fact, in most cases, blocking URLs in robots.txt will actually cause them to remain in the index even longer.
I'm assuming you have cleaned up the site so the actual spam URLs no longer resolve. Those URLs should now result in a 404 error page. You must confirm they are actually returning the correct 404 code in the headers. As long as this is the case, it is a matter of waiting while the search engines crawl the spam URLs often enough to recognise they are really gone and remove them from the index. The problem with adding them to the robots.txt is that is actually telling the search engines NOT to crawl them, so they are unlikely to discover that they lead to 404s, hence they may remain in the index even longer.
Unfortunately you can't use a no-index tag on the offending pages, because the pages should no longer exist on the site. I don't think even a careful implementation of a X-Robots noindex directive in htaccess would work, because the URLs should be resulting in a 404.
Make certain the problem URLs return a clean 404, use the Google Search Console Remove URLs tool for as many of them as you can (for example you can request removal for entire directories, if the spam happened to be built that way), and then be patient for the rest. But do NOT block them in robots.txt - you'll just prolong the agony and waste your time.
Hope that all makes sense?
Paul
-
Hi Jon,
Why not just: Disallow: /viagra
-
Jon,
I have never done it with a robots.txt, one easy why that I think you could do it would be on the page level. You could add a noindex nofollow to the page itself.
You can generate it automatically too and have it fired depending on the url by using a substring search on the url as well. That will get them all for sure.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
All URLs in the site is 302 redirected to itself
Hi everyone, I have a problem with a website wherein all URLs (homepage, inner pages) are 302 redirected. This is based on Screaming Frog crawl. But the weird thing is that they are 302 redirected to themselves which doesn't make any sense. Example:
Intermediate & Advanced SEO | | alex_goldman
https://www.example.com.au/ is 302 redirected to https://www.example.com.au/ https://www.example.com.au/shop is 302 redirected to https://www.example.com.au/shop https://www.example.com.au/shop/dresses is 302 redirected to https://www.example.com.au/shop/dresses Have you encountered this issue? What did you do to fix it? Would be very glad to hear your responses. Cheers!0 -
Duplicate URLs ending with #!
Hi guys, Does anyone know why a site can contain duplicate URLs ending with hastag & exclamation mark e.g. https://site.com.au/#! We are finding a lot of these URLs (as duplicates) and i was wondering what they are from developer standpoint? And do you think it's worth the time and effort adding a rel canonical tag or 301 to these URLs eventhough they're not getting indexed by Google? Cheers, Chris
Intermediate & Advanced SEO | | jayoliverwright0 -
Dilemma about "images" folder in robots.txt
Hi, Hope you're doing well. I am sure, you guys must be aware that Google has updated their webmaster technical guidelines saying that users should allow access to their css files and java-scripts file if it's possible. Used to be that Google would render the web pages only text based. Now it claims that it can read the css and java-scripts. According to their own terms, not allowing access to the css files can result in sub-optimal rankings. "Disallowing crawling of Javascript or CSS files in your site’s robots.txt directly harms how well our algorithms render and index your content and can result in suboptimal rankings."http://googlewebmastercentral.blogspot.com/2014/10/updating-our-technical-webmaster.htmlWe have allowed access to our CSS files. and Google bot, is seeing our webapges more like a normal user would do. (tested it in GWT)Anyhow, this is my dilemma. I am sure lot of other users might be facing the same situation. Like any other e commerce companies/websites.. we have lot of images. Used to be that our css files were inside our images folder, so I have allowed access to that. Here's the robots.txt --> http://www.modbargains.com/robots.txtRight now we are blocking images folder, as it is very huge, very heavy, and some of the images are very high res. The reason we are blocking that is because we feel that Google bot might spend almost all of its time trying to crawl that "images" folder only, that it might not have enough time to crawl other important pages. Not to mention, a very heavy server load on Google's and ours. we do have good high quality original pictures. We feel that we are losing potential rankings since we are blocking images. I was thinking to allow ONLY google-image bot, access to it. But I still feel that google might spend lot of time doing that. **I was wondering if Google makes a decision saying, hey let me spend 10 minutes for google image bot, and let me spend 20 minutes for google-mobile bot etc.. or something like that.. , or does it have separate "time spending" allocations for all of it's bot types. I want to unblock the images folder, for now only the google image bot, but at the same time, I fear that it might drastically hamper indexing of our important pages, as I mentioned before, because of having tons & tons of images, and Google spending enough time already just to crawl that folder.**Any advice? recommendations? suggestions? technical guidance? Plan of action? Pretty sure I answered my own question, but I need a confirmation from an Expert, if I am right, saying that allow only Google image access to my images folder. Sincerely,Shaleen Shah
Intermediate & Advanced SEO | | Modbargains1 -
Robots.txt: how to exclude sub-directories correctly?
Hello here, I am trying to figure out the correct way to tell SEs to crawls this: http://www.mysite.com/directory/ But not this: http://www.mysite.com/directory/sub-directory/ or this: http://www.mysite.com/directory/sub-directory2/sub-directory/... But with the fact I have thousands of sub-directories with almost infinite combinations, I can't put the following definitions in a manageable way: disallow: /directory/sub-directory/ disallow: /directory/sub-directory2/ disallow: /directory/sub-directory/sub-directory/ disallow: /directory/sub-directory2/subdirectory/ etc... I would end up having thousands of definitions to disallow all the possible sub-directory combinations. So, is the following way a correct, better and shorter way to define what I want above: allow: /directory/$ disallow: /directory/* Would the above work? Any thoughts are very welcome! Thank you in advance. Best, Fab.
Intermediate & Advanced SEO | | fablau1 -
Urls missing from product_cat sitemap
I'm using Yoast SEO plugin to generate XML sitemaps on my e-commerce site (woocommerce). I recently changed the category structure and now only 25 of about 75 product categories are included. Is there a way to manually include urls or what is the best way to have them all indexed in the sitemap?
Intermediate & Advanced SEO | | kisen0 -
Where to put a page ID in a URL?
Hello, My company is going to change URLs to example.com/category or example.com/product. When we will change the URLs to product or category pages somehow we have to check whether the requested page is from category table in DB or from products table (this gives much speed to page load time). So we have to choose how to make the different product and category pages.
Intermediate & Advanced SEO | | komeksimas
Programmers said that we need to insert id to URL. So the question is: Which is the better way to place an id to an URL? example.com/product-name?id=111 example.com/product-name/111 example.com/product_name-111 Or maybe we should use some other punctuation mark to separate id from product name? p.s. I have read Dynamic URLs vs. static URLs by Google and it still didn't answered which is the best for all of the pages. Somehow others solve this problem by typing only the names to the URL, but could anyone tell what that technology should be?0 -
Soft 404's from pages blocked by robots.txt -- cause for concern?
We're seeing soft 404 errors appear in our google webmaster tools section on pages that are blocked by robots.txt (our search result pages). Should we be concerned? Is there anything we can do about this?
Intermediate & Advanced SEO | | nicole.healthline4 -
Could you use a robots.txt file to disalow a duplicate content page from being crawled?
A website has duplicate content pages to make it easier for users to find the information from a couple spots in the site navigation. Site owner would like to keep it this way without hurting SEO. I've thought of using the robots.txt file to disallow search engines from crawling one of the pages. Would you think this is a workable/acceptable solution?
Intermediate & Advanced SEO | | gregelwell0