Moz Q&A is closed.
After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.
Recovering from robots.txt error
-
Hello,
A client of mine is going through a bit of a crisis. A developer (at their end) added Disallow: / to the robots.txt file. Luckily the SEOMoz crawl ran a couple of days after this happened and alerted me to the error. The robots.txt file was quickly updated but the client has found the vast majority of their rankings have gone.
It took a further 5 days for GWMT to file that the robots.txt file had been updated and since then we have "Fetched as Google" and "Submitted URL and linked pages" in GWMT.
In GWMT it is still showing that that vast majority of pages are blocked in the "Blocked URLs" section, although the robots.txt file below it is now ok.
I guess what I want to ask is:
- What else is there that we can do to recover these rankings quickly?
- What time scales can we expect for recovery?
- More importantly has anyone had any experience with this sort of situation and is full recovery normal?
Thanks in advance!
-
Great info Rikki
thats goid news!
-
Hi Antonio,
I would take a look at your entire site using
One of my very favorite tools this tool will crawl your site and tell you if you have no follow's or other issues that would cause Google bot have trouble indexing your site.
Simply put your sites URL in the box presented in the tool you can find in the link here
http://www.feedthebot.com/tools/spider/
Then use link 2
Displays amount of links (internal, external, nofollow, image, etc.) found on webpage.
http://www.feedthebot.com/tools/linkcount/
You can then see if there is a no follow that might be creating a real problem inside of a page using the two URLs you should be a will to get about of this.
Check this much of your site is you possibly can with this as it will show you A lot of information that would be very relevant as to if your site can be crawled correctly or not
This third tool Will show you if your robots.txt file is still blocking all or part of your website the nice thing about this tool is is is built to make her about star text files however if you simply put your URL in the top and hit the upload button it will pull your robots.txt file this is very helpful when making comparisons between changes that have been made or you wish to make
http://www.internetmarketingninjas.com/seo-tools/robots-txt-generator/
Two check out your robot.txt file against what could be something blocking it I think that will
http://moz.com/blog/interactive-guide-to-robots-txt
http://moz.com/learn/seo/robotstxt
http://tools.seobook.com/robots-txt/
http://yoast.com/x-robots-tag-play/
https://developers.google.com/webmasters/control-crawl-index/docs/robots_meta_tag?hl=de
http://www.searchenginejournal.com/x-robots-tag-simple-alternate-robots-txt-meta-tag/67138/
A citation that I hope will help you is the not too noticeable difference between allowing everything and not allowing everything simply having a / after disallow: Will tell Google that you do not want to be showing up in their search engine results
Simply put I have the information below websites by default are set up with
Allow: /
Example Robots.txt Format
Allow indexing of everything
User-agent: *
Disallow:
or
User-agent: *
Allow: /
Disallow indexing of everything
User-agent: *
Disallow: /Disallow indexing of a specific folder
User-agent: *
Disallow: /folder/Please remember there are multiple ways to block a website for instance
PHP-based websites are extremely popular and if you're using a WordPress or agenda for many other
header("X-Robots-Tag: noindex", true);
I want to remind you what Tom Roberts said in the first response about using Twitter I have quoted him here however you can read it at the top of the Page below the first question
The most frequently crawled domain on the web is Twitter. If you could legitimately get your key URLs tweeted, either by yourselves or others, this may encourage the Google crawler to revisit the URLs, and consequently re index them. There won't be any harm SEO wise in sending tweets with your URLs, it's a quick and free method and so may be worth giving it a shot
Hope This Helps,
Thomas
-
Hi Antonio,
Sorry to hear you have had the same problem, due to our clients nature this error by the developer cost them a load of lost revenue.
In answer to your questions:
-
It took 19 days in total to recover
-
We took everyone's advice and implemented them but I am unsure what actually helped. I think working work GWMT is the best thing for it. Make sure you submit for a re-crawl as soon as possible and see what is still blocked
I know how scary the situation is but things will go back to normal. Its just a matter of playing the waiting game really, sorry I couldn't be of more help.
Rikki
-
-
Hi Rikki,
I know it's been some time since your post, however I just found it because a couple of weeks ago my developer did exactly the same.
It's been 2 weeks now and our traffic is still divided by 4 compared with what it used to be. My questions are:
1/ How long it finally took you to completely recover your previous traffic levels (if you finally did)
2/ Did you apply any of the advices from other bloggers? What would you recommend to do from your experience?
Thanks in advance. I am really worried at this moment, since we've got a peak campaign coming on very soon.
Regards,
Antonio (Citricamente)
-
Hi Rikki,
I really want to say great job though with those numbers. It's always good to see somebody pulling positive ROI. Good work!If I may ask what type of development do specialize in if you have a specialty?
My reason for asking is there are some excellent hosts that will allow you to run a staging server that changes everything like robots.txt back to follow and index when you hit the production button. Other hosts have similar methods.
In fact, that might be an idea that's worth a little bit of money. A nice WordPress plug-in that gives you a constant reminder here in the development phase and does the swap then deletes itself?
Or use a managed WordPress host if it's WordPress.
You can do so many cool things would git these days.
I am extremely happy you have found out there's nothing to worry about if it is simply the tags you will have your rank back before you know it.you can also use Webmaster tools on the manual setting and put it to Max I have done it on test sites, and the site was indexed just as well I would simply make sure I had a reminder telling me to return it to normal after.
You should set the rel="canonical as well/
Glad I was able to help,
Thomas
-
Hi guys,
Thanks very much for the responses. I guess my gut feeling was right that everything would come back to normal but just needed some reassurance.
I have made real progress with this client going from an online brought in revenue of £15k per month at the start of the year to £105k last month but it is all phone based so at the moment his call centre is like a ghost town - its a shame that can happen when a developer is trying to block his own dev sub domain and ends up blocking the whole thing. Just hope it doesn't take too long.
We will certainly try the social media route to see if that speeds things along.
-
please look and see that I updated my response I did I copied from a dictation software's writing pad and only copied a part of it when I meant to copy all of it
please read and let me know if I can be of help
sincerely,
Thomas
-
Please forgive my 1st comment I the button too early and use the dictation software so I save it to one page then paste to another I am sincerely sorry I got this part on there without the entire thing.
Send me the domain either privately if you can or through this chat I would be more than happy to look into it for you. I can tell you I have made the no follow no index mistake myself showing a intern something on our own site and talk about it here below.
However if you are still getting problems you may want to download
screaming frog SEO spider
it only will check for 500 links for free however it gives you invaluable insight
It is a download and works on Mac, Windows and Linux
http://www.screamingfrog.co.uk/seo-spider/
if you want to try something web-based
http://www.internetmarketingninjas.com/tools/
http://www.internetmarketingninjas.com/broken-links-tool/
http://www.internetmarketingninjas.com/seo-tools/robots-txt-generator/
http://www.internetmarketingninjas.com/seo-tools/google-sitemap-generator/
I would also not hesitate to use their DNS tool to check that everything there is okay
Another tool or tools I would strongly recommend and you can access for free are the excellent Internet marketing ninjas
The words used in the metadata tags, in body text and in anchor text in external and internal links all play important roles in on page search engine optimization (SEO). The On-Page Optimization Analysis Free SEO Tool lets you quickly see the important SEO content on your webpage URL the same way a search engine spider views your data. This free SEO onpage optimization tool is multiple onpage SEO tools in one, helpful for reviewing the following onpage optimization information in the source code on the page:
- Metadata tool: Displays text in title tags and meta elements
- Keyword density tool: Reveals onpage SEO keyword statistics for linked and unlinked content
- Keyword optimization tool: Analyzes on page optimization by showing the number of words used in the content, including anchor text of internal and external links
- Link Accounting tool: Displays the number and types of links used
- Header check tool: Shows HTTP Status Response codes for links
- Source code tool: Provides quick access to on-page HTML source code
if you are talking about just the index and no follow
I can now happily say I have done this identical thing.
I have done the exact same thing. I can tell you I was showing somebody how to use the WordPress SEO plug-in when I got distracted and simply did not change the settings back to follow and index. So approximately 2 to 3 days later I noticed a huge loss in ranking year for the company brand name.
(Luckily this was mine not a clients)
It took approximately two days after I changed the settings back to normal follow and index them submitted my entire website to Google's Webmaster tools even clicking yes when asked the index all large change
before I knew it all the rankings had returned back to normal literally the keywords I was tracking returned within the normal fluctuation I see as they were in many cases sometimes better & sometimes little bit worse what I had feared they never would come back at all.Sincerely,
Thomas
Believe me when I say I was extremely thankful for this and don't see why you will not get the same results with your site.
I hope this is a simple a mistake of just that one problem like mine that's the only thing I can give you a testimony of. I would say you have nothing to worry about. But remember to tell Google Webmaster tools I also did tell Bing but that's up to you
-
Should be as quick as google re-crawls the robots.txt.
Best thing you can do is add a couple of links to sites that are crawled daily, to encourage google to visit your clients site as soon as possible
Could be:
- newspaper sites - comments
- and the like
-
Hey there
I've seen this before and in almost all cases the rankings were returned to their previous state, give or take maybe 1 or 2 places (which would be normal SERP flux).
Unfortunately, I've found that this can often take weeks and there's no real sure-fire way of getting Google to update it quicker. Theoretically, to speed things up you want to get the crawler revisiting the URLs more and more often. Fresh backlinks would do this, but obviously you can't game that sort of thing for web spam reasons. You could also try pinging devices, such as GooglePing, but I'm not convinced by their effectiveness.
The most frequently crawled domain on the web is Twitter. If you could legitimately get your key URLs tweeted, either by yourselves or others, this may encourage the Google crawler to revisit the URLs, and consequently reindex them. There won't be any harm SEO wise in sending tweets with your URLs, it's a quick and free method and so may be worth giving it a shot.
Hope this helps you - I've often found you can't control these things but hopefully some of these theories might work. In the long-run, however, the rankings will return and so for normal SEO purposes, create content and links as per usual.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
If I block a URL via the robots.txt - how long will it take for Google to stop indexing that URL?
If I block a URL via the robots.txt - how long will it take for Google to stop indexing that URL?
Intermediate & Advanced SEO | | Gabriele_Layoutweb0 -
Not found errors (404) due to being hacked
Hi Moz Guru's Our website was hacked a few months ago, since then we have taken various measures, last one being redesigning the website all together and removing it from a WordPress platform. So far all is going well, except that the 404 not found errors keeps coming up in Google Webmaster tools. The URLs are spam pages that were created by the virus. And these spam pages have been indexed by Google, and now we are struggling to get rid of them. Is there any way we can deal with these 404 spam pages links? Is marking all of them as fixed in the webmaster tools - search console- crawl errors helpful in any way? Can this have a negative impact on the SEO ? Looking forward to your answers. Many thanks.
Intermediate & Advanced SEO | | monicapopa0 -
Wildcarding Robots.txt for Particular Word in URL
Hey All, So I know that this isn't a standard robots.txt, I'm aware of how to block or wildcard certain folders but I'm wondering whether it's possible to block all URL's with a certain word in it? We have a client that was hacked a year ago and now they want us to help remove some of the pages that were being autogenerated with the word "viagra" in it. I saw this article and tried implementing it https://builtvisible.com/wildcards-in-robots-txt/ and it seems that I've been able to remove some of the URL's (although I can't confirm yet until I do a full pull of the SERPs on the domain). However, when I test certain URL's inside of WMT it still says that they are allowed which makes me think that it's not working fully or working at all. In this case these are the lines I've added to the robots.txt Disallow: /*&viagra Disallow: /*&Viagra I know I have the solution of individually requesting URL's to be removed from the index but I want to see if anybody has every had success with wildcarding URL's with a certain word in their robots.txt? The individual URL route could be very tedious. Thanks! Jon
Intermediate & Advanced SEO | | EvansHunt0 -
Dilemma about "images" folder in robots.txt
Hi, Hope you're doing well. I am sure, you guys must be aware that Google has updated their webmaster technical guidelines saying that users should allow access to their css files and java-scripts file if it's possible. Used to be that Google would render the web pages only text based. Now it claims that it can read the css and java-scripts. According to their own terms, not allowing access to the css files can result in sub-optimal rankings. "Disallowing crawling of Javascript or CSS files in your site’s robots.txt directly harms how well our algorithms render and index your content and can result in suboptimal rankings."http://googlewebmastercentral.blogspot.com/2014/10/updating-our-technical-webmaster.htmlWe have allowed access to our CSS files. and Google bot, is seeing our webapges more like a normal user would do. (tested it in GWT)Anyhow, this is my dilemma. I am sure lot of other users might be facing the same situation. Like any other e commerce companies/websites.. we have lot of images. Used to be that our css files were inside our images folder, so I have allowed access to that. Here's the robots.txt --> http://www.modbargains.com/robots.txtRight now we are blocking images folder, as it is very huge, very heavy, and some of the images are very high res. The reason we are blocking that is because we feel that Google bot might spend almost all of its time trying to crawl that "images" folder only, that it might not have enough time to crawl other important pages. Not to mention, a very heavy server load on Google's and ours. we do have good high quality original pictures. We feel that we are losing potential rankings since we are blocking images. I was thinking to allow ONLY google-image bot, access to it. But I still feel that google might spend lot of time doing that. **I was wondering if Google makes a decision saying, hey let me spend 10 minutes for google image bot, and let me spend 20 minutes for google-mobile bot etc.. or something like that.. , or does it have separate "time spending" allocations for all of it's bot types. I want to unblock the images folder, for now only the google image bot, but at the same time, I fear that it might drastically hamper indexing of our important pages, as I mentioned before, because of having tons & tons of images, and Google spending enough time already just to crawl that folder.**Any advice? recommendations? suggestions? technical guidance? Plan of action? Pretty sure I answered my own question, but I need a confirmation from an Expert, if I am right, saying that allow only Google image access to my images folder. Sincerely,Shaleen Shah
Intermediate & Advanced SEO | | Modbargains1 -
Can an incorrect 301 redirect or .htaccess code cause 500 errors?
Google Webmaster Tools is showing the following message: _Googlebot couldn't access the contents of this URL because the server had an internal error when trying to process the request. These errors tend to be with the server itself, not with the request. _ Before I contact the person who manages the server and hosting (essentially asking if the error is on his end) is there a chance I could have created an issue with an incorrect 301 redirect or other code added to .htaccess incorrectly? Here is the 301 redirect code I am using in .htaccess: RewriteEngine On RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /([^/.]+/)*(index.html|default.asp)\ HTTP/ RewriteRule ^(([^/.]+/)*)(index|default) http://www.example.com/$1 [R=301,L] RewriteCond %{HTTP_HOST} !^(www.example.com)?$ [NC] RewriteRule (.*) http://www.example.com/$1 [R=301,L] Could adding the following code after that in the .htaccess potentially cause any issues? BEGIN EXPIRES <ifmodule mod_expires.c="">ExpiresActive On
Intermediate & Advanced SEO | | kimmiedawn
ExpiresDefault "access plus 10 days"
ExpiresByType text/css "access plus 1 week"
ExpiresByType text/plain "access plus 1 month"
ExpiresByType image/gif "access plus 1 month"
ExpiresByType image/png "access plus 1 month"
ExpiresByType image/jpeg "access plus 1 month"
ExpiresByType application/x-javascript "access plus 1 month"
ExpiresByType application/javascript "access plus 1 week"
ExpiresByType application/x-icon "access plus 1 year"</ifmodule> END EXPIRES (Edit) I'd like to add that there is a Wordpress blog on the site too at www.example.com/blog with the following code in it's .htaccess: BEGIN WordPress <ifmodule mod_rewrite.c="">RewriteEngine On
RewriteBase /blog/
RewriteRule ^index.php$ - [L]
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule . /blog/index.php [L]</ifmodule> END WordPress Thanks0 -
Avoiding Duplicate Content with Used Car Listings Database: Robots.txt vs Noindex vs Hash URLs (Help!)
Hi Guys, We have developed a plugin that allows us to display used vehicle listings from a centralized, third-party database. The functionality works similar to autotrader.com or cargurus.com, and there are two primary components: 1. Vehicle Listings Pages: this is the page where the user can use various filters to narrow the vehicle listings to find the vehicle they want.
Intermediate & Advanced SEO | | browndoginteractive
2. Vehicle Details Pages: this is the page where the user actually views the details about said vehicle. It is served up via Ajax, in a dialog box on the Vehicle Listings Pages. Example functionality: http://screencast.com/t/kArKm4tBo The Vehicle Listings pages (#1), we do want indexed and to rank. These pages have additional content besides the vehicle listings themselves, and those results are randomized or sliced/diced in different and unique ways. They're also updated twice per day. We do not want to index #2, the Vehicle Details pages, as these pages appear and disappear all of the time, based on dealer inventory, and don't have much value in the SERPs. Additionally, other sites such as autotrader.com, Yahoo Autos, and others draw from this same database, so we're worried about duplicate content. For instance, entering a snippet of dealer-provided content for one specific listing that Google indexed yielded 8,200+ results: Example Google query. We did not originally think that Google would even be able to index these pages, as they are served up via Ajax. However, it seems we were wrong, as Google has already begun indexing them. Not only is duplicate content an issue, but these pages are not meant for visitors to navigate to directly! If a user were to navigate to the url directly, from the SERPs, they would see a page that isn't styled right. Now we have to determine the right solution to keep these pages out of the index: robots.txt, noindex meta tags, or hash (#) internal links. Robots.txt Advantages: Super easy to implement Conserves crawl budget for large sites Ensures crawler doesn't get stuck. After all, if our website only has 500 pages that we really want indexed and ranked, and vehicle details pages constitute another 1,000,000,000 pages, it doesn't seem to make sense to make Googlebot crawl all of those pages. Robots.txt Disadvantages: Doesn't prevent pages from being indexed, as we've seen, probably because there are internal links to these pages. We could nofollow these internal links, thereby minimizing indexation, but this would lead to each 10-25 noindex internal links on each Vehicle Listings page (will Google think we're pagerank sculpting?) Noindex Advantages: Does prevent vehicle details pages from being indexed Allows ALL pages to be crawled (advantage?) Noindex Disadvantages: Difficult to implement (vehicle details pages are served using ajax, so they have no tag. Solution would have to involve X-Robots-Tag HTTP header and Apache, sending a noindex tag based on querystring variables, similar to this stackoverflow solution. This means the plugin functionality is no longer self-contained, and some hosts may not allow these types of Apache rewrites (as I understand it) Forces (or rather allows) Googlebot to crawl hundreds of thousands of noindex pages. I say "force" because of the crawl budget required. Crawler could get stuck/lost in so many pages, and my not like crawling a site with 1,000,000,000 pages, 99.9% of which are noindexed. Cannot be used in conjunction with robots.txt. After all, crawler never reads noindex meta tag if blocked by robots.txt Hash (#) URL Advantages: By using for links on Vehicle Listing pages to Vehicle Details pages (such as "Contact Seller" buttons), coupled with Javascript, crawler won't be able to follow/crawl these links. Best of both worlds: crawl budget isn't overtaxed by thousands of noindex pages, and internal links used to index robots.txt-disallowed pages are gone. Accomplishes same thing as "nofollowing" these links, but without looking like pagerank sculpting (?) Does not require complex Apache stuff Hash (#) URL Disdvantages: Is Google suspicious of sites with (some) internal links structured like this, since they can't crawl/follow them? Initially, we implemented robots.txt--the "sledgehammer solution." We figured that we'd have a happier crawler this way, as it wouldn't have to crawl zillions of partially duplicate vehicle details pages, and we wanted it to be like these pages didn't even exist. However, Google seems to be indexing many of these pages anyway, probably based on internal links pointing to them. We could nofollow the links pointing to these pages, but we don't want it to look like we're pagerank sculpting or something like that. If we implement noindex on these pages (and doing so is a difficult task itself), then we will be certain these pages aren't indexed. However, to do so we will have to remove the robots.txt disallowal, in order to let the crawler read the noindex tag on these pages. Intuitively, it doesn't make sense to me to make googlebot crawl zillions of vehicle details pages, all of which are noindexed, and it could easily get stuck/lost/etc. It seems like a waste of resources, and in some shadowy way bad for SEO. My developers are pushing for the third solution: using the hash URLs. This works on all hosts and keeps all functionality in the plugin self-contained (unlike noindex), and conserves crawl budget while keeping vehicle details page out of the index (unlike robots.txt). But I don't want Google to slap us 6-12 months from now because it doesn't like links like these (). Any thoughts or advice you guys have would be hugely appreciated, as I've been going in circles, circles, circles on this for a couple of days now. Also, I can provide a test site URL if you'd like to see the functionality in action.0 -
Robots Disallow Backslash - Is it right command
Bit skeptical, as due to dynamic url and some other linkage issue, google has crawled url with backslash and asterisk character ex - www.xyz.com/\/index.php?option=com_product www.xyz.com/\"/index.php?option=com_product Now %5c is the encoded version of \ - backslash & %22 is encoded version of asterisk Need to know for command :- User-agent: * Disallow: \As am disallowing all backslash url through this - will it only remove the backslash url which are duplicates or the entire site,
Intermediate & Advanced SEO | | Modi0 -
How to Disallow Tag Pages With Robot.txt
Hi i have a site which i'm dealing with that has tag pages for instant - http://www.domain.com/news/?tag=choice How can i exclude these tag pages (about 20+ being crawled and indexed by the search engines with robot.txt Also sometimes they're created dynamically so i want something which automatically excludes tage pages from being crawled and indexed. Any suggestions? Cheers, Mark
Intermediate & Advanced SEO | | monster990