Spider Indexed Disallowed URLs
-
Hi there,
In order to reduce the huge amount of duplicate content and titles for a cliënt, we have disallowed all spiders for some areas of the site in August via the robots.txt-file. This was followed by a huge decrease in errors in our SEOmoz crawl report, which, of course, made us satisfied.
In the meanwhile, we haven't changed anything in the back-end, robots.txt-file, FTP, website or anything. But our crawl report came in this November and all of a sudden all the errors where back. We've checked the errors and noticed URLs that are definitly disallowed. The disallowment of these URLs is also verified by our Google Webmaster Tools, other robots.txt-checkers and when we search for a disallowed URL in Google, it says that it's blocked for spiders. Where did these errors came from? Was it the SEOmoz spider that broke our disallowment or something? You can see the drop and the increase in errors in the attached image.
Thanks in advance.
[](<a href=)" target="_blank">a> [](<a href=)" target="_blank">a> LAAFj.jpg
-
This was what I was looking for! The pages are indexed by Google, yes, but they aren't being crawled by the Googlebot (as my Webmaster Tool and the Matt Cutts Video is telling me), but they are occasionally being crawled by the Rogerbot probably (not monthly). Thank you very much!
-
Yes yes, canonicalization or meta noindex-tag would be better of course to pass the possible link juice, but we aren't worried about that. I was worried Google would still see the pages as duplicates. (couldn't really distile that out of the article, although it was useful!) Barry Smith answered that last issue in the answer below, but i do want to thank you for your insight.
-
The directives issued in a robots.txt file are just a suggestion to bots. One that Google does follow though.
Malicious bots will ignore them and occasionally even bots that follow the directives may mess up (probably what's happened here).
Google may also index pages that you've blocked as they've found them via a link as explained here - http://www.youtube.com/watch?v=KBdEwpRQRD0 - or for an overview of what Google does with robots.txt files you can read here - http://support.google.com/webmasters/bin/answer.py?hl=en&answer=156449
I'd suggest you look at other ways of fixing the problem than just blocking 1500 pages but I see you've considered what would be required to fix the issues without removing the pages from a crawl and decided the value isn't there.
If WMT is telling you the pages are blocked from being crawled I'd believe that.
Try searching for a url that should be blocked in Google and see if it's indexed or do site:http://yoursitehere.com and see if blocked pages come up.
-
The assumptions of what to expect from using robots.txt may not be in line with the realities. Crawling a page isn't the same thing as indexing the content to appear in SERPs and even with robots, your pages can be crawled.
http://www.seomoz.org/blog/serious-robotstxt-misuse-high-impact-solutions
-
Thanks mister Goyal. Of course we have been thinking about ways and figured out some options in doing so, but implementing these solutions would be disastreous from a time/financial perspective. The pages that we have blocked from the spiders aren't needed for visibility in the search engines and don't carry much link juice, they are only there for the visitors, so we decided we don't really need them for our SEO-efforts in a positive way. But when these pages do get crawled and the engines notice the huge amount of duplicates, i recogn this would have a negative influence on our site as a whole.
So, the problem we have is focused on the doubts we have on the legitimacy of the report. If SEOMoz can crawl it, the Googlebot could probably too, right, since we've used: User-agent: *
-
Mark
Are you blocking all your bots to spider these erroneous URLs ? Is there a way for you to fix these such that either they don't exist or they are not duplicate anymore.
I'd just recommend looking from that perspective as well. Not just the intent of making those errors disappear from the SEOMoz report.
I hope this helps.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Should I disallow crawl of my Job board?
MOZ crawler is telling me we have loads of duplicate content issues. We use a Job Board plugin on our Wordpress site and we have allot of duplicate or very similar jobs (usually just a different location), but the plugin doesn't allow us to add any rel canonical tags to the individual jobs. Should I disallow the /jobs/ url in the robots.txt file? This will solve the duplicate content issue but then Google wont be able to crawl any of the individual job listings Has anyone had any experience working with a job board plugin on Wordpress and had a similar issue, or can advise on how best to solve our duplicate content?? Thanks 🙂
Technical SEO | | O2C0 -
Why my website does not index?
I made some changes in my website after that I try webmaster tool FETCH AS GOOGLE but this is 2nd day and my new pages does not index www. astrologersktantrik .com
Technical SEO | | ramansaab0 -
Why are my images not being indexed?
I have submitted an image sitemap with over 2,000 images yet only about 35 have been indexed. Could you please help me understand why Google is not indexing my images? www.creative-calendars.com
Technical SEO | | nicole20140 -
Bing is not Indexing my site.
Hi, My website is four months old and has more than 8000 pages. Bing has indexed only 8 pages till date and Google also keeps playing hide and seek with it. There was a time when google indexed almost all the pages of my site but now there are only 5000 pages indexed. Moreover when I check my site on google (by typing site:socktail.com), it shows only 26 pages. Please let me know what should I do. If somebody wants to take a look, my website is http://socktail.com Thanks
Technical SEO | | saurabh19050 -
Will rel canonical tags remove previously indexed URLs?
Hello, 7 days ago, we implemented canonical tags to resolve duplicate content issues that had been caused by URL parameters. These "duplicate content" had already been indexed. Now that the URLs have rel canonical tags in place, will Google automatically remove from its index the other URLs with the URL parameters? I ask because we have been tracking the approximate number of URLs indexed by doing a site: search in Google, and we have barely noticed a decrease in URLs indexed. Thanks.
Technical SEO | | yacpro130 -
How to handle temporary campaign URLs
Hi, We have just run a yearly returning commercial campaign for which we have created optimized URL's. (e.g. www.domain.tld/campaign including the category and brand names after the campaign www.domain.tld./campaign/womens This has resulted in 4500+ URL's being indexed in Google including the campaign name, now the campaign is over and these URL's do not exist anymore. How should we handle those URL's? 1.) 301 them to the correct category without the campaign name 2.) Create a static page www.domain.tld/campaign to which we 301 all URL's that have the campaign name in them Do you have any other suggestions on what the best approach would be? This is a yearly commercial campaign so in a year time we will have the same URL's again. Thanks, Chris
Technical SEO | | eCommerceSEO0 -
Do the engine spiders see this javascript?
http://www.drsfostersmith.com/general.cfm?gid=259 I'm looking at building something like these banners, and wondering if the engines a) see b) value links like these in the drop-down selector? I guess I could test it but wondering if anyone else has before I do it.
Technical SEO | | ATShock0 -
IP address URLs being indexed, 301 to domain?
I apoligize if this question as been asked before, I couldnt' find in the Q&A though. I noticed Google has been indexing our IP address for some pages (ie: 123.123.123.123/page.html instead of domain.com/page.html). I suspect this is possibly due to a few straggler relative links instead of absolute, or possibly something else I'm not thinking of. My less-evasive solution a few months back was to ensure canonical tags were on all pages, and then replaced any relative links w/ absolutes. This does not seem to be fixing the problem though, as recently as today new pages were scooped up with the IP address. My next thought is to 301 redirect any IP address URL to the domain, but I was afraid that may be too drastic and that the canonical should be sufficient (which it doesn't seem to be). Has anyone dealt with this issue? Do you think the 301 would be a safe move, any other suggestions? thanks.
Technical SEO | | KT6840