Spider Indexed Disallowed URLs
-
Hi there,
In order to reduce the huge amount of duplicate content and titles for a cliënt, we have disallowed all spiders for some areas of the site in August via the robots.txt-file. This was followed by a huge decrease in errors in our SEOmoz crawl report, which, of course, made us satisfied.
In the meanwhile, we haven't changed anything in the back-end, robots.txt-file, FTP, website or anything. But our crawl report came in this November and all of a sudden all the errors where back. We've checked the errors and noticed URLs that are definitly disallowed. The disallowment of these URLs is also verified by our Google Webmaster Tools, other robots.txt-checkers and when we search for a disallowed URL in Google, it says that it's blocked for spiders. Where did these errors came from? Was it the SEOmoz spider that broke our disallowment or something? You can see the drop and the increase in errors in the attached image.
Thanks in advance.
[
](<a href=)" target="_blank">a> [
](<a href=)" target="_blank">a> LAAFj.jpg
-
This was what I was looking for! The pages are indexed by Google, yes, but they aren't being crawled by the Googlebot (as my Webmaster Tool and the Matt Cutts Video is telling me), but they are occasionally being crawled by the Rogerbot probably (not monthly). Thank you very much!
-
Yes yes, canonicalization or meta noindex-tag would be better of course to pass the possible link juice, but we aren't worried about that. I was worried Google would still see the pages as duplicates. (couldn't really distile that out of the article, although it was useful!) Barry Smith answered that last issue in the answer below, but i do want to thank you for your insight.
-
The directives issued in a robots.txt file are just a suggestion to bots. One that Google does follow though.
Malicious bots will ignore them and occasionally even bots that follow the directives may mess up (probably what's happened here).
Google may also index pages that you've blocked as they've found them via a link as explained here - http://www.youtube.com/watch?v=KBdEwpRQRD0 - or for an overview of what Google does with robots.txt files you can read here - http://support.google.com/webmasters/bin/answer.py?hl=en&answer=156449
I'd suggest you look at other ways of fixing the problem than just blocking 1500 pages but I see you've considered what would be required to fix the issues without removing the pages from a crawl and decided the value isn't there.
If WMT is telling you the pages are blocked from being crawled I'd believe that.
Try searching for a url that should be blocked in Google and see if it's indexed or do site:http://yoursitehere.com and see if blocked pages come up.
-
The assumptions of what to expect from using robots.txt may not be in line with the realities. Crawling a page isn't the same thing as indexing the content to appear in SERPs and even with robots, your pages can be crawled.
http://www.seomoz.org/blog/serious-robotstxt-misuse-high-impact-solutions
-
Thanks mister Goyal. Of course we have been thinking about ways and figured out some options in doing so, but implementing these solutions would be disastreous from a time/financial perspective. The pages that we have blocked from the spiders aren't needed for visibility in the search engines and don't carry much link juice, they are only there for the visitors, so we decided we don't really need them for our SEO-efforts in a positive way. But when these pages do get crawled and the engines notice the huge amount of duplicates, i recogn this would have a negative influence on our site as a whole.
So, the problem we have is focused on the doubts we have on the legitimacy of the report. If SEOMoz can crawl it, the Googlebot could probably too, right, since we've used: User-agent: *
-
Mark
Are you blocking all your bots to spider these erroneous URLs ? Is there a way for you to fix these such that either they don't exist or they are not duplicate anymore.
I'd just recommend looking from that perspective as well. Not just the intent of making those errors disappear from the SEOMoz report.
I hope this helps.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Should search pages be indexed?
Hey guys, I've always believed that search pages should be no-indexed but now I'm wondering if there is an argument to index them? Appreciate any thoughts!
Technical SEO | | RebekahVP0 -
Do you Index your Image Repository?
On our backend system, when an image is uploaded it is saved to a repository. For example: If you upload a picture of a shark it will go to - oursite.com/uploads as shark.png When you use a picture of this shark on a blog post it will show the source as oursite.com/uploads/shark.png This repository (/uploads) is currently being indexed. Is it a good idea to index our repository? Will Google not be able to see the images if it can't crawl the repository link (we're in the process of adding alt text to all of our images ). Thanks
Technical SEO | | SteveDBSEO0 -
Indexing product attributes in sitemap
Hey Mozzers! I'm battling a few questions about the sitemap for my ecommerce store. Could you help me out? Is it necessary to include your product attributes in the sitemap? I'm not sure why it would matter to have a sitemap that lists everything in the color cherry. Also, if the attributes were included in the sitemap, would that count as duplicate content for the same products to show up in multiple attributes? Is there any benefit to submitting the sitemaps individually? For example, submitting /product-sitemap.xml, /product_brand-sitemap.xml versus just /sitemap.xml? Any other best practices for managing my ecommerce sitemap, or great resources, would be very helpful. Thank you! a1vUz
Technical SEO | | localwork0 -
When to use canonical urls
I will be the first to admit I am never really 100% sure when to use canonical urls. I have a quick question and I am not really sure if this is a situation for a canonical or not. I am looking at a my friends building website and there are issues with what pages are ranking. Basically there homepage is focusing on the building refurbishment location but for some reason in internal page is ranking for that keyword and it is not mentioned at all on that page. Would this be a time to add the homepage url and a canonical on the ranking page (using yoast plugin) to tell Google that the homepage is the preferred page? Thanks Paul
Technical SEO | | propertyhunter0 -
How do I use only one URL
my site can be reach by both www.site.com and site.com. How do I make it only use www?
Technical SEO | | Weblion0 -
Site not indexing correctly
I am trying to figure out what is going on with my site listings. Google is only displaying my title and url - no description. You can see it when you search for Franchises for Sale. The site is www.franchisesolutions.com. Why could this happen? Also I saw a big drop off in a handful of keyword rankings today. Could this be related?
Technical SEO | | franchisesolutions0 -
Google Indexed URLs for Terms Have Changed Causing Huge SERP Drop
We haven't made any significant changes to our website, however the pages that google has indexed for our critical keywords have changed to pages that have caused our SERP to drop dramatically for those pages. In some cases, the changes make no sense at all. For example, one of our terms that used to be indexed to our homepage is now indexed to a dead category page that has nothing on it. One of our biggest terms, where we were 9th, changed and is now indexed to our FAQ. As a result, we now rank 44th. This is having a MAJOR impact on our business so any help on why this sudden change happened and what we can do to combat it is greatly appreciated.
Technical SEO | | EvergladesDirect0 -
Non existant URLs being generated in index
Hi all, I have a pretty big problem with my site at the moment which I'm worried will have an impact on my rankings. I've just had a crawl test done and for some reason I get a load of urls returned that don't actually exist... For example I am getting urls like this in my crawl test and xml sitemap: www.applicablejobs.com/jobs/add/android-designer/android-designer/android-designer/android-developer/android-developer/ www.applicablejobs.com/jobs/add/android-designer/android-designer/android-designer/android-developer/iphone-designer/ All the urls seem to start off with www.applicablejobs.com/jobs/ and there is an entry for every conceivable combination of slugs. I can only assume that if the crawl test and an xml sitemap generator is indexing these urls then Google and other search engines probably are too. Does anyone have any idea what might be causing this issue and what can I do to remove them from Googles index if they are? Thanks
Technical SEO | | Benji870