Log files vs. GWT: major discrepancy in number of pages crawled
-
Following up on this post, I did a pretty deep dive on our log files using Web Log Explorer. Several things have come to light, but one of the issues I've spotted is the vast difference between the number of pages crawled by the Googlebot according to our log files versus the number of pages indexed in GWT. Consider:
- Number of pages crawled per log files: 2993
- Crawl frequency (i.e. number of times those pages were crawled): 61438
- Number of pages indexed by GWT: 17,182,818 (yes, that's right - more than 17 million pages)
We have a bunch of XML sitemaps (around 350) that are linked on the main sitemap.xml page; these pages have been crawled fairly frequently, and I think this is where a lot of links have been indexed. Even so, would that explain why we have relatively few pages crawled according to the logs but so many more indexed by Google?
-
I'll reserve my answer until you hear from your dev team. A massive site for sure.
One other question/comment: just because there are 13 million URLs in your sitemap doesn't necessarily mean there are that many pages on the site. We could be talking about URI versus URL.
I'm pretty sure you know what I mean by that, but for others reading this who may not know, URI is the unique Web address of any given resource, while a URL is generally used to reference a complete Web page. An example of this would be an image. While it certainly has its own unique address on the Web, it most often does not have it's very own "page" on a Website (although there are certainly exceptions to that).
So, I could see a site having millions of URIs, but very few sites have 17 million+ pages. To put it into perspective, Alibaba and IBM roughly show 6-7 million pages indexed in Google. Walmart has between 8-9 million.
So where I'm headed in my thinking is major duplicate content issues...but, as I said, I'm going to reserve further comment until you hear back from your developers.
This is a very interesting thread so I want to know more. Cheers!
-
Waiting on an answer from our dev team on that now. In the meantime, here's what I can tell you:
-
Number submitted in XML sitemaps per GWT: 13,882,040 (number indexed: 13,204,476, or 95.1%)
-
Number indexed: 17,182,818
-
Difference: 3,300,778
-
Number of URLs throwing 404 errors: 2,810,650
-
2,810,650 / 3,300,778 = 85%
I'm sure the ridiculous number of 404s on site (I mentioned them in a separate post here) are at least partially to blame. How much, though? I know that Google says that 404s don't hurt SEO, but the fact that the number of 404s is 85% of the difference between the number indexed and submitted is not exactly a coincidence.
(Apologies if these questions seem a bit dense or elementary. I've done my share of SEO, but never on a site this massive.)
-
-
Hi. Interesting question. You had me at "log files." So before I give a longer, more detailed answer, I have a follow up question: Does your site really have 17+ million pages?
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Blog Page Titles - Page 1, Page 2 etc.
Hi All, I have a couple of crawl errors coming up in MOZ that I am trying to fix. They are duplicate page title issues with my blog area. For example we have a URL of www.ourwebsite.com/blog/page/1 and as we have quite a few blog posts they get put onto another page, example www.ourwebsite.com/blog/page/2 both of these urls have the same heading, title, meta description etc. I was just wondering if this was an actual SEO problem or not and if there is a way to fix it. I am using Wordpress for reference but I can't see anywhere to access the settings of these pages. Thanks
Technical SEO | | O2C0 -
Is there a way to index important pages manually or to make sure a certain page will get indexed in a short period of time??
Hi There! The problem I'm having is that certain pages are waiting already three months to be indexed. They even have several backlinks. Is it normal to have to wait more than three months before these pages get an indexation? Is there anything i can do to make sure these page will get an indexation soon? Greetings Bob
Technical SEO | | rijwielcashencarry0400 -
One page templates
Hi, I have in plan to implement One-page template (http://shapebootstrap.net/wp-content/plugins/shapebootstrap/demo.php?id=380) is someone have experience how this type of page have SEO problems? Best regards
Technical SEO | | komir20040 -
Duplicate Page Errors
Hey guys, I'm wondering if anyone can help... Here is my issue... Our website:
Technical SEO | | TCPReliable
http://www.cryopak.com
It's built on Concrete 5 CMS I'm noticing a ton of duplicate page errors (9530 to be exact). I'm looking at the issues and it looks like it is being caused by the CMS. For instance the home page seems to be duplicating.. http://www.cryopak.com/en/
http://www.cryopak.com/en/?DepartmentId=67
http://www.cryopak.com/en/?DepartmentId=25
http://www.cryopak.com/en/?DepartmentId=4
http://www.cryopak.com/en/?DepartmentId=66 Do you think this is an issue? Is their anyway to fix this issue? It seems to be happening on every page. Thanks Jim0 -
Duplicate Pages Issue
I noticed a problem and I was wondering if anyone knows how to fix it. I was a sitemap for 1oxygen.com, a site that has around 50 pages. The sitemap generator come back with over a 2000 pages. Here is two of the results: http://www.1oxygen.com/portableconcentrators/portableconcentrators/portableconcentrators/services/rentals.htm
Technical SEO | | chuck-layton
http://www.1oxygen.com/portableconcentrators/portableconcentrators/1oxygen/portableconcentrators/portableconcentrators/portableconcentrators/oxusportableconcentrator.htm These are actaully pages somehow. In my FTP there in the first /portableconentrators/ folder there is about 12 html documents and no other folders. It looks like it is creating a page for every possible folder combination. I have no idea why you those pages above actually work, help please???0 -
Tracking a Crawl error
Hi All, If you find a crawl error on your page. How do you find it? The error only says the URL that is wrong but this is not the location. Can i drill down and find out more information? Thank you!
Technical SEO | | wedmonds0 -
Duplicate Page Title
First i had a problem with duplicate title errors, almost every page i had was double because my website linked to both www.funky-lama.com and funky-lama.com I changed this by adding a code to htaccess to redirect everything to www.funky-lama.com, but now my website was crawled again and the errors were actually doubled. all my pages now have duplicate title errors cause of pages like this www.funky-lama.com/160-confetti-gitaar.html funky-lama.com/160-confetti-gitaar.html www.funky-lama.com/1_present-time funky-lama.com/1_present-time
Technical SEO | | funkylama0 -
404 vs. 200?
Is it better to have an error page return a 404 or 200? If I change it to 200, will I still be able to see reports of 404's and/ or broken links? Is there a valid SEO reason that Google would have for not wanting error pages to return 200? In other words, is there any SEO reason to absolutely change it to return a 404? I would rather let it return 200 if no priority reason to change. [title edited by staff to provide clarity]
Technical SEO | | cindyt-170380