Log files vs. GWT: major discrepancy in number of pages crawled

ufmedia

Following up on this post, I did a pretty deep dive on our log files using Web Log Explorer. Several things have come to light, but one of the issues I've spotted is the vast difference between the number of pages crawled by the Googlebot according to our log files versus the number of pages indexed in GWT. Consider:

Number of pages crawled per log files: 2993
Crawl frequency (i.e. number of times those pages were crawled): 61438
Number of pages indexed by GWT: 17,182,818 (yes, that's right - more than 17 million pages)

We have a bunch of XML sitemaps (around 350) that are linked on the main sitemap.xml page; these pages have been crawled fairly frequently, and I think this is where a lot of links have been indexed. Even so, would that explain why we have relatively few pages crawled according to the logs but so many more indexed by Google?

danatanseo

I'll reserve my answer until you hear from your dev team. A massive site for sure.

One other question/comment: just because there are 13 million URLs in your sitemap doesn't necessarily mean there are that many pages on the site. We could be talking about URI versus URL.

I'm pretty sure you know what I mean by that, but for others reading this who may not know, URI is the unique Web address of any given resource, while a URL is generally used to reference a complete Web page. An example of this would be an image. While it certainly has its own unique address on the Web, it most often does not have it's very own "page" on a Website (although there are certainly exceptions to that).

So, I could see a site having millions of URIs, but very few sites have 17 million+ pages. To put it into perspective, Alibaba and IBM roughly show 6-7 million pages indexed in Google. Walmart has between 8-9 million.

So where I'm headed in my thinking is major duplicate content issues...but, as I said, I'm going to reserve further comment until you hear back from your developers.

This is a very interesting thread so I want to know more. Cheers!

ufmedia

Waiting on an answer from our dev team on that now. In the meantime, here's what I can tell you:

Number submitted in XML sitemaps per GWT: 13,882,040 (number indexed: 13,204,476, or 95.1%)
Number indexed: 17,182,818
Difference: 3,300,778
Number of URLs throwing 404 errors: 2,810,650
2,810,650 / 3,300,778 = 85%

I'm sure the ridiculous number of 404s on site (I mentioned them in a separate post here) are at least partially to blame. How much, though? I know that Google says that 404s don't hurt SEO, but the fact that the number of 404s is 85% of the difference between the number indexed and submitted is not exactly a coincidence.

(Apologies if these questions seem a bit dense or elementary. I've done my share of SEO, but never on a site this massive.)

danatanseo

Hi. Interesting question. You had me at "log files." So before I give a longer, more detailed answer, I have a follow up question: Does your site really have 17+ million pages?

Welcome to the Q&A Forum

Browse the forum for helpful insights and fresh discussions about all things SEO.

Log files vs. GWT: major discrepancy in number of pages crawled

Got a burning SEO question?

Browse Questions

Explore more categories

Related Questions

Will putting a one page site up for all other countries stop Googlebot from crawling my UK website?

Why does my site have so many crawl errors relating to the wordpress login / captcha page

Is it good to redirect million of pages on a single page?

Page disappeared from Google index. Google cache shows page is being redirected.

Need Help With WWW vs. Non-WWW Duplicate Pages

Indexed pages and current pages - Big difference?

I have 15,000 pages. How do I have the Google bot crawl all the pages?

What is the largest page size a searchbot will crawl?