Log files vs. GWT: major discrepancy in number of pages crawled
-
Following up on this post, I did a pretty deep dive on our log files using Web Log Explorer. Several things have come to light, but one of the issues I've spotted is the vast difference between the number of pages crawled by the Googlebot according to our log files versus the number of pages indexed in GWT. Consider:
- Number of pages crawled per log files: 2993
- Crawl frequency (i.e. number of times those pages were crawled): 61438
- Number of pages indexed by GWT: 17,182,818 (yes, that's right - more than 17 million pages)
We have a bunch of XML sitemaps (around 350) that are linked on the main sitemap.xml page; these pages have been crawled fairly frequently, and I think this is where a lot of links have been indexed. Even so, would that explain why we have relatively few pages crawled according to the logs but so many more indexed by Google?
-
I'll reserve my answer until you hear from your dev team. A massive site for sure.
One other question/comment: just because there are 13 million URLs in your sitemap doesn't necessarily mean there are that many pages on the site. We could be talking about URI versus URL.
I'm pretty sure you know what I mean by that, but for others reading this who may not know, URI is the unique Web address of any given resource, while a URL is generally used to reference a complete Web page. An example of this would be an image. While it certainly has its own unique address on the Web, it most often does not have it's very own "page" on a Website (although there are certainly exceptions to that).
So, I could see a site having millions of URIs, but very few sites have 17 million+ pages. To put it into perspective, Alibaba and IBM roughly show 6-7 million pages indexed in Google. Walmart has between 8-9 million.
So where I'm headed in my thinking is major duplicate content issues...but, as I said, I'm going to reserve further comment until you hear back from your developers.
This is a very interesting thread so I want to know more. Cheers!
-
Waiting on an answer from our dev team on that now. In the meantime, here's what I can tell you:
-
Number submitted in XML sitemaps per GWT: 13,882,040 (number indexed: 13,204,476, or 95.1%)
-
Number indexed: 17,182,818
-
Difference: 3,300,778
-
Number of URLs throwing 404 errors: 2,810,650
-
2,810,650 / 3,300,778 = 85%
I'm sure the ridiculous number of 404s on site (I mentioned them in a separate post here) are at least partially to blame. How much, though? I know that Google says that 404s don't hurt SEO, but the fact that the number of 404s is 85% of the difference between the number indexed and submitted is not exactly a coincidence.
(Apologies if these questions seem a bit dense or elementary. I've done my share of SEO, but never on a site this massive.)
-
-
Hi. Interesting question. You had me at "log files." So before I give a longer, more detailed answer, I have a follow up question: Does your site really have 17+ million pages?
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Content change within the same URL/Page (UX vs SEO)
Context: I'm asking my client to create city pages so he can present all of his appartements in that specific sector so i can have a page that ranks for "appartement for rent in +sector". The page will present a map with all the sector so the user can navigate and choose the sector he wants after he landed on the page. Question: The UX team is asking if we absolutly need to reload the sector page when the user is clicking the location on the map or if they can switch the content within the same page/url once the user is on the landing page. My concern: 1. Can this be analysed as duplicate content if Google can crawl within the javascript app or if Google only analyse his "first view" of the page. 2. Do you consider that it would be preferable to keep the "page change" so i'm increasing the number of page viewed ?
Technical SEO | | alexrbrg0 -
Duplicate Pages on GWT when redesigning website
Hi, we recently redesigned our online shop. We have done the 301 redirects for all product pages to the new URL (and went live about 1.5 week ago), but GWT indicated that the old product URL and the new product URL are 2 different pages with the same meta title tags (duplication) - when in fact, the old URL is 301 redirecting to the new URL when visited. I found this article on google forum: https://productforums.google.com/forum/#!topic/webmasters/CvCjeNOxOUw
Technical SEO | | Essentia
It says we either just wait for Google to re-crawl, of use the fetch URL function for the OLD URLs. Question is, after i fetch the OLD URL to tell Google that it's being redirected, should i click the button 'submit to index' or not? (See screengrab - please note that it was the OLD URL that was being fetched, not the NEW URL). I mean, if i click this button, is it telling Google that: a. 'This old URL has been redirected, therefore please index the new URL'? or
b. 'Please keep this old URL in your index'? What's your view on this? Thanks1 -
GWT Images Indexing
Hi guys! How does normally take to get Google to index the images within the sitemap? I recently submitted a new, up to date sitemap and most of the pages have been indexed already, but no images have. Any reason for that? Cheers
Technical SEO | | PremioOscar0 -
Duplicate Page Title
Our pages has so many DUPLİCATE PAGE TİTLE
Technical SEO | | iskq
I want to change all of them, is it right way?0 -
How Often is Site Crawled
Good morning- I saw some errors in my first crawl and immediately removed the pages from my website. I then re-created my XML sitemap and uploaded to Google. The question I have is will the site be crawled to recognize the changes in the next day or so? The pages were just placed on the site as test pages and never removed. The initial crawl that notified me it was done found the errors and were removed. Thanks for your help. Peter
Technical SEO | | VT_Pete0 -
Link rel next previous VS duplicate page title
Hello, I am running into a little problem that i would like to have a feedback on. I am running multiple wordpress blogs and Seo Moz pro is telling me that i have duplicate title tags on canadiansavers.ca vs http://www.canadiansavers.ca/page/2 I can dynamically add a page 2 to the title but I am correctly using the link rel: next and rel:previous Why is it seen as a duplicate title tag and should i add the page 2, page 3... in the meta title thanks
Technical SEO | | pixweb0 -
Why is there such a big discrepancy between OSE and GWT regarding # backlinks?
Hello, We have been doing some analysis around our backlink profiles for our sites and have been experiencing a massive discrepancy between what is reported as number of C class linking domains in OSE and the information returned in Google Webmaster tools. For a variety of sites OSE is reporting numbers < 10 for C class linking doamins while GWT shows >100 unique domains linking (we confirmed that the majority of these links are in different C classes) Is this simply a matter of the limited index size of OSE or could there be another explanation? It is interesting that the links that do show up in OSE a nearly exclusively sites that we own. /T
Technical SEO | | tomypro0 -
SEOMoz is indicating I have 40 pages with duplicate content, yet it doesn't list the URL's of the pages???
When I look at the Errors and Warnings on my Campaign Overview, I have a lot of "duplicate content" errors. When I view the errors/warnings SEOMoz indicates the number of pages with duplicate content, yet when I go to view them the subsequent page says no pages were found... Any ideas are greatly welcomed! Thanks Marty K.
Technical SEO | | MartinKlausmeier0