Log files vs. GWT: major discrepancy in number of pages crawled
-
Following up on this post, I did a pretty deep dive on our log files using Web Log Explorer. Several things have come to light, but one of the issues I've spotted is the vast difference between the number of pages crawled by the Googlebot according to our log files versus the number of pages indexed in GWT. Consider:
- Number of pages crawled per log files: 2993
- Crawl frequency (i.e. number of times those pages were crawled): 61438
- Number of pages indexed by GWT: 17,182,818 (yes, that's right - more than 17 million pages)
We have a bunch of XML sitemaps (around 350) that are linked on the main sitemap.xml page; these pages have been crawled fairly frequently, and I think this is where a lot of links have been indexed. Even so, would that explain why we have relatively few pages crawled according to the logs but so many more indexed by Google?
-
I'll reserve my answer until you hear from your dev team. A massive site for sure.
One other question/comment: just because there are 13 million URLs in your sitemap doesn't necessarily mean there are that many pages on the site. We could be talking about URI versus URL.
I'm pretty sure you know what I mean by that, but for others reading this who may not know, URI is the unique Web address of any given resource, while a URL is generally used to reference a complete Web page. An example of this would be an image. While it certainly has its own unique address on the Web, it most often does not have it's very own "page" on a Website (although there are certainly exceptions to that).
So, I could see a site having millions of URIs, but very few sites have 17 million+ pages. To put it into perspective, Alibaba and IBM roughly show 6-7 million pages indexed in Google. Walmart has between 8-9 million.
So where I'm headed in my thinking is major duplicate content issues...but, as I said, I'm going to reserve further comment until you hear back from your developers.
This is a very interesting thread so I want to know more. Cheers!
-
Waiting on an answer from our dev team on that now. In the meantime, here's what I can tell you:
-
Number submitted in XML sitemaps per GWT: 13,882,040 (number indexed: 13,204,476, or 95.1%)
-
Number indexed: 17,182,818
-
Difference: 3,300,778
-
Number of URLs throwing 404 errors: 2,810,650
-
2,810,650 / 3,300,778 = 85%
I'm sure the ridiculous number of 404s on site (I mentioned them in a separate post here) are at least partially to blame. How much, though? I know that Google says that 404s don't hurt SEO, but the fact that the number of 404s is 85% of the difference between the number indexed and submitted is not exactly a coincidence.
(Apologies if these questions seem a bit dense or elementary. I've done my share of SEO, but never on a site this massive.)
-
-
Hi. Interesting question. You had me at "log files." So before I give a longer, more detailed answer, I have a follow up question: Does your site really have 17+ million pages?
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Why is my inner pages ranking higher than main page?
Hi everyone, for some reason lately i have discovered that Google is ranking my inner pages higher than the main subfolder page. www.domain.com/subfolder --> Target page to be ranked
Technical SEO | | davidboh
www.domain.com/subfolder/aboutus ---> page that is currently ranking Also in the SERP most of the time, it is showing both links in this manner. www.domain.com/subfolder/aboutus
-----------www.domain.com/subfolder Thanks in advance.1 -
What is a good crawl budget?
Hi Community! I am in the process of updating sitemaps and am trying to obtain a standard for what is considered "strong" crawl budget? Every documentation I've found includes how to make it better or what to watch out for. However, I'm looking for an amount to obtain for (ex: 60% of the sitemap has been crawled, 100%, etc.)
Technical SEO | | yaelslater1 -
Over 40+ pages have been removed from the indexed and this page has been selected as the google preferred canonical.
Over 40+ pages have been removed from the indexed and this page has been selected as the google preferred canonical. https://studyplaces.com/about-us/ The pages affected by this include: https://studyplaces.com/50-best-college-party-songs-of-all-time-and-why-we-love-them/ https://studyplaces.com/15-best-minors-for-business-majors/ As you can see the content on these pages is totally unrelated to the content on the about-us page. Any ideas why this is happening and how to resolve.
Technical SEO | | pnoddy0 -
Page content not being recognised?
I moved my website from Wix to Wordpress in May 2018. Since then, it's disappeared from Google searches. The site and pages are indexed, but no longer ranking. I've just started a Moz campaign, and most pages are being flagged as having "thin content" (50 words or less), when I know that there are 300+ words on most of the pages. Looking at the page source I find this bit of code: page contents Does this mean that Google is finding this and thinks that I have only two words (page contents) on the page? Or is this code to grab the page contents from somewhere else in the code? I'm completely lost with this and would appreciate any insight.
Technical SEO | | Photowife1 -
Purchased domain with links - redirect page by page or entire domain?
Hi, I purchased an old domain with a lot of links that I'm redirecting to my site. I want all of their links to redirect to the same page on my site so I can approach this two different ways: Entire site
Technical SEO | | ninel_P
1.) RedirectMatch 301 ^(.*)$ http://www.xyz.com or Page by page
2). Redirect 301 /retiredpage.html http://www.xyz.com/newpage.html Is there a better option I should go with in regards to SEO effectiveness? Thanks in advance!0 -
Increase in Crawl Errors
I had a problem with a lot of crawl errors (on Google Search Console) a while back, due to the removal of a shopping cart. I thought I'd dealt with this & Google seemed to agree (see attached pic), but now they're all back with a vengeance! The crawl errors are all the old shop pages that I thought I'd made clear weren't there anymore. The sitemaps (using Yoast on Wordpress to generate these) all updated 16 Aug but the increase didn't happen till 18-20. How do I make it clear to Google that these pages are gone forever? Screen-Shot-2016-08-22-at-10.19.05.png
Technical SEO | | abisti20 -
On-Page Problem
Hello Mozzers, A friend has a business website and the on-page stuff is done really bad. He wants to rank for: conference room furnishing, video conference, digital signage. (Don't worry about the keywords, it's just made up for an example.) For these three services he has a page: hiswebsite.com/av AV stands for audio and video and is the h1. If you click on one of the service, the url doesn't change. Like if you click on video conference, just the text changes, the url stays /av. All his targeted pages got an F Grade, I am not surprised, the services titles are in . Wouldn't it be a lot better to make an own page for every service with a targeted keyword, like hiswebsite.com/video-conference All this stuff is on /av, how will a 301 resirect work to all the service pages, does this make sense? Any help is appreciated! Thanks in advance!
Technical SEO | | grobro1 -
How many pages should my site have?
Right now I think I only have 36. What is a good amount of pages to have? Any ideas on ways to add relevant pages to my site? I was thinking about starting a message board. Also, I have a free tech support chat room, and was thinking about posting the logs somewhere on the site. Does that sound like a good idea? Thanks.
Technical SEO | | eugenecomputergeeks0