Log files vs. GWT: major discrepancy in number of pages crawled
-
Following up on this post, I did a pretty deep dive on our log files using Web Log Explorer. Several things have come to light, but one of the issues I've spotted is the vast difference between the number of pages crawled by the Googlebot according to our log files versus the number of pages indexed in GWT. Consider:
- Number of pages crawled per log files: 2993
- Crawl frequency (i.e. number of times those pages were crawled): 61438
- Number of pages indexed by GWT: 17,182,818 (yes, that's right - more than 17 million pages)
We have a bunch of XML sitemaps (around 350) that are linked on the main sitemap.xml page; these pages have been crawled fairly frequently, and I think this is where a lot of links have been indexed. Even so, would that explain why we have relatively few pages crawled according to the logs but so many more indexed by Google?
-
I'll reserve my answer until you hear from your dev team. A massive site for sure.
One other question/comment: just because there are 13 million URLs in your sitemap doesn't necessarily mean there are that many pages on the site. We could be talking about URI versus URL.
I'm pretty sure you know what I mean by that, but for others reading this who may not know, URI is the unique Web address of any given resource, while a URL is generally used to reference a complete Web page. An example of this would be an image. While it certainly has its own unique address on the Web, it most often does not have it's very own "page" on a Website (although there are certainly exceptions to that).
So, I could see a site having millions of URIs, but very few sites have 17 million+ pages. To put it into perspective, Alibaba and IBM roughly show 6-7 million pages indexed in Google. Walmart has between 8-9 million.
So where I'm headed in my thinking is major duplicate content issues...but, as I said, I'm going to reserve further comment until you hear back from your developers.
This is a very interesting thread so I want to know more. Cheers!
-
Waiting on an answer from our dev team on that now. In the meantime, here's what I can tell you:
-
Number submitted in XML sitemaps per GWT: 13,882,040 (number indexed: 13,204,476, or 95.1%)
-
Number indexed: 17,182,818
-
Difference: 3,300,778
-
Number of URLs throwing 404 errors: 2,810,650
-
2,810,650 / 3,300,778 = 85%
I'm sure the ridiculous number of 404s on site (I mentioned them in a separate post here) are at least partially to blame. How much, though? I know that Google says that 404s don't hurt SEO, but the fact that the number of 404s is 85% of the difference between the number indexed and submitted is not exactly a coincidence.
(Apologies if these questions seem a bit dense or elementary. I've done my share of SEO, but never on a site this massive.)
-
-
Hi. Interesting question. You had me at "log files."
So before I give a longer, more detailed answer, I have a follow up question: Does your site really have 17+ million pages?
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Can increase in crawl errors in GWT) be caused by input fields and jquery?
Dear Mozzerz We took over www.urgiganten.dk not long ago and last week we opened up for indexation, after having taken the old website down for a couple of months. One week after opening for indexation we saw a huge increase in crawl errors.Google is discovering some weird links to e.g http://www.urgiganten.dk/30-garmin-urremme/ which returns a 404. In GWT we are told that we are linking to this url from http://www.urgiganten.dk/garmin-urremme. But nowhere on http://www.urgiganten.dk/garmin-urremme will you find this link. However you will find the following script in the source code, which is the only code part that contains "/30-garmin-urremme/":Can it be true that google take the id and adds it to our tld to form a url? We have seen quite a lot of these errors not only on Urgiganten.dk but also some of our other websites!
Technical SEO | | urgiganten0 -
Advice on whether we 301 redirect a page or update existing page?
Hi guys, any advice would be really appreciated. We have an existing page that ranks well for 'red widgets'. The page isn't monetised right now, but we're bringing in a new product onto our site that we optimised for 'blue widgets'. Unfortunately, not enough research was done for this page and we've now realised that consumers actually search for 'red widgets' when looking for the product we're creating as 'blue widgets'. The problem with this is that the 'red widgets' page is in a completely different category of our site than what it needs to be (it needs to be with 'blue widgets'). So, my question is; Should we do a 301 redirect from our 'red-widgets' page to our 'blue-widgets' page which we want to update and optimise the content on there for 'red-widgets'. Or, should we update the existing red-widgets page to have the right products and content on there, even thought it is in the wrong place of our site and users could get confused as to why they are there. If we do a 301 redirect to our new page, will we lose our rankings and have to start again, or is there a better way around this? Thanks! Dave
Technical SEO | | davo230 -
Sub pages Vs subdomain Pagerank flow.
I Have Question about Pagerank flow:
Technical SEO | | tommytai
If ihave a site :domain.com and i have 2 solutions like:
Solution #1: Quote: | domain.com/blog and domain.com/video if i try to do :
Root domain only link to
domain.com/video and
domain.com/blog | Solution #2: Quote: | Root domain only link to
video.domain.com and
blog.domain.com | So <acronym title="Google PageRank">Pr</acronym> domain.com/blog = <acronym title="Google PageRank">Pr</acronym> blog.domain.com ?
and <acronym title="Google PageRank">PR</acronym> domain.com/video = <acronym title="Google PageRank">Pr</acronym> video.domain.com ? And don't know why a subdomain of blogspot or Wordpress ranking easier than a new domain like:keyword.wordpress.com and keyword.com So What Wordpress pass to keyword.wordpress.com ?0 -
Submitting Sitemap File vs Sitemap Index File
Is it better to submit all sitemap files contained in a Sitemap Index File manually to Google or is it about the same as just submitting the Master Sitemap Index File.
Technical SEO | | AU-SEO0 -
Keywords in file names vs folder names
We understand the value of a keyword phrase included in the URL. Is there more value to having that phrase in the folder name of the URL or the file name or does it matter? Example: http://www.biztoolsone.com/website-design.php or http://www.biztoolsone.com/website-design/ Which is best? Thanks, Wick Smith
Technical SEO | | wcksmith0 -
Subdirectories vs subdomains
Hi SEO gurus 🙂 Anyone has input on what's better? blog.domain.com vs domain.com/blog store.domain.com vs domain.com/store etc I think the subdir (/xyz) will concentrate authority on the same subdomain so should be better? However sometimes it is tidier on the server to maintain online stores or blogs in a separate strucutre so subdomains work better in that sense. I just want to make sure that doesn't affect SEO? Cheers!
Technical SEO | | hectorpn0 -
Removing Duplicate Pages
Hi everyone. I'm sure this falls under novice seo question. But how do i remove duplicate pages from my site. I have not created the pages per say. Their may be a an internal link on a page that links to the page causing the duplication. Do i remove the internal link here is a sample of a duplicate page http://www.ticketplatform.com/about/ticket-industry-news-details/11-03-07/Ticket_Platform_to_help_LilysProject_com_to_raise_money_for_ALYN_Hospital_in_Israel.aspx?ReturnURL=%2fabout%2fticket-industry-news.aspx http://www.ticketplatform.com/about/ticket-industry-news-details/11-03-07/Ticket_Platform_to_help_LilysProject_com_to_raise_money_for_ALYN_Hospital_in_Israel.aspx?ReturnURL=%2fhome.aspx&CntPageID=1 I know the url is way too long. working on it Thanks for your feedbacks.
Technical SEO | | ticketplatform0 -
Can Search engines crawl this page
Hi guys, To put a long story short we have had to make a copy of our site and put it on another domain so in essence there are 2 copies of our site on the web.What we have done is put a username and password on the homepage - http://www.ughhwiki.co.uk/ now i just want to be 100% sure that the search engines cannot crawl this? Thank you Jon
Technical SEO | | imrubbish0