Log files vs. GWT: major discrepancy in number of pages crawled
-
Following up on this post, I did a pretty deep dive on our log files using Web Log Explorer. Several things have come to light, but one of the issues I've spotted is the vast difference between the number of pages crawled by the Googlebot according to our log files versus the number of pages indexed in GWT. Consider:
- Number of pages crawled per log files: 2993
- Crawl frequency (i.e. number of times those pages were crawled): 61438
- Number of pages indexed by GWT: 17,182,818 (yes, that's right - more than 17 million pages)
We have a bunch of XML sitemaps (around 350) that are linked on the main sitemap.xml page; these pages have been crawled fairly frequently, and I think this is where a lot of links have been indexed. Even so, would that explain why we have relatively few pages crawled according to the logs but so many more indexed by Google?
-
I'll reserve my answer until you hear from your dev team. A massive site for sure.
One other question/comment: just because there are 13 million URLs in your sitemap doesn't necessarily mean there are that many pages on the site. We could be talking about URI versus URL.
I'm pretty sure you know what I mean by that, but for others reading this who may not know, URI is the unique Web address of any given resource, while a URL is generally used to reference a complete Web page. An example of this would be an image. While it certainly has its own unique address on the Web, it most often does not have it's very own "page" on a Website (although there are certainly exceptions to that).
So, I could see a site having millions of URIs, but very few sites have 17 million+ pages. To put it into perspective, Alibaba and IBM roughly show 6-7 million pages indexed in Google. Walmart has between 8-9 million.
So where I'm headed in my thinking is major duplicate content issues...but, as I said, I'm going to reserve further comment until you hear back from your developers.
This is a very interesting thread so I want to know more. Cheers!
-
Waiting on an answer from our dev team on that now. In the meantime, here's what I can tell you:
-
Number submitted in XML sitemaps per GWT: 13,882,040 (number indexed: 13,204,476, or 95.1%)
-
Number indexed: 17,182,818
-
Difference: 3,300,778
-
Number of URLs throwing 404 errors: 2,810,650
-
2,810,650 / 3,300,778 = 85%
I'm sure the ridiculous number of 404s on site (I mentioned them in a separate post here) are at least partially to blame. How much, though? I know that Google says that 404s don't hurt SEO, but the fact that the number of 404s is 85% of the difference between the number indexed and submitted is not exactly a coincidence.
(Apologies if these questions seem a bit dense or elementary. I've done my share of SEO, but never on a site this massive.)
-
-
Hi. Interesting question. You had me at "log files." So before I give a longer, more detailed answer, I have a follow up question: Does your site really have 17+ million pages?
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Google Results Title vs My Page Title
I'm having some trouble with my titles of a new site, it has been online for around two months now and i'm getting weird titles from most indexed pages. Since my site is focused on finding courses, the course title format is the following: URL: https://www.maseducacion.com/estudios/programacion-curricular--tecnigrap-2982
Technical SEO | | JoaoCJ
My Title: Course - Institute | Mybrand
Google Search Title: Course - Institute | Mybrand - Educativa Half of my results have that word at the end, don't know where it comes from, that word is only included in two links. Any idea on how to fix it?0 -
Thousands of 404-pages, duplicate content pages, temporary redirect
Hi, i take over the SEO of a quite large e-commerce-site. After checking crawl issues, there seems to be +3000 4xx client errors, +3000 duplicate content issues and +35000 temporary redirects. I'm quite desperate regarding these results. What would be the most effective way to handle that. It's a magento shop. I'm grateful for any kind of help! Thx,
Technical SEO | | posthumus
boris0 -
Why is there a difference in the number of indexed pages shown by GWT and site: search?
Hi Moz Fans, I have noticed that there is a huge difference between the number of indexed pages of my site shown via site: search and the one that shows Webmaster Tools. While searching for my site directly in the browser (site:), there are about 435,000 results coming up. According to GWT there are over 2.000.000 My question is: Why is there such a huge difference and which source is correct? We have launched the site about 3 months ago, there are over 5 million urls within the site and we get lots of organic traffic from the very beginning. Hope you can help! Thanks! Aleksandra
Technical SEO | | aleker0 -
Can up a page
I do my best to optimize the on-page parameters for my page www.lkeria.com/AADL-logement-Algerie.php for the kw "aadl" but i can't understand what Ii'm doing wrong (i desapear 2 mounths ago). The page is optimize (title, description, h1, h2 etc.) few links with different ancers, but google put a spamy site www[dot]aadl[dot]biz in top 3 ratheer my page. Can you give me some advice to fix this issue? What I am doing wrong? Tanks in advance
Technical SEO | | lkeria0 -
What does the Google Crawler see when crawling this page?
If you look at this page http://www.rockymountainatvmc.com/t/49/61/185/730/Batteries. You will see we have a vehicle filter on it. Right now you only see a picture of a battery and some bad text that needs to be updated ( We just hired a copywriter!). Our question is when google crawls this site will thy just see this or will they see all the products that appear after you pick a "machine type" "make" "model" and "year" Any help would be great. Right now we think it just sees this main page how we have set things up; however, we know that the crawler is also crawling some ajax. We just want to be sure of things.
Technical SEO | | DoRM0 -
New EMD update effected my mom's legit author page? From page 1 in SERP to nowhere for her name
I think my mom's site, MargaretTerry.com was hit by this update for her name "Margaret Terry". Went from bouncing around the first page on google.com and .ca all the time to nowhere on the index. The results are now very strange, a mix of Youtube, linked in, and small book stores that she has done events at recently to promote her first book. I was checking after some of my SEO buddys were freaking out about their EMD's getting hit on Sunday. She is an aspiring author with a book coming out this month. There is obviously no ads or spam content on the site... I have never done SEO for it either except a bit of on page I guess. It sucks that people might be grabbing her book soon and when they Google her name nothing shows up. This couldn't have really happened at a worse time. Not to mention the hours spent building the site to her liking, free of charge of course 🙂 Is there anyone I can contact there to help me out? Shouldn't and EMD that is someones name still rank when you search their name?
Technical SEO | | Operatic0 -
Linking from and to pages
My website, www.kamperen-bij-de-boer.com, tells people what campingssites can be found in The Netherlands for recreational purposes. In order for a campingsite to be mentioned on our website we ask them to place a link to our website (either using a text link or image link) and then we make a page for that campsite on our website with in the end a link to ther website, e.g. http://www.kamperen-bij-de-boer.com/Minicamping-In-t-Oldambt.html -> they in return link back to us. Since this comes natural will this or won't this be penalized by Google and so on for linkfarming. At this moment we have about 600 camping sites on our website alone linking to us (not all of them) and we are linking to them. Since this can be explained as link trading which is not as good for your ranking as one-way-linking what should be wise? Should i include a nofollow? I already have many links from other sites linking to mine without having to link back, is there anything else i can do with linking to ensure better ranking?
Technical SEO | | JarnoNijzing0 -
Removing pages from website
Hello all, I am fairly new to the SEOmoz community. But i am working for a company which organizes exhibitons, events and training in Holland. A lot of these events are only given ones ore twice and then we do not organise them any more because they are no longer relevant. Every event has its own few webpages which provide information about the event and are being indexed by Google. In the past we did not remove any of these events. I was looking in the CMS and saw a lot of events of 2008 and older which are being indexed. To clean the website and the CMS i am thinking of removing these pages of old events. The risk is that these pages have some links to them and are getting some traffic, so if i remove them there is a risk of losing traffic and rankings. What would be the wise thing to do? Make a folder with archive or something? Regards, Ruud
Technical SEO | | RuudHeijnen0