Indexed pages and current pages - Big difference?
-
Our website shows ~22k pages in the sitemap but ~56k are showing indexed on Google through the "site:" command. Firstly, how much attention should we paying to the discrepancy? If we should be worried what's the best way to find the cause of the difference?
The domain canonical is set so can't really figure out if we've got a problem or not?
-
Hi Nathan,
The delta between the number of pages returned by the site: operator and the number of pages in your sitemap could be down to a number of issues:
- Your XML sitemap may represent only a percentage of the total number of valid content URLs that your site is capable of generating.
a) Often sites will only generate XML sitemaps for URLs that someone has decided are "important", when the total number of URLs is much larger.
- Your XML sitemap contains ALL the valid content URLs that your site is capable of generating, but search engines are somehow finding more URLs.
a) Look in Google Webmaster Tools under Optimization >> HTML improvements >> Duplicate title tags
i) Do the pages with duplicate titles have duplicate page content? If so, your publishing platform is allowing multiple URLs to render the same content, which is a bug that needs to be fixed
b) Run a crawler like Xenu Link Sleuth or Screaming Frog against your site, and see how many URLs they discover. Export the results to Excel and look for weird URLs
i) Usually culprits for duplicate content include incorrect canonicalization (www vs non-www, URLs ending in /index.html vs just /, etc)
ii) Look for URLs ending with strange query strings (affiliate tracking, session IDs, etc)
c) Use the site: operator in other engines (Bing, blekko, etc) and compare the numbers they return. Especially if this number is larger than the number Google is returning, starting looking for weird URL patterns
Also, I'm not sure what you mean by "the domain canonical has been set correctly". If you're referring to use of the canonical link element for every URL, there are plenty of ways that can go wrong. E.g., if your CMS requires that each published URL have rel="canonical", but allows URLs to be published with and without the trailing /index.html, you can end up with a canonical link element on the non-canonical version of the URL, further confusing engines. Something to look into.
-
You might have a duplicate content issue. You will want to check if you have the proper 301 redirect and a canonical command in the head of your code. If you don't have this set properly then the search engines will see the www and non-www versions of your site as duplicate. Also remember that the search engines also by default place this at the end of the url /
Here are two links that can help if this is the issue.
http://www.webconfs.com/how-to-redirect-a-webpage.php/
http://www.mattcutts.com/blog/rel-canonical-html-head/
Hope this helps. Good Luck
-
Yes this is a potentially significant problem. The easiest way to troubleshoot is to do the 'site:' command again, and go to the last page of results. You should be seeing pages that aren't in your sitemap. Very likely duplicated content.
If you are having a rough time troubleshooting, post a link and I'll be glad to take a peek.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Google Not Indexing Pages (Wordpress)
Hello, recently I started noticing that google is not indexing our new pages or our new blog posts. We are simply getting a "Discovered - Currently Not Indexed" message on all new pages. When I click "Request Indexing" is takes a few days, but eventually it does get indexed and is on Google. This is very strange, as our website has been around since the late 90's and the quality of the new content is neither duplicate nor "low quality". We started noticing this happening around February. We also do not have many pages - maybe 500 maximum? I have looked at all the obvious answers (allowing for indexing, etc.), but just can't seem to pinpoint a reason why. Has anyone had this happen recently? It is getting very annoying having to manually go in and request indexing for every page and makes me think there may be some underlying issues with the website that should be fixed.
Technical SEO | | Hasanovic1 -
Very wierd pages. 2900 403 errors in page crawl for a site that only has 140 pages.
Hi there, I just made a crawl of the website of one of my clients with the crawl tool from moz. I have 2900 403 errors and there is only 140 pages on the website. I will give an exemple of what the crawl error gives me. | http://www.mysite.com/en/www.mysite.com/en/en/index.html#?lang=en | http://www.mysite.com/en/www.mysite.com/en/en/en/index.html#?lang=en | http://www.mysite.com/en/www.mysite.com/en/en/en/en/index.html#?lang=en | http://www.mysite.com/en/www.mysite.com/en/en/en/en/en/index.html#?lang=en | http://www.mysite.com/en/www.mysite.com/en/en/en/en/en/en/index.html#?lang=en | http://www.mysite.com/en/www.mysite.com/en/en/en/en/en/en/index.html#?lang=en | http://www.mysite.com/en/www.mysite.com/en/en/en/en/en/en/en/en/en/en/en/en/index.html#?lang=en | http://www.mysite.com/en/www.mysite.com/en/en/en/en/en/en/en/en/en/en/en/en/en/index.html#?lang=en | http://www.mysite.com/en/www.mysite.com/en/en/en/en/en/en/en/en/en/en/en/en/en/en/en/en/index.html#?lang=en | http://www.mysite.com/en/www.mysite.com/en/en/en/en/en/en/en/en/en/en/en/en/en/en/en/en/en/en/en/en/en/en/en/en/en/en/en/en/en/en/en/en/en/en/en/en/en/en/en/en/en/en/en/en/en/en/en/en/en/en/en/en/index.html#?lang=en | | | | | | | | | | There are 2900 pages like this. I have tried visiting the pages and they work, but they are only html pages without CSS. Can you guys help me to see what the problems is. We have experienced huge drops in traffic since Septembre.
Technical SEO | | H.M.N.0 -
Specific pages won't index
I have a few pages on my site that Google won't index, and I can't understand why. I've looked into possible issues with Robots, noindex, redirects, canonicals, and Search Console rules. I've got nothing. Example: I want this page to index https://tour.franchisebusinessreview.com/services/franchisee-satisfaction-surveys/ When I Google the full URL, I get results including the non-subdomain homepage, and various pages on the subdomain, including a child page of the page I want, but not the page itself. Any ideas? Thanks for the help!
Technical SEO | | ericstites0 -
Why my website does not index?
I made some changes in my website after that I try webmaster tool FETCH AS GOOGLE but this is 2nd day and my new pages does not index www. astrologersktantrik .com
Technical SEO | | ramansaab0 -
Website SEO Product Pages - Condense Product Pages
We are managing a website that has seen consistently dropping rankings over the last 2 years (http://www.independence-bunting.com/). Our long term strategy has been purely content-based and is of high quality, but isn’t seeing the desired results. It is an ecommerce site that has a lot of pages, most of which are category or product pages. Many of the product pages have duplicate or thin content, which we currently see as one of the primary reasons for the ranking drops.The website has many individual products which have the same fabric and size options, but have different designs. So it is difficult to write valuable content that differs between several products that have similar designs. Right now each of the different designs has its own product page. We have a dilemma, because our options are:A.Combine similar designs of the product into one product page where the customer must choose a design, a fabric, and a size before checking out. This way we can have valuable content and don’t have to duplicate that content on other pages or try to find more to say about something that there really isn’t anything else to say about. However, this process will remove between 50% and 70% of the pages on the website. We know number of indexed pages is important to search engines and if they suddenly see that half of our pages are gone, we may cause more negative effects despite the fact that we are in fact aiming to provide more value to the user, rather than less.B.Leave the product pages alone and try to write more valuable content for each product page, which will be difficult because there really isn’t that much more to say, or more valuable ways to say it. This is the “safe” option as it means that our negative potential impact is reduced but we won’t necessarily see much positive trending either. C.Test solution A on a small percentage of the product categories to see any impact over the next several months before making sitewide updates to the product pages if we see positive impact, or revert to the old way if we see negative impact.Any sound advice would be of incredible value at this point, as the work we are doing isn’t having the desired effects and we are seeing consistent dropping rankings at this point.Any information would be greatly appreciated. Thank you,
Technical SEO | | Ed-iOVA0 -
How do I get google to index the right pages with the right key word?
Hello I notice that even though I have a site map google is indexing the wrong pages under the wrong key words. As a result its not as relevant and is not ranking properly.
Technical SEO | | ursalesguru0 -
Noindex Pages indexed
I'm having problem that gogole is index my search results pages even though i have added the "noindex" metatag. Is the best thing to block the robot from crawling that file using robots.txt?
Technical SEO | | Tedred0 -
Have a client that migrated their site; went live with noindex/nofollow and for last two SEOMoz crawls only getting one page crawled. In contrast, G.A. is crawling all pages. Just wait?
Client site is 15 + pages. New site had noindex/nofollow removed prior to last two crawls.
Technical SEO | | alankoen1230