Total Indexed 1.5M vs 83k submitted by sitemap. What?
-
We recently took a good look at one of our content site's sitemap and tried to cut out a lot of crap that had gotten in there such as .php, .xml, .htm versions of each page. We also cut out images to put in a separate image sitemap.
The sitemap generated 83,000+ URLs for google to crawl (this partially used the Yoast Wordpress plugin to generate)
In webmaster tools in the index status section is showing that this site has a total index of 1.5 million.
With our sitemap coming back with 83k and google indexing 1.5 million pages, is this a sign of a CMS gone rogue? Is it an indication that we could be pumping out error pages or empty templates, or junk pages that we're cramming into Google's bot?
I would love to hear what you guys think. Is this normal? Is this something to be concerned about? Should our total index more closely match our sitemap page count?
-
As well as parameters mentioned you may possibly have heaps of duplicating categories, tags etc. What I would also do is start searching Google with something like site:www.example.com/directory/ or possibly site:www.example.com/category/directory/directory/ so you are tightly narrowing down the results, switch to 100 results per page and manually look for clues.
-
If you have 1.5 million pages and you think your sitemap is comprehensive at 83,000 then yes, your CMS is needlessly generating pages. It's usually not a big deal from a ranking standpoint, but it can make other important issues hard to detect. I would clean it up, but that's a business call you'll have to make.
The first step is diagnosing where are the URLs are coming from. What you do next will depend, but I will give you the best advice I can without knowing what types of extraneous URLs you have and how Google is treating them:
First, I'd start with WMT > Crawl > URL Parameters. Quite often your CMS will generate URLs, and Google usually knows how to handle them. If there are a lot of URL parameters, Google them and see if they're exactly the same as other pages. If they are, make sure you have canonical tags in place to point them to the main version. There's more you can do with parameters, but it'll depend on what you find so I won't go into more detail. As a general rule, though, a CMS should not generate a page unless it is uniquely useful as differentiated landing page or a page for people to link to.
Also check for parameters in your analytics program. They could actually be messing up your pageview data depending on how you report.There's a post on fixing that in GA here:
http://blog.crazyegg.com/2013/03/29/remove-url-parameters-from-google-analytics-reports/
Next I'd look at the "Advanced" tab in WMT > Google Index > Index Status . Are there a lot of URLs removed? If so, check on these pages and see why they're removed and why they exist.
I would also run a crawl with Xenu and Screaming Frog to make sure crawlers are finding a reasonable number of pages and that they're not getting stuck in crawl loops. (crawling variations of a page endlessly). These kinds of issues can prevent new pages from being indexed on time because Google is wasting time (your crawl budget) running in circles.
-
Rob,
Your sitemap is but an indication to Google about urls on your domain. The sitemap does not limit google to crawling or indexing only the urls listed on it, nor is it a directive that tells google to remove urls from the index that it has already crawled. As stated in GWT, use **robots.txt **to specify how search engines should crawl your site, or request **removal **of URLs from Google's search results with the URL removal tool Google webmaster tools under the "google index" link.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
This url is not allowed for a Sitemap at this location error using pro-sitemaps.com
Hey, guys, We are using the pro-sitemaps.com tool to automate our sitemaps on our properties, but some of them give this error "This url is not allowed for a Sitemap at this location" for all the urls. Strange thing is that not all of them are with the error and most have all the urls indexed already. Do you have any experience with the tool and what is your opinion? Thanks
Intermediate & Advanced SEO | | lgrozeva0 -
HTTPS pages - To meta no-index or not to meta no-index?
I am working on a client's site at the moment and I noticed that both HTTP and HTTPS versions of certain pages are indexed by Google and both show in the SERPS when you search for the content of these pages. I just wanted to get various opinions on whether HTTPS pages should have a meta no-index tag through an htaccess rule or whether they should be left as is.
Intermediate & Advanced SEO | | Jamie.Stevens0 -
Big discrepancies between pages in Google's index and pages in sitemap
Hi, I'm noticing a huge difference in the number of pages in Googles index (using 'site:' search) versus the number of pages indexed by Google in Webmaster tools. (ie 20,600 in 'site:' search vs 5,100 submitted via the dynamic sitemap.) Anyone know possible causes for this and how i can fix? It's an ecommerce site but i can't see any issues with duplicate content - they employ a very good canonical tag strategy. Could it be that Google has decided to ignore the canonical tag? Any help appreciated, Karen
Intermediate & Advanced SEO | | Digirank0 -
Links to www vs non-www
I was having speed issues when I ran a test under Google Page Speed test and, as a result, switched to using Google Page Speed Service. This meant I had to switch my site from the non-www to the www. Since the switch my page is running faster but my ranking has dropped. What I'm trying to find out is the drop due to all of my previous links going to the non-www or is it because of the site being considered new and is more of a temporary issue. If it is a link issue I will contact everyone I can to see who will update the site address. Thanks everyone!
Intermediate & Advanced SEO | | toddmatthewca0 -
Sitemap on a Subdomain
Hi, For various reasons I placed my sitemaps on a subdomain where I keep images and other large files (static.example.com). I then submitted this to Google as a separate site in Webmaster tools. Is this a problem? All of the URLs are for the actual site (www.example.com), the only issue on my end is not being able to look at it all at the same time. But I'm wondering if this would cause any problems on Google's end.
Intermediate & Advanced SEO | | enotes0 -
Google Is Indexing The Wrong Page For My Keyword
For a long time (almost 3 mounth) google indexing the wrong page for my main keyword.
Intermediate & Advanced SEO | | Tiedemann_Anselm
The problem is that each time google indexed another page each time for a period of 4-7 days, Sometimes i see the home page, sometimes a category page and sometimes a product page.
It seems though Google has not yet decided what his favorite / better page for this keyword. This is the pages google index: (In most cases you can find the site on the second or third page) Main Page: http://bit.ly/19fOqDh Category Page: http://bit.ly/1ebpiRn Another Category: http://bit.ly/K3MZl4 Product Page: http://bit.ly/1c73B1s All links I get to the website are natural links, therefore in most cases the anchor we got is the website name. In addition I have many links I get from bloggers that asked to do a review on one of my products, I'm very careful about that and so I'm always checking the blogger and their website only if it is something good, I allowed it. also i never ask for a link back (must of the time i receive without asking), and as I said, most of their links are anchor with my website name. Here some example of links that i received from bloggers: http://bit.ly/1hF0pQb http://bit.ly/1a8ogT1 http://bit.ly/1bqqRr8 http://bit.ly/1c5QeC7 http://bit.ly/1gXgzXJ Please Can I get a recommendation what should you do?
Should I try to change the anchor of the link?
Do I need to not allow bloggers to make a review on my products? I'd love to hear what you recommend,
Thanks for the help0 -
Urls missing from product_cat sitemap
I'm using Yoast SEO plugin to generate XML sitemaps on my e-commerce site (woocommerce). I recently changed the category structure and now only 25 of about 75 product categories are included. Is there a way to manually include urls or what is the best way to have them all indexed in the sitemap?
Intermediate & Advanced SEO | | kisen0 -
Indexing issue or just time?
Hey guys, When I publish a post on our blog, I notice that it barely shows up in SERPs even if I copy and paste the title verbatim into Google. All my settings in Yoast are correct from what I've seen. Is this just Google slowly getting around to crawling our site? Or is something else wrong here? We recently shut down and relaunched our site about 3 weeks ago. Here is the site URL: The Tech Block
Intermediate & Advanced SEO | | ttb0