Total Indexed 1.5M vs 83k submitted by sitemap. What?
-
We recently took a good look at one of our content site's sitemap and tried to cut out a lot of crap that had gotten in there such as .php, .xml, .htm versions of each page. We also cut out images to put in a separate image sitemap.
The sitemap generated 83,000+ URLs for google to crawl (this partially used the Yoast Wordpress plugin to generate)
In webmaster tools in the index status section is showing that this site has a total index of 1.5 million.
With our sitemap coming back with 83k and google indexing 1.5 million pages, is this a sign of a CMS gone rogue? Is it an indication that we could be pumping out error pages or empty templates, or junk pages that we're cramming into Google's bot?
I would love to hear what you guys think. Is this normal? Is this something to be concerned about? Should our total index more closely match our sitemap page count?
-
As well as parameters mentioned you may possibly have heaps of duplicating categories, tags etc. What I would also do is start searching Google with something like site:www.example.com/directory/ or possibly site:www.example.com/category/directory/directory/ so you are tightly narrowing down the results, switch to 100 results per page and manually look for clues.
-
If you have 1.5 million pages and you think your sitemap is comprehensive at 83,000 then yes, your CMS is needlessly generating pages. It's usually not a big deal from a ranking standpoint, but it can make other important issues hard to detect. I would clean it up, but that's a business call you'll have to make.
The first step is diagnosing where are the URLs are coming from. What you do next will depend, but I will give you the best advice I can without knowing what types of extraneous URLs you have and how Google is treating them:
First, I'd start with WMT > Crawl > URL Parameters. Quite often your CMS will generate URLs, and Google usually knows how to handle them. If there are a lot of URL parameters, Google them and see if they're exactly the same as other pages. If they are, make sure you have canonical tags in place to point them to the main version. There's more you can do with parameters, but it'll depend on what you find so I won't go into more detail. As a general rule, though, a CMS should not generate a page unless it is uniquely useful as differentiated landing page or a page for people to link to.
Also check for parameters in your analytics program. They could actually be messing up your pageview data depending on how you report.There's a post on fixing that in GA here:
http://blog.crazyegg.com/2013/03/29/remove-url-parameters-from-google-analytics-reports/
Next I'd look at the "Advanced" tab in WMT > Google Index > Index Status . Are there a lot of URLs removed? If so, check on these pages and see why they're removed and why they exist.
I would also run a crawl with Xenu and Screaming Frog to make sure crawlers are finding a reasonable number of pages and that they're not getting stuck in crawl loops. (crawling variations of a page endlessly). These kinds of issues can prevent new pages from being indexed on time because Google is wasting time (your crawl budget) running in circles.
-
Rob,
Your sitemap is but an indication to Google about urls on your domain. The sitemap does not limit google to crawling or indexing only the urls listed on it, nor is it a directive that tells google to remove urls from the index that it has already crawled. As stated in GWT, use **robots.txt **to specify how search engines should crawl your site, or request **removal **of URLs from Google's search results with the URL removal tool Google webmaster tools under the "google index" link.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Pending Sitemaps
Hi, all Wondering if someone could give me a pointer or two, please. I cannot seem to get Google or Bing to crawl my sitemap. If I submit the sitemap in WMT and test it I get a report saying 44,322urls found. However, if I then submit that same sitemap it either says Pending (in old WMT) or Couldn't fetch in the new version. This couldn't fetch is very puzzling as it had no issue fetching the map to test it. My other domains on the same server are fine, the problem is limited to this one site. I have tried several pages on the site using the Fetch as Google tool and they load without issue, however, try as I may, it will not fetch my sitemap. The sitemapindex.xml file won't even submit. I can confirm my sitemaps, although large, work fine, please see the following as an example (minus the spaces, of course, didn't want to submit and make it look like I was just trying to get a link) https:// digitalcatwalk .co.uk/sitemap.xml https:// digitalcatwalk .co.uk/sitemapindex.xml I would welcome any feedback anyone could offer on this, please. It's driving me mad trying to work out what is up. Many thanks, Jeff
Intermediate & Advanced SEO | | wonkydogadmin0 -
Why Aren't My Images Being Indexed?
Hi, One of my clients submitted an image sitemap with 465 images. It was submitted on July 20 2017 to Google Search Console. None of the submitted images have been indexed. I'm wondering why? Here's the image sitemap: http://www.tagible.com/images_sitemap.xml We do use a CDN for the images, and the images are hosted on a subdomain of the client's site: ex. https://photos.tagible.com/images/Les_Invalides_Court_Of_Honor.jpg Thanks in advance! Cheers,
Intermediate & Advanced SEO | | SEOdub
Julian0 -
No images in Google index
No images are indexed on this site (client of ours): http://www.rubbermagazijn.nl/. We've tried everything (descriptive alt texts, image sitemaps, fetch&render, check robots) but a site:www.rubbermagazijn.nl shows 0 image results and the sitemap report in Search Console shows 0 images indexed. We're not sure how to proceed from here. Is there anyone with an idea what the problem could be?
Intermediate & Advanced SEO | | Adriaan.Multiply0 -
Duplicated Content with Index.php
Good Afternoon, My website uses Joomla CMS and has the htaccess rewrite code enabled to ensure the use of search engine friendly URLs (SEF's). While browsing the crawl diagnostics I have found that Moz considers the /index.php URL a duplicate to our root. I will always under the impression that the htaccess rewrite took care of that issue and obviously I would like to address it. I attempted to create a 301 redirect from the index.php URL to the root but ran into an issue when attempting to login to the admin portion of the website as the redirect sent me back to the homepage. I was curious if anyone had advice for handling the index.php duplication issue, specifically with Joomla. Additionally, I have confirmed that in Google Webmasters, under URL parameters, the index.php parameter is set as 'Representative URL'.
Intermediate & Advanced SEO | | BrandonEML0 -
TLDs vs ccTLDs?
*Was trying to get this question answered in another thread but someone marked it as "answered" and no more responses came. So the question is about best practices on TLDs vs ccTLDs. I have a .com TLD that has DA 39 which redirects to the localized ccTLDs .co.id and .com.sg that have DA 17. All link building has been done for the .com TLD. In terms of content, it sometimes overlaps as the same content shows up on both the ccTLDs. What is best practices here? It doesnt look like my ccTLDs are getting any juice from the TLD. Should I just take my ccTLDs and combine them into my TLD in subdomains? Will I see any benefits? Thanks V j3LWnOJ
Intermediate & Advanced SEO | | venkatraman0 -
Site not indexed in Google UK
This site was moved to a new host by the client a month back and is still not indexed in Google UK if you search for the site directly. www.loftconversionswestsussex.com Webmaster tools shows that 55 pages have been crawled and no errors have been detected. The client also tried the "Fetch as Google Bot" tactic in GWT as well as running a PPC campaign and the site is still not appearing in Google. Any thoughts please? Cheers, SEO5..
Intermediate & Advanced SEO | | SEO5Team0 -
How important are sitemap errors?
If there aren't any crawling / indexing issues with your site, how important do thing sitemap errors are? Do you work to always fix all errors? I know here: http://www.seomoz.org/blog/bings-duane-forrester-on-webmaster-tools-metrics-and-sitemap-quality-thresholds Duane Forrester mentions that sites with many 302's 301's will be punished--does any one know Googe's take on this?
Intermediate & Advanced SEO | | nicole.healthline0 -
Freshness Index?
Hi, I've been a member for a few months but this is my first entry. I typically build small portal websites to help attract more customers for small business approx. 5-7 pages and very tightly optimized around one primary keyword and 2 secondaries. These are typically very low competition. I do no link building to speak of. I don't keyword stuff or use poorly written content. I know that may be subjective but I believe the content I am using is genuinely useful to the reader. What I have noticed recently is the sites get ranked quite well to begin with e.g. anywhere from the bottom half of the first page to page 2-3 and they stick for maybe 2-3 weeks, and the client is very happy, they then just vanish. It's not just the Google dance either these sites don't typically come back at all or when they do they are 100+ I was advised this was due to the freshness index but honestly these sites are hardly newsworthy...just wondering if anyone had any ideas? Many thanks in advance.
Intermediate & Advanced SEO | | nichemarkettools0