Total Indexed 1.5M vs 83k submitted by sitemap. What?
-
We recently took a good look at one of our content site's sitemap and tried to cut out a lot of crap that had gotten in there such as .php, .xml, .htm versions of each page. We also cut out images to put in a separate image sitemap.
The sitemap generated 83,000+ URLs for google to crawl (this partially used the Yoast Wordpress plugin to generate)
In webmaster tools in the index status section is showing that this site has a total index of 1.5 million.
With our sitemap coming back with 83k and google indexing 1.5 million pages, is this a sign of a CMS gone rogue? Is it an indication that we could be pumping out error pages or empty templates, or junk pages that we're cramming into Google's bot?
I would love to hear what you guys think. Is this normal? Is this something to be concerned about? Should our total index more closely match our sitemap page count?
-
As well as parameters mentioned you may possibly have heaps of duplicating categories, tags etc. What I would also do is start searching Google with something like site:www.example.com/directory/ or possibly site:www.example.com/category/directory/directory/ so you are tightly narrowing down the results, switch to 100 results per page and manually look for clues.
-
If you have 1.5 million pages and you think your sitemap is comprehensive at 83,000 then yes, your CMS is needlessly generating pages. It's usually not a big deal from a ranking standpoint, but it can make other important issues hard to detect. I would clean it up, but that's a business call you'll have to make.
The first step is diagnosing where are the URLs are coming from. What you do next will depend, but I will give you the best advice I can without knowing what types of extraneous URLs you have and how Google is treating them:
First, I'd start with WMT > Crawl > URL Parameters. Quite often your CMS will generate URLs, and Google usually knows how to handle them. If there are a lot of URL parameters, Google them and see if they're exactly the same as other pages. If they are, make sure you have canonical tags in place to point them to the main version. There's more you can do with parameters, but it'll depend on what you find so I won't go into more detail. As a general rule, though, a CMS should not generate a page unless it is uniquely useful as differentiated landing page or a page for people to link to.
Also check for parameters in your analytics program. They could actually be messing up your pageview data depending on how you report.There's a post on fixing that in GA here:
http://blog.crazyegg.com/2013/03/29/remove-url-parameters-from-google-analytics-reports/
Next I'd look at the "Advanced" tab in WMT > Google Index > Index Status . Are there a lot of URLs removed? If so, check on these pages and see why they're removed and why they exist.
I would also run a crawl with Xenu and Screaming Frog to make sure crawlers are finding a reasonable number of pages and that they're not getting stuck in crawl loops. (crawling variations of a page endlessly). These kinds of issues can prevent new pages from being indexed on time because Google is wasting time (your crawl budget) running in circles.
-
Rob,
Your sitemap is but an indication to Google about urls on your domain. The sitemap does not limit google to crawling or indexing only the urls listed on it, nor is it a directive that tells google to remove urls from the index that it has already crawled. As stated in GWT, use **robots.txt **to specify how search engines should crawl your site, or request **removal **of URLs from Google's search results with the URL removal tool Google webmaster tools under the "google index" link.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Specific page does not index
Hi, First question: Working on the indexation of all pages for a specific client, there's one page that refuses to index. Google Search console says there's a robots.txt file, but I can't seem to find any tracks of that in the backend, nor in the code itself. Could someone reach out to me and tell me why this is happening? The page: https://www.brody.be/nl/assistentiewoningen/ Second question: Google is showing another meta description than the one our client gave in in Yoast Premium snippet. Could it be there's another plugin overwriting this description? Or do we have to wait for it to change after a specific period of time? Hope you guys can help
Intermediate & Advanced SEO | | conversal0 -
Editing A Sitemap
Would there be any positive effect from editing a site map down to a more curated list of pages that perform, or that we hope they begin to perform, in organic search? A site I work with has a sitemap with about 20,000 pages that is automatically created out of a Drupal plugin. Of those pages, only about 10% really produce out of search. There are old sections of the site that are thin, obsolete, discontinued and/or noindexed that are still on the sitemap. For instance, would it focus Google's crawl budget more efficiently or have some other effect? Your thoughts? Thanks! Best... Darcy
Intermediate & Advanced SEO | | 945010 -
Sitemap into SE
Hi Moz community experts, I have a question about the sitemap into search engine like here : http://i.imgur.com/gQ0JhuH.jpg. Do you know what I need to do to get the same structure or do decide which pages we want to present into our result. We created a new page and we would like to see it into the resultat when the visitor is searching for our branded keywords. Thank in advance for your support. gQ0JhuH.jpg.
Intermediate & Advanced SEO | | johncurlee0 -
Apps content Google indexation ?
I read some months back that Google was indexing the apps content to display it into its SERP. Does anyone got any update on this recently ? I'll be very interesting to know more on it 🙂
Intermediate & Advanced SEO | | JoomGeek0 -
To index or de-index internal search results pages?
Hi there. My client uses a CMS/E-Commerce platform that is automatically set up to index every single internal search results page on search engines. This was supposedly built as an "SEO Friendly" feature in the sense that it creates hundreds of new indexed pages to send to search engines that reflect various terminology used by existing visitors of the site. In many cases, these pages have proven to outperform our optimized static pages, but there are multiple issues with them: The CMS does not allow us to add any static content to these pages, including titles, headers, metas, or copy on the page The query typed in by the site visitor always becomes part of the Title tag / Meta description on Google. If the customer's internal search query contains any less than ideal terminology that we wouldn't want other users to see, their phrasing is out there for the whole world to see, causing lots and lots of ugly terminology floating around on Google that we can't affect. I am scared to do a blanket de-indexation of all /search/ results pages because we would lose the majority of our rankings and traffic in the short term, while trying to improve the ranks of our optimized static pages. The ideal is to really move up our static pages in Google's index, and when their performance is strong enough, to de-index all of the internal search results pages - but for some reason Google keeps choosing the internal search results page as the "better" page to rank for our targeted keywords. Can anyone advise? Has anyone been in a similar situation? Thanks!
Intermediate & Advanced SEO | | FPD_NYC0 -
Need help creating sitemap
Hello, The details of my question is sitemap related. Below is the background info: we are ecommerce site with around 4000 pages, and 20000 images. we dont have a sitemap implemented on our site yet. i have checked alot of sitemap tools out there, like g-sitecrawler, xml sitemap, a1 sitemap builder etc, and i tried to create sitemaps via them, but all them give different results. the major links are all there, but the results start to vary for level 2, level 3 links and so on. plus no matter how much i read up on sitemaps, the more i am getting confused. i read lots of seomoz articles on sitemaps, and due to my limited seo and technical knowledge, the extra information on these articles gets more confusing. i also just read an article on seomoz that instead of having one sitemap, having multiple smaller sitemaps is very good idea, specially if we are adding lots of new products (which we are). Now my question: My question is having understood the immense value of sitemap (and by having it very poorly implemented before), how can i make sure that i get a very good sitemap (both xml and html sitemap). i do not want to do something again and just repeat old mistakes by having a poorly implemented sitemap for our site. I am hoping that one of the professionals out there, can help me also make and implement the sitemap. If you can please point me to the right direction.
Intermediate & Advanced SEO | | kannu10 -
Directory VS Article Directory
Which got hit harder in penguin update. I was looking at SEER Interactive backlink profile (the SEO company that didn't rank for it's main keyword phrases) and noticed a pretty big trend on why it might not rank for its domain name. SEER was in a majority of anchor text, many coming from directories. i'm guessing THEY were effected because they matched the exact match domain link profile rule I'm not an expert programmer, but if i was playing "Google Programmer" I would think the Algo update went something like. If ((exact match domain) & (certain % anchor text==domain) & (certain % of anchor text== partial domain + services/company)) { tank the rankings } So back to the question, do you think that this update had a lot to do with directories, article directories, or neither. Is article directories still a legit way to get links. (not ezine)
Intermediate & Advanced SEO | | imageworks-2612900 -
Is there a FastTrack to re-index? a site?
Hello... i just started with a new client this week, before working with us his last domain-hosting-webdev provider cancel their account and took off the entire site and left them with a nice "under construction page" (NOT) and added the noindex, nofollow tags. 4 weeks after that, we come into the scene and of course our client it's expecting us to reinsert at least for branded terms the site, and he wants it done on a matter of hours... I tried my best to explain that it's not possible and we are doing everything we can't.... now i ask you guys.. I already created de GWT account, Created a well structured Sitemap and submitted it to google and bing, did the onpage optimizitation at least the basics... there is a way to speed up the process? kind of like "hey you! google bot, forget about the noindex nonsense a come crawl again?" Any help would be great Daniel
Intermediate & Advanced SEO | | daniel.alvarez0