Should all pages on a site be included in either your sitemap or robots.txt?

RossFruin

I don't have any specific scenario here but just curious as I come across sites fairly often that have, for example, 20,000 pages but only 1,000 in their sitemap. If they only think 1,000 of their URL's are ones that they want included in their sitemap and indexed, should the others be excluded using robots.txt or a page level exclusion? Is there a point to having pages that are included in neither and leaving it up to Google to decide?

RossFruin

Thanks guys!

CleverPhD

You bet - Cheers!

Ron_McCabe

Clever PHD,

You are correct. I have found that these little housekeeping issues like eliminating duplicate content really do make a big difference.

Ron

CleverPhD

I thinks Ron's point was that if you have a bunch of duplicates, the dups are not "real" pages, if you are only counting "real" pages. Therefore, if Google indexes your "real" pages and the dup versions of them, you can have more pages indexed. That is the issue then that you have duplicate versions of the same page in Google's index and so which will rank for a given key term? You could be competing against yourself. That is why it is so important you deal with crawl issues.

RossFruin

Thank you. Just curious, how would the number of pages indexed be higher than the number of actual pages?

Ron_McCabe

I think you are looking at the pages indexed which is generally a higher number than those on your web site. There is a point to marking things up so that there is a no follow on any pages that you do not want indexed as well as properly marking up the web pages that you do specifically want indexed. It is really important that you eliminate duplicate pages. A common source of these duplicates is improper tags on the blog. Make sure that your tags are set up in a logical hierarchy like your site map. This will assist the search engines when they re index your page.

Hope this helps,

Ron

CleverPhD

You want to have as many pages in the index as possible, as long as they are high quality pages with original content - if you publish quality original articles on a regular basis, you want to have all those pages indexed. Yes, from a practical perspective you may only be able to focus on tweaking the SEO on a portion of them, but if you have good SEO processes in place as you produce those pages, they will rank long term for a broad range of terms and bring traffic..

If you have 20,000 pages as you have an online catalog and you have 345 different ways to sort the same set of page results, or if you have keyword search URLs, or printer friendly version pages or your shopping cart pages, you do not want those indexed. These pages are typically, low quality/thin content pages and/or are duplicates and those do you no favor. You would want to use the noindex meta tag or canonical where appropriate. The reality is that out of the 20,000 pages, there are probably only a subset that are the "originals" and so you dont want to waste Googles time in crawling those pages.

A good concept here to look up is Crawl Budget or Crawl Optimization

http://searchengineland.com/how-i-think-crawl-budget-works-sort-of-59768

http://www.blindfiveyearold.com/crawl-optimization

Welcome to the Q&A Forum

Browse the forum for helpful insights and fresh discussions about all things SEO.

Should all pages on a site be included in either your sitemap or robots.txt?

Got a burning SEO question?

Browse Questions

Explore more categories

Related Questions

I have a metadata issue. My site crawl is coming back with missing descriptions, but all of the pages look like site tags (i.e. /blog/?_sft_tag=call-routing)

Robots.txt advice

Large robots.txt file

What to do when your home page an index for a series of pages.

On 1 of our sites we have our Company name in the H1 on our other site we have the page title in our H1 - does anyone have any advise about the best information to have in the H1, H2 and Page Tile

Can you use more than one meta robots tag per page?

Link anchor text: only useful for pages linked to directly or distributed across site?

Why is noindex more effective than robots.txt?