Should all pages on a site be included in either your sitemap or robots.txt?
-
I don't have any specific scenario here but just curious as I come across sites fairly often that have, for example, 20,000 pages but only 1,000 in their sitemap. If they only think 1,000 of their URL's are ones that they want included in their sitemap and indexed, should the others be excluded using robots.txt or a page level exclusion? Is there a point to having pages that are included in neither and leaving it up to Google to decide?
-
Thanks guys!
-
You bet - Cheers!
-
Clever PHD,
You are correct. I have found that these little housekeeping issues like eliminating duplicate content really do make a big difference.
Ron
-
I thinks Ron's point was that if you have a bunch of duplicates, the dups are not "real" pages, if you are only counting "real" pages. Therefore, if Google indexes your "real" pages and the dup versions of them, you can have more pages indexed. That is the issue then that you have duplicate versions of the same page in Google's index and so which will rank for a given key term? You could be competing against yourself. That is why it is so important you deal with crawl issues.
-
Thank you. Just curious, how would the number of pages indexed be higher than the number of actual pages?
-
I think you are looking at the pages indexed which is generally a higher number than those on your web site. There is a point to marking things up so that there is a no follow on any pages that you do not want indexed as well as properly marking up the web pages that you do specifically want indexed. It is really important that you eliminate duplicate pages. A common source of these duplicates is improper tags on the blog. Make sure that your tags are set up in a logical hierarchy like your site map. This will assist the search engines when they re index your page.
Hope this helps,
Ron
-
You want to have as many pages in the index as possible, as long as they are high quality pages with original content - if you publish quality original articles on a regular basis, you want to have all those pages indexed. Yes, from a practical perspective you may only be able to focus on tweaking the SEO on a portion of them, but if you have good SEO processes in place as you produce those pages, they will rank long term for a broad range of terms and bring traffic..
If you have 20,000 pages as you have an online catalog and you have 345 different ways to sort the same set of page results, or if you have keyword search URLs, or printer friendly version pages or your shopping cart pages, you do not want those indexed. These pages are typically, low quality/thin content pages and/or are duplicates and those do you no favor. You would want to use the noindex meta tag or canonical where appropriate. The reality is that out of the 20,000 pages, there are probably only a subset that are the "originals" and so you dont want to waste Googles time in crawling those pages.
A good concept here to look up is Crawl Budget or Crawl Optimization
http://searchengineland.com/how-i-think-crawl-budget-works-sort-of-59768
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
E-Commerce Site Collection Pages Not Being Indexed
Hello Everyone, So this is not really my strong suit but I’m going to do my best to explain the full scope of the issue and really hope someone has any insight. We have an e-commerce client (can't really share the domain) that uses Shopify; they have a large number of products categorized by Collections. The issue is when we do a site:search of our Collection Pages (site:Domain.com/Collections/) they don’t seem to be indexed. Also, not sure if it’s relevant but we also recently did an over-hall of our design. Because we haven’t been able to identify the issue here’s everything we know/have done so far: Moz Crawl Check and the Collection Pages came up. Checked Organic Landing Page Analytics (source/medium: Google) and the pages are getting traffic. Submitted the pages to Google Search Console. The URLs are listed on the sitemap.xml but when we tried to submit the Collections sitemap.xml to Google Search Console 99 were submitted but nothing came back as being indexed (like our other pages and products). We tested the URL in GSC’s robots.txt tester and it came up as being “allowed” but just in case below is the language used in our robots:
Intermediate & Advanced SEO | | Ben-R
User-agent: *
Disallow: /admin
Disallow: /cart
Disallow: /orders
Disallow: /checkout
Disallow: /9545580/checkouts
Disallow: /carts
Disallow: /account
Disallow: /collections/+
Disallow: /collections/%2B
Disallow: /collections/%2b
Disallow: /blogs/+
Disallow: /blogs/%2B
Disallow: /blogs/%2b
Disallow: /design_theme_id
Disallow: /preview_theme_id
Disallow: /preview_script_id
Disallow: /apple-app-site-association
Sitemap: https://domain.com/sitemap.xml A Google Cache:Search currently shows a collections/all page we have up that lists all of our products. Please let us know if there’s any other details we could provide that might help. Any insight or suggestions would be very much appreciated. Looking forward to hearing all of your thoughts! Thank you in advance. Best,0 -
Competing Pages on Ecommerce Site - Very Frustrating
We have multiple issues with this situation. We rank #1 for "Lace Fabric", #3 for "Lace Trim", and #80 for "Lace". We also rank for "Lace Ribbon", and "Lace Appliques". The Lace Fabric and Lace Trim pages have plenty of backlinks, wherein may lie the problem. We have a similar issue for "Satin". "Silk Satin", "Polyester Satin", "Satin Trim", "Satin Ribbon", etc. This is a very annoying and common pattern. Our backlink profile is sterling, and our competitors with inferior backlink profiles and branded search are outranking us. We outrank them across the board for 2 word terms. Based on my evaluation of TF/CF, PA/DA, Content, etc., we should be on page 1 for "Lace". IMHO, these pages are competing for the head term. Any ideas on how to eliminate this issue to rank for head terms?
Intermediate & Advanced SEO | | GWMSEO0 -
Robots.txt Disallowed Pages and Still Indexed
Alright, I am pretty sure I know the answer is "Nothing more I can do here." but I just wanted to double check. It relates to the robots.txt file and that pesky "A description for this result is not available because of this site's robots.txt". Typically people want the URL indexed and the normal Meta Description to be displayed but I don't want the link there at all. I purposefully am trying to robots that stuff outta there.
Intermediate & Advanced SEO | | DRSearchEngOpt
My question is, has anybody tried to get a page taken out of the Index and had this happen; URL still there but pesky robots.txt message for meta description? Were you able to get the URL to no longer show up or did you just live with this? Thanks folks, you are always great!0 -
Our parent company has included their sitemap links in our robots.txt file - will that have an impact on the way our site is crawled?
Our parent company has included their sitemap links in our robots.txt file. All of their sitemap links are on a different domain and I'm wondering if this will have any impact on our searchability or potential rankings.
Intermediate & Advanced SEO | | tsmith1310 -
Landing pages, are my pages competing?
If I have identified a keyword which generates income and when searched in google my homepage comes up ranked second, should I still create a landing page based on that keyword or will it compete with my homepage and cause it to rank lower?
Intermediate & Advanced SEO | | The_Great_Projects0 -
When Mobile and Desktop sites have the same page URLs, how should I handle the 'View Desktop Site' link on a mobile site to ensure a smooth crawl?
We're about to roll out a mobile site. The mobile and desktop URLs are the same. User Agent determines whether you see the desktop or mobile version of the site. At the bottom of the page is a 'View Desktop Site' link that will present the desktop version of the site to mobile user agents when clicked. I'm concerned that when the mobile crawler crawls our site it will crawl both our entire mobile site, then click 'View Desktop Site' and crawl our entire desktop site as well. Since mobile and desktop URLs are the same, the mobile crawler will end up crawling both mobile and desktop versions of each URL. Any tips on what we can do to make sure the mobile crawler either doesn't access the desktop site, or that we can let it know what is the mobile version of the page? We could simply not show the 'View Desktop Site' to the mobile crawler, but I'm interested to hear if others have encountered this issue and have any other recommended ways for handling it. Thanks!
Intermediate & Advanced SEO | | merch_zzounds0 -
If I only Link to Page via Sitemap, can it still get indexed?
Hi there! I am creating a ton of content for specific geographies. Is it possible for these pages to get indexed if I only put them in my sitemap and don't link to them through my actual site (though the pages will be live). Thanks!
Intermediate & Advanced SEO | | Travis-W
Travis0 -
Is there a maximum amount of pages that should be added on a sitemap daily?
I started a new music site that has a database of 8,000,000 songs and 500,000+ artists that we are cross referencing with free & legal content sources. Each song essentially has its own page. We are about to start adding links to a sitemap and wanted to find the best practices. Should we add all 8,000,000+ links at once? Should we add a maximum amount a day? Maybe max 5,000? What are the pros and cons of slowly adding the pages or adding them all at once. Any risks? At the rate google is crawling our page it will take 8 years to have all of our songs indexed (It would be very hard to crawl all of our songs as our system is more of an app). I wan't to play it safe and not do anything that will come off as spammy. I have been trying to find some actual evidence on what the best course of action is. Thanks in Advance!
Intermediate & Advanced SEO | | mikecrib10