Moz Q&A is closed.
After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.
Will an XML sitemap override a robots.txt
-
I have a client that has a robots.txt file that is blocking an entire subdomain, entirely by accident. Their original solution, not realizing the robots.txt error, was to submit an xml sitemap to get their pages indexed.
I did not think this tactic would work, as the robots.txt would take precedent over the xmls sitemap. But it worked... I have no explanation as to how or why.
Does anyone have an answer to this? or any experience with a website that has had a clear Disallow: / for months , that somehow has pages in the index?
-
The robots file will avoid google to show further information on the disallowed pages but it doesn't prevent indexation.
They're still indexed (that's why you're seeing them) but with no meta desc nor text taken from the page because google wasn't allowed to retrieve more information.
If you want them to start showing info, you'll jsut need to remove that rule from the robots.txt and soon you'll start seeing those pages information showing, but if you want them out of the index you can use GWT to remove them from the index after you've included in each page the noindex meta tag which is the only command which will prevent indexation.
-
I assumed the same thing, but I performed a site command search while they were prospects, and they had 1 result present with the explanation of "A description for this result is not available because of this site's robots.txt – learn more"
They uploaded an xml sitemap before I could tell them to remove the robots.txt. and 1 week later, the entire site is now in the index.
I have used the robots.txt to properly block websites, it usually takes 2-3 for all results to drop out the index, so I don't know how that could explain it either.
-
I agree, the only way I could think this would work would be if the robotx.txt file was on the root domain. I agree, check Webmaster tools, they will tell you under the sitemaps section about "Error: URL was blocked by robots.txt).
One thing to remember is that robots.txt is technically a suggestion to ask search engines not to crawl your site. They can choose to ignore it, though personally I don't know of any cases in which this happenned.
-
An XML sitemap shouldn't override robots.txt. If you have Google Webmaster Tools setup, you will see warnings on the sitemaps page that pages being blocked by robots are being submitted.
Now, robots.txt does not prevent indexation, just crawling. So if the pages were indexed before they implemented robots.txt, they may continue to be indexed. Google will also display just the URL for pages that it's discovered, but can't crawl because of robots.txt.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Sitemap.xml strategy for site with thousands of pages
I have a client that has a HUGE website with thousands of product pages. We don't currently have a sitemap.xml because it would take so much power to map the sitemap. I have thought about creating a sitemap for the key pages on the website - but didn't want to hurt the SEO on the thousands of product pages. If you have a sitemap.xml that only has some of the pages on your site - will it negatively impact the other pages, that Google has indexed - but are not listed on the sitemap.xml.
Technical SEO | | jerrico10 -
Do I need a separate robots.txt file for my shop subdomain?
Hello Mozzers! Apologies if this question has been asked before, but I couldn't find an answer so here goes... Currently I have one robots.txt file hosted at https://www.mysitename.org.uk/robots.txt We host our shop on a separate subdomain https://shop.mysitename.org.uk Do I need a separate robots.txt file for my subdomain? (Some Google searches are telling me yes and some no and I've become awfully confused!
Technical SEO | | sjbridle0 -
Indexing product attributes in sitemap
Hey Mozzers! I'm battling a few questions about the sitemap for my ecommerce store. Could you help me out? Is it necessary to include your product attributes in the sitemap? I'm not sure why it would matter to have a sitemap that lists everything in the color cherry. Also, if the attributes were included in the sitemap, would that count as duplicate content for the same products to show up in multiple attributes? Is there any benefit to submitting the sitemaps individually? For example, submitting /product-sitemap.xml, /product_brand-sitemap.xml versus just /sitemap.xml? Any other best practices for managing my ecommerce sitemap, or great resources, would be very helpful. Thank you! a1vUz
Technical SEO | | localwork0 -
301 Redirects Relating to Your XML Sitemap
Lets say you've got a website and it had quite a few pages that for lack of a better term were like an infomercial, 6-8 pages of slightly different topics all essentially saying the same thing. You could all but call it spam. www.site.com/page-1 www.site.com/page-2 www.site.com/page-3 www.site.com/page-4 www.site.com/page-5 www.site.com/page-6 Now you decided to consolidate all of that information into one well written page, and while the previous pages may have been a bit spammy they did indeed have SOME juice to pass through. Your new page is: www.site.com/not-spammy-page You then 301 redirect the previous 'spammy' pages to the new page. Now the question, do I immediately re-submit an updated xml sitemap to Google, which would NOT contain all of the old URL's, thus making me assume Google would miss the 301 redirect/seo juice. Or do I wait a week or two, allow Google to re-crawl the site and see the existing 301's and once they've taken notice of the changes submit an updated sitemap? Probably a stupid question I understand, but I want to ensure I'm following the best practices given the situation, thanks guys and girls!
Technical SEO | | Emory_Peterson0 -
Should all pagination pages be included in sitemaps
How important is it for a sitemap to include all individual urls for the paginated content. Assuming the rel next and prev tags are set up would it be ok to just have the page 1 in the sitemap ?
Technical SEO | | Saijo.George0 -
Empty Meta Robots Directive - Harmful?
Hi, We had a coding update and a side-effect of that was that our directive was emptied, in other words it now reads as: on all of the site. I've since noticed that Google's cache date on all of the pages - at least, the ones I tested - have a Cached date of no later than 17 December '12 - that's the Monday after the directive was removed on mass. So, A, does anyone have solid evidence of an empty directive causing problems? Past experience, Matt Cutts, Fishkin quote, etc. And then B - It seems fairly well correlated but, does my entire site's homogenous Cached date point to this tag removal? Or is it fairly normal to have a particular cache date across a large site (we're a large ecommerce site). Our site: http://www.zando.co.za/ I'm having the directive reinstated as soon as Dev permitting. And then, for extra credit, is there a way with Google's API, or perhaps some other tool, to run an arbitrary list and retrieve Cached dates? I'd want to do this for diagnosis purposes and preferably in a way that OK with Google. I'd avoid CURLing for the cached URL and scraping out that dates with BASH, or any such kind of thing. Cheers,
Technical SEO | | RocketZando0 -
Allow or Disallow First in Robots.txt
If I want to override a Disallow directive in robots.txt with an Allow command, do I have the Allow command before or after the Disallow command? example: Allow: /models/ford///page* Disallow: /models////page
Technical SEO | | irvingw0 -
Does Google index XML files?
Does Google or other search engines include XML files in their index? More specifically, I am wondering how Google knows the difference between an xml filetype and an RSS feed.
Technical SEO | | nicole.healthline0