Is robots met tag a more reliable than robots.txt at preventing indexing by Google?
-
What's your experience of using robots meta tag v robots.txt when it comes to a stand alone solution to prevent Google indexing?
I am pretty sure robots meta tag is more reliable - going on own experiences, I have never experience any probs with robots meta tags but plenty with robots.txt as a stand alone solution.
Thanks in advance, Luke
-
Hi there,
Regarding the X-Robots tag. We have had a couple of sites that were disallowed in the robots.txt have their PDF, Doc etc files get indexed. I understand the reasoning for this. I would like to remove the disallow in the robots.txt and use the X-robots tag to noindex all pages as well as PDF, Doc files etc. This is for a ngnix configuation. Does anyone know what the written x-robots tag would look like in this case?
-
Test for what works for your site.
Use tools below
- https://www.deepcrawl.com/ (will give you one free full crawl)
- https://www.screamingfrog.co.uk/seo-spider/ (free up to 500 URLs)
- http://urlprofiler.com/ (14 days free try)
- https://www.deepcrawl.com/blog/best-practice/noindex-disallow-nofollow/
- https://www.screamingfrog.co.uk/seo-spider/user-guide/general/#robots-txt
- https://www.deepcrawl.com/blog/best-practice/noindex-and-google/
So much info
https://www.deepcrawl.com/blog/tag/robots-txt/
Thomas
-
Hi Luke,
In order to exclude individual pages from search engine indices, the noindex meta tag
is actually superior to robots.txt.
But X-Robots-Tag header tag is the best but much hader to use.
Block all web crawlers from all content
User-agent: * Disallow: /
Using the
robots.txt
file, you can tell a spider where it cannot go on your site. You can not tell a search engine which URLs it cannot show in the search results. This means that not allowing a search engine to crawl an URL – called “blocking” it – does not mean that URL will not show up in the search results. If the search engine finds enough links to that URL, it will include it; it will just not know what’s on that page.If you want to reliably block a page from showing up in the search results, you need to use a meta robots
noindex
tag. That means the search engine has to be able to index that page and find thenoindex
tag, so the page should not be blocked byrobots.txt
a
robots.txt
file does. In a nutshell, what it does is tell search engines to not crawl a particular page, file or directory of your website.Using this, helps both you and search engines such as Google. By not providing access to certain, unimportant areas of your website, you can save on your crawl budget and reduce load on your server.
Please note that using the
robots.txt
file to hide your entire website for search engines is definitely not recommended.see big photo: http://i.imgur.com/MM7hM4g.png
_(…)_ _(…)_
The robots meta tag in the above example instructs all search engines not to show the page in search results. The value of the
name
attribute (robots
) specifies that the directive applies to all crawlers. To address a specific crawler, replace therobots
value of thename
attribute with the name of the crawler that you are addressing. Specific crawlers are also known as user-agents (a crawler uses its user-agent to request a page.) Google's standard web crawler has the user-agent name.Googlebot
To prevent only Googlebot from crawling your page, update the tag as follows:This tag now instructs Google (but no other search engines) not to show this page in its web search results. Both the and
name
the attributescontent
are non-case sensitive.Search engines may have different crawlers for different properties or purposes. See the complete list of Google's crawlers. For example, to show a page in Google's web search results, but not in Google News, use the following meta tag:
If you need to specify multiple crawlers individually, it's okay to use multiple robots meta tags:
If competing directives are encountered by our crawlers we will use the most restrictive directive we find.
irective. This basically means that if you want to really hide something from the search engines, and thus from people using search,
robots.txt
won’t suffice.Indexer directives
Indexer directives are directives that are set on a per page and/or per element basis. Up until July 2007, there were two directives: the microformat rel=”nofollow”, which means that that link should not pass authority / PageRank, and the Meta Robots tag.
With the Meta Robots tag, you can really prevent search engines from showing pages you want to keep out of the search results. The same result can be achieved with the X-Robots-Tag HTTP header. As described earlier, the X-Robots-Tag gives you more flexibility by also allowing you to control how specific file(types) are indexed.
Example uses of the X-Robots-Tag
Using the
X-Robots-Tag
HTTP headerThe
X-Robots-Tag
can be used as an element of the HTTP header response for a given URL. Any directive that can be used in an robots meta tag can also be specified as anX-Robots-Tag
. Here's an example of an HTTP response with anX-Robots-Tag
instructing crawlers not to index a page:HTTP/1.1 200 OK Date: Tue, 25 May 2010 21:42:43 GMT _(…)_ **X-Robots-Tag: noindex** _(…)_
Multiple
X-Robots-Tag
headers can be combined within the HTTP response, or you can specify a comma-separated list of directives. Here's an example of an HTTP header response which has anoarchive
X-Robots-Tag
combined with anunavailable_after
X-Robots-Tag
.HTTP/1.1 200 OK Date: Tue, 25 May 2010 21:42:43 GMT _(…)_ **X-Robots-Tag: noarchive X-Robots-Tag: unavailable_after: 25 Jun 2010 15:00:00 PST** _(…)_
The
X-Robots-Tag
may optionally specify a user-agent before the directives. For instance, the following set ofX-Robots-Tag
HTTP headers can be used to conditionally allow showing of a page in search results for different search engines:HTTP/1.1 200 OK Date: Tue, 25 May 2010 21:42:43 GMT _(…)_ **X-Robots-Tag: googlebot: nofollow X-Robots-Tag: otherbot: noindex, nofollow** _(…)_
Directives specified without a user-agent are valid for all crawlers. The section below demonstrates how to handle combined directives. Both the name and the specified values are not case sensitive.
- https://moz.com/learn/seo/robotstxt
- https://yoast.com/ultimate-guide-robots-txt/
- https://moz.com/blog/the-wonderful-world-of-seo-metatags
- https://yoast.com/x-robots-tag-play/
- https://www.searchenginejournal.com/x-robots-tag-simple-alternate-robots-txt-meta-tag/67138/
- https://developers.google.com/webmasters/control-crawl-index/docs/robots_meta_tag
I hope this helps,
Tom
-
If you've recently added the "noindex" meta, get the page fetched in GWT. Google can't act if it doesn't see the tag.
-
Hi Luke,
It's a pretty common misconception that the robots.txt will prevent indexing. It's only purpose is actually to prevent crawling, anything disallowed in there is still up for indexing if it's linked to elsewhere. If you want something deindexed, your best bet is the robots meta tag, but make sure you allow crawling of the URLs to give search engine bots an opportunity to see the tag.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Staging website got indexed by google
Our staging website got indexed by google and now MOZ is showing all inbound links from staging site, how should i remove those links and make it no index. Note- we already added Meta NOINDEX in head tag
Intermediate & Advanced SEO | | Asmi-Ta0 -
Website dropped out from Google index
Howdy, fellow mozzers. I got approached by my friend - their website is https://www.hauteheadquarters.com She is saying that they dropped from google index over night - and, as you can see if you google their name, website url or even site: , most of the pages are not indexed. Home page is nowhere to be found - that's for sure. I know that they were indexed before. Google webmaster tools don't have any manual actions (at least yet). No sudden changes in content or backlink profile. robots.txt has some weird rule - disallow everything for EtaoSpider. I don't know if google would listen to that - robots checker in GWT says it's all good. Any ideas why that happen? Any ideas what I should check? P.S. Just noticed in GWT there was a huge drop in indexed pages within first week of August. Still no idea why though. P.P.S. Just noticed that there is noindex x-robots-tag in headers... Anyone knows where this can be set?
Intermediate & Advanced SEO | | DmitriiK0 -
Lately I have noticed Google indexing many files on the site without the .html extension
Hello, Our site, while we convert, remains in HTML 4.0. Fle names such as http://www.sample.com/samples/index.shtml are being picked up in the SERPS as http://www.sample.com/samples/ even when I use the "rel="canonical" tag and specify the full file name therein as recommended. The link to the truncated URL (http://www.sample.com/samples/) results in what MOZ shows as fewer incoming links than the full file name is shown as having incoming. I am not sure if this is causing a loss in placement (the MOZ stats are showing a decline of late), which I have seen recently (of course, I am aware of other possible reasons, such as not being in HTML5 yet). Any help with this would be great. Thank you in advance
Intermediate & Advanced SEO | | gheh20130 -
What is best practice for "Sorting" URLs to prevent indexing and for best link juice ?
We are now introducing 5 links in all our category pages for different sorting options of category listings.
Intermediate & Advanced SEO | | lcourse
The site has about 100.000 pages and with this change the number of URLs may go up to over 350.000 pages.
Until now google is indexing well our site but I would like to prevent the "sorting URLS" leading to less complete crawling of our core pages, especially since we are planning further huge expansion of pages soon. Apart from blocking the paramter in the search console (which did not really work well for me in the past to prevent indexing) what do you suggest to minimize indexing of these URLs also taking into consideration link juice optimization? On a technical level the sorting is implemented in a way that the whole page is reloaded, for which may be better options as well.0 -
Robots.txt - Googlebot - Allow... what's it for?
Hello - I just came across this in robots.txt for the first time, and was wondering why it is used? Why would you have to proactively tell Googlebot to crawl JS/CSS and why would you want it to? Any help would be much appreciated - thanks, Luke User-Agent: Googlebot Allow: /.js Allow: /.css
Intermediate & Advanced SEO | | McTaggart0 -
Does Google View "SRC", "HREF", TITLE and Alt tags as Duplicate Content on Home Page Slider?
Greetings MOZ Community. A keyword matrix was developed by my SEO firm. I am in the process of integrating primary, secondary and terciary phrases into the text and am also sprinkling three or four other terms. Using a keyword density tool (http://www.webconfs.com/keyword-density-checker.php) the results were somewhat unexpected after I optimized. So I then looked at the source code and noticed text from HREF, ALT and SRC tags that may be effecting how Google would interpret text on the page. Our home page (www.nyc-officespace-leader.com) contains a slider with commercial real estate listings. Would Google index the SRC, HREF, TITLE and ALT tags in these slider items? Would this be detrimental to SEO? The code for one listing (and there are 7-8 in the slider) looks like this: | href="http://www.nyc-officespace-leader.com/listings/305-fifth-avenue-office-suite-1340sf" title="Lease a Prestigious Fifth Avenue Office - Manhattan, New York">Class A Fifth Avenue Offices class="blockLeft"><a< p=""></a<> href="http://www.nyc-officespace-leader.com/listings/305-fifth-avenue-office-suite-1340sf" title="Lease a Prestigious Fifth Avenue Office - Manhattan, New York"> src="http://dr0nu3l9a17ym.cloudfront.net/wp-content/uploads/fsrep/houses/125x100/305.jpg" alt="Lease a Prestigious Fifth Avenue Office - Manhattan, New York" width="125" height="94" /> 1,340 Sq. Ft. $5,918 / month Fifth Avenue Midtown / Grand Central <a< p=""></a<> | Could the repetition of the title text ("lease a Prestigious Fifth...") trigger a duplicate content penalty? Should the slider content be blocked or set to no-index by some kind of a Java script? We have worked very hard to optimize the home page so it would be a real shame if through some technical oversight we got hit by a Google Panda penalty. Thanks, Alan Thanks
Intermediate & Advanced SEO | | Kingalan10 -
Do I need to disallow the dynamic pages in robots.txt?
Do I need to disallow the dynamic pages that show when people use our site's search box? Some of these pages are ranking well in SERPs. Thanks! 🙂
Intermediate & Advanced SEO | | esiow20130 -
Is 404'ing a page enough to remove it from Google's index?
We set some pages to 404 status about 7 months ago, but they are still showing in Google's index (as 404's). Is there anything else I need to do to remove these?
Intermediate & Advanced SEO | | nicole.healthline0