Meta NoIndex tag and Robots Disallow
-
Hi all,
I hope you can spend some time to answer my first of a few questions
We are running a Magento site - layered/faceted navigation nightmare has created thousands of duplicate URLS!
Anyway, during my process to tackle the issue, I disallowed in Robots.txt anything in the querystring that was not a p (allowed this for pagination).
After checking some pages in Google, I did a site:www.mydomain.com/specificpage.html and a few duplicates came up along with the original with
"There is no information about this page because it is blocked by robots.txt"So I had added in Meta Noindex, follow on all these duplicates also but I guess it wasnt being read because of Robots.txt.
So coming to my question.
-
Did robots.txt block access to these pages? If so, were these already in the index and after disallowing it with robots, Googlebot could not read Meta No index?
-
Does Meta Noindex Follow on pages actually help Googlebot decide to remove these pages from index?
I thought Robots would stop and prevent indexation? But I've read this:
"Noindex is a funny thing, it actually doesn’t mean “You can’t index this”, it means “You can’t show this in search results”. Robots.txt disallow means “You can’t index this” but it doesn’t mean “You can’t show it in the search results”.I'm a bit confused about how to use these in both preventing duplicate content in the first place and then helping to address dupe content once it's already in the index.
Thanks!
B
-
-
There's no real way to estimate how long the re-crawl will take, Ben. You can get a bit of an idea by looking at the crawl rate reported in Google Webmaster Tools.
Yes, asking for a page fetch then submitting with linked pages for each of the main website sections can help speed up the crawl discovery. In addition, make sure you've submitted a current sitemap and it's getting found correctly (also reported in GWT) You should also do the same in Bing Webmaster Tools. Too many sites forget about optimizing for Bing - even if it's only 20% of Google's traffic, there's no point throwing it away.
Lastly, earning some new links to different sections of the site is another great signal. This can often be effectively & quickly done using social media - especially Google+ as it gets crawled very quickly.
As far as your other question - yes, once you get the unwanted URLs out of the index, you can add the robots.txt disallow back in to optimise your crawl budget. I would strongly recommend you leave the meta-robots no-index tag in place though as a "belt & suspenders" approach to keep pages linking into those unwanted pages from triggering a re-indexing. It's OK to have both in place as long as the de-indexing has already been accomplished, as we've discussed.
Hope that answer your questions?
Paul
-
So once Google has started to see the meta-noindex and is slowly deindexing pages, once that is done, I would like to block it from crawling them with a robots.txt to conserve my crawl budget.
But, there are still internal links on the site that point to these URL´s - would they get back into the index in this case?
-
Hi Paul,
Thank you for your detailed answer - so I'm not going crazy
I did try with canonicals but then realized they are more of a suggestion as opposed to a directive and I am still correcting a lot of dupe content and 404's so I am imagining that Google view's the site as "these guys don't know what they are doing' so may have ignored the canonical suggestion.
So what I have done is remove the robots block on the pages I want de-indexed and add in meta noindex, follow on these pages - From what you are saying, they should naturally de-index, after which, I will put the robots.txt block back on to keep my crawl budget spent on better areas of the site.
How long in your opinion can it take for Googlebot to de-index the pages? Can I help it along at all to speed up? Fetch page and linking pages as Googlebot?
Thanks again,
Ben
-
You're right to be confused, B. The terminology is unfortunate and misleading.
To answer your questions
1. Yes
2. Yes.
A disallow in robots.txt does nothing to remove already-indexed pages. That's not its purpose. Its only purpose is to tell the search crawlers not to waste their time crawling those pages. Even if pages have been blocked in robots, they will remain in the index if already there. Even if never crawled, and blocked in robots.txt, they can still end up indexed if some other indexed page links to them and the crawlers find those pages by following links. Again, nothing in a robots.txt disallow tells the engines to remove a page from the index, just not to waste time crawling it.
Put another way, the robots.txt disallow directive only disallows crawling - it says nothing about what to do if the page gets into the index in other ways.
The meta-robots no-index tag however explicitly states to the crawler "if you arrive at this page, do not add it to the index. If it is already in the index, remove it".
And yea - as you suspected - if pages are blocked in robots.txt, the crawler obeys and doesn't visit those pages So it can't discover the no-index command to drop them from the index. Thus the only way a page could get dropped is if a crawler followed a link from an external site and discovered the page that way. A very inefficient way of trying to get all those pages out of the index.
Bottom line - robots.txt is never the correct tool to deal with duplicate content issues. It's sole purpose is to keep the crawlers from wasting time on unimportant pages so they can spend more time finding (and therefore indexing) more important pages.
The three tools for dealing with duplicate content are meta-robots no-index tags in a page header, 301 redirects, and canonical tags. Which one to use depends on the architecture of your site, your intended purpose, and the site's technical limitations.
Hope that makes sense?
Paul
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
How (or if) to apply re canonical tags to Shopify?
Anyone familiar with Shopify will understand the problems of their directory structure. Every time you add a product to a 'collection' it essentially creates a duplicate. For example... https://www.domain.com/products/product-slim-regular-bikini may also appear as: https://www.domain.com/collections/all/products/product-slim-regular-bikini https://www.domain.com/collections/new-arrivals/products/product-slim-regular-bikini https://www.domain.com/collections/bikinis/products/product-slim-regular-bikini etc, etc It's not uncommon to have up to six duplicates of each product. So my question is twofold: Firstly, should I worry about this from an SEO point of view? I understand the desire to minimise potential duplicate content issues and also in focussing the 'juice' on just one page per product. But I also planned on trying to build the authority of the collection pages. If I request Google not to index the product pages which link off the collections, does this not devalue these collections pages? Secondly, I understand the correct way to fix these is using 'rel canonical' tags, but I'm not clear about HOW to actually do this. Shopify support has not been very helpful. They have provided two different instructions, so just added to the confusion (see below). Shopify instruction #1: Add the following to the theme.liquid file... <title><br />{{ page_title }}{% if current_tags %} – tagged "{{ current_tags | join: ', ' }}"{% endif %}{% if current_page != 1 %} – Page {{ current_page }}{% endif %}{% unless page_title contains shop.name %} – {{ shop.name }}{% endunless %}<br /></title>
Intermediate & Advanced SEO | | muzzmoz
{% if page_description %} {% endif %} Shopify instruction #2: Add the following to each individual product page... So, can anyone help clarify: The best strategic approach to this inherent SEO issue with Shopify (besides moving to another platform!)? and If 'rel canonical' tags is the way to go, exactly where and how to apply them? Regards, Murray1 -
Google overriding meta description - retrospective?
Hi all, we run a furniture store and our results in Google.co.uk almost always use the page's content as the description in the search results, even though we have a correctly implemented meta description tag. The rival's pages that rank above and below us almost always show the ranking page's meta description in Google's results page. Our developer seems to think it's because we added the meta description tags long after the pages were first indexed by Google and he seems to think Google has "fixed" on the page content and is happy with that over our meta description ...but I've never heard of that before. Thank you.
Intermediate & Advanced SEO | | Bee1590 -
To noindex and follow or noindex no follow?
We have to greatly scale back on one of our services and focus on the other more successful ones. I need to figure out what to do with all the pages relating to the service we are cutting back. Just to be clear, we aren't getting rid of the service. So they still want the pages on the website, but it is better for us to have more link juice going to the other service pages, more of our content ratio to be around the more profitable services, etc. So, should I no-index/no-follow all the pages relating to the service we are cutting back on? Or should I no-index/follow all the pages relating the service we are cutting back on? Thanks, Ruben
Intermediate & Advanced SEO | | KempRugeLawGroup0 -
Robots.txt help
Hi Moz Community, Google is indexing some developer pages from a previous website where I currently work: ddcblog.dev.examplewebsite.com/categories/sub-categories Was wondering how I include these in a robots.txt file so they no longer appear on Google. Can I do it under our homepage GWT account or do I have to have a separate account set up for these URL types? As always, your expertise is greatly appreciated, -Reed
Intermediate & Advanced SEO | | IceIcebaby0 -
Meta Tags (again)
Hey, I know this has been discussed to death but look back through previous postings there doesn't seem to be a consensus on the exact Meta tags that an eCommerce site should include, specifically whether to remove the keyword tag or not since it is believed that Yahoo potentially still makes use of it. Currently our homepage has the following Meta Tags: <title>Buy Printer Cartridges | Ink and Toner Cartridge for Inkjet and Laser Printers</title> Description" content="<a class="attribute-value">Visit Refresh Cartridges for great prices on ink cartridges, toner cartridges, ink, printers and accessories.</a>" /> Keywords" content="<a class="attribute-value">ink cartridges, cheap cartridges, inkjet cartridges, inkjet ink cartridges, ink cartridge, printer ink cartridges, laser cartridges, toner, laser printers</a>" /> Content-Type" content="<a class="attribute-value">text/html; charset=iso-8859-1</a>"/> author" content="<a class="attribute-value">Ink Cartridges, Inkjet Cartridge, Printer Cartridge, Toner Cartridges Refresh Cartridges</a>" /> expires" content="<a class="attribute-value">0</a>" /> robots" content="<a class="attribute-value">noodp,index,follow</a>" /> Language" content="<a class="attribute-value">English</a>" /> Cache-Control" content="<a class="attribute-value">Public</a>" /> verify-v1" content="<a class="attribute-value">sJXqAAWP6ar/LTEOMyUgG6nqothxk62tJTid+ryBJxo=</a>" /> viewport" content="<a class="attribute-value">width=1024</a>" /> This is too messy but before I do something drastic that I'll possibly regret please can you confirm that, in your opinion, I am best to remove everything with the exception of this: <title>Buy Printer Cartridges | Ink and Toner Cartridge for Inkjet and Laser Printers</title> Description" content="<a class="attribute-value">Visit Refresh Cartridges for great prices on ink cartridges, toner cartridges, ink, printers and accessories.</a>" /> Content-Type" content="<a class="attribute-value">text/html; charset=iso-8859-1</a>"/>
Intermediate & Advanced SEO | | ChrisHolgate
viewport" content="<a class="attribute-value">width=1024</a>" /> I realise there is a verify-v1 tag in there but this can be done through a file on our server so while cleaning up that might as well go. Would there be an argument for keeping any of the other tags or are they all pretty much redundant now? Many thanks! Chris0 -
Why google index some meta titles I dont have?
Hi there, I have a problem with a website and I am desperate to find a solution because I have tried many things and nothing works! My website its: adtriboo.com Google does not find my main URL (main countro spain) www.adtriboo.com/es and I dont see this page its indexed in google. See link https://www.google.es/search?num=100&hl=es&site=&source=hp&q=site%3Aadtriboo.com&oq=site%3Aadtriboo.com&gs_l=hp.3...1189.4419.0.4586.17.17.0.0.0.0.223.1457.9j6j1.16.0...0.0...1c.1.8.hp.brTKX-zPwVI Also, google its showing some meta titles that are not in my page! For example my subfolder for the country Chile shows this title: Chile - Adtriboo but this its my real title Diseño logo, logotipos, video corporativo - adtriboo In webmaster tools everything looks good, and if I explore the webpage like google in webmaster tools the code its ok and everything lookd okay. If you see for example the URL from Chile (www.adtriboo.com/es_CL) the meta title is not the right one! Also i have a problem indexatión because i am not visible for any of my keywords even in the page 10! Please, somebody knows what happen?
Intermediate & Advanced SEO | | Comunicare0 -
Canonical vs noindex for blog tags
Our blog started to user tags & I know this is bad for Panda, but our product team wants use them for user experience. Should we canonizalize these tags to the original blog URL or noindex them?
Intermediate & Advanced SEO | | nicole.healthline0 -
Are tags an issue in SEO
SEOMoz saw that my tags were duplicate pages. Are tags a serious issue in SEO? Should I remove it entirely to prevent the duplicate pages?
Intermediate & Advanced SEO | | visualartistics0