Robots.txt and canonical tag
-
In the SEOmoz post - http://www.seomoz.org/blog/robot-access-indexation-restriction-techniques-avoiding-conflicts, it's being said -
If you have a robots.txt disallow in place for a page, the canonical tag will never be seen.
Does it so happen that if a page is disallowed by robots.txt, spiders DO NOT read the html code ?
-
Thanks Ryan for explaining things very clearly.
-
What we know is there have been many cases where a page that is blocked in robots.txt has appeared in search results. The explanation provided is that robots.txt blocks crawlers during normal site visits, but not necessarily on visits where they are following links from other sites.
-
If spiders follow links to an article on my site, will they read the contents then ? If the canonical tag is on article page itself, will canonical tag will be seen ?
-
Daylan offered a great answer but I would like to add one exception. When crawlers from the major SEs visit your site they will honor your robots.txt file but sometimes they will follow links from other sites to an article on your site, and during that particular visit they will not see the robots.txt file and index your page.
This is one of the reasons why your robots.txt file should be used as minimally as possible, and when it is used you should have a backup process in place such as the canonical or noindex tag on a page.
-
Thanks Daylan for your quick response. I just wanted a second opinion that canonical tag will never be seen if a page is disallowed.
-
Thats correct in most cases:
It works likes this: a robot wants to vists a Web site URL, say http://www.example.com/welcome.html. Before it does so, it firsts checks for http://www.example.com/robots.txt, and finds:
User-agent: *
Disallow: /The "User-agent: *" means this section applies to all robots. The "Disallow: /" tells the robot that it should not visit any pages on the site.
Robots can ignore your /robots.txt. Especially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention.
More information available here about:
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
How long does it take for canonical tags to work
How long on average does it take for a canonical tag to work? Understand that canonicals are just a suggestion, but after adding a canonical tag and submitting the page via Google fetch, assuming Google follows the canonical, would you expect it to work after a day or two or does it take longer? We added canonicals to old PPC landing pages that are ranking organically, though our new landing pages (which we want to rank organically) are not identical and have a bit more content/features. They are similar though. Canonicals were added to the old pages (pointing to new pages) and requested indexing via search console. Old pages are still ranking and new pages not so much. FYI we are unable to 301 old PPC pages due to other non negotiable reasons unfortunately. Thanks.
Technical SEO | | SoulSurfer80 -
301 Redirect Url Within a Canonical Tag
So this might sounds like a silly question... A client of mine has a duplicate content issue which will be fixed using canonical tags. We are also providing them with an updated URL structure meaning rwe will be having to do lots of 301 redirects. The URL structure is a much larger task that than the duplicate content so i planned to set up the canonicals first. Then it occurred to me id be updating the canonical tags with the urls from the old structure which brings me to my question. Will the canonical tags with the old urls redirect credit to the new urls with the 301? Or should i just wait until we have the new url structure in place and use these new urls in the canonicals? Thanks!
Technical SEO | | NickG-1230 -
Hreflang Tags - error: 'en' - no return tags
Hello, We have recently implemented Hreflang tags to improve the findability of our content in each specific language. However, Webmaster tool is giving us this error... Does anyone know what it means and how to solve it? Here I attach a screenshot: http://screencast.com/t/a4AsqLNtF6J Thanks for your help!
Technical SEO | | Kilgray0 -
HTTP Status showing up in opensiteexplorer top pages as blocked by robot.txt file
I am trying to find an answer to this question it has alot of url on this page with no data when i go into the data source and search for noindex or robot.txt but the site is visible in the search engines ?
Technical SEO | | ReSEOlve0 -
Canonical URL Tag: Confusing Use Case
We have a webpage that changes content each evening at mid-night -- let's call this page URL /foo. This allows a user to bookmark URL /foo and obtain new content each day. In our case, the content on URL /foo for a given day is the same content that exists on another URL on our website. Let's say the content for November 5th is URL /nov05, November 6th is /nov06 and so on. This means on November 5th, there are two pages on the website that have almost identical content -- namely /foo and /nov05. This is likely a duplication of content violation in the view of some search engines. Is the Canonical URL Tag designed to be used in this situation? The page /nov05 is the permanent page containing the content for the day on the website. This means page /nov05 should have a Canonical Tag that points to itself and /foo should have a Canonical Tag that points to /nov05. Correct? Now here is my problem. The page at URL /foo is the fourth highest page authority on our 2,000+ page website. URL /foo is a key part of the marketing strategy for the website. It has the second largest number of External Links second only to our home page. I must tell you that I'm concerned about using a Cononical URL Tag that points away from the URL /foo to a permanent page on the website like /nov05. I can think of a lot of things negative things that could happen to the rankings of the page by making a change like this and I am not sure what we would gain. Right now /foo has a Canonical URL Tag that points to itself. Does anyone believe we should change this? If so, to what and why? Thanks for helping me think this through! Greg
Technical SEO | | GregSims0 -
Site blocked by robots.txt and 301 redirected still in SERPs
I have a vanity URL domain that 301 redirects to my main site. That domain does have a robots.txt to disallow the entire site as well. However, for a branded enough search that vanity domain still shows up in SERPs and has the new Google message of: A description for this result is not available because of this site's robots.txt I get why the message is there - that's not my , my question is shouldn't a 301 redirect trump this domain showing in SERPs, ever? Client isn't happy about it showing at all. How can I get the vanity domain out of the SERPs? THANKS in advance!
Technical SEO | | VMLYRDiscoverability0 -
Robots.txt checker
Google seems to have discontinued their robots.txt checker. Is there another tool that I can use to check my text instead? Thanks!
Technical SEO | | theLotter0 -
Robots.txt and robots meta
I have an odd situation. I have a CMS that has a global robots.txt which has the generic User-Agent: *
Technical SEO | | Highland
Allow: / I also have one CMS site that needs to not be indexed ever. I've read in various pages (like http://www.jesterwebster.com/robots-txt-vs-meta-tag-which-has-precedence/22 ) that robots.txt always wins over meta, but I have also read that robots.txt indicates spiderability whereas meta can control indexation. I just want the site to not be indexed. Can I leave the robots.txt as is and still put NOINDEX in the robots meta?0