Robots.txt and canonical tag
-
In the SEOmoz post - http://www.seomoz.org/blog/robot-access-indexation-restriction-techniques-avoiding-conflicts, it's being said -
If you have a robots.txt disallow in place for a page, the canonical tag will never be seen.
Does it so happen that if a page is disallowed by robots.txt, spiders DO NOT read the html code ?
-
Thanks Ryan for explaining things very clearly.
-
What we know is there have been many cases where a page that is blocked in robots.txt has appeared in search results. The explanation provided is that robots.txt blocks crawlers during normal site visits, but not necessarily on visits where they are following links from other sites.
-
If spiders follow links to an article on my site, will they read the contents then ? If the canonical tag is on article page itself, will canonical tag will be seen ?
-
Daylan offered a great answer but I would like to add one exception. When crawlers from the major SEs visit your site they will honor your robots.txt file but sometimes they will follow links from other sites to an article on your site, and during that particular visit they will not see the robots.txt file and index your page.
This is one of the reasons why your robots.txt file should be used as minimally as possible, and when it is used you should have a backup process in place such as the canonical or noindex tag on a page.
-
Thanks Daylan for your quick response. I just wanted a second opinion that canonical tag will never be seen if a page is disallowed.
-
Thats correct in most cases:
It works likes this: a robot wants to vists a Web site URL, say http://www.example.com/welcome.html. Before it does so, it firsts checks for http://www.example.com/robots.txt, and finds:
User-agent: *
Disallow: /The "User-agent: *" means this section applies to all robots. The "Disallow: /" tells the robot that it should not visit any pages on the site.
Robots can ignore your /robots.txt. Especially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention.
More information available here about:
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Canonical tag not working
I have a weebly site and I put the canonical tag in the header code but the moz crawler still says that I'm missing the canonical tag. Any tips?
Technical SEO | | ctpolarbears0 -
Product Tags
Opencart allows the use of product tags (please note, these are NOT meta tags) which I believe are used for when customers want to search for a product using the search function. So one of my tags could be ''star wars socks'', and when a customer types this into the search it brings up every product containing the tag for socks. This is all good and well, however, these tags appear on the product page itself, right below the Manufacturer/Brand, and above the price (they created pages but I have canonical links in them so that is a non-issue). Will Google look kindly on this or could it be considered as keyword stuffing? Or will Google know they're for search and ignore them? I just need to know whether or not removing them entirely will be a good or bad idea.
Technical SEO | | moon-boots0 -
Robots.txt & meta noindex--site still shows up on Google Search
I have set up my robots.txt like this: User-agent: *
Technical SEO | | RoxBrock
Disallow: / and I have this meta tag in my on a Wordpress site, set up with SEO Yoast name="robots" content="noindex,follow"/> I did "Fetch as Google" on my Google Search Console My website is still showing up in the search results and it says this: "A description for this result is not available because of this site's robots.txt" This site has not shown up for years and now it is ranking above my site that I want to rank for this keyword. How do I get Google to ignore this site? This seems really weird and I'm confused how a site with little content, that has not been updated for years can rank higher than a site that is constantly updated and improved.1 -
Clarification regarding robots.txt protocol
Hi,
Technical SEO | | nlogix
I have a website , and having 1000 above url and all the url already got indexed in Google . Now am going to stop all the available services in my website and removed all the landing pages from website. Now only home page available . So i need to remove all the indexed urls from Google . I have already used robots txt protocol for removing url. i guess it is not a good method for adding bulk amount of urls (nearly 1000) in robots.txt . So just wanted to know is there any other method for removing indexed urls.
Please advice.0 -
Google indexing despite robots.txt block
Hi This subdomain has about 4'000 URLs indexed in Google, although it's blocked via robots.txt: https://www.google.com/search?safe=off&q=site%3Awww1.swisscom.ch&oq=site%3Awww1.swisscom.ch This has been the case for almost a year now, and it does not look like Google tends to respect the blocking in http://www1.swisscom.ch/robots.txt Any clues why this is or what I could do to resolve it? Thanks!
Technical SEO | | zeepartner0 -
Canonical Advice - ?
Hi everyone, I have a bit of problem with duplicate content on a newly launched site and looking for some advice on which pages to canonicalize. Our legacy site had product "information" pages that now 301 to new product information pages. The reason for the legacy having these pages (instead of pages where you can purchase) is because we used our vendors "cart link", which was an iframe inside the website. So in order to get ranked for these products, we created these pages, that had links to the frame where they could buy. The strategy worked, and we got ranked for our products. Now with the new site, we have those same product information pages, but when you click the link to buy, it goes to a page which now is on our actual site, where you can make the purchase, but this page contains the same basic information, though it looks very different. So my question --- the product "information" pages, are the new 301 homes and are the pages with the rank. The purchase pages are new and have no rank, but are essentially duplicate content. Should I put the canonical link element on the purchase page and tell Google to regard the information pages since those are ranked? It just seems weird to me to direct Google away from the place where people can purchase, however, the purchase pages aren't nearly as "pretty" as the information pages are, and wouldn't be the greatest landing pages. We have an automotive site, and the purchase page you have to enter vehicle information. The information page is nicer, and if the visitor is interested, its just one click to get to that page to buy. What to do here? I am fairly new to Moz, and I couldn't determine whether I am permitted to include an example link from our site of what I am referring to. Is that permitted? Thanks for any help anyone can provide.
Technical SEO | | yogitrout1
Kristin0 -
Robots.txt best practices & tips
Hey, I was wondering if someone could give me some advice on whether I should block the robots.txt file from the average user (not from googlebot, yandex, etc)? If so, how would I go about doing this? With .htaccess I'm guessing - but not an expert. What can people do with the information in the file? Maybe someone can give me some "best practices"? (I have a wordpress based website) Thanks in advance!
Technical SEO | | JonathanRolande0 -
Canonical
I am seeing canonical implementation in many sites for non identical pages. Google honoring these implementation and didn't have any issue. Did anyone have different experience? Thanks.
Technical SEO | | gmk15670