Moz Q&A is closed.
After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.
Block in robots.txt instead of using canonical?
-
When I use a canonical tag for pages that are variations of the same page, it basically means that I don't want Google to index this page. But at the same time, spiders will go ahead and crawl the page. Isn't this a waste of my crawl budget? Wouldn't it be better to just disallow the page in robots.txt and let Google focus on crawling the pages that I do want indexed?
In other words, why should I ever use rel=canonical as opposed to simply disallowing in robots.txt?
-
With this info, I would go with Robots.txt because, as you say, it outweighs any potential loss given the use of the pages and the absence of links.
Thanks
-
Thanks Robert.
The pages that I'm talking about disallowing do not have rank or links. They are sub-pages of a profile page. If anything, the main page will be linked to, not the sub-pages.
Maybe I should have explained that I'm talking about a large site - around 400K pages. More than 1,000 new pages are created per week. That's why I am concerned about managing crawl budget. The pages that I'm referring to are not linked to anywhere on the site. Sure, Google can potentially get to them if someone decides to link to them on their own site, but this is unlikely and certainly won't happen on a large scale. So I'm not really concerned about about losing pagerank on the main profile page if I disallow them. To be clear: we have many thousands of pages with content that we want to rank. The pages I'm talking about are not important in those terms.
So it's really a question of balance... if these pages (there are MANY of them) are included in the crawl (and in our sitemap), potentially it's a real waste of crawl budget. Doesn't this outweigh the minuscule, far-fetched potential loss?
I understand that Google designed rel=canonical for this scenario, but that does not mean that it's necessarily the best way to go considering the other options.
-
Thanks Takeshi.
Maybe I should have explained that I'm talking about a large site - around 400K pages. More than 1,000 new pages are created per week. That's why I am concerned about managing crawl budget. The pages that I'm referring to are not linked to anywhere on the site. Sure, Google can potentially get to them if someone decides to link to them on their own site, but this is unlikely (since it's a sub-page of the main profile page, which is where people would naturally link to) and certainly won't happen on a large scale. So I'm not really concerned about about link-juice evaporation. According to AJ Kohn here, it's not enough to see in Webmaster Tools that Google has indexed all pages on our site. There is also the issue of how often pages are being crawled, which is what we are trying to optimize for.
So it's really a question of balance... if these pages (there are MANY of them) are included in the crawl (and in our sitemap), potentially it's a real waste of crawl budget. Doesn't this outweigh the minuscule, far-fetched potential loss?
Would love to hear your thoughts...
-
I would go with the canonicals. If there are any links going to these duplicate pages, that will prevent any "link juice evaporation" from links which Google can see but can't crawl due to robots.txt. Best to let Google just crawl the page and see the canonical so that it understands that it is a duplicate page.
Having canonicals on all your pages is good practice anyway, as it can prevent inadvertent duplicate content from things like query parameters.
Crawl budget can be of some concern if you're talking about a massive number of pages, but start by first taking a look at Google Webmaster Tools and seeing how many of your pages are being crawled vs the total number of pages on your site. As long as this ration isn't small, you should be good. You can also get more crawl budget by building up your domain authority by building links.
-
I don't disagree at all and I think AJ Kohn is a rock star. In SEO, I have learned over time that there are rarely absolutes like always do this or never do that. I based my answer on how you posited the question.
If you read AJ's post you will note that the rel=canonical issue comes up with others commenting and not in the body of his post. Yes, if the page is superfluous like a cart page or a contact page, use the robots.txt to block the crawl. But, if you have a page with rank, links, etc. that help your canonical page, how are you helping yourself by forgoing rel=canon?
I think his bigger point was that you want to be aware and to understand that the # of times you are crawled is at least partially governed by PR which is governed by all those other things we discussed. If you understand that and keep the crawl focused on better pages you help yourself.
Does that clarify a bit?
Best -
Hi, even if you use robots.txt file to block these pages, Google can still pick the references of these pages from third-party websites and can crawl from there. Such pages will not have the description snippet in the search results and instead will show text that reads:
A description of this result is not available because of this site's robots.txt.
So, to fully stop Google from crawling these pages, you can go in for the page-level meta robots tag along with the robots.txt method. The page-level robots meta tag complements robots.txt method.By the way, robots.txt file can definitely save you some crawl budget. I don't think you should be thinking much about crawl budget though, as long as your website is super-easy to crawl with simple text-based internal links and stuff like, super-fast servers etc.,
Those my my two cents my friend.
Best regards,
Devanur Rafi
-
Thanks for the response, Robert.
I have read lots of SEO advice on maximizing your "crawl budget" - making sure your internal link system is built well to send the bots to the right pages. According to my research, since bots only spend a certain amount of time on your site when they are crawling, it is important to do whatever you can to ensure that they don't "waste time" on pages that are not important for SEO. Just as one example, see this post from AJ Kohn.
Do you disagree with this whole approach?
-
Yair
I think that the canonical is the better option. I am unsure as to your use of the term "crawl budget," in that there is no fixed number of times a page or a site will be crawled versus a second similar site for example. I have a huge reference site that is crawled every couple of days and I have small sites of ten pages that are crawled weekly or less. It is dependent on the traffic and behaviors of that traffic (which would include number of inbound links, etc.) and on things like you re-submitting sitemap, etc.
The canonical tag was created to provide the clarification to the search engine as to what you considered to be the relevant page. Go ahead and use it.Best
Robert
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Block session id URLs with robots.txt
Hi, I would like to block all URLs with the parameter '?filter=' from being crawled by including them in the robots.txt. Which directive should I use: User-agent: *
Intermediate & Advanced SEO | | Mat_C
Disallow: ?filter= or User-agent: *
Disallow: /?filter= In other words, is the forward slash in the beginning of the disallow directive necessary? Thanks!1 -
SEO Best Practices regarding Robots.txt disallow
I cannot find hard and fast direction about the following issue: It looks like the Robots.txt file on my server has been set up to disallow "account" and "search" pages within my site, so I am receiving warnings from the Google Search console that URLs are being blocked by Robots.txt. (Disallow: /Account/ and Disallow: /?search=). Do you recommend unblocking these URLs? I'm getting a warning that over 18,000 Urls are blocked by robots.txt. ("Sitemap contains urls which are blocked by robots.txt"). Seems that I wouldn't want that many urls blocked. ? Thank you!!
Intermediate & Advanced SEO | | jamiegriz0 -
Rel=canonical and internal links
Hi Mozzers, I was musing about rel=canonical this morning and it occurred to me that I didnt have a good answer to the following question: How does applying a rel=canonical on page A referencing page B as the canonical version affect the treatment of the links on page A? I am thinking of whether those links would get counted twice, or in the case of ver-near-duplicates which may have an extra sentence which includes an extra link, whther that extra link would count towards the internal link graph or not. I suspect that google would basically ignore all the content on page A and only look to page B taking into account only page Bs links. Any thoughts? Thanks!
Intermediate & Advanced SEO | | unirmk0 -
If Robots.txt have blocked an Image (Image URL) but the other page which can be indexed has this image, how is the image treated?
Hi MOZers, This probably is a dumb question but I have a case where the robots.tags has an image url blocked but this image is used on a page (lets call it Page A) which can be indexed. If the image on Page A has an Alt tags, then how is this information digested by crawlers? A) would Google totally ignore the image and the ALT tags information? OR B) Google would consider the ALT tags information? I am asking this because all the images on the website are blocked by robots.txt at the moment but I would really like website crawlers to crawl the alt tags information. Chances are that I will ask the webmaster to allow indexing of images too but I would like to understand what's happening currently. Looking forward to all your responses 🙂 Malika
Intermediate & Advanced SEO | | Malika11 -
Robots.txt - Do I block Bots from crawling the non-www version if I use www.site.com ?
my site uses is set up at http://www.site.com I have my site redirected from non- www to the www in htacess file. My question is... what should my robots.txt file look like for the non-www site? Do you block robots from crawling the site like this? Or do you leave it blank? User-agent: * Disallow: / Sitemap: http://www.morganlindsayphotography.com/sitemap.xml Sitemap: http://www.morganlindsayphotography.com/video-sitemap.xml
Intermediate & Advanced SEO | | morg454540 -
Robots.txt: how to exclude sub-directories correctly?
Hello here, I am trying to figure out the correct way to tell SEs to crawls this: http://www.mysite.com/directory/ But not this: http://www.mysite.com/directory/sub-directory/ or this: http://www.mysite.com/directory/sub-directory2/sub-directory/... But with the fact I have thousands of sub-directories with almost infinite combinations, I can't put the following definitions in a manageable way: disallow: /directory/sub-directory/ disallow: /directory/sub-directory2/ disallow: /directory/sub-directory/sub-directory/ disallow: /directory/sub-directory2/subdirectory/ etc... I would end up having thousands of definitions to disallow all the possible sub-directory combinations. So, is the following way a correct, better and shorter way to define what I want above: allow: /directory/$ disallow: /directory/* Would the above work? Any thoughts are very welcome! Thank you in advance. Best, Fab.
Intermediate & Advanced SEO | | fablau1 -
Canonical URLs and Sitemaps
We are using canonical link tags for product pages in a scenario where the URLs on the site contain category names, and the canonical URL points to a URL which does not contain the category names. So, the product page on the site is like www.example.com/clothes/skirts/skater-skirt-12345, and also like www.example.com/sale/clearance/skater-skirt-12345 in another category. And on both of these pages, the canonical link tag references a 3rd URL like www.example.com/skater-skirt-12345. This 3rd URL, used in the canonical link tag is a valid page, and displays the same content as the other two versions, but there are no actual links to this generic version anywhere on the site (nor external). Questions: 1. Does the generic URL referenced in the canonical link also need to be included as on-page links somewhere in the crawled navigation of the site, or is it okay to be just a valid URL not linked anywhere except for the canonical tags? 2. In our sitemap, is it okay to reference the non-canonical URLs, or does the sitemap have to reference only the canonical URL? In our case, the sitemap points to yet a 3rd variation of the URL, like www.example.com/product.jsp?productID=12345. This page retrieves the same content as the others, and includes a canonical link tag back to www.example.com/skater-skirt-12345. Is this a valid approach, or should we revise the sitemap to point to either the category-specific links or the canonical links?
Intermediate & Advanced SEO | | 379seo0 -
Using 2 wildcards in the robots.txt file
I have a URL string which I don't want to be indexed. it includes the characters _Q1 ni the middle of the string. So in the robots.txt can I use 2 wildcards in the string to take out all of the URLs with that in it? So something like /_Q1. Will that pickup and block every URL with those characters in the string? Also, this is not directly of the root, but in a secondary directory, so .com/.../_Q1. So do I have to format the robots.txt as //_Q1* as it will be in the second folder or just using /_Q1 will pickup everything no matter what folder it is on? Thanks.
Intermediate & Advanced SEO | | seo1234560