Moz Q&A is closed.
After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.
How to block "print" pages from indexing
-
I have a fairly large FAQ section and every article has a "print" button. Unfortunately, this is creating a page for every article which is muddying up the index - especially on my own site using Google Custom Search.
Can you recommend a way to block this from happening?
Example Article:
Example "Print" page:
http://www.knottyboy.com/lore/article.php?id=052&action=print
-
Donnie, I agree. However, we had the same problem on a website and here's what we did the canonical tag:
Over a period of 3-4 weeks, all those print pages disappeared from the SERP. Now if I take a print URL and do a cache: for that page, it shows me the web version of that page.
So yes, I agree the question was about blocking the pages from getting indexed. There's no real recipe here, it's about getting the right solution. Before canonical tag, robots.txt was the only solution. But now with canonical there (provided one has the time and resources available to implement it vs adding one line of text to robots.txt), you can technically 301 the pages and not have to stop/restrict the spiders from crawling them.
Absolutely no offence to your solution in any way. Both are indeed workable solutions. The best part is that your robots.txt solution takes 30 seconds to implement since you provided the actually disallow code :), so it's better.
-
Thanks Jennifer, will do! So much good information.
-
Sorry, but I have to jump in - do NOT use all of those signals simultaneously. You'll make a mess, and they'll interfere with each other. You can try Robots.txt or NOINDEX on the page level - my experience suggests NOINDEX is much more effective.
Also, do not nofollow the links yet - you'll block the crawl, and then the page-level cues (like NOINDEX) won't work. You can nofollow later. This is a common mistake and it will keep your fixes from working.
-
Josh, please read my and Dr. Pete's comments below. Don't nofollow the links, but do use the meta noindex,follow on the page.
-
Rel-canonical, in practice, does essentially de-index the non-canonical version. Technically, it's not a de-indexation method, but it works that way.
-
You are right Donnie. I've "good answered" you too.
I've gone ahead and updated my robots.txt file. As soon as I am able, I will use no indexon the page, no follow on the links, and rel=canonical.
This is just what I needed, a quick fix until I can make a more permanent solution.
-
Your welcome : )
-
Although you are correct... there is still more then one way to skin a chicken.
-
But the spiders still run on the page and read the canonical link, however with the robot text the spiders will not.
-
Yes, but Rel=Canonical does not block a page it only tells google which page to follow out of two pages.The question was how to block, not how to tell google which link to follow. I believe you gave credit to the wrong answer.
http://en.wikipedia.org/wiki/Canonical_link_element
This is not fair. lol
-
I have to agree with Jen - Robots.txt isn't great for getting indexed pages out. It's good for prevention, but tends to be unreliable as a cure. META NOINDEX is probably more reliable.
One trick - DON'T nofollow the print links, at least not yet. You need Google to crawl and read the NOINDEX tags. Once the ?print pages are de-indexed, you could nofollow the links, too.
-
Yes, it's strongly recommended. It should be fairly simple to populate this tag with the "full" URL of the article based on the article ID. This approach will not only help you get rid of the duplicate content issue, but a canonical tag essentially works like a 301 redirect. So from all search engine perspective you are 301'ing your print pages to the real web urls without redirecting the actual user's who are browsing the print pages if they need to.
-
Ya it is actually really useful. Unfortunately they are out of business now - so I'm hacking it on my own.
I will take your advice. I've shamefully never used rel= canonical before - so now is a good time to start.
-
True but using robots.txt does not keep them out of the index. Only using "noindex" will do that.
-
Thanks Donnie. Much appreciated!
-
I actually remember Lore from a while ago. It's an interesting, easy to use FAQ CMS.
Anyways, I would also recommend implementing Canonical Tags for any possible duplicate content issues. So whether it's the print or the web version, each one of them will contain a canonical tag pointing to the web url of that article in the section of your website.
rel="canonical" href="http://www.knottyboy.com/lore/idx.php/11/183/Maintenance-of-Mature-Locks-6-months-/article/How-do-I-get-sand-out-of-my-dreads.html" /> -
-
Try This.
User-agent: *
Disallow: /*&action=print
-
Theres more then one way to skin a chicken.
-
Rather than using robots.txt I'd use a noindex,follow tag instead to the page. This code goes into the tag for each print page. And it will ensure that the pages don't get indexed but that the links are followed.
-
That would be great. Do you mind giving me an example?
-
you can block in .robot text, every page that ends in action=print
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Google Not Indexing Pages (Wordpress)
Hello, recently I started noticing that google is not indexing our new pages or our new blog posts. We are simply getting a "Discovered - Currently Not Indexed" message on all new pages. When I click "Request Indexing" is takes a few days, but eventually it does get indexed and is on Google. This is very strange, as our website has been around since the late 90's and the quality of the new content is neither duplicate nor "low quality". We started noticing this happening around February. We also do not have many pages - maybe 500 maximum? I have looked at all the obvious answers (allowing for indexing, etc.), but just can't seem to pinpoint a reason why. Has anyone had this happen recently? It is getting very annoying having to manually go in and request indexing for every page and makes me think there may be some underlying issues with the website that should be fixed.
Technical SEO | | Hasanovic1 -
Pages are Indexed but not Cached by Google. Why?
Hello, We have magento 2 extensions website mageants.com since 1 years google every 15 days cached my all pages but suddenly last 15 days my websites pages not cached by google showing me 404 error so go search console check error but din't find any error so I have cached manually fetch and render but still most of pages have same 404 error example page : - https://www.mageants.com/free-gift-for-magento-2.html error :- http://webcache.googleusercontent.com/search?q=cache%3Ahttps%3A%2F%2Fwww.mageants.com%2Ffree-gift-for-magento-2.html&rlz=1C1CHBD_enIN803IN804&oq=cache%3Ahttps%3A%2F%2Fwww.mageants.com%2Ffree-gift-for-magento-2.html&aqs=chrome..69i57j69i58.1569j0j4&sourceid=chrome&ie=UTF-8 so have any one solutions for this issues
Technical SEO | | vikrantrathore0 -
Is it good to redirect million of pages on a single page?
My site has 10 lakh approx. genuine urls. But due to some unidentified bugs site has created irrelevant urls 10 million approx. Since we don’t know the origin of these non-relevant links, we want to redirect or remove all these urls. Please suggest is it good to redirect such a high number urls to home page or to throw 404 for these pages. Or any other suggestions to solve this issue.
Technical SEO | | vivekrathore0 -
Why do some URLs for a specific client have "/index.shtml"?
Reviewing our client's URLs for a 301 redirect strategy, we have noticed that many URLs have "/index.shtml." The part we don'd understand is these URLs aren't the homepage and they have multiple folders followed by "/index.shtml" Does anyone happen to know why this may be occurring? Is there any SEO value in keeping the "/index.shtml" in the URL?
Technical SEO | | FranFerrara0 -
Pages removed from Google index?
Hi All, I had around 2,300 pages in the google index until a week ago. The index removed a load and left me with 152 submitted, 152 indexed? I have just re-submitted my sitemap and will wait to see what happens. Any idea why it has done this? I have seen a drop in my rankings since. Thanks
Technical SEO | | TomLondon0 -
Unnecessary pages getting indexed in Google for my blog
I have a blog dapazze.com and I am suffering from a problem for a long time. I found out that Google have indexed hundreds of replytocom links and images attachment pages for my blog. I had to remove these pages manually using the URL removal tool. I had used "Disallow: ?replytocom" in my robots.txt, but Google disobeyed it. After that, I removed the parameter from my blog completely using the SEO by Yoast plugin. But now I see that Google has again started indexing these links even after they are not present in my blog (I use #comment). Google have also indexed many of my admin and plugin pages, whereas they are disallowed in my robots.txt file. Have a look at my robots.txt file here: http://dapazze.com/robots.txt Please help me out to solve this problem permanently?
Technical SEO | | rahulchowdhury0 -
De-indexing millions of pages - would this work?
Hi all, We run an e-commerce site with a catalogue of around 5 million products. Unfortunately, we have let Googlebot crawl and index tens of millions of search URLs, the majority of which are very thin of content or duplicates of other URLs. In short: we are in deep. Our bloated Google-index is hampering our real content to rank; Googlebot does not bother crawling our real content (product pages specifically) and hammers the life out of our servers. Since having Googlebot crawl and de-index tens of millions of old URLs would probably take years (?), my plan is this: 301 redirect all old SERP URLs to a new SERP URL. If new URL should not be indexed, add meta robots noindex tag on new URL. When it is evident that Google has indexed most "high quality" new URLs, robots.txt disallow crawling of old SERP URLs. Then directory style remove all old SERP URLs in GWT URL Removal Tool This would be an example of an old URL:
Technical SEO | | TalkInThePark
www.site.com/cgi-bin/weirdapplicationname.cgi?word=bmw&what=1.2&how=2 This would be an example of a new URL:
www.site.com/search?q=bmw&category=cars&color=blue I have to specific questions: Would Google both de-index the old URL and not index the new URL after 301 redirecting the old URL to the new URL (which is noindexed) as described in point 2 above? What risks are associated with removing tens of millions of URLs directory style in GWT URL Removal Tool? I have done this before but then I removed "only" some useless 50 000 "add to cart"-URLs.Google says themselves that you should not remove duplicate/thin content this way and that using this tool tools this way "may cause problems for your site". And yes, these tens of millions of SERP URLs is a result of a faceted navigation/search function let loose all to long.
And no, we cannot wait for Googlebot to crawl all these millions of URLs in order to discover the 301. By then we would be out of business. Best regards,
TalkInThePark0 -
301 Redirect "wildcard" question
I have been looking at the SEOmoz redirect guide for some advice but I can't seem to find the answer : http://www.seomoz.org/learn-seo/redirection I have lots of URLs from a previous version of a site that look like the following: sitename.com/-c-25.html?sort=2d&page=1 sitename.com/-c-25.html?sort=3a&page=1 etc etc. I want to write a redirect so whenever a URL with the terms "-c-25.html" is requested it redirects to a specified page, regardless of what comes after the question mark. These URLs were created by our previous ecommerce software. The 'c' is for category, and each page of the cateogry created a different URL. I want to do these so I can rediect all of these URLs to the appropraite new cateogry page in a single redirect. Thanks for any help.
Technical SEO | | craigycraig0