Can PDF be seen as duplicate content? If so, how to prevent it?
-
I see no reason why PDF couldn't be considered duplicate content but I haven't seen any threads about it.
We publish loads of product documentation provided by manufacturers as well as White Papers and Case Studies. These give our customers and prospects a better idea off our solutions and help them along their buying process.
However, I'm not sure if it would be better to make them non-indexable to prevent duplicate content issues. Clearly we would prefer a solutions where we benefit from to keywords in the documents.
Any one has insight on how to deal with PDF provided by third parties?
Thanks in advance.
-
It looks like Google is not crawling tabs anymore, therefore if your pdf's are tabbed within pages, it might not be an issue: https://www.seroundtable.com/google-hidden-tab-content-seo-19489.html
-
Sure, I understand - thanks EGOL
-
I would like to give that to you but it is on a site that I don't share in forums. Sorry.
-
Thanks EGOL
That would be ideal.
For a site that has multiple authors and with it being impractical to get a developer involved every time a web page / blog post and the pdf are created, is there a single line of code that could be used to accomplish this in .htaccess?
If so, would you be able to show me an example please?
-
I assigned rel=canonical to my PDFs using htaccess.
Then, if anyone links to the PDFs the linkvalue gets passed to the webpage.
-
Hi all
I've been discussing the topic of making content available as both blog posts and pdf downloads today.
Given that there is a lot of uncertainty and complexity around this issue of potential duplication, my plan is to house all the pdfs in a folder that we block with robots.txt
Anyone agree / disagree with this approach?
-
Unfortunately, there's no great way to have it both ways. If you want these pages to get indexed for the links, then they're potential duplicates. If Google filters them out, the links probably won't count. Worst case, it could cause Panda-scale problems. Honestly, I suspect the link value is minimal and outweighed by the risk, but it depends quite a bit on the scope of what you're doing and the general link profile of the site.
-
I think you can set it to public or private (logged-in only) and even put a price-tag on it if you want. So yes setting it to private would help to eliminate the dup content issue, but it would also hide the links that I'm using to link-build.
I would imagine that since this guide would link back to our original site that it would be no different than if someone were to copy the content from our site and link back to us with it, thus crediting us as the original source. Especially if we ensure to index it through GWMT before submitting to other platforms. Any good resources that delve into that?
-
Potentially, but I'm honestly not sure how Scrid's pages are indexed. Don't you need to log in or something to actually see the content on Scribd?
-
What about this instance:
(A) I made an "ultimate guide to X" and posted it on my site as individual HTML pages for each chapter
(B) I made a PDF version with the exact same content that people can download directly from the site
(C) I uploaded the PDF to sites like Scribd.com to help distribute it further, and build links with the links that are embedded in the PDF.
Would those all be dup content? Is (C) recommended or not?
-
Thanks!. I am going to look into this. I'll let you know if I learn anything.
-
If they duplicate your main content, I think the header-level canonical may be a good way to go. For the syndication scenario, it's tough, because then you're knocking those PDFs out of the rankings, potentially, in favor of someone else's content.
Honestly, I've seen very few people deal with canonicalization for PDFs, and even those cases were small or obvious (like a page with the exact same content being outranked by the duplicate PDF). It's kind of uncharted territory.
-
Thanks for all of your input Dr. Pete. The example that you use is almost exactly what I have - hundreds of .pdfs on a fifty page site. These .pdfs rank well in the SERPs, accumulate pagerank, and pass traffic and link value back to the main site through links embedded within the .pdf. The also have natural links from other domains. I don't want to block them or nofollow them butyour suggestion of using header directive sounds pretty good.
-
Oh, sorry - so these PDFs aren't duplicates with your own web/HTML content so much as duplicates with the same PDFs on other websites?
That's more like a syndication situation. It is possible that, if enough people post these PDFs, you could run into trouble, but I've never seen that. More likely, your versions just wouldn't rank. Theoretically, you could use the header-level canonical tag cross-domain, but I've honestly never seen that tested.
If you're talking about a handful of PDFs, they're a small percentage of your overall indexed content, and that content is unique, I wouldn't worry too much. If you're talking about 100s of PDFs on a 50-page website, then I'd control it. Unfortunately, at that point, you'd probably have to put the PDFs in a folder and outright block it. You'd remove the risk, but you'd stop ranking on those PDFs as well.
-
@EGOL: Can you expend a bit on your Author suggestion?
I was wondering if there is a way to do rel=author for a pdf document. I don't know how to do it and don't know if it is possible.
-
To make sure I understand what I'm reading:
- PDFs don't usually rank as well as regular pages (although it is possible)
- It is possible to configure a canonical tag on a PDF
My concern isn't that our PDFs may outrank the original content but rather getting slammed by Google for publishing them.
Am right in thinking a canonical tag prevents to accumulate link juice? If so I would prefer to not use it, unless it leads to Google slamming.
Any one has experienced Google retribution for publishing PDF coming from a 3rd party?
@EGOL: Can you expend a bit on your Author suggestion?
Thanks all!
-
I think it's possible, but I've only seen it in cases that are a bit hard to disentangle. For example, I've seen a PDF outrank a duplicate piece of regular content when the regular content had other issues (including massive duplication with other, regular content). My gut feeling is that it's unusual.
If you're concerned about it, you can canonicalize PDFs with the header-level canonical directive. It's a bit more technically complex than the standard HTML canonical tag:
http://googlewebmastercentral.blogspot.com/2011/06/supporting-relcanonical-http-headers.html
I'm going to mark this as "Discussion", just in case anyone else has seen real-world examples.
-
I am really interested in hearing what others have to say about this.
I know that .pdfs can be very valuable content. They can be optimized, they rank in the SERPs, they accumulate PR and they can pass linkvalue. So, to me it would be a mistake to block them from the index...
However, I see your point about dupe content... they could also be thin content. Will panda whack you for thin and dupes in your PDFs?
How can canonical be used... what about author?
Anybody know anything about this?
-
Just like any other piece of duplicate content, you can use canonical link elements to specify the original piece of content (if there's indeed more than one identical piece). You could also block these types of files in the robots.txt, or use noindex-follow meta tags.
Regards,
Margarita
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Can you rank without 10 x content
If I create a page about a "Normandy bike tour "and present the same things (pictures, hotels, dates, day by day itinerary, clients reviews, map) as my competitors can I still rank ? Or do I need to add something totally that my competitors don't have on their webpages to rank and compete ? Thank you,
Intermediate & Advanced SEO | | seoanalytics0 -
Http vs. https - duplicate content
Hi I have recently come across a new issue on our site, where https & http titles are showing as duplicate. I read https://moz.com/community/q/duplicate-content-and-http-and-https however, am wondering as https is now a ranking factor, blocked this can't be a good thing? We aren't in a position to roll out https everywhere, so what would be the best thing to do next? I thought about implementing canonicals? Thank you
Intermediate & Advanced SEO | | BeckyKey0 -
How do we avoid duplicate/thin content on +150,000 product pages?
Hey guys! We got a rather large product range (books) on our eCommerce site (+150,000 titles). We get book descriptions as meta data from our publishers, which we display on the product pages. This obviously is not unique, as many other sites display the same piece of description of the book. It is important for us to rank on those book titles, so my question to You is: How would you go about it? I mean, it seems like a rather unrealistic task to paraphrase +150,000 (and growing) book descriptions. As I see it, there are these options: 1. Don't display the descriptions on the product pages (however then those pages will get even thinner!)
Intermediate & Advanced SEO | | Jacob_Holm
2. Display the (duplicate) descriptions, but put no-index on those product pages in order not to punish the rest of the site (not really an option, though).
3. Hire student workers to produce unique product descriptions for all 150,000 products (seems like a huge and expensive task) But how would You solve such a challenge?
Thanks a lot! Cheers, Tommy.0 -
Contextual FAQ and FAQ Page, is this duplicate content?
Hi Mozzers, On my website, I have a FAQ Page (with the questions-responses of all the themes (prices, products,...)of my website) and I would like to add some thematical faq on the pages of my website. For example : adding the faq about pricing on my pricing page,... Is this duplicate content? Thank you for your help, regards. Jonathan
Intermediate & Advanced SEO | | JonathanLeplang0 -
Is This Considered Duplicate Content?
My site has entered SEO hell and I am not sure how to fix it. Up until 18 months ago I had tremendous success on Google and Bing and now my website appears below my Facebook page for the term "Direct Mail Raleigh." What makes it even more frustrating is my competitors have done no SEO and they are dominating this keyword. I thought that the issue was due to harmful inbound links and two months ago I disavowed ones that were clearly spam. Somehow my site has actually gone down! I have a blog that I have updated infrequently and I do not know if it I am getting punished for duplicate content. On Google Webmaster Tools it says I have 279 crawled and indexed pages. Yesterday when I ran the MOZ crawl check I was amazed to find 1150 different webpages on my site. Despite the fact that it does not appear on the webmaster tools I have three different webpages due to the format that the Wordpress blog was created: "http://www.marketplace-solutions.com/report/part2leadershi/", "http://www.marketplace-solutions.com/report/page/91/" and "http://www.marketplace-solutions.com/report/category/competent-leadership/page/3/" What does not make sense to me is why Google only indexed 279 webpages AND why MOZ did not identify these three webpages as duplicate content with the Crawl Test Tool. Does anyone have any ideas? Would it be as easy as creating a massive robot.txt file and just putting 2 of the 3 URLs in that file? Thank you for your help.
Intermediate & Advanced SEO | | DR700950 -
Can I redirect duplicate blogs to give credit to one?
I have two sites that have no duplicate content (yet). One ranks better than the other but has a crappy hyphenated domain name (Domain A), and the other one is the "brand site" with a better domain name (Domain B). I'm creating a blog with technical articles and corresponding videos. I want the videos to refer to the better domain name (Domain B) because I can't see referring people to a hyphenated domain (it would sound horrible). But, the hyphenated domain has a better chance of improving it's rankings (long story why). Can I duplicate the content and just use a canonical tag on Domain B to give the credit to Domain A? If I do that, is it done on each post? Or the blog's main page? What I think would happen is any links to Domain B would pass the juice to Domain A. Is that correct? I know Canonical's are tricky and I don't want to screw this up, so I'd greatly appreciate some advice from the experienced people on here. Thank you.
Intermediate & Advanced SEO | | PhoenixDev0 -
Bi-Lingual Site: Lack of Translated Content & Duplicate Content
One of our clients has a blog with an English and Spanish version of every blog post. It's in WordPress and we're using the Q-Translate plugin. The problem is that my company is publishing blog posts in English only. The client is then responsible for having the piece translated, at which point we can add the translation to the blog. So the process is working like this: We add the post in English. We literally copy the exact same English content to the Spanish version, to serve as a placeholder until it's translated by the client. (*Question on this below) We give the Spanish page a placeholder title tag, so at least the title tags will not be duplicate in the mean time. We publish. Two pages go live with the exact same content and different title tags. A week or more later, we get the translated version of the post, and add that as the Spanish version, updating the content, links, and meta data. Our posts typically get indexed very quickly, so I'm worried that this is creating a duplicate content issue. What do you think? What we're noticing is that growth in search traffic is much flatter than it usually is after the first month of a new client blog. I'm looking for any suggestions and advice to make this process more successful for the client. *Would it be better to leave the Spanish page blank? Or add a sentence like: "This post is only available in English" with a link to the English version? Additionally, if you know of a relatively inexpensive but high-quality translation service that can turn these translations around quicker than my client can, I would love to hear about it. Thanks! David
Intermediate & Advanced SEO | | djreich0 -
Duplicate content, website authority and affiliates
We've got a dilemma at the moment with the content we supply to an affiliate. We currently supply the affiliate with our product database which includes everything about a product including the price, title, description and images. The affiliate then lists the products on their website and provides a Commission Junction link back to our ecommerce store which tracks any purchases with the affiliate getting a commission based on any sales via a cookie. This has been very successful for us in terms of sales but we've noticed a significant dip over the past year in ranking whilst the affiliate has achieved a peak...all eyes are pointing towards the Panda update. Whenever I type one of our 'uniquely written' product descriptions into Google, the affiliate website appears higher than ours suggesting Google has ranked them the authority. My question is, without writing unique content for the affiliate and changing the commission junction link. What would be the best option to be recognised as the authority of the content which we wrote in the first place? It always appears on our website first but Google seems to position the affiliate higher than us in the SERPS after a few weeks. The commission junction link is written like this: http://www.anrdoezrs.net/click-1428744-10475505?sid=shopp&url=http://www.outdoormegastore.co.uk/vango-calisto-600xl-tent.html
Intermediate & Advanced SEO | | gavinhoman0