Do you bother cleaning duplicate content from Googles Index?

FashionLux

Hi,

I'm in the process of instructing developers to stop producing duplicate content, however a lot of duplicate content is already in Google's Index and I'm wondering if I should bother getting it removed... I'd appreciate it if you could let me know what you'd do...

For example one 'type' of page is being crawled thousands of times, but it only has 7 instances in the index which don't rank for anything. For this example I'm thinking of just stopping Google from accessing that page 'type'.

Do you think this is right?

Do you normally meta NoIndex,follow the page, wait for the pages to be removed from Google's Index, and then stop the duplicate content from being crawled?

Or do you just stop the pages from being crawled and let Google sort out its own Index in its own time?

Thanks

FashionLux

Dr-Pete

One tricky point - you don't necessarily want to fix the duplicate URLs before you 301-redirect and clear out the index. This is counter-intuitive and throws many people off. If you cut the crawl paths to the bad URLs, then Google will never crawl them and process the 301-redirects (since those exist on the page level). Same is try for canonical tags. Clear out the duplicates first, THEN clean up the paths. I know it sounds weird, but it's important.

For malformed URLs and usability, you could still dynamically 301-redirect. In most cases, those bad URLs shouldn't get indexed, because they have no crawl path in your site. Someone would have to link to them. Google will never mis-type, in other words.

FashionLux

Hi Highland/Dr Pete,

My apologies I wasn't very clear - fixing the duplicate problem... or rather stopping our site from generating further duplicate content isn't an issue at all, I'm going to instruct our developers to stop generating dupe content by doing things like no longer passing variables in the URL's (mysite.com/page2?previouspage=page1).

However the problem is that for a lot of instances duplicate URL's work and they need to work - for example if a user types in the URL but gets one character wrong ('1q' rather than '1') then from a usability perspective its the correct thing to serve the content they wanted. You don't want to make the user have to stop, figure out what they did wrong and redo it - not when you can make it work seamlessly.

My question relates to 'once my site is no longer generating unnecessary duplicate content, what should I to do about the duplicate pages that have already made their way into the Index?' and you have both answered the question very well, thank you.

I can manually set-up 301 redirects for all of the duplicate pages that I find in the index, once they disappear from the index I can probably remove those 301's. I was thinking of going down the noindex meta tag route which is harder to develop.

Thanks guys

FashionLux

Dr-Pete

I DO NOT believe in letting Google sort it out - they don't do it well, and, since Panda (and really even before), they basically penalize sites for their inability to sort out duplicates. I think it's very important to manage your index.

Unfortunately, how to do that can be very complex and depends a lot on the situation. Highland's covered the big ones, but the details can get messy. I wrote a mega-post about it:

http://www.seomoz.org/blog/duplicate-content-in-a-post-panda-world

Without giving URLs, can you give us a sense of what kind of duplicates they are (or maybe some generic URL examples)?

Highland

Your options are

De-index the duplicate pages yourself and save yourself the crawl budget
301 the duplicates to the pages you want to keep (preferred)
Canonical the duplicate pages, which lets you pick which page remains in the index. The duplicate pages will still be crawled, however.

Welcome to the Q&A Forum

Browse the forum for helpful insights and fresh discussions about all things SEO.

Do you bother cleaning duplicate content from Googles Index?

Got a burning SEO question?

Browse Questions

Explore more categories

Related Questions

My url disappeared from Google but Search Console shows indexed. This url has been indexed for more than a year. Please help!

No Index thousands of thin content pages?

Why isn't Google indexing this site?

When does Google index a fetched page?

Could this be seen as duplicate content in Google's eyes?

Scraping / Duplicate Content Question

Duplicate page content errors stemming from CMS

Why the archive sub pages are still indexed by Google?