Duplicate pages in Google index despite canonical tag and URL Parameter in GWMT

Tinhat

Good morning Moz...

This is a weird one. It seems to be a "bug" with Google, honest...

We migrated our site www.three-clearance.co.uk to a Drupal platform over the new year. The old site used URL-based tracking for heat map purposes, so for instance

www.three-clearance.co.uk/apple-phones.html

..could be reached via

www.three-clearance.co.uk/apple-phones.html?ref=menu or

www.three-clearance.co.uk/apple-phones.html?ref=sidebar and so on.

GWMT was told of the ref parameter and the canonical meta tag used to indicate our preference. As expected we encountered no duplicate content issues and everything was good.

This is the chain of events:

Site migrated to new platform following best practice, as far as I can attest to.
Only known issue was that the verification for both google analytics (meta tag) and GWMT (HTML file) didn't transfer as expected so between relaunch on the 22nd Dec and the fix on 2nd Jan we have no GA data, and presumably there was a period where GWMT became unverified.
URL structure and URIs were maintained 100% (which may be a problem, now)
Yesterday I discovered 200-ish 'duplicate meta titles' and 'duplicate meta descriptions' in GWMT. Uh oh, thought I. Expand the report out and the duplicates are in fact ?ref= versions of the same root URL. Double uh oh, thought I.
Run, not walk, to google and do some Fu:

http://is.gd/yJ3U24 (9 versions of the same page, in the index, the only variation being the ?ref= URI)

Checked BING and it has indexed each root URL once, as it should.

Situation now:

Site no longer uses ?ref= parameter, although of course there still exists some external backlinks that use it. This was intentional and happened when we migrated.
I 'reset' the URL parameter in GWMT yesterday, given that there's no "delete" option. The "URLs monitored" count went from 900 to 0, but today is at over 1,000 (another wtf moment)

I also resubmitted the XML sitemap and fetched 5 'hub' pages as Google, including the homepage and HTML site-map page.

The ?ref= URls in the index have the disadvantage of actually working, given that we transferred the URL structure and of course the webserver just ignores the nonsense arguments and serves the page. So I assume Google assumes the pages still exist, and won't drop them from the index but will instead apply a dupe content penalty. Or maybe call us a spam farm. Who knows.

Options that occurred to me (other than maybe making our canonical tags bold or locating a Google bug submission form ) include

A) robots.txt-ing .?ref=. but to me this says "you can't see these pages", not "these pages don't exist", so isn't correct

B) Hand-removing the URLs from the index through a page removal request per indexed URL

C) Apply 301 to each indexed URL (hello BING dirty sitemap penalty)

D) Post on SEOMoz because I genuinely can't understand this.

Even if the gap in verification caused GWMT to forget that we had set ?ref= as a URL parameter, the parameter was no longer in use because the verification only went missing when we relaunched the site without this tracking. Google is seemingly 100% ignoring our canonical tags as well as the GWMT URL setting - I have no idea why and can't think of the best way to correct the situation.

Do you?

Edited To Add: As of this morning the "edit/reset" buttons have disappeared from GWMT URL Parameters page, along with the option to add a new one. There's no messages explaining why and of course the Google help page doesn't mention disappearing buttons (it doesn't even explain what 'reset' does, or why there's no 'remove' option).

Dr-Pete

GWT numbers sometimes ignore parameter handling, oddly, and can be hard to read. I'm only seeing about 40 indexed pages with "ref" in the URL, which hardly seems disastrous. One note - once the pages get indexed, for whatever reason, de-indexing can take weeks, even if you do everything correctly. Don't change tactics every couple of days, or you're only going to make this worse, long-term. I think canonicals are fine for this, and they should be effective. It just may take Google some time to re-crawl and dis-lodge the pages. You actually may want to create an XML sitemap (for Google only) that just contains the "ref=" pages Google has indexed. This can nudge them to re-crawl and honor the canonical. Otherwise, the pages could sit there forever. You could 301-redirect - it would be perfectly valid in this case, since those URLs have no value to visitors. I wouldn't worry about the Bing sitemaps - just don't include the "ref=" URLs in the Bing maps, and you'll be fine.

Tinhat

Monday morning, still the same, still no reset/add parameters buttons in GMWT any more, still not understanding why Google is being so stubborn about this.

Sample: http://www.google.co.uk/search?q=site:three-clearance.co.uk+"Beats+Audio,+the+Sensation+XL"&num=30&hl=en&client=safari&tbo=d&rls=en&filter=0&biw=1920&bih=915

3 identical pages in the index, Google ignoring both GWMT URL parameter and canonical meta tag.

Sigh.

Tinhat

Nope, nice clean site map that GWMT says provides the right number of URLs with no 404s and no ?ref= links.

It's like Google has always indexed these links separately but for some reason has decided to only show them now they no longer exist..

Vizergy

They arent in your xml sitemap are they? You probably generated a new one when you moved the site over... that could possibly be overriding the parameters... maybe... weird...

Welcome to the Q&A Forum

Browse the forum for helpful insights and fresh discussions about all things SEO.

Duplicate pages in Google index despite canonical tag and URL Parameter in GWMT

Got a burning SEO question?

Browse Questions

Explore more categories

Related Questions

Canonical Page Question

Why does Google's search results display my home page instead of my target page?

Home Page Being Indexed / Referral URLs /

Why would Google not index all submitted pages?

From page 1th to page 18th @ Google

Duplicate pages on wordpress

Duplicate page errors from pages don't even exist

Canonical URL