Duplicate pages in Google index despite canonical tag and URL Parameter in GWMT
-
Good morning Moz...
This is a weird one. It seems to be a "bug" with Google, honest...
We migrated our site www.three-clearance.co.uk to a Drupal platform over the new year. The old site used URL-based tracking for heat map purposes, so for instance
www.three-clearance.co.uk/apple-phones.html
..could be reached via
www.three-clearance.co.uk/apple-phones.html?ref=menu or
www.three-clearance.co.uk/apple-phones.html?ref=sidebar and so on.
GWMT was told of the ref parameter and the canonical meta tag used to indicate our preference. As expected we encountered no duplicate content issues and everything was good.
This is the chain of events:
-
Site migrated to new platform following best practice, as far as I can attest to.
-
Only known issue was that the verification for both google analytics (meta tag) and GWMT (HTML file) didn't transfer as expected so between relaunch on the 22nd Dec and the fix on 2nd Jan we have no GA data, and presumably there was a period where GWMT became unverified.
-
URL structure and URIs were maintained 100% (which may be a problem, now)
-
Yesterday I discovered 200-ish 'duplicate meta titles' and 'duplicate meta descriptions' in GWMT. Uh oh, thought I. Expand the report out and the duplicates are in fact ?ref= versions of the same root URL. Double uh oh, thought I.
-
Run, not walk, to google and do some Fu:
http://is.gd/yJ3U24 (9 versions of the same page, in the index, the only variation being the ?ref= URI)
Checked BING and it has indexed each root URL once, as it should.
Situation now:
-
Site no longer uses ?ref= parameter, although of course there still exists some external backlinks that use it. This was intentional and happened when we migrated.
-
I 'reset' the URL parameter in GWMT yesterday, given that there's no "delete" option. The "URLs monitored" count went from 900 to 0, but today is at over 1,000 (another wtf moment)
I also resubmitted the XML sitemap and fetched 5 'hub' pages as Google, including the homepage and HTML site-map page.
- The ?ref= URls in the index have the disadvantage of actually working, given that we transferred the URL structure and of course the webserver just ignores the nonsense arguments and serves the page. So I assume Google assumes the pages still exist, and won't drop them from the index but will instead apply a dupe content penalty. Or maybe call us a spam farm. Who knows.
Options that occurred to me (other than maybe making our canonical tags bold or locating a Google bug submission form ) include
A) robots.txt-ing .?ref=. but to me this says "you can't see these pages", not "these pages don't exist", so isn't correct
B) Hand-removing the URLs from the index through a page removal request per indexed URL
C) Apply 301 to each indexed URL (hello BING dirty sitemap penalty)
D) Post on SEOMoz because I genuinely can't understand this.
Even if the gap in verification caused GWMT to forget that we had set ?ref= as a URL parameter, the parameter was no longer in use because the verification only went missing when we relaunched the site without this tracking. Google is seemingly 100% ignoring our canonical tags as well as the GWMT URL setting - I have no idea why and can't think of the best way to correct the situation.
Do you?
Edited To Add: As of this morning the "edit/reset" buttons have disappeared from GWMT URL Parameters page, along with the option to add a new one. There's no messages explaining why and of course the Google help page doesn't mention disappearing buttons (it doesn't even explain what 'reset' does, or why there's no 'remove' option).
-
-
GWT numbers sometimes ignore parameter handling, oddly, and can be hard to read. I'm only seeing about 40 indexed pages with "ref" in the URL, which hardly seems disastrous. One note - once the pages get indexed, for whatever reason, de-indexing can take weeks, even if you do everything correctly. Don't change tactics every couple of days, or you're only going to make this worse, long-term. I think canonicals are fine for this, and they should be effective. It just may take Google some time to re-crawl and dis-lodge the pages. You actually may want to create an XML sitemap (for Google only) that just contains the "ref=" pages Google has indexed. This can nudge them to re-crawl and honor the canonical. Otherwise, the pages could sit there forever. You could 301-redirect - it would be perfectly valid in this case, since those URLs have no value to visitors. I wouldn't worry about the Bing sitemaps - just don't include the "ref=" URLs in the Bing maps, and you'll be fine.
-
Monday morning, still the same, still no reset/add parameters buttons in GMWT any more, still not understanding why Google is being so stubborn about this.
3 identical pages in the index, Google ignoring both GWMT URL parameter and canonical meta tag.
Sigh.
-
Nope, nice clean site map that GWMT says provides the right number of URLs with no 404s and no ?ref= links.
It's like Google has always indexed these links separately but for some reason has decided to only show them now they no longer exist..
-
They arent in your xml sitemap are they? You probably generated a new one when you moved the site over... that could possibly be overriding the parameters... maybe... weird...
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
New SEO manager needs help! Currently only about 15% of our live sitemap (~4 million url e-commerce site) is actually indexed in Google. What are best practices sitemaps for big sites with a lot of changing content?
In Google Search console 4,218,017 URLs submitted 402,035 URLs indexed what is the best way to troubleshoot? What is best guidance for sitemap indexation of large sites with a lot of changing content? view?usp=sharing
Technical SEO | | Hamish_TM1 -
My sites "pages indexed by Google" have gone up more than qten-fold.
Prior to doing a little work cleaning up broken links and keyword stuffing Google only indexed 23/333 pages. I realize it may not be because of the work but now we have around 300/333. My question is is this a big deal? cheers,
Technical SEO | | Billboard20120 -
Pages to be indexed in Google
Hi, We have 70K posts in our site but Google has scanned 500K pages and these extra pages are category pages or User profile pages. Each category has a page and each user has a page. When we have 90K users so Google has indexed 90K pages of users alone. My question is. Should we leave it as they are or should we block them from being indexed? As we get unwanted landings to the pages and huge bounce rate. If we need to remove what needs to be done? Robots block or Noindex/Nofollow Regards
Technical SEO | | mtthompsons0 -
Do canonical tags pass all of the link juice onto the URL they point to?
I have an ecommerce website where the category pages have various sorting and paging options which add a suffix to the URLs. My site is setup so the root category URL, domain.com/category-name, has a canonical tag pointing to domain.com/category-name/page1/price however all links, both interner & external, point to the former (i.e. domain.com/category-name). I would like to know whether all of the link juice is being passed onto the canonical tag URL? Otherwise should I change the canonical tag to point the other way? Thanks!
Technical SEO | | tjhossy0 -
Why are these pages duplicates when canonical is defined?
The SEOmoz reports indicate that the following pages are duplicates even though the canonical tag has been added. http://www.designquotes.com.au/dq/web/get-quotes/quotes http://www.designquotes.com.au/dq/web/get-quotes/brief Is this normal?
Technical SEO | | designquotes0 -
Trailing Slashes In Url use Canonical Url or 301 Redirect?
I was thinking of using 301 redirects for trailing slahes to no trailing slashes for my urls. EG: www.url.com/page1/ 301 redirect to www.url.com/page1 Already got a redirect for non-www to www already. Just wondering in my case would it be best to continue using htacces for the trailing slash redirect or just go with Canonical URLs?
Technical SEO | | upick-1623910 -
Canonical tags
Hi there, I have just noticed that SEOmoz picked up some duplicates links that I would like to resolve but not sure how. For example, the "Finding work in the arts" article has two links: http://www.creative-choices.co.uk/develop-your-career/article/finding-work-in-the-arts http://www.creative-choices.co.uk/develop-your-career/article/finding-work-in-the-arts?utm_source=Website&utm_medium=Website&utm_content=Finding+work+in+the+arts&utm_campaign=Footer+Links Both links can be found on this page http://www.creative-choices.co.uk/industry-news-views/article/what-employers-are-looking-for (see attachment). Would automatically generated canonical tags by the CMS solve this issue? rmxiP
Technical SEO | | CreativeChoices0 -
Using the Canonical Tag
Hi, I have an issue that can be solve with a canonical tag, but I am not sure yet, we are developing a page full of statistics, like this: www.url.com/stats/ But filled with hundreds of stats, so users can come and select only the stats they want to see and share with their friends, so it becomes like a new page with their slected stats: www.url.com/stats/?id=mystats The problems I see on this is: All pages will be have a part of the content from the main page 1) and many of them will be exactly the same, so: duplicate content. My idea was to add the canonical tag of "www.url.com/stats/" to all pages, similar as how Rand does it here: http://www.seomoz.org/blog/canonical-url-tag-the-most-important-advancement-in-seo-practices-since-sitemaps But I am not sure of this solution because the content is not exactly the same, page 2) will only have a part of the content that page 1) has, and in some cases just a very small part. Is the canonical tag useful in this case? Thank you!
Technical SEO | | andresgmontero0