Moz Q&A is closed.
After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.
De-indexing millions of pages - would this work?
-
Hi all,
We run an e-commerce site with a catalogue of around 5 million products.
Unfortunately, we have let Googlebot crawl and index tens of millions of search URLs, the majority of which are very thin of content or duplicates of other URLs. In short: we are in deep. Our bloated Google-index is hampering our real content to rank; Googlebot does not bother crawling our real content (product pages specifically) and hammers the life out of our servers.
Since having Googlebot crawl and de-index tens of millions of old URLs would probably take years (?), my plan is this:
- 301 redirect all old SERP URLs to a new SERP URL.
- If new URL should not be indexed, add meta robots noindex tag on new URL.
- When it is evident that Google has indexed most "high quality" new URLs, robots.txt disallow crawling of old SERP URLs. Then directory style remove all old SERP URLs in GWT URL Removal Tool
- This would be an example of an old URL:
www.site.com/cgi-bin/weirdapplicationname.cgi?word=bmw&what=1.2&how=2 - This would be an example of a new URL:
www.site.com/search?q=bmw&category=cars&color=blue
I have to specific questions:
- Would Google both de-index the old URL and not index the new URL after 301 redirecting the old URL to the new URL (which is noindexed) as described in point 2 above?
- What risks are associated with removing tens of millions of URLs directory style in GWT URL Removal Tool? I have done this before but then I removed "only" some useless 50 000 "add to cart"-URLs.Google says themselves that you should not remove duplicate/thin content this way and that using this tool tools this way "may cause problems for your site".
And yes, these tens of millions of SERP URLs is a result of a faceted navigation/search function let loose all to long.
And no, we cannot wait for Googlebot to crawl all these millions of URLs in order to discover the 301. By then we would be out of business.Best regards,
TalkInThePark -
Thanks a lot, Tom. Time will tell...
Just one last thing:
what damage are you (and Google) thinking of when advising against removing URLs on a large scale through GWMT?Personally, I think Google says so only because they want to keep as much information possible in their index.
-
Thanks for the PM, I can now appreciate the problem a little more.
I think it's something that you should not rush. What you've done seems the best thing you can do for now.
Longer term, I'd look at your CMS options!
-
Yes, I have put a conditional meta robots "noindex" on all pages whose URL contains more than 2 GET elements. It is also present on URLs containing parameters of little or no SEO value (e.g. the "price" parameter).
Regarding the nofollow directive, my plan is to not put it in the head but on the individual links pointing to URLs that should not be indexed. If we happen to get a backlink to one of these noindexed pages, I want the link value to get passed on to listed product pages.
My big worrie is what should I do if this de-indexation process takes forever...
-
If you could put a conditional meta tag in to the source code, that will show the nofollow tag if the URL contains more than 3 GET elements, then that might help?
You seem to have already thought hard about your options, and they sound ok. Let's just wait to see whether any Gurus are about to shout stop!
-
Thanks for answering that quickly, Tom!
We cannot robots.txt disallow all URLs. We get quite a lot of organic traffic to these URLs. In july, organic traffic landing on results pages gave us approximately $85 000 in revenue. Also, what is good to know is that pages resulting from searching and browsing share the same URL - the search phrase is treated as just another filtering parameter in the URL.
Keeping the same URL structure is part of my preferred, 2-step solution:
- Meta Robots "noindex" unwanted results pages (the overwhelming majority)
- When our Google index has shrunken enough, put rel=nofollow on internal links pointing to those results pages in order to prevent bots from crawling them.
I have actually implemented step 1 (as of yesterday). The solution I was describing in my original post is my last resort solution. I wanted to get a professional opinion on that one in order to know if I should rule it out or not.
Unfortunately, I cannot disclose our company name here (I have a feeling our competitors use Seomoz as well :)). But I'll send you some links in a private message.
-
If I were you I'd keep the same URL structure. You're correct in thinking this won't be a quick fix.
First, use the robots.txt to disallow robots access to the search pages.
Don't remove all results just yet from GWT, this will be a long task and might damage your sites performance.
Could you provide some links to your site? I'll have a closer look.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Keywords are indexed on the home page
Hello everyone, For one of our websites, we have optimized for many keywords. However, it seems that every keyword is indexed on the home page, and thus not ranked properly. This occurs only on one of our many websites. I am wondering if anyone knows the cause of this issue, and how to solve it. Thank you.
Technical SEO | | Ginovdw1 -
Google has deindexed a page it thinks is set to 'noindex', but is in fact still set to 'index'
A page on our WordPress powered website has had an error message thrown up in GSC to say it is included in the sitemap but set to 'noindex'. The page has also been removed from Google's search results. Page is https://www.onlinemortgageadvisor.co.uk/bad-credit-mortgages/how-to-get-a-mortgage-with-bad-credit/ Looking at the page code, plus using Screaming Frog and Ahrefs crawlers, the page is very clearly still set to 'index'. The SEO plugin we use has not been changed to 'noindex' the page. I have asked for it to be reindexed via GSC but I'm concerned why Google thinks this page was asked to be noindexed. Can anyone help with this one? Has anyone seen this before, been hit with this recently, got any advice...?
Technical SEO | | d.bird0 -
Is it good to redirect million of pages on a single page?
My site has 10 lakh approx. genuine urls. But due to some unidentified bugs site has created irrelevant urls 10 million approx. Since we don’t know the origin of these non-relevant links, we want to redirect or remove all these urls. Please suggest is it good to redirect such a high number urls to home page or to throw 404 for these pages. Or any other suggestions to solve this issue.
Technical SEO | | vivekrathore0 -
How to stop google from indexing specific sections of a page?
I'm currently trying to find a way to stop googlebot from indexing specific areas of a page, long ago Yahoo search created this tag class=”robots-nocontent” and I'm trying to see if there is a similar manner for google or if they have adopted the same tag? Any help would be much appreciated.
Technical SEO | | Iamfaramon0 -
How to block text on a page to be indexed?
I would like to block the spider indexing a block of text inside a page , however I do not want to block the whole page with, for example , a noindex tag. I have tried already with a tag like this : chocolate pudding chocolate pudding However this is not working for my case, a travel related website. thanks in advance for your support. Best regards Gianluca
Technical SEO | | CharmingGuy0 -
Investigating a huge spike in indexed pages
I've noticed an enormous spike in pages indexed through WMT in the last week. Now I know WMT can be a bit (OK, a lot) off base in its reporting but this was pretty hard to explain. See, we're in the middle of a huge campaign against dupe content and we've put a number of measures in place to fight it. For example: Implemented a strong canonicalization effort NOINDEX'd content we know to be duplicate programatically Are currently fixing true duplicate content issues through rewriting titles, desc etc. So I was pretty surprised to see the blow-up. Any ideas as to what else might cause such a counter intuitive trend? Has anyone else see Google do something that suddenly gloms onto a bunch of phantom pages?
Technical SEO | | farbeseo0 -
Getting Pages Indexed That Are Not In The Main Navigation
Hi All, Hoping you can help me out with a couple of questions I have. I am looking to create SEO friendly landing pages optimized for long tail keywords to increase site traffic and conversions. These pages will not live on the main navigation. I am wondering what the best way to get these pages indexed is? Internal text linking, adding to the sitemap? What have you done in this situation? I know that these pages cannot be orphaned pages and they need to be linked to somewhere. Looking for some tips to do this properly and to ensure that they can become indexed. Thanks! Pat
Technical SEO | | PatBausemer0 -
How to get Google to index another page
Hi, I will try to make my question clear, although it is a bit complex. For my site the most important keyword is "Insurance" or at least the danish variation of this. My problem is that Google are'nt indexing my frontpage on this, but are indexing a subpage - www.mydomain.dk/insurance instead of www.mydomain.dk. My link bulding will be to subpages and to my main domain, but i wont be able to get that many links to www.mydomain.dk/insurance. So im interested in making my frontpage the page that is my main page for the keyword insurance, but without just blowing the traffic im getting from the subpage at the moment. Is there any solutions to do this? Thanks in advance.
Technical SEO | | Petersen110