Moz Q&A is closed.
After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.
Mass Removal Request from Google Index
-
Hi,
I am trying to cleanse a news website. When this website was first made, the people that set it up copied all kinds of articles they had as a newspaper, including tests, internal communication, and drafts. This site has lots of junk, but this kind of junk was on the initial backup, aka before 1st-June-2012. So, removing all mixed content prior to that date, we can have pure articles starting June 1st, 2012!
Therefore
- My dynamic sitemap now contains only articles with release date between 1st-June-2012 and now
- Any article that has release date prior to 1st-June-2012 returns a custom 404 page with "noindex" metatag, instead of the actual content of the article.
The question is how I can remove from the google index all this junk as fast as possible that is not on the site anymore, but still appears in google results?
I know that for individual URLs I need to request removal from this link
https://www.google.com/webmasters/tools/removalsThe problem is doing this in bulk, as there are tens of thousands of URLs I want to remove. Should I put the articles back to the sitemap so the search engines crawl the sitemap and see all the 404? I believe this is very wrong. As far as I know this will cause problems because search engines will try to access non existent content that is declared as existent by the sitemap, and return errors on the webmasters tools.
Should I submit a DELETED ITEMS SITEMAP using the <expires>tag? I think this is for custom search engines only, and not for the generic google search engine.
https://developers.google.com/custom-search/docs/indexing#on-demand-indexing</expires>The site unfortunatelly doesn't use any kind of "folder" hierarchy in its URLs, but instead the ugly GET params, and a kind of folder based pattern is impossible since all articles (removed junk and actual articles) are of the form:
http://www.example.com/docid=123456So, how can I bulk remove from the google index all the junk... relatively fast?
-
Hi Ioannis,
What about the first suggestion? Can you create a page linking to all of the pages that you'd like to remove, then have Google crawl that page?
Best,
Kristina
-
Thank you Kristina,
I know about the URL structure, I have been trying the past few months to cleanse this site that I was not involved in its creation. It has several more SEO problems that have either been fixed or not yet, but we are talking about more than 50 SEO problems I've found so far - most of these critical.
On the sitemap that I built, the junk pages do not exist, and because this is sitemap I have written myself, I can easily make another containing the articles that I have removed (just reverse a part of my select query for the sitemap to get the ones I have removed).
http://www.neakriti.gr/webservices/sitemap-index.aspx
So far I implemented the last of your suggestions and here is an example:
This is a valid article page
http://www.neakriti.gr/?page=newsdetail&DocID=1314221 - (Status Code: 200)This is a non existent article page (never existed at the first place) - (Status Code: 404)
http://www.neakriti.gr/?page=newsdetail&DocID=12345678This is one of the articles that I removed from sitemap and site - (Status Code: 410)
http://www.neakriti.gr/?page=newsdetail&DocID=894052Also I would like you to take a look at another question about the same site and see that it can relate to this question with garbage articles too...
https://moz.com/community/q/multiple-instances-of-the-same-articleThank you so much!
-
Hi Ioannis,
You're in quite a bind here, without a good URL structure! I don't think there's any one perfect option, but I think all of these will work:
- Create a page on your site that links to every article you would like to delete, keeping those articles 404/410ed. Then, use the Fetch as Googlebot tool, and ask Google to crawl the page plus all of its links. This will get Google to quickly crawl all of those pages, see that they're gone, and remove them from their index. Keep in mind that if you just use a 404, Google may keep the page around for a bit to make sure you didn't just mess up. As Eric said, a 410 is more of a sure thing.
- Create an XML sitemap of those deleted articles, and have Google crawl it. Yes, this will create errors in GSC, but errors in GSC mean that they're concerned you've made a mistake, not that they're necessarily penalizing you. Just mark those guys as fixed and take the sitemap down once Google's crawled it.
- 410 these pages, remove all internal links to them (use a tool like Screaming Frog to make sure you didn't miss any links!), and remove them from your sitemap. That'll distance you from that old, crappy content, and Google will slowly realize that it's been removed as it checks in on its old pages. This is probably the least satisfying option, but it's an option that'll get the job done eventually.
Hope this helps! Let us know what you decide to do.
Best,
Kristina
-
Thank you,
so you suggest that based on my date based query, instead of blocking everything before that date blindly, keep blocking it with 410, while anything that doesn't exist anyway return 404.
Also another question, about the blocked articles that return 410, should I put their URLs back on the xml sitemap or not?
-
Any article that has release date prior to 1st-June-2012 should return a custom 410 page with "noindex" metatag, instead of the actual content of the article.
The error returned should be a "410 gone" and not just a 404. That way Google will treat it differently, and may remove it from the index faster than just returning a 404. Also, you can use the Google removal tool, as well. Don't forget the robots.txt file, as well, there may be directories with the content that you need to disallow.
But overall, using a 410 is going to be better and most likely faster.
-
Thank you for your response.
I defenintelly cannot use noindex because as I explained I changed all articles prior to the minimum given date to return 404. So this content is not visibly available on the web in order to contain a noindex directive. Unless you mean to have it at my custom 404 page, where yes its there.
Also there is no folder to associate in robots, since they are in ugly form of GET params like DOCID=12345. So given that, there are thousands of DocIDs that are junk and removed, and thousands that are the actuall articles.
So I assumed that creating a "deleted articles" sitemap where each <url>will contain an <expires>2016-06-01</expires> tag seemed the most logical thing, but I am afraid its for "custom search engines", rather than for normal de-index requests as its provided bellow</url>
https://developers.google.com/custom-search/docs/indexing#on-demand-indexing
-
Sitemaps is definitely not the way to go for this as you can't just have an expires tag in there and it would make pages go away. The best option to go with is the meta robots and then put them either on nonindex, nofollow, or noindex, follow. With this approach and hopefully with a relative high crawl rate you can make sure that the data from these pages will be removed from the Google Index as soon as possible.
If you still want these pages to be indexed but maybe just not have them crawled anymore, which I don't think you'd like to do based on your explanation then go with robots.txt and excluding the pages in there that you'd like to.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Trying to get Google to stop indexing an old site!
Howdy, I have a small dilemma. We built a new site for a client, but the old site is still ranking/indexed and we can't seem to get rid of it. We setup a 301 from the old site to the new one, as we have done many times before, but even though the old site is no longer live and the hosting package has been cancelled, the old site is still indexed. (The new site is at a completely different host.) We never had access to the old site, so we weren't able to request URL removal through GSC. Any guidance on how to get rid of the old site would be very appreciated. BTW, it's been about 60 days since we took these steps. Thanks, Kirk
Intermediate & Advanced SEO | Jul 29, 2019, 12:34 PM | kbates0 -
How to stop URLs that include query strings from being indexed by Google
Hello Mozzers Would you use rel=canonical, robots.txt, or Google Webmaster Tools to stop the search engines indexing URLs that include query strings/parameters. Or perhaps a combination? I guess it would be a good idea to stop the search engines crawling these URLs because the content they display will tend to be duplicate content and of low value to users. I would be tempted to use a combination of canonicalization and robots.txt for every page I do not want crawled or indexed, yet perhaps Google Webmaster Tools is the best way to go / just as effective??? And I suppose some use meta robots tags too. Does Google take a position on being blocked from web pages. Thanks in advance, Luke
Intermediate & Advanced SEO | Mar 15, 2017, 9:04 PM | McTaggart0 -
My site shows 503 error to Google bot, but can see the site fine. Not indexing in Google. Help
Hi, This site is not indexed on Google at all. http://www.thethreehorseshoespub.co.uk Looking into it, it seems to be giving a 503 error to the google bot. I can see the site I have checked source code Checked robots Did have a sitemap param. but removed it for testing GWMT is showing 'unreachable' if I submit a site map or fetch Any ideas on how to remove this error? Many thanks in advance
Intermediate & Advanced SEO | Nov 23, 2015, 1:10 PM | SolveWebMedia0 -
Why Google isn't indexing my images?
Hello, on my fairly new website Worthminer.com I am noticing that Google is not indexing images from my sitemap. Already 560 images submitted and Google indexed only 3 of them. Altough there is more images indexed they are not indexing any new images, and I have no idea why. Posts, categories and other urls are indexing just fine, but images not. I am using Wordpress and for sitemaps Wordpress SEO by yoast. Am I missing something here? Why Google won't index my images? Thanks, I appreciate any help, David xv1GtwK.jpg
Intermediate & Advanced SEO | Jul 9, 2016, 2:40 PM | Worthminer1 -
Google indexing only 1 page out of 2 similar pages made for different cities
We have created two category pages, in which we are showing products which could be delivered in separate cities. Both pages are related to cake delivery in that city. But out of these two category pages only 1 got indexed in google and other has not. Its been around 1 month but still only Bangalore category page got indexed. We have submitted sitemap and google is not giving any crawl error. We have also submitted for indexing from "Fetch as google" option in webmasters. www.winni.in/c/4/cakes (Indexed - Bangalore page - http://www.winni.in/sitemap/sitemap_blr_cakes.xml) 2. http://www.winni.in/hyderabad/cakes/c/4 (Not indexed - Hyderabad page - http://www.winni.in/sitemap/sitemap_hyd_cakes.xml) I tried searching for "hyderabad site:www.winni.in" in google but there also http://www.winni.in/hyderabad/cakes/c/4 this link is not coming, instead of this only www.winni.in/c/4/cakes is coming. Can anyone please let me know what could be the possible issue with this?
Intermediate & Advanced SEO | Sep 6, 2023, 9:56 AM | abhihan0 -
How is Google crawling and indexing this directory listing?
We have three Directory Listing pages that are being indexed by Google: http://www.ccisolutions.com/StoreFront/jsp/ http://www.ccisolutions.com/StoreFront/jsp/html/ http://www.ccisolutions.com/StoreFront/jsp/pdf/ How and why is Googlebot crawling and indexing these pages? Nothing else links to them (although the /jsp.html/ and /jsp/pdf/ both link back to /jsp/). They aren't disallowed in our robots.txt file and I understand that this could be why. If we add them to our robots.txt file and disallow, will this prevent Googlebot from crawling and indexing those Directory Listing pages without prohibiting them from crawling and indexing the content that resides there which is used to populate pages on our site? Having these pages indexed in Google is causing a myriad of issues, not the least of which is duplicate content. For example, this file <tt>CCI-SALES-STAFF.HTML</tt> (which appears on this Directory Listing referenced above - http://www.ccisolutions.com/StoreFront/jsp/html/) clicks through to this Web page: http://www.ccisolutions.com/StoreFront/jsp/html/CCI-SALES-STAFF.HTML This page is indexed in Google and we don't want it to be. But so is the actual page where we intended the content contained in that file to display: http://www.ccisolutions.com/StoreFront/category/meet-our-sales-staff As you can see, this results in duplicate content problems. Is there a way to disallow Googlebot from crawling that Directory Listing page, and, provided that we have this URL in our sitemap: http://www.ccisolutions.com/StoreFront/category/meet-our-sales-staff, solve the duplicate content issue as a result? For example: Disallow: /StoreFront/jsp/ Disallow: /StoreFront/jsp/html/ Disallow: /StoreFront/jsp/pdf/ Can we do this without risking blocking Googlebot from content we do want crawled and indexed? Many thanks in advance for any and all help on this one!
Intermediate & Advanced SEO | Sep 13, 2013, 6:49 PM | danatanseo0 -
Google Indexing Feedburner Links???
I just noticed that for lots of the articles on my website, there are two results in Google's index. For instance: http://www.thewebhostinghero.com/articles/tools-for-creating-wordpress-plugins.html and http://www.thewebhostinghero.com/articles/tools-for-creating-wordpress-plugins.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+thewebhostinghero+(TheWebHostingHero.com) Now my Feedburner feed is set to "noindex" and it's always been that way. The canonical tag on the webpage is set to: rel='canonical' href='http://www.thewebhostinghero.com/articles/tools-for-creating-wordpress-plugins.html' /> The robots tag is set to: name="robots" content="index,follow,noodp" /> I found out that there are scrapper sites that are linking to my content using the Feedburner link. So should the robots tag be set to "noindex" when the requested URL is different from the canonical URL? If so, is there an easy way to do this in Wordpress?
Intermediate & Advanced SEO | Jun 14, 2013, 9:44 AM | sbrault740 -
Best way to de-index content from Google and not Bing?
We have a large quantity of URLs that we would like to de-index from Google (we are affected b Panda), but not Bing. What is the best way to go about doing this?
Intermediate & Advanced SEO | Nov 8, 2011, 5:07 AM | nicole.healthline0