Moz Q&A is closed.
After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.
Disallowed Pages Still Showing Up in Google Index. What do we do?
-
We recently disallowed a wide variety of pages for www.udemy.com which we do not want google indexing (e.g., /tags or /lectures). Basically we don't want to spread our link juice around to all these pages that are never going to rank. We want to keep it focused on our core pages which are for our courses.
We've added them as disallows in robots.txt, but after 2-3 weeks google is still showing them in it's index. When we lookup "site: udemy.com", for example, Google currently shows ~650,000 pages indexed... when really it should only be showing ~5,000 pages indexed.
As another example, if you search for "site:udemy.com/tag", google shows 129,000 results. We've definitely added "/tag" into our robots.txt properly, so this should not be happening... Google showed be showing 0 results.
Any ideas re: how we get Google to pay attention and re-index our site properly?
-
The last time I used a tool, excluding via robots.txt was also sufficient for URL removal.
Recently, Google has updated their documentation to strongly encourage you to use URL removal only for things like exposing confidential information, and not to clean up old pages or errors in your GWT account (see http://support.google.com/webmasters/bin/answer.py?hl=en&answer=1269119). I know many people still use the tool for that type of stuff, but wanted to point out that change.
-
Thank you Keri.
Yes, good idea, but whatever you request, that page or directory must respond with a 404, otherwise, it will be ignored.
- that is why I couldn't do that with the send to a friend URLs
(would have been a nice thing to do)
I guess I could have cheated, and made them return a 404 if it was google, just to dump them all out of the index.
The 15,000 I did request to be removed were individual pages, that returned 404 response code, so thats why I did them one at a time. I could have waited, but if you wait, then google keeps trying to fetch those missing pages and they keep reporting them in your GWT.
That is a good reason to request the removals.
I actually gave up when the number of deletions got to 1.5 million. I figured it was just too hard to do.
-
The last time I looked, you can request removal of an entire directory as well, which should work for the OP.
-
I would have said the same thing, except that a few weeks ago, I removed a rule from the robots file and I changed the affected pages to have a noindex.nofollow and the next day, tens of thousands of those pages appeared in the index and overpowered the content pages.
So my advice, is don't trust noindex,nofollow and just stop the robot going down that tree (as you are doing) and find another way to get those pages out of the index.
You can use the URL removal request tool.
It only seems to allow you to remove 1000 per day.
I have done this before by automating the removal using a macro program.
I think I removed about 15,000 over the space of a month, doing that.
They are fairly fast at removing URLs these days, 24 hours or less.
-
Disallowing in your robots.txt keeps the bots from indexing your pages going forward, but Google may keep returning them in search results. This post has great explanations about ways to remove pages from indices: http://www.seomoz.org/blog/robot-access-indexation-restriction-techniques-avoiding-conflicts
The surefire way to get them out of the index is to remove the disallow from your robots.txt, and add a meta noindex tags on all the pages you want removed. Once they're reindexed by Google, they'll no longer appear in SERPs.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Should I use noindex or robots to remove pages from the Google index?
I have a Magento site and just realized we have about 800 review pages indexed. The /review directory is disallowed in robots.txt but the pages are still indexed. From my understanding robots means it will not crawl the pages BUT if the pages are still indexed if they are linked from somewhere else. I can add the noindex tag to the review pages but they wont be crawled. https://www.seroundtable.com/google-do-not-use-noindex-in-robots-txt-20873.html Should I remove the robots.txt and add the noindex? Or just add the noindex to what I already have?
Intermediate & Advanced SEO | | Tylerj0 -
Google not Indexing images on CDN.
My URL is: http://bit.ly/1H2TArH We have set up a CDN on our own domain: http://bit.ly/292GkZC We have an image sitemap: http://bit.ly/29ca5s3 The image sitemap uses the CDN URLs. We verified the CDN subdomain in GWT. The robots.txt does not restrict any of the photos: http://bit.ly/29eNSXv. We used to have a disallow to /thumb/ which had a 301 redirect to our CDN but we removed both the disallow in the robots.txt as well as the 301. Yet, GWT still reports none of our images on the CDN are indexed.
Intermediate & Advanced SEO | | alphonsehaThe above screenshot is from the GWT of our main domain.The GWT from the CDN subdomain just shows 0. We did not submit a sitemap to the verified subdomain property because we already have a sitemap submitted to the property on the main domain name. While making a search of images indexed from our CDN, nothing comes up: http://bit.ly/293ZbC1While checking the GWT of the CDN subdomain, I have been getting crawling errors, mainly 500 level errors. Not that many in comparison to the number of images and traffic that we get on our website. Google is crawling, but it seems like it just doesn't index the pictures!?
Can anyone help? I have followed all the information that I was able to find on the web but yet, our images on the CDN still can't seem to get indexed.
0 -
Substantial difference between Number of Indexed Pages and Sitemap Pages
Hey there, I am doing a website audit at the moment. I've notices substantial differences in the number of pages indexed (search console), the number of pages in the sitemap and the number I am getting when I crawl the page with screamingfrog (see below). Would those discrepancies concern you? The website and its rankings seems fine otherwise. Total indexed: 2,360 (Search Consule)
Intermediate & Advanced SEO | | Online-Marketing-Guy
About 2,920 results (Google search "site:example.com")
Sitemap: 1,229 URLs
Screemingfrog Spider: 1,352 URLs Cheers,
Jochen0 -
How do you check the google cache for hashbang pages?
So we use http://webcache.googleusercontent.com/search?q=cache:x.com/#!/hashbangpage to check what googlebot has cached but when we try to use this method for hashbang pages, we get the x.com's cache... not x.com/#!/hashbangpage That actually makes sense because the hashbang is part of the homepage in that case so I get why the cache returns back the homepage. My question is - how can you actually look up the cache for hashbang page?
Intermediate & Advanced SEO | | navidash0 -
Proper 301 in Place but Old Site Still Indexed In Google
So i have stumbled across an interesting issue with a new SEO client. They just recently launched a new website and implemented a proper 301 redirect strategy at the page level for the new website domain. What is interesting is that the new website is now indexed in Google BUT the old website domain is also still indexed in Google? I even checked the Google Cached date and it shows the new website with a cache date of today. The redirect strategy has been in place for about 30 days. Any thoughts or suggestions on how to get the old domain un-indexed in Google and get all authority passed to the new website?
Intermediate & Advanced SEO | | kchandler0 -
How to Disallow Tag Pages With Robot.txt
Hi i have a site which i'm dealing with that has tag pages for instant - http://www.domain.com/news/?tag=choice How can i exclude these tag pages (about 20+ being crawled and indexed by the search engines with robot.txt Also sometimes they're created dynamically so i want something which automatically excludes tage pages from being crawled and indexed. Any suggestions? Cheers, Mark
Intermediate & Advanced SEO | | monster990 -
Should pages of old news articles be indexed?
My website published about 3 news articles a day and is set up so that old news articles can be accessed through a "back" button with articles going to page 2 then page 3 then page 4, etc... as new articles push them down. The pages include a link to the article and a short snippet. I was thinking I would want Google to index the first 3 pages of articles, but after that the pages are not worthwhile. Could these pages harm me and should they be noindexed and/or added as a canonical URL to the main news page - or is leaving them as is fine because they are so deep into the site that Google won't see them, but I also won't be penalized for having week content? Thanks for the help!
Intermediate & Advanced SEO | | theLotter0 -
Tool to calculate the number of pages in Google's index?
When working with a very large site, are there any tools that will help you calculate the number of links in the Google index? I know you can use site:www.domain.com to see all the links indexed for a particular url. But what if you want to see the number of pages indexed for 100 different subdirectories (i.e. www.domain.com/a, www.domain.com/b)? is there a tool to help automate the process of finding the number of pages from each subdirectory in Google's index?
Intermediate & Advanced SEO | | nicole.healthline0