Huge google index with un-relevant pages
-
Hi,
i run a site about sport matches, every match has a page and the pages are generated automatically from the DB. pages are not duplicated, but over time some look a little bit similar. after a match finishes it has no internal links or sitemap entry, but it's reachable by direct URL and continues to be on google index. so over time we have more than 100,000 indexed pages.
since past matches have no significance and they're not linked and a match can repeat and it may look like duplicate content....what you suggest us to do:
when a match is finished - not linked, but appears on the index and SERP
-
301 redirect the match Page to the match Category which is a higher hierarchy and is always relevant?
-
use rel=canonical to the match Category
-
do nothing....
*301 redirect will shrink my index status, some say a high index status is good...
*is it safe to 301 redirect 100,000 pages at once - wouldn't it look strange to google?
*would canonical remove the past matches pages from the index?
what do you think?
Thanks,
Assaf.
-
-
In terms of what you've written, blocking a page via robots.txt doesn't remove it from the index. It simply prevents the crawlers from reaching the page. So if you block a page via robots.txt, the page remains in the index, Google just can't go back to the page and see if anything has changed. So if you were to block the page via robots.txt, and add a noindex tag to the page, Google won't be able to see the page with the noindex tag to remove it from the index because it's blocked via robots.txt.
If you moved all of your old content to a different folder, and block that folder via robots.txt, Google won't remove those pages from the index. In order to remove them from the index, you would have to go in to Webmaster Tools and use the URL removal tool to remove that new folder from the index - if they see it's blocked via robots.txt, then and only then they'll remove the content from the index - it has to be blocked via robots.txt first in order to remove the whole folder with the URL removal tool.
I'm not sure though if this would work for the future - if you removed a folder from the index, and then added more content that was indexed previously afterwards, I'm not sure what would happen to that new content moved to that folder. Either way, Google will have to come back and recrawl the page to see that it has moved to the new folder, and then remove it from the index. So either way, the content will only be removed once Google recrawls the old content.
So I still think a better way to remove the content from the index is to add the noindex tag to the old pages. To facilitate the search engines reaching these old pages, I'd make sure there is a way the engines can get to them - make sure there is a path they can take to reach them.
Another good idea I saw on a forum post here a while ago would be to create a sitemap containing all of these old pages you have indexed and want removed. Add the noindex tag to the sitemap - using the Webmaster tools sitemap interface, you'll then be able to monitor the progress of deindexation over time - by checking how many pages on the sitemap/s of the old content are originally indexed as reported by webmaster tools, and then you can see later on how many of those pages are still indexed, this will be a good indicator for you of the progress of the deindexation.
-
Dear Mark,
*i've sent you a private message.
i'm starting to understand i've a much bigger problem.
*my index status contain 120k pages while only 2000 are currently relevant.
your suggestion is - after a match finishes pragmatically add to the page and google will remove it from it's index. it could work for relatively new pages but since very old pages don't have links OR sitemap entry it could take a very long time to clear the index cause they're rarely crawled - if at all.
- more aggressive approach would be to change this site architecture and restrict by robot.txt the folder that holds all the past irrelevant pages.
so if today a match URL is like this: www.domain.com/sport/match/T1vT2
restrict www.domain.com/sport/match/ on robots.txt
and from now on create all new matches on different folder like: www.domain.com/sport/new-match-dir/T1vT2
-
is this a good solution?
-
wouldn't google penalize me for removing a directory with 100k pages?
-
if it's a good approach, how much time it will take for google to clear all those pages from it's index?
I know it's a long one and i'll really appreciate your response.
Thanks a lot,
Assaf.
-
there are a bunch of articles out there, but each case is different - here are a few:
http://www.searchenginejournal.com/the-holy-grail-of-panda-recovery-a-1-year-case-study/45683/
You can contact me via private message here on the forum and I can try to take a more in depth look at your site if you can give me some more detailed info.
-
yes. when the 1st Panda update was rolled out i've lost 50% of the traffic from google and haven't really recovered since.
-
Are you sure you got hit by Panda before we talk about a Panda hit?
-
Thanks Mark!
any good article about how to recover from Panda?
-
Exactly - I'd build a strategy more around promoting pages that will have long lasting value.
If you use the tag noindex, follow, it will continue to spread link juice throughout the site, it's just the individual page with the tag will not be included in the search results and will not be part of the index. In order for the tag to work, they first have to crawl the page and see the tag - so it doesn't happen instantaneously - if they crawl these deeper pages once every few weeks, once a month, or even longer, it may take a while for these pages to be removed from the index.
-
Hi Mark
-
these pages are very important when they are relevant (before the match finished) - they are the source of most of our traffic which come from long tail searches.
-
some of these pages have inbound link and it would be a shame to lose all this juice.
-
would noindex remove the pages from the google index? how much time it would take? wouldn't a huge noindex also look suspicious?
-
by "evergreen pages" - you mean pages that are always relevant like League page / Sport page etc...?
Thanks,
Assaf.
-
-
Hi Assaf,
(I'm not stalking you, I just think you've raised another interesting question)
In terms of index status/size, you don't want to create a massive index of empty/low value pages - this is food for Google's Panda algorithm, and will not be good for your site in the long run. It'll get a Panda smack if it hasn't already.
To remove these pages from the index, instead of doing hundreds of thousands of 301 redirects, which your server won't like either, I'd recommend adding the noindex meta tag to the pages.
I'd put a rule in your cms that after a certain point in time, you noindex those pages. Make sure you also have evergreen pages on your site that can serve as landing pages for the search engines and which won't need to be removed after a short period of time. These are the pages you'll want to focus your outreach and link building efforts on.
Mark
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Number of indexed pages dropped. No manual action though?
I have a client who had their WordPress site hacked. At that point there was no message from Google in webmaster tools and the search results for their pages still looked normal. They paid sitelock to fix the site. This was all about a month ago. Logging into Webmaster Tools now there are still no messages from Google nor anything on the manual actions page. Their organic traffic is essentially gone. Looking at the submitted sitemap only 3 of their 121 submitted pages are indexed. Before this all of them where in the index. Looking at the index status report I can see that the number of indexed pages dropped completely off the map. We are sure that the site is free of malware. This client has done no fishy SEO practices. What can be done?
Intermediate & Advanced SEO | | connectiveWeb0 -
How do i prevent Google and Moz from counting pages as duplicates?
I have 130,000 profiles on my site. When not Connected to them they have very few differences. So a bot - not logged in, etc, will see a login form and "Connect to Profilename" MOZ and Google call the links the same, even though theyre unique such as example.com/id/328/name-of-this-group example.com/id/87323/name-of-a-different-group So how do i separate them? Can I use Schema or something to help identify that these are profile pages, or that the content on them should be ignored as its help text, etc? Take facebook - each facebook profile for a name renders simple results: https://www.facebook.com/public/John-Smith https://www.facebook.com/family/Smith/ Would that be duplicate data if facebook had a "Why to join" article on all of those pages?
Intermediate & Advanced SEO | | inmn0 -
Google displaying the homepage instead of the landing page
I have a landing page that was ranking before for web design philippines its http://www.myoptimind.com/web-design-philippines Early this year, we dropped our rank and google displayed our homepage http://www.myoptimind.com When i search for "web design company philippines", i rank. however, for "web design philippines" I am on page 2. When I try "web design philippines" site:myoptimind.com it shows the landing page as the 2nd result. Last week, we tried to change the content of the page to reflect content that is more related to the keyword and moved the old content to http://www.myoptimind.com/web-design-services-philippines. We also changed the title of the homepage from Web Design Philippines | SEO Company Philippines to Web Design & SEO Company Philippines Still nothing has changed. I just wanted it to show the landing page instead of the root domain. Any idea how this can be solved?
Intermediate & Advanced SEO | | optimind0 -
JavaScript Issue? Google not indexing a microsite
We have a microsite that was created on our domain but is not linked to from ANYwhere EXCEPT within some Javascript elements on pages on our site. The link is in one JQuery slide panel. The microsite is not being indexed at all - when i do site:(microsite name) on Google, it doesn't return anything. I think it's because the link's only in a Java element, but my client assures me that if I submit to Google for crawling the problem will be solved. Maybe so, but my point is that if you just create a simple HTML link from at least one of our site pages, it will get indexed no problem. The microsite has been up for months and it's still not being indexed - another newer microsite that's been up for a few weeks and has simple links to it from our pages is indexing fine. I have submitted the URL for crawling but had to use the google.com/webmasters/tools/submit-url/ method as I don't have access to the top level domain WMT account. p.s. when we put the microsite URL into the SEOBook spider-test tool it returns lots of lovely information - but that just tells me the page is findable, does exist, right? That doesn't mean Google's going to necessarily index it, as I am surmising...Moz hasn't found in the 5 months the microsite has been up and running. What's going on here?
Intermediate & Advanced SEO | | Jen_Floyd0 -
Google indexing "noindex" pages
1 weeks ago my website expanded with a lot more pages. I included "noindex, follow" on a lot of these new pages, but then 4 days ago I saw the nr of pages Google indexed increased. Should I expect in 2-3 weeks these pages will be properly noindexed and it may just be a delay? It is odd to me that a few days after including "noindex" on pages, that webmaster tools shows an increase in indexing - that the pages were indexed in other words. My website is relatively new and these new pages are not pages Google frequently indexes.
Intermediate & Advanced SEO | | khi50 -
Home Page Got Indexed as httpS and Rankings Went Down.
Hello fellow SEO's About 3 weeks ago all of a sudden the home page on our Magento based website went down in rankings (from top 10 to page 3-4 Google) and was showing as httpS - instead of usual http. It first happened with just a few keywords and a week later any search phrase was returning the httpS result for the home page. When I view cache for the home page now it (both http and httpS versions) it gives me this http://clip2net.com/s/2OtPS We are not blocking anything in robots.txt Robots tags are set to index,follow There are hardly any external links pointing at the home pages as httpS This only affected the home page - all other pages rank where they used to and appear as http Has anybody ever had a similar problem? Thanks in advance for your thoughts and help
Intermediate & Advanced SEO | | ddseo0 -
What causes internal pages to have a page rank of 0 if the home page is PR 5?
The home page PageRank is 5 but every single internal page is PR 0. Things I know I need to address each page has 300 links (Menu problem). Each article has 2-3 duplicates caused from the CMS working on this now. Has anyone else had this problem before? What things should I look out for to fix this issue. All internal linking is follow there is no page rank sculpting happening on the pages.
Intermediate & Advanced SEO | | SEOBrent0 -
IP address being indexed by Google in addition to canonical domain.
Our site's IP address is being indexed in addition to the canonical www.example.com domain. As soon as it was flagged a 301 was implemented in the .htaccess file to redirect the IP address to the canonical. Does this usually occur? Is it detrimental to SEO? In my time in SEO I've never heard of this being an issue, or being part of a list of things to be checked. It sounds more like a server that wasn't configured correctly when hosting was set up? It didn't seem to be affecting the site at all, but is it more common and I've just never heard of it? 😛 Should it be something I'm usually looking for in future? Responses are greatly appreciated!
Intermediate & Advanced SEO | | mikeimrie0