Panda Updates - robots.txt or noindex?
-
Hi,
I have a site that I believe has been impacted by the recent Panda updates. Assuming that Google has crawled and indexed several thousand pages that are essentially the same and the site has now passed the threshold to be picked out by the Panda update, what is the best way to proceed?
Is it enough to block the pages from being crawled in the future using robots.txt, or would I need to remove the pages from the index using the meta noindex tag? Of course if I block the URLs with robots.txt then Googlebot won't be able to access the page in order to see the noindex tag.
Anyone have and previous experiences of doing something similar?
Thanks very much.
-
This is a good read. http://www.seomoz.org/blog/duplicate-content-in-a-post-panda-world I think you should be careful with robot.txt because blocking access to the bot will not cause them to remove the content from their index. They will simply include a message saying not quite sure what's on this page.. I would use noindex to clear out the index first before attempting robot.txt exclusion.
-
Yes, both because if a page is linked to on another site google with spider that other site and follow your link without hitting the robots.txt and the page could get indexed if there is not a noindex on it.
-
Indeed try both.
Irving +1
-
both. block the lowest quality lowest traffic pages with nodindex and block the folder in robots.txt
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
What should I do after a failed request for validation (error with noindex, nofollow) in new Google Search Console?
Hi guys, We have the following situation: After an error message in new google search console for a large amount of pages with noindex, nofollow tag, a validation is requested before the problem is fixed. (it's incredibly stupid decision taken before asking the SEO team for advice) Google starts the validation, crawls 9 URLs and changes the status to "Failed". All other URLs are still in "pending" status. The problem has been fixed for more than 10 days, but apparently Google doesn't crawl the pages and none of the URLs is back in the index. We tried pinging several pages and html sitemaps, but there is no result. Do you think we should request for re-validation or wait more time? It there something more we could do to speed up the process?
Intermediate & Advanced SEO | | ParisChildress0 -
Robots.txt - Googlebot - Allow... what's it for?
Hello - I just came across this in robots.txt for the first time, and was wondering why it is used? Why would you have to proactively tell Googlebot to crawl JS/CSS and why would you want it to? Any help would be much appreciated - thanks, Luke User-Agent: Googlebot Allow: /.js Allow: /.css
Intermediate & Advanced SEO | | McTaggart0 -
New Google Update on 21st April
Hi We all know that the mobile update is coming on the 21st April and if your site isnt mobile friendly in Googles eye you will be removed from the mobile index. Will this affect tablets. Most of our pages are mobile friendly but there are a few which arent. However these are tablet friendly. I havent heard Google mention about tablet rankings. Thanks Andy
Intermediate & Advanced SEO | | Andy-Halliday0 -
Should I use meta noindex and robots.txt disallow?
Hi, we have an alternate "list view" version of every one of our search results pages The list view has its own URL, indicated by a URL parameter I'm concerned about wasting our crawl budget on all these list view pages, which effectively doubles the amount of pages that need crawling When they were first launched, I had the noindex meta tag be placed on all list view pages, but I'm concerned that they are still being crawled Should I therefore go ahead and also apply a robots.txt disallow on that parameter to ensure that no crawling occurs? Or, will Googlebot/Bingbot also stop crawling that page over time? I assume that noindex still means "crawl"... Thanks 🙂
Intermediate & Advanced SEO | | ntcma0 -
Is this Panda?
Hi, I have a page that has been fluctuating a lot the last few days, here are the results: 5/14: #17 (this is where it had been ranking for about 2 months). 5/15: #34 5/16: #33 5/18: #9 5/19: #35 5/20: #13 5/21: #37 I have only made minor changes to the page, and the link profile seems to look good. Here's the page: www.thesandiegocriminallawyer.com/dui.html (targeted KW: San Diego DUI Lawyer, San Diego DUI Attorney). The page has a lot of high-quality, original, and well-cited content. Any thoughts on what could be causing so much back and forth? I should state that none of the other rankings for this site (overall) have been impacted. Just this page for DUI related searches (San Diego DUI Lawyer, San Diego DUI attorney, San Diego Drunk Driving Lawyer, etc.).
Intermediate & Advanced SEO | | mrodriguez14400 -
Robots.txt: Syntax URL to disallow
Did someone ever experience some "collateral damages" when it's about "disallowing" some URLs? Some old URLs are still present on our website and while we are "cleaning" them off the site (which takes time), I would like to to avoid their indexation through the robots.txt file. The old URLs syntax is "/brand//13" while the new ones are "/brand/samsung/13." (note that there is 2 slash on the URL after the word "brand") Do I risk to erase from the SERPs the new good URLs if I add to the robots.txt file the line "Disallow: /brand//" ? I don't think so, but thank you to everyone who will be able to help me to clear this out 🙂
Intermediate & Advanced SEO | | Kuantokusta0 -
NOINDEX content still showing in SERPS after 2 months
I have a website that was likely hit by Panda or some other algorithm change. The hit finally occurred in September of 2011. In December my developer set the following meta tag on all pages that do not have unique content: name="robots" content="NOINDEX" /> It's been 2 months now and I feel I've been patient, but Google is still showing 10,000+ pages when I do a search for site:http://www.mydomain.com I am looking for a quicker solution. Adding this many pages to the robots.txt does not seem like a sound option. The pages have been removed from the sitemap (for about a month now). I am trying to determine the best of the following options or find better options. 301 all the pages I want out of the index to a single URL based on the page type (location and product). The 301 worries me a bit because I'd have about 10,000 or so pages all 301ing to one or two URLs. However, I'd get some link juice to that page, right? Issue a HTTP 404 code on all the pages I want out of the index. The 404 code seems like the safest bet, but I am wondering if that will have a negative impact on my site with Google seeing 10,000+ 404 errors all of the sudden. Issue a HTTP 410 code on all pages I want out of the index. I've never used the 410 code and while most of those pages are never coming back, eventually I will bring a small percentage back online as I add fresh new content. This one scares me the most, but am interested if anyone has ever used a 410 code. Please advise and thanks for reading.
Intermediate & Advanced SEO | | NormanNewsome0 -
Subdomains - duplicate content - robots.txt
Our corporate site provides MLS data to users, with the end goal of generating leads. Each registered lead is assigned to an agent, essentially in a round robin fashion. However we also give each agent a domain of their choosing that points to our corporate website. The domain can be whatever they want, but upon loading it is immediately directed to a subdomain. For example, www.agentsmith.com would be redirected to agentsmith.corporatedomain.com. Finally, any leads generated from agentsmith.easystreetrealty-indy.com are always assigned to Agent Smith instead of the agent pool (by parsing the current host name). In order to avoid being penalized for duplicate content, any page that is viewed on one of the agent subdomains always has a canonical link pointing to the corporate host name (www.corporatedomain.com). The only content difference between our corporate site and an agent subdomain is the phone number and contact email address where applicable. Two questions: Can/should we use robots.txt or robot meta tags to tell crawlers to ignore these subdomains, but obviously not the corporate domain? If question 1 is yes, would it be better for SEO to do that, or leave it how it is?
Intermediate & Advanced SEO | | EasyStreet0