New CMS system - 100,000 old urls - use robots.txt to block?
-
Hello.
My website has recently switched to a new CMS system.
Over the last 10 years or so, we've used 3 different CMS systems on our current domain. As expected, this has resulted in lots of urls.
Up until this most recent iteration, we were unable to 301 redirect or use any page-level indexation techniques like rel 'canonical'
Using SEOmoz's tools and GWMT, I've been able to locate and redirect all pertinent, page-rank bearing, "older" urls to their new counterparts..however, according to Google Webmaster tools 'Not Found' report, there are literally over 100,000 additional urls out there it's trying to find.
My question is, is there an advantage to using robots.txt to stop search engines from looking for some of these older directories? Currently, we allow everything - only using page level robots tags to disallow where necessary.
Thanks!
-
Great stuff..thanks again for your advice..much appreciated!
-
It can be really tough to gauge the impact - it depends on how suddenly the 404s popped up, how many you're seeing (webmaster tools, for Google and Bing, is probably the best place to check) and how that number compares to your overall index. In most cases, it's a temporary problem and the engines will sort it out and de-index the 404'ed pages.
I'd just make sure that all of these 404s are intentional and none are valuable pages or occurring because of issues with the new CMS itself. It's easy to overlook something when you're talking about 100K pages, and it could be more than just a big chunk of 404s.
-
Thanks for the advice! The previous website did have a robots.txt file with a few wild cards declared. A lot of the urls I'm seeing are NOT indexed anymore and haven't been for many years.
So, I think the 'stop the bleeding' method will work, and I'll just have to proceed with investigating and applying 301s as necessary.
Any idea what kind of an impact this is having on our rankings? I submitted a valid sitemap, crawl paths are good, and major 301s are in place. We've been hit particularly hard in Bing.
Thanks!
-
I've honestly had mixed luck with using Robots.txt to block pages that have already been indexed. It tends to be unreliable at a large scale (good for prevention, poor for cures). I endorsed @Optimize, though, because if Robots.txt is your only option, it can help "stop the bleeding". Sometimes, you use the best you have.
It's a bit trickier with 404s ("Not Found"). Technically, there's nothing wrong with having 404s (and it's a very valid signal for SEO), but if you create 100,000 all at once, that can sometimes give raise red flags with Google. Some kind of mass-removal may prevent problems from Google crawling thousands of not founds all at once.
If these pages are isolated in a folder, then you can use Google Webmaster Tools to remove the entire folder (after you block it). This is MUCH faster than Robots.txt alone, but you need to make sure everything in the folder can be dumped out of the index.
-
Absolutely. Not founds and no content are a concern. This will help your ranking....
-
Thanks a lot! I should have been a little more specific..but, my exact question would be, if I move the crawlers' attention away from these 'Not Found' pages, will that benefit the indexation of the now valid pages? Are the 'Not Found's' really a concern? Will this help my indexation and/or ranking?
Thanks!
-
Loaded question without knowing exactly what you are doing.....but let me offer this advice. Stop the bleeding with robots.txt. This is the easiest way to quickly resolve that many "not found".
Then you can slowly pick away at the issue and figure out if some of the "not founds" really have content and it is sending them to the wrong area....
On a recent project we had over 200,000 additional url's "not found". We stopped the bleeding and then slowly over the course of a month, spending a couple hours a week, found another 5,000 pages of content that we redirected correctly and removed the robots....
Good luck.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Use of trademarks in tags and text
Does Google spider read registered trademarks (the 'R' or 'TM') or do these marks impede anything if they are featured in meta tags or text?
On-Page Optimization | | KnutDSvendsen0 -
How can i block the below URLs
Google indexed plugins pages for my website. Please check below. How can stop them to be indexed on google.? http://www.ayurjeewan.com/wp-content/plugins/LayerSlider/static/skins/glass/ http://www.ayurjeewan.com/wp-content/plugins/LayerSlider/static/skins/borderlesslight3d/ http://www.ayurjeewan.com/wp-content/plugins/LayerSlider/static/skins/defaultskin/ My robots.txt file is - User-agent: * Disallow: /wp-admin/
On-Page Optimization | | MasonBaker0 -
Need suggestion: Should the user profile link be disallowed in robots.txt
I maintain a myBB based forum here. The user profile links look something like this http://www.learnqtp.com/forums/User-Ankur Now in my GWT, I can see many 404 errors for user profile links. This is primarily because we have tight control over spam and auto-profiles generated by bots. Either our moderators or our spam control software delete such spammy member profiles on a periodic basis but by then Google indexes those profiles. I am wondering, would it be a good idea to disallow User profiles links using robots.txt? Something like Disallow: /forums/User-*
On-Page Optimization | | AnkurJ0 -
Issue with old sitemap
Hi Quite new to SeoMoz so some learning curve but we have one issue on a site perhaps someone can advise on? We have many duplicate page titles / content, we think this was due to a corrupt sitemap.xml file, all of the URL's look like this with an odd page at the end http://www.totalpetsupplies.co.uk/problem.aspx?aspxerrorpath=/animals/Cat_Accessories_Pet_Collars/privacy.aspx So that XML map was removed well over a week ago, the new sitemap is an apsx generated one to get all pages correctly show as www.totalpetsuppies.co.uk/sitemap.aspx. Goolge and Bing have that. SeoMoz re did our site and the errors went up a bit!! Not sure how that process works, anyone any suggestions? Cheers Shaun
On-Page Optimization | | ShaunSizen0 -
To use or not to use: Keywords with locations
Hello there. I work for a marketing agency that manages SEO campaigns for a variety of small businesses in South Florida. Let's say we have a client that sells cheap shoes at their store location. Obviously, we want to show up in Google rankings for search terms like "cheap shoes south florida" or "cheap shoes miami." Now, my question is, when optimizing a website's content for various keywords, is it really necessary to include keywords with the location (which are often awkward for both reading and writing purposes)? Ideally, I'd prefer to have text that always reads as naturally as possible. Text like this is just an eyesore: Welcome to ExampleSite.com, home of the best cheap shoes Florida. We offer all kinds of cheap shoes Boca Raton. Your whole family doesn't have enough fingers and toes to count how many cheap shoes West Palm Beach we have in stock! Contact us to ask about our cheap shoes Miami discounts today! Olé!" What say you? Is there a way to work around ugly SEO text like this while still effectively ranking for GEO terms? Thanks!
On-Page Optimization | | BBEXNinja0 -
404 crawl errors with all url+domain
We have 187 crawl 404 errors. All urls on web make a 404 error that this http://www.domain.com/[.....]l/www.domain.com all errors added to the url, the url domain I put an example gestoriabarcelona.com/www.gestoriabarcelona.com
On-Page Optimization | | promonet
gestoriabarcelona.com/tarifas/www.gestoriabarcelona.com
gestoriabarcelona.com/category/noticias/page/7/www.gestoriabarcelona.com
gestoriabarcelona.com/2012/08/amortizacion-de-unaconstruccion/
www.gestoriabarcelona.com
[..] I don't know where can i find to solve errors Anyone can help me? Thanks0 -
Long url > 115
Hi, in my web code I have link to my images that are resize online and the link is very long. like this src="http://img.espectador.com/mediadelivery/?fn=&i_enc=1&i_a=L2hvbWUvZXNwZWN0YWRvci93d3cvaW1hZ2VuZXMvMjUwMTY2XzEzNDk5NTQ0NjFfY29uc3RydWNjaW9uLmpwZw==&i_cl=1&i_tr=100&i_q=70&i_rt=0&i_w=250&i_h=188&i_wtmrk=" alt="Paro parcial de Sunca" border="0"/> I have a lot of warning in my reports with this and I would like to omit this warnings How can I do that? noindex? nofollow? Thanks The original page that contain that code is this http://www.espectador.com/noticias/250166/paro-parcial-de-sunca Thanks
On-Page Optimization | | informatica8100 -
URL and SEO
How much weight do search engines give the URL? We're a medical call center provider and medicalcallcenter is part of our URL. Does that help us much? Thanks!!
On-Page Optimization | | THMCC0