New CMS system - 100,000 old urls - use robots.txt to block?
-
Hello.
My website has recently switched to a new CMS system.
Over the last 10 years or so, we've used 3 different CMS systems on our current domain. As expected, this has resulted in lots of urls.
Up until this most recent iteration, we were unable to 301 redirect or use any page-level indexation techniques like rel 'canonical'
Using SEOmoz's tools and GWMT, I've been able to locate and redirect all pertinent, page-rank bearing, "older" urls to their new counterparts..however, according to Google Webmaster tools 'Not Found' report, there are literally over 100,000 additional urls out there it's trying to find.
My question is, is there an advantage to using robots.txt to stop search engines from looking for some of these older directories? Currently, we allow everything - only using page level robots tags to disallow where necessary.
Thanks!
-
Great stuff..thanks again for your advice..much appreciated!
-
It can be really tough to gauge the impact - it depends on how suddenly the 404s popped up, how many you're seeing (webmaster tools, for Google and Bing, is probably the best place to check) and how that number compares to your overall index. In most cases, it's a temporary problem and the engines will sort it out and de-index the 404'ed pages.
I'd just make sure that all of these 404s are intentional and none are valuable pages or occurring because of issues with the new CMS itself. It's easy to overlook something when you're talking about 100K pages, and it could be more than just a big chunk of 404s.
-
Thanks for the advice! The previous website did have a robots.txt file with a few wild cards declared. A lot of the urls I'm seeing are NOT indexed anymore and haven't been for many years.
So, I think the 'stop the bleeding' method will work, and I'll just have to proceed with investigating and applying 301s as necessary.
Any idea what kind of an impact this is having on our rankings? I submitted a valid sitemap, crawl paths are good, and major 301s are in place. We've been hit particularly hard in Bing.
Thanks!
-
I've honestly had mixed luck with using Robots.txt to block pages that have already been indexed. It tends to be unreliable at a large scale (good for prevention, poor for cures). I endorsed @Optimize, though, because if Robots.txt is your only option, it can help "stop the bleeding". Sometimes, you use the best you have.
It's a bit trickier with 404s ("Not Found"). Technically, there's nothing wrong with having 404s (and it's a very valid signal for SEO), but if you create 100,000 all at once, that can sometimes give raise red flags with Google. Some kind of mass-removal may prevent problems from Google crawling thousands of not founds all at once.
If these pages are isolated in a folder, then you can use Google Webmaster Tools to remove the entire folder (after you block it). This is MUCH faster than Robots.txt alone, but you need to make sure everything in the folder can be dumped out of the index.
-
Absolutely. Not founds and no content are a concern. This will help your ranking....
-
Thanks a lot! I should have been a little more specific..but, my exact question would be, if I move the crawlers' attention away from these 'Not Found' pages, will that benefit the indexation of the now valid pages? Are the 'Not Found's' really a concern? Will this help my indexation and/or ranking?
Thanks!
-
Loaded question without knowing exactly what you are doing.....but let me offer this advice. Stop the bleeding with robots.txt. This is the easiest way to quickly resolve that many "not found".
Then you can slowly pick away at the issue and figure out if some of the "not founds" really have content and it is sending them to the wrong area....
On a recent project we had over 200,000 additional url's "not found". We stopped the bleeding and then slowly over the course of a month, spending a couple hours a week, found another 5,000 pages of content that we redirected correctly and removed the robots....
Good luck.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
www vs no-www duplicate which should I use
site is no-www I caught this in archives. Will this by my fix? Mike Davis Online Marketing Manager at McKesson May 22, 2013 Easy fix: in your .htaccess file, use this RewriteEngine On
On-Page Optimization | | touristips
RewriteCond %{HTTP_HOST} !^domain.com
RewriteRule (.*) http://domain.com/$1 [R=301,L] Remember to replace domain.com with your domain name.
Enjoy!0 -
URL Length
I know a URL should "technically" shorter than 75 characters. Does that include the http://www.domainname.com ? Thank you 🙂
On-Page Optimization | | Libra0130 -
Blocking Pages E-Commerce Site
Hello, I am working on a site with 1,000's of product pages, some of which do not have inventory in them. Should be blocking these pages in order to reduce bouce rate? How could i manage so many pages efficiently? It would takes weeks to comb through pages to determine which have inventory and which do not. They are also time sensitive as they are live events so dates are always changing. Thanks!
On-Page Optimization | | TP_Marketing0 -
Canonical URL tags help I am not sure what this is
I am trying to get an A grade on my webpage and this is one of the critical steps canonical URL tags I cant find much information as to what this even is never mind fixing it. Thanks I am a total newbe at this any advice is appreciated
On-Page Optimization | | gemfirez0 -
Replacing "_" with "-" in url, results in new url?
We ran SEOmoz's "On-Page Optimization" tool on a url which contains the character "_". According to the tool: "Characters which are less commonly used in URLs may cause problems with accessibility, interpretation and ranking in search engines. It is considered a best practice to stick to standard URL structures to avoid potential problems." "Rewrite the URL to contain only standard characters." Therefore we will rewrite the url, replacing "_" with "-". Will search engines consider the "-" url a different one? Do we need to 301 the old url to the new one? Thanks for your help!
On-Page Optimization | | gerardoH0 -
Search engine friendly URLs
I'm going to create some new content for my site, I'm trying to decide on the best search engine friendly format. Namely, is it ok to use a subdirectory or should I keep all content on root level? Is the SEO effect of either of these URLs superior to the other? domain.com/cooking/lasagna.php vs domain.com/lasagna.php
On-Page Optimization | | limens0 -
How to best use our blog posts for SEO?
My company recently created a WordPress hosted blog. It is hosted completely separate from our company site. The primary domain for the blog is blog.mydomainname.com, but we immediately created a folder within our company website for www.mydomainname.com/blog that has a reverse proxy to the blog itself. I'm curious though if we should consider taking the content from the blog posts and re-creating that within our company website as well? The blog posts are very good SEO rich content, and we always struggle to find new content to put on our company website as it is already. Would like to get some folks thoughts on this. Thanks!
On-Page Optimization | | KHCreative0 -
Duplicate Content using templates
Hi, Our web site is designed using a template, which means the header and footer is consistent across all pages. Only the body content is unique on each page. Is the google bot able to see that the header and footer content is defined by the common template? Will this have any impact in terms of duplicate content? For example, we have a two line text in the footer that summarize the services we provide. Because the same text is in the footer of all pages, i am concerned about creating duplicate content. Finally, does it make sense to include keywords in header and footer of the template? Will it have any positive or negative SEO impact?
On-Page Optimization | | petersen0