New CMS system - 100,000 old urls - use robots.txt to block?
-
Hello.
My website has recently switched to a new CMS system.
Over the last 10 years or so, we've used 3 different CMS systems on our current domain. As expected, this has resulted in lots of urls.
Up until this most recent iteration, we were unable to 301 redirect or use any page-level indexation techniques like rel 'canonical'
Using SEOmoz's tools and GWMT, I've been able to locate and redirect all pertinent, page-rank bearing, "older" urls to their new counterparts..however, according to Google Webmaster tools 'Not Found' report, there are literally over 100,000 additional urls out there it's trying to find.
My question is, is there an advantage to using robots.txt to stop search engines from looking for some of these older directories? Currently, we allow everything - only using page level robots tags to disallow where necessary.
Thanks!
-
Great stuff..thanks again for your advice..much appreciated!
-
It can be really tough to gauge the impact - it depends on how suddenly the 404s popped up, how many you're seeing (webmaster tools, for Google and Bing, is probably the best place to check) and how that number compares to your overall index. In most cases, it's a temporary problem and the engines will sort it out and de-index the 404'ed pages.
I'd just make sure that all of these 404s are intentional and none are valuable pages or occurring because of issues with the new CMS itself. It's easy to overlook something when you're talking about 100K pages, and it could be more than just a big chunk of 404s.
-
Thanks for the advice! The previous website did have a robots.txt file with a few wild cards declared. A lot of the urls I'm seeing are NOT indexed anymore and haven't been for many years.
So, I think the 'stop the bleeding' method will work, and I'll just have to proceed with investigating and applying 301s as necessary.
Any idea what kind of an impact this is having on our rankings? I submitted a valid sitemap, crawl paths are good, and major 301s are in place. We've been hit particularly hard in Bing.
Thanks!
-
I've honestly had mixed luck with using Robots.txt to block pages that have already been indexed. It tends to be unreliable at a large scale (good for prevention, poor for cures). I endorsed @Optimize, though, because if Robots.txt is your only option, it can help "stop the bleeding". Sometimes, you use the best you have.
It's a bit trickier with 404s ("Not Found"). Technically, there's nothing wrong with having 404s (and it's a very valid signal for SEO), but if you create 100,000 all at once, that can sometimes give raise red flags with Google. Some kind of mass-removal may prevent problems from Google crawling thousands of not founds all at once.
If these pages are isolated in a folder, then you can use Google Webmaster Tools to remove the entire folder (after you block it). This is MUCH faster than Robots.txt alone, but you need to make sure everything in the folder can be dumped out of the index.
-
Absolutely. Not founds and no content are a concern. This will help your ranking....
-
Thanks a lot! I should have been a little more specific..but, my exact question would be, if I move the crawlers' attention away from these 'Not Found' pages, will that benefit the indexation of the now valid pages? Are the 'Not Found's' really a concern? Will this help my indexation and/or ranking?
Thanks!
-
Loaded question without knowing exactly what you are doing.....but let me offer this advice. Stop the bleeding with robots.txt. This is the easiest way to quickly resolve that many "not found".
Then you can slowly pick away at the issue and figure out if some of the "not founds" really have content and it is sending them to the wrong area....
On a recent project we had over 200,000 additional url's "not found". We stopped the bleeding and then slowly over the course of a month, spending a couple hours a week, found another 5,000 pages of content that we redirected correctly and removed the robots....
Good luck.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Keeping SEO benefit of an old URL by changing content
We have a blog written in Oct 2012 that accounts for 30-40% of our traffic (174K pageviews per year/80% bounce rate). We are considering updating the content but are concerned that it will fall off the search engine's map if the content is updated to include information that is not exactly the same, but relevant. The URL would be the same and the original blog content would be shortened with a link to the full blog. The new content would include other FDA products under investigation. Here is the blog: http://myadvocates.com/blog/fda-issues-warning-about-so-called-brain-supplement-prevagen
On-Page Optimization | | jgodwin0 -
Im seeing a Dot after the / on a new project, never seen this before, any issues using this format ?
Hi Ive got a new project and seeing a dot after the forward slash something ive never seen before what does it mean ? Are there any seo issues regarding it, is it bad practice or fine to proceed using that format ? Example below; www.domain.co.uk/**.**cool-new-product Thanks Dan
On-Page Optimization | | Dan-Lawrence0 -
How to exclude URL filter searches in robots.txt
When I look through my MOZ reports I can see it's included 'pages' which it shouldn't have included i.e. adding filtering rules such as this one http://www.mydomain.com/brands?color=364&manufacturer=505 How can I exclude all of these filters in the robots.txt? I think it'll be: Disallow: /*?color=$ Is that the correct syntax with the $ sign in it? Thanks!
On-Page Optimization | | neenor0 -
Keyword in URL: Ranking Factor?
I've got a site about a specific topic, which we'll call "themes" for the sake of this discussion. I personally like to keep the url structure short and clean (for usability purposes, but mainly because I'm a perfectionist and a minimalist). I feel that adding "themes" to the url structure is a bit redundant. However, nearly every keyword phrase that my site should rank for includes the word "themes." So I'm wondering how much I'm handicapping myself by not including the keyword "themes" in the url? The domain name itself sort of includes the keyword . . . although it's in Italian (I chose the domain for it's brand-ability, not for the keyword). A quick example: My Url Structure: www.themo.com/topic/abc My Competitor's Url Structure: www.sitesample.com/themes/topic/abc For many of the keywords, the competitors with the keyword in the url rank highest. But, I'm not sure how much emphasis to place on this, because from my understanding Google doesn't pay as much attention to url keywords anymore . . . and those sites might just be ranking high because they've been around for so long (which also happens to be the reason why they coincidentally also include the keyword in the url, because they started the site when that was a high ranking factor). Thoughts? Should I just trash my perfectionism and add the keyword to the url structure? (By the way, the site is only a couple months old and doesn't have any significant backlinks to inner pages yet, so changing the url structure wouldn't be a big deal if I decided to do that).
On-Page Optimization | | JABacchetta0 -
Prevent Indexing of URLs Based on Tags
I started my website as a blog over at Posterous, but decided to turn it into a full scale business website with a self-hosted WordPress theme. Shortly after transitioning from Posterous to WordPress, I noticed that Google was indexing not only my old blog posts, but the URLs of my blog posts based on the tags they have. Is there any reason why this is a problem? I'm sure it shouldn't qualify as duplicate content, but for some reason it just feels a bit sloppy to me to have all of these pages indexed...Is this a non-issue? Should I just be more discriminating with my use of 'tags' if it bothers me? JiGLH.png
On-Page Optimization | | williammarlow0 -
Keyword in URL?
I have a website that has been live for about 8yrs. I do not have any significant rankings for my main keywords but am now starting SEO on my site. I am contemplating changing the url to contain the main keyword prefixed by my brand name. Any views on the ranking benefits and or CTR benefits.
On-Page Optimization | | Johnnyh
Example:
Main Volume keyword - 'car leasing'
current url - www.bobleasing.co.uk (made up name) thinking of changing to - www.bobcarleasing.co.uk (made up name) Any advice would be much appreciated. John0 -
How to deal with tracking numbers in URLs
I am working on a site at the minute that has links like this: Jobs in London URL looks like: domain.com/jobs-in-london/ However, my developers insist that they need to use tracking codes, so everytime someone clicks on the above link, they are redirected (301) to a new URL that looks like: domain.com/search/1234567(unique search id) This is killing me when I am trying to get internal pages, like /jobs-in-london/ ranked. What to do?
On-Page Optimization | | MirandaP0 -
What to do with old content in light of the Panda update?
Let's say you operate a laptop review website. After several years, the individual product review URL's (like site.com/dell/xp1234-review/) aren't receiving much traffic, they may have a few links here and there. In general and considering the panda update, would the best option be to 301 the old URL's back to the category page (site.com/dell/)or just keep them where they are? Any potential issues like having excessive 301's which could slow down the site or appear fishy to search engines?
On-Page Optimization | | BryanPhelps-BigLeapWeb0