New CMS system - 100,000 old urls - use robots.txt to block?
-
Hello.
My website has recently switched to a new CMS system.
Over the last 10 years or so, we've used 3 different CMS systems on our current domain. As expected, this has resulted in lots of urls.
Up until this most recent iteration, we were unable to 301 redirect or use any page-level indexation techniques like rel 'canonical'
Using SEOmoz's tools and GWMT, I've been able to locate and redirect all pertinent, page-rank bearing, "older" urls to their new counterparts..however, according to Google Webmaster tools 'Not Found' report, there are literally over 100,000 additional urls out there it's trying to find.
My question is, is there an advantage to using robots.txt to stop search engines from looking for some of these older directories? Currently, we allow everything - only using page level robots tags to disallow where necessary.
Thanks!
-
Great stuff..thanks again for your advice..much appreciated!
-
It can be really tough to gauge the impact - it depends on how suddenly the 404s popped up, how many you're seeing (webmaster tools, for Google and Bing, is probably the best place to check) and how that number compares to your overall index. In most cases, it's a temporary problem and the engines will sort it out and de-index the 404'ed pages.
I'd just make sure that all of these 404s are intentional and none are valuable pages or occurring because of issues with the new CMS itself. It's easy to overlook something when you're talking about 100K pages, and it could be more than just a big chunk of 404s.
-
Thanks for the advice! The previous website did have a robots.txt file with a few wild cards declared. A lot of the urls I'm seeing are NOT indexed anymore and haven't been for many years.
So, I think the 'stop the bleeding' method will work, and I'll just have to proceed with investigating and applying 301s as necessary.
Any idea what kind of an impact this is having on our rankings? I submitted a valid sitemap, crawl paths are good, and major 301s are in place. We've been hit particularly hard in Bing.
Thanks!
-
I've honestly had mixed luck with using Robots.txt to block pages that have already been indexed. It tends to be unreliable at a large scale (good for prevention, poor for cures). I endorsed @Optimize, though, because if Robots.txt is your only option, it can help "stop the bleeding". Sometimes, you use the best you have.
It's a bit trickier with 404s ("Not Found"). Technically, there's nothing wrong with having 404s (and it's a very valid signal for SEO), but if you create 100,000 all at once, that can sometimes give raise red flags with Google. Some kind of mass-removal may prevent problems from Google crawling thousands of not founds all at once.
If these pages are isolated in a folder, then you can use Google Webmaster Tools to remove the entire folder (after you block it). This is MUCH faster than Robots.txt alone, but you need to make sure everything in the folder can be dumped out of the index.
-
Absolutely. Not founds and no content are a concern. This will help your ranking....
-
Thanks a lot! I should have been a little more specific..but, my exact question would be, if I move the crawlers' attention away from these 'Not Found' pages, will that benefit the indexation of the now valid pages? Are the 'Not Found's' really a concern? Will this help my indexation and/or ranking?
Thanks!
-
Loaded question without knowing exactly what you are doing.....but let me offer this advice. Stop the bleeding with robots.txt. This is the easiest way to quickly resolve that many "not found".
Then you can slowly pick away at the issue and figure out if some of the "not founds" really have content and it is sending them to the wrong area....
On a recent project we had over 200,000 additional url's "not found". We stopped the bleeding and then slowly over the course of a month, spending a couple hours a week, found another 5,000 pages of content that we redirected correctly and removed the robots....
Good luck.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Is the URL Matching the Page Title Important?
Hello I have tried searching for an answer on this but I can't get a clear answer due to the results when searching for URL title. I have just launched our second Shopify site for one of our brands. My first site launched in 2014 but when I launched I didn't pay much heed to SEO for page titles, URLs, etc so have retrospectively fixed this over time. For my Shopify site just launching I want to get it as right as possible from the start (learning from mistakes). My question is regarding URLs and what my approach should be for better SEO. So, I have a page with a Title of Newton Leather Wallets, Purses, Card Holders & Glasses Cases and the URL is https://www.tumbleandhide.com/collections/newton-leather-wallets-card-holders It was my understanding that I should try and make the URL reflect the Page Title more accurately. The problem is that this takes the character count to 77. On other pages it can be in the 80s. Will the above link be better for SEO than say just https://www.tumbleandhide.com/collections/newton I am just wary of the URL's being too long as my Moz Site Crawl is returning a lot of URLs that are too long. Thanks in Advance.
On-Page Optimization | | lukegj0 -
Potential new URL structure for my ecommerce site
At the moment my site suffers from a flat product category structure where over 600 items fall into one category alone. This category is then filtered using a faceted search which appends query strings to the category URL and changes the products displayed on the page. At the moment our product category URL is as follows, www.domain.com/category/greeting-cards and this holds all cards including occasions such as anniversary, birthdays etc and also themes such as animal cards, contemporary cards etc I have proposed changes to my developer to change this structure to include subcategories. I can now go two subcategories deep. For example, "greeting cards > occasions > birthday cards" or "greeting cards > themes > animals". This is reflected in the new URL structure, which has been proposed, www.domain.com/greeting-cards/occasions/birthday-cards. In this URL do I need "occasions" in the URL as I don't think it adds much value to the user? Would I be better of having www.domain.com/greeting-cards/birthday-cards. If a user searches for "birthday cards" then I think this would be more relevant?
On-Page Optimization | | joe-ainswoth0 -
New content cause for drop? Still showing old cache though...
About 10 days ago I added about 750 words of new text to a homepage as it was simply a sign up landing page and rather sparse. I presumed that by adding content it would help google understand the content of the site as it is a single page on the root domain (as soon as users sign up they go to a subdomain). Yesterday though the site recieved a huge drop in traffic and now recieves almost no organic traffic at all from google. I've done the obvious like checked WMT for messages but I'm not sure as to what caused the drop. As far as I'm aware there were no confirmed google updates on 7/6/13. The strange thing is that when I check the google cache its about 3 weeks old so I'm guessing google is using the old cache to reference my site content against search queries. Does that mean the changes I made are not the cause for the drop in rankings? The title tag however has changed when the site is showin in the SERPS. How can that be if google has an old cache?
On-Page Optimization | | SamCUK0 -
Prevent Indexing of URLs Based on Tags
I started my website as a blog over at Posterous, but decided to turn it into a full scale business website with a self-hosted WordPress theme. Shortly after transitioning from Posterous to WordPress, I noticed that Google was indexing not only my old blog posts, but the URLs of my blog posts based on the tags they have. Is there any reason why this is a problem? I'm sure it shouldn't qualify as duplicate content, but for some reason it just feels a bit sloppy to me to have all of these pages indexed...Is this a non-issue? Should I just be more discriminating with my use of 'tags' if it bothers me? JiGLH.png
On-Page Optimization | | williammarlow0 -
Adding keywords to URL's
I understand the importance of having the keyword in the URL (at least now I do). When I created my site (www.enchantingquotes.com), I was completely ignorant about SEO. So....question is...how do I go about adding keywords to already done pages? Do I create a new section and then redirect - or do I have to basically recreate pages? Thx much 🙂
On-Page Optimization | | enchantedgirlz0 -
Which Canonical URL Tag tag should we remove?
Hi guys, We are in the process of optimizing the pages of our new site. We have used the 'on page' report card feature in the Seomoz Pro Campaign analyser. On several pages we got the following result No More Than One Canonical URL Tag Number of Canonical tags <dl> <dd>2</dd> <dt>Explanation</dt> <dd>The canonical URL tag is meant to be employed only a single time on an individual URL (much like the title element or meta description). To ensure the search engines properly parse the canonical source, employ only a single version of this tag.</dd> <dt>Recommendation</dt> <dd>Remove all but a single canonical URL tag</dd> </dl> I have looked into the source code of one of the pages http://www.sabaileela.co.uk/acupuncture-london and can see that there are two "canonical" tags. Does anyone have any advise on which one I should ask the developer to remove? I am not sure how to determine the relative importance of either link.
On-Page Optimization | | brian.james0 -
Absolute vs relative urls
Hello, Should absolute or relative urls to be used for the internal links? I heard mixed opinions on that: One source claims that web crawlers prefer absolute urls as they are more understandable Other source points that there is no difference for web crawlers what urls are used and relative urls are shorter which reduces the size of a page. Which option is recommended? Many thanks Darius
On-Page Optimization | | LinenMe0