New CMS system - 100,000 old urls - use robots.txt to block?
-
Hello.
My website has recently switched to a new CMS system.
Over the last 10 years or so, we've used 3 different CMS systems on our current domain. As expected, this has resulted in lots of urls.
Up until this most recent iteration, we were unable to 301 redirect or use any page-level indexation techniques like rel 'canonical'
Using SEOmoz's tools and GWMT, I've been able to locate and redirect all pertinent, page-rank bearing, "older" urls to their new counterparts..however, according to Google Webmaster tools 'Not Found' report, there are literally over 100,000 additional urls out there it's trying to find.
My question is, is there an advantage to using robots.txt to stop search engines from looking for some of these older directories? Currently, we allow everything - only using page level robots tags to disallow where necessary.
Thanks!
-
Great stuff..thanks again for your advice..much appreciated!
-
It can be really tough to gauge the impact - it depends on how suddenly the 404s popped up, how many you're seeing (webmaster tools, for Google and Bing, is probably the best place to check) and how that number compares to your overall index. In most cases, it's a temporary problem and the engines will sort it out and de-index the 404'ed pages.
I'd just make sure that all of these 404s are intentional and none are valuable pages or occurring because of issues with the new CMS itself. It's easy to overlook something when you're talking about 100K pages, and it could be more than just a big chunk of 404s.
-
Thanks for the advice! The previous website did have a robots.txt file with a few wild cards declared. A lot of the urls I'm seeing are NOT indexed anymore and haven't been for many years.
So, I think the 'stop the bleeding' method will work, and I'll just have to proceed with investigating and applying 301s as necessary.
Any idea what kind of an impact this is having on our rankings? I submitted a valid sitemap, crawl paths are good, and major 301s are in place. We've been hit particularly hard in Bing.
Thanks!
-
I've honestly had mixed luck with using Robots.txt to block pages that have already been indexed. It tends to be unreliable at a large scale (good for prevention, poor for cures). I endorsed @Optimize, though, because if Robots.txt is your only option, it can help "stop the bleeding". Sometimes, you use the best you have.
It's a bit trickier with 404s ("Not Found"). Technically, there's nothing wrong with having 404s (and it's a very valid signal for SEO), but if you create 100,000 all at once, that can sometimes give raise red flags with Google. Some kind of mass-removal may prevent problems from Google crawling thousands of not founds all at once.
If these pages are isolated in a folder, then you can use Google Webmaster Tools to remove the entire folder (after you block it). This is MUCH faster than Robots.txt alone, but you need to make sure everything in the folder can be dumped out of the index.
-
Absolutely. Not founds and no content are a concern. This will help your ranking....
-
Thanks a lot! I should have been a little more specific..but, my exact question would be, if I move the crawlers' attention away from these 'Not Found' pages, will that benefit the indexation of the now valid pages? Are the 'Not Found's' really a concern? Will this help my indexation and/or ranking?
Thanks!
-
Loaded question without knowing exactly what you are doing.....but let me offer this advice. Stop the bleeding with robots.txt. This is the easiest way to quickly resolve that many "not found".
Then you can slowly pick away at the issue and figure out if some of the "not founds" really have content and it is sending them to the wrong area....
On a recent project we had over 200,000 additional url's "not found". We stopped the bleeding and then slowly over the course of a month, spending a couple hours a week, found another 5,000 pages of content that we redirected correctly and removed the robots....
Good luck.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
How to deal with rel=canonical when using POST parameters
Hi there,
On-Page Optimization | | mjk26
I currently have a number of URLs throughout my site of the form: https://www.concerthotels.com/venue-hotels/o2-academy-islington-hotels/256133#checkin_4-21-2024&checkout_4-22-2024&rooms_1&guests_2&artistid_15878:256133 This sends the user through to a page showing hotels near the O2 Academy Islington. Once the page loads, my code looks at the parameters specified in the # part of the URL, and uses them to fill in a form, before submitting the form as a POST. This basically reloads the page, but checks the availability of the hotels first, and therefore returns slightly different content to the "canonical" version of this page (which simply lists the hotels before any availability checks done). Until now, I've marked the page that has had availability checks as noindex,follow. But because the form was submitted with POST parameters, the URL looks exactly like the canonical one. So the two URLs are identical, but due to POST parameters, the content is slightly different. Does that make sense? My question is, should both versions of this page be marked as index,follow? Thanks
Mike0 -
Keywords used to land on specific page?
Hi all, Does anyone know if there's anywhere where I can see what keywords are used in search engines to land on a specific page? I have access to the Google Analytics account and linked it to Moz as a campaign, but I can't find this data. I'm curious about this because a very uncommon word is used in a page title for a page I try to optimize. It's the Dutch translation of 'malicious'. And now I wonder if it's better to switch to a word that's used more often. Or if it's better to 'win the battle' on this (probably) rarely used word. I've used Google trends to see how many people use it, but it says there's not enough data to show the interest over time.
On-Page Optimization | | RaoulWB0 -
Robots.txt Question for E-Commerce Sites
Hi All, I have a couple of e-commerce clients and have a question about URLs. When you perform a search on website all URLs contain a question mark, for example: /filter.aspx?search=blackout I'm not sure that I want these indexed. Could I be causing any harm/danger if I add this to the robots.txt file? /*? Any suggestions welcome! Gavin
On-Page Optimization | | IcanAgency0 -
Should I use an acronym in my URL?
I know that Google understands various acronyms. Example: If I search for CRM System, it knows i'm searching for a customer relationship management system. However, will it recognize less known acronyms? I have a page geared specifically for SAP data archiving for human capital management systems. For those in the industry, they simply call it HCM. Here is how I view my options: Option #1: www.mywebsite.com/sap-data-archiving/human-capital-management Option #2: www.mywebsite.com/sap-data-archiving/hcm Option #3: www.mywebsite.com/sap-data-archiving/hcm-human-capital-management With option #3, i'm capturing the acronym AND the full phrase. This doesn't make my URL overly long either. Of course, in my content i'll reference both. What does everyone else think about the URL? -Alex
On-Page Optimization | | MeasureEverything0 -
Keeping SEO benefit of an old URL by changing content
We have a blog written in Oct 2012 that accounts for 30-40% of our traffic (174K pageviews per year/80% bounce rate). We are considering updating the content but are concerned that it will fall off the search engine's map if the content is updated to include information that is not exactly the same, but relevant. The URL would be the same and the original blog content would be shortened with a link to the full blog. The new content would include other FDA products under investigation. Here is the blog: http://myadvocates.com/blog/fda-issues-warning-about-so-called-brain-supplement-prevagen
On-Page Optimization | | jgodwin0 -
Which speed test to use?
Hi so I have a very easy question I think. I am too new to seo stuff to know better, but which speed test site is the best? I have a construction website. My issue is I am getting mixed results. Pingdom is showing <dl> <dt>Load time</dt> <dd>1.18s</dd> </dl> <dl> <dt>Requests</dt> <dd>22</dd> </dl> <dl class="last"> <dt>Perf. grade</dt> <dd>94/100</dd> <dd>Gt metrix is showing 98/87 with a 2.13 load time.</dd> <dd>And it varies from when I do it.</dd> <dd>So which do I go off of?</dd> <dd>With pingdom, I feel pretty good.</dd> <dd>With gtmetrix I feel like maybe I could improve.</dd> <dd>I ask because I have been looking into a cdn, but because its like learning a foreign language to me, I do not want to go there unless I have too.</dd> <dd>Thank you for any advice on which speed test to view as most accurate</dd> <dd>Chris</dd> </dl>
On-Page Optimization | | asbchris0 -
Wordpress pages URL's redirection.
I was checking W3C Markup Validation and in report it was shown that that pages (not post or any other URL's just PAGES) at investmentcontrarians.com are 301 redirected. e.g. original URL "http://www.investmentcontrarians.com/debt-crisis" which is redirected to "http://www.investmentcontrarians.com/debt-crisis/" I know that its not that serious issue, but still want to know why only pages are being redirected and how can we avoid it.
On-Page Optimization | | NumeroUnoWebSolutions0 -
Recommendations for a good FAQ system
Hi I am looking for an FAQ system that is seo friendly, naturally 😉 , so wondered what other people use or would recommend for a website that's isn't using a cms like wordpress etc. Basically looking to add the question as the title and the answer as the page content to get the pages indexed. Thanks in advance. Trevor
On-Page Optimization | | TrevorJones0