2.3 million 404s in GWT - learn to live with 'em?
-
So I’m working on optimizing a directory site. Total size: 12.5 million pages in the XML sitemap. This is orders of magnitude larger than any site I’ve ever worked on – heck, every other site I’ve ever worked on combined would be a rounding error compared to this.
Before I was hired, the company brought in an outside consultant to iron out some of the technical issues on the site. To his credit, he was worth the money: indexation and organic Google traffic have steadily increased over the last six months. However, some issues remain. The company has access to a quality (i.e. paid) source of data for directory listing pages, but the last time the data was refreshed some months back, it threw 1.8 million 404s in GWT. That has since started to grow progressively higher; now we have 2.3 million 404s in GWT.
Based on what I’ve been able to determine, links on this particular site relative to the data feed are broken generally due to one of two reasons: the page just doesn’t exist anymore (i.e. wasn’t found in the data refresh, so the page was simply deleted), or the URL had to change due to some technical issue (page still exists, just now under a different link). With other sites I’ve worked on, 404s aren’t that big a deal: set up a 301 redirect in htaccess and problem solved. In this instance, setting up that many 301 redirects, even if it could somehow be automated, just isn’t an option due to the potential bloat in the htaccess file.
Based on what I’ve read here and here, 404s in and of themselves don’t really hurt the site indexation or ranking. And the more I consider it, the really big sites – the Amazons and eBays of the world – have to contend with broken links all the time due to product pages coming and going. Bottom line, it looks like if we really want to refresh the data on the site on a regular basis – and I believe that is priority one if we want the bot to come back more frequently – we’ll just have to put up with broken links on the site on a more regular basis.
So here’s where my thought process is leading:
- Go ahead and refresh the data. Make sure the XML sitemaps are refreshed as well – hopefully this will help the site stay current in the index.
- Keep an eye on broken links in GWT. Implement 301s for really important pages (i.e. content-rich stuff that is really mission-critical). Otherwise, just learn to live with a certain number of 404s being reported in GWT on more or less an ongoing basis.
- Watch the overall trend of 404s in GWT. At least make sure they don’t increase. Hopefully, if we can make sure that the sitemap is updated when we refresh the data, the 404s reported will decrease over time.
We do have an issue with the site creating some weird pages with content that lives within tabs on specific pages. Once we can clamp down on those and a few other technical issues, I think keeping the data refreshed should help with our indexation and crawl rates.
Thoughts? If you think I’m off base, please set me straight.
-
I was actually thinking about some type of wildcard rule in htaccess. This might actually do the trick! Thanks for the response!
-
Hi,
Sounds like you’ve taken on a massive job with 12.5 million pages, but I think you can implement a simple fix to get things started.
You’re right to think about that sitemap, make sure it’s being dynamically updated as the data refreshes, otherwise that will be responsible for a lot of your 404s.
I understand you don’t want to add 2.3 million separate redirects to your htaccess, so what about a simple rule - if the request starts with ^/listing/ (one of your directory pages), is not a file and is not a dir, then redirect back to the homepage. Something like this:
does the request start with /listing/ or whatever structure you are using
RewriteCond %{REQUEST_URI} ^/listing/ [nc]
is it NOT a file and NOT a dir
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
#all true? Redirect
RewriteRule .* / [L,R=301]This way you can specify a certain URL structure for the pages which tend to turn to 404s, any 404s outside of your first rule will still serve a 404 code and show your 404 page and you can manually fix these problems, but the pages which tend to disappear can all be redirected back to the homepage if they’re not found.
You could still implement your 301s for important pages or simply recreate the page if it’s worth doing so, but you will have dealt with a large chunk or your non-existing pages.
I think it’s a big job and those missing pages are only part of it, but it should help you to sift through all of the data to get to the important bits – you can mark a lot of URLs as fixed and start giving your attention to the important pages which need some works.
Hope that helps,
Tom
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Can't diagnose this 404 error
Hi Moz community I have started receiving a load of 404 errors that look like this: This page: http://paulminors.com/blog/page/5/ is linking to: http://paulminors.com/category/podcast/paulminors.com which is a broken link. This is happening with a load of other pages as well. It seems that "paulminors.com" is being added to the end of the linking pages URL.I'm using Wordpress and the SEO by Yoast plugin. I have searched for this link in the source of the linking page but can't find it, so I'm struggling to diagnose the problem. Does anyone have any ideas on what could be causing this? Thanks in advance Paul
Intermediate & Advanced SEO | | kevinliao0 -
Why isn't my site being indexed by Google?
Our domain was originally pointing to a Squarespace site that went live in March. In June, the site was rebuilt in WordPress and is currently hosted with WPEngine. Oddly, the site is being indexed by Bing and Yahoo, but is not indexed at all in Google i.e. site:example.com yields nothing. As far as I know, the site has never been indexed by Google, neither before nor after the switch. What gives? A few things to note: I am not "discouraging search engines" in WordPress Robots.txt is fine - I'm not blocking anything that shouldn't be blocked A sitemap has been submitted via Google Webmaster Tools and I have "fetched as Google" and submitted for indexing - No errors I've entered both the www and non-www in WMT and chose a preferred There are several incoming links to the site, some from popular domains The content on the site is pretty standard and crawlable, including several blog posts I have linked up the account to a Google+ page
Intermediate & Advanced SEO | | jtollaMOT0 -
Acceptable use of availability attribute 'preorder' value in rich snippets schema markup and Google Shopping feed?
Hello all, Could someone please advise on acceptable use of the availability attribute 'preorder' value in rich snippets schema markup for our websites and the Google Shopping feed? Currently all of our products are either 'in stock' or 'out of stock', also mentioned was 'available for order' but I found that in the 2014 Google Shopping update, this value will be merged with 'in stock' here 'We are simplifying the ‘availability’ attribute by merging ‘in stock’ with ‘available for order’ and removing ‘available for order’. The products which we would like to mark as 'preorder' have been in stock and then sold out, however we have a due date for when they will come back into stock, so therefore the customer can preorder the product on our website i.e. pay in advance to secure their purchase and then they are provided with a due date for the products. Is this the correct use of the 'preorder' value, or does the product literally have to never have been released before? The guidance we have is: 'You are taking orders for this product, but it’s not yet been released.' Is this set in stone? Many thanks in advance and kind regards.
Intermediate & Advanced SEO | | jeffwhitfield0 -
Content question about 3 sites targeted at 3 different countries
I am new here, and this is my first question. I was hoping to get help with the following scenario: I am looking to launch 3 sites in 3 different countries, using 3 different domains. For example the.com for USA, the .co.uk for UK , and a slightly different .com for Australia, as I could not purchase .com.au as I am not a registered business in Australia. I am looking to set the Geographic Target on Google Webmaster. So for example, I have set the .com for USA only, with .co.uk I won't need to set anything, and I will set the other Australian .com to Australia. Now, initially the 3 site will be "brochure" websites explaining the service that we offer. I fear that at the beginning they will most likely have almost identical content. However, on the long term I am looking to publish unique content for each site, almost on a weekly basis. So over time they would have different content from each other. These are small sites to begin with. So each site in the "brochure" form will have around 10 pages. Over time it will have 100's of pages. My question or my worry is, will Google look at the fact that I have same content across 3 sites negatively even though they are specifically targeted to different countries? Will it penalise my sites negatively?
Intermediate & Advanced SEO | | ryanetc0 -
Tips do join 2 domains
I would like to move all my old domain content ( dicasdogoogle.com.br) with more than 1200 tutorials pages to a new one (seomartin.com)... and then unify them. I´m using wordpress in both but the permalinks are different... Any tips 4 me folks?
Intermediate & Advanced SEO | | SeoMartin10 -
Community question- Penguin 2.0 link types?
What type of links do you think Penguin 2.0 targeted most - anchor text abuse , directory links, paid links, low quality guest posts, article directories etc????
Intermediate & Advanced SEO | | DavidKonigsberg0 -
Any ideas for easy code to get rankings live?
I'm interested in gathering some further data for my site. What I would like to do is as well as collecting the search Keyword for users coming to my site, I would also like to gather the search engine position for that keyword LIVE - IE what is the current position of that keyword in the SE at the time the user put in their search. This data will be really useful when digging down on my analysis. Does anyone have any simple ideas of how you would go about implimenting this? Many thanks
Intermediate & Advanced SEO | | James770 -
Competitior 'scraped' entire site - pretty much - what to do?
I just discovered a competitor in the insurance lead generation space has completely copied my client's site's architecture, page names, titles, even the form, tweaking a word or two here or there to prevent 100% 'scraping'. We put a lot of time into the site, only to have everything 'stolen'. What can we do about this? My client is very upset. I looked into filing a 'scraper' report through Google but the slight modifications to content technically don't make it a 'scraped' site. Please advise to what course of action we can take, if any. Thanks,
Intermediate & Advanced SEO | | seagreen
Greg0