Old URLs that have 301s to 404s not being de-indexed.

boxclever

We have a scenario on a domain that recently moved to enforcing SSL. If a page is requested over non-ssl (http) requests, the server automatically redirects to the SSL (https) URL using a good old fashioned 301. This is great except for any page that no longer exists, in which case you get a 301 going to a 404.

Here's what I mean.

Case 1 - Good page:

http://domain.com/goodpage -> 301 -> https://domain.com/goodpage -> 200

Case 2 - Bad page that no longer exists:

http://domain.com/badpage -> 301 -> https://domain.com/badpage -> 404

Google is correctly re-indexing all the "good" pages and just displaying search results going directly to the https version.

Google is stubbornly hanging on to all the "bad" pages and serving up the original URL (http://domain.com/badpage) unless we submit a removal request. But there are hundreds of these pages and this is starting to suck. Note: the load balancer does the SSL enforcement, not the CMS. So we can't detect a 404 and serve it up first. The CMS does the 404'ing.

Any ideas on the best way to approach this problem? Or any idea why Google is holding on to all the old "bad" pages that no longer exist, given that we've clearly indicated with 301s that no one is home at the old address?

boxclever

I don't think 404 vs 410 is the answer here.The basis for this thought is the following:

https://searchenginewatch.com/sew/how-to/2340728/matt-cutts-on-how-google-handles-404-410-status-codes

========

"if we see a page and we get a 404, we are gonna protect that page for 24 hours in the crawling system, so we sort of wait and we say maybe that was a transient 404, maybe it really wasn’t intended to be a page not found.”

“If we see a 410, then the site crawling system says, OK we assume the webmasters knows what they’re doing because they went off the beaten path to deliberately say this page is gone,” he said. “So they immediately convert that 410 to an error, rather than protecting it for 24 hours."

========

I'm thinking the deeper issue is why the 301s are not being respected. If a link points to http://domain.com/badpage and we use a 301 to point to https://domain.com/badpage - shouldn't the crawler (Google or otherwise) respect the 301? Why still index and serve up a page that responds with the 301? To me, this is baffling. If we serve up a 404 or a 410 - either way we are saying "this page is gone" but we're still seeing the original http://domain.com/badpage in the index?

Does that make sense? Or is there more clarification required?

becole

sym_admin is right--you'll want to find the source of those pages, as Google apparently is seeing them from somewhere and still requesting them. If there are links to those pages somewhere, you will need to remove them. Also, if you're able, I would change those URLs so that they serve up a "410 Gone" error, and not a 404.

TheSymmetran

Read these three, then do what you got to do...

https://www.searchcommander.com/how-to-bulk-remove-urls-google/

https://productforums.google.com/forum/#!topic/webmasters/uYFJnsyiH8w

https://moz.com/community/q/404-redirects-to-the-homepage-is-this-good-bad-ugly

For proper removal, please ensure that there are no INTERNAL links anywhere on your website to 404 addresses, from sitemap, buttons, text, or images (the whole 9 yards).

Good luck!

Welcome to the Q&A Forum

Browse the forum for helpful insights and fresh discussions about all things SEO.

Old URLs that have 301s to 404s not being de-indexed.

Got a burning SEO question?

Browse Questions

Explore more categories

Related Questions

How to get a large number of urls out of Google's Index when there are no pages to noindex tag?

Google Indexing

Should I include URLs that are 301'd or only include 200 status URLs in my sitemap.xml?

Website Does not index in any page?

Removing UpperCase URLs from Indexing

Product or Shop in URL

Why are some pages indexed but not cached by Google?

To index search results or to not index search results?