How can I tell Google, that a page has not changed?

bimp

Hello,

we have a website with many thousands of pages. Some of them change frequently, some never. Our problem is, that googlebot is generating way too much traffic. Half of our page views are generated by googlebot.

We would like to tell googlebot, to stop crawling pages that never change. This one for instance:

http://www.prinz.de/party/partybilder/bilder-party-pics,412598,9545978-1,VnPartypics.html

As you can see, there is almost no content on the page and the picture will never change.So I am wondering, if it makes sense to tell google that there is no need to come back.

The following header fields might be relevant. Currently our webserver answers with the following headers:

Cache-Control: no-cache, must-revalidate, post-check=0, pre-check=0, public
Pragma: no-cache
Expires: Thu, 19 Nov 1981 08:52:00 GMT

Does Google honor these fields? Should we remove no-cache, must-revalidate, pragma: no-cache and set expires e.g. to 30 days in the future?

I also read, that a webpage that has not changed, should answer with 304 instead of 200. Does it make sense to implement that? Unfortunatly that would be quite hard for us.

Maybe Google would also spend more time then on pages that actually changed, instead of wasting it on unchanged pages.

Do you have any other suggestions, how we can reduce the traffic of google bot on unrelevant pages?

Thanks for your help

Cord

Dr-Pete

Unfortunately, I don't think there are many reliable options, in the sense that Google will always honor them. I don't think they gauge crawl frequency by the "expires" field - or, at least, it carries very little weight. As John and Rob mentioned, you can set the "changefreq" in the XML sitemap, but again, that's just a hint to Google. They seem to frequently ignore it.

If it's really critical, a 304 probably is a stronger signal, but I suspect even that's hit or miss. I've never seen a site implement it on a large scale (100s or 1000s of pages), so I can't speak to that.

Two broader questions/comments:

(1) If you currently list all of these pages in your XML sitemap, consider taking them out. The XML sitemap doesn't have to contain every page on your site, and in many cases, I think it shouldn't. If you list these pages, you're basically telling Google to re-crawl them (regardless of the changefreq setting).

(2) You may have overly complex crawl paths. In other words, it may not be the quantity of pages that's at issue, but how Google accesses those pages. They could be getting stuck in a loop, etc. It's going to take some research on a large site, but it'd be worth running a desktop crawler like Xenu or Screaming Frog. This could represent a site architecture problem (from an SEO standpoint).

(3) Should all of these pages even be indexed at all, especially as time passes? More and more (especially post-Panda), more indexed pages is often worse. If Googlebot is really hitting you that hard, it might be time to canonicalize some older content or 301-redirect it to newer, more relevant content. If it's not active at all, you could even NOINDEX or 404 it.

bimp

Thanks for the answers so far. The tips are not really solving my problems yet, though: I don't want to set down general crawling speed in the webmaster tools, because pages that frequently change should also be crawled frequently. We do have XML Sitemaps, although we did not include these picture pages, as in our example. There are ten- maybe houndreds- of thousands of these pages. If everyone agrees on this, we can include these pages in our XML Sitemaps of course. Using "meta refresh" to indicate, that the page never changed, seems a bit odd to me. But I'll look into it.

But what about the http headers, I asked about? Does anyone have any ideas on that?

RobMay

Your best bet is to build an Excel report using a crawl tool (like Xenu, Frog, Moz, etc), and export that data. Then look to map out the pages you want to log and mark as 'not changing'.

Make sure to built (or have a functioning XML sitemap file) for the site, and as John said, state which URL's NEVER change. Over time, this will tell googlebot that it isn't neccessary yo crawl those page URL's as they never change.

You could also place a META REFRESH tag on those individual pages, and set that to never as well.

Hope some of this helps! Cheers

john4math

If you have Google Webmaster Tools set up, go to Site configuration > Settings, and you can set a custom crawl rate for you site. That will change it site-wide, so if you have other pages that change frequently, that might not be so great for you.

Another thing you could try is generate a sitemap, and set a change frequency of never (or yearly) for all of the pages you don't expect to change. That also might slow down Google's crawl rate of those pages.

Welcome to the Q&A Forum

Browse the forum for helpful insights and fresh discussions about all things SEO.

How can I tell Google, that a page has not changed?

Got a burning SEO question?

Browse Questions

Explore more categories

Related Questions

Google keeps marking different pages as duplicates

Can I have an http AND a https site on Google Webmaster tools

Can I turn off Google site links?

Can Google read onClick links?

What can be the cause of my inner pages ranking higher than my home page?

I have 15,000 pages. How do I have the Google bot crawl all the pages?

Site:www.tld.com rank is it a measure of googles per page importance?

Home page URL disappears in Google after switching to WordPress