Moz Q&A is closed.
After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.
Would you rate-control Googlebot? How much crawling is too much crawling?
-
One of our sites is very large - over 500M pages. Google has indexed 1/8th of the site - and they tend to crawl between 800k and 1M pages per day.
A few times a year, Google will significantly increase their crawl rate - overnight hitting 2M pages per day or more. This creates big problems for us, because at 1M pages per day Google is consuming 70% of our API capacity, and the API overall is at 90% capacity. At 2M pages per day, 20% of our page requests are 500 errors.
I've lobbied for an investment / overhaul of the API configuration to allow for more Google bandwidth without compromising user experience. My tech team counters that it's a wasted investment - as Google will crawl to our capacity whatever that capacity is.
Questions to Enterprise SEOs:
*Is there any validity to the tech team's claim? I thought Google's crawl rate was based on a combination of PageRank and the frequency of page updates. This indicates there is some upper limit - which we perhaps haven't reached - but which would stabilize once reached.
*We've asked Google to rate-limit our crawl rate in the past. Is that harmful? I've always looked at a robust crawl rate as a good problem to have.
- Is 1.5M Googlebot API calls a day desirable, or something any reasonable Enterprise SEO would seek to throttle back?
*What about setting a longer refresh rate in the sitemaps? Would that reduce the daily crawl demand? We could set increase it to a month, but at 500M pages Google could still have a ball at the 2M pages/day rate.
Thanks
-
I agree with Matt that there can probably be a reduction of pages, but that aside, how much of an issue this is comes down to what pages aren't being indexed. It's hard to advise without the site, are you able to share the domain? If the site has been around for a long time, that seems a low level of indexation. Is this a site where the age of the content matters? For example Craigslist?
Craig
-
Thanks for your response. I get where you're going with that. (Ecomm store gone bad.) It's not actually an Ecomm FWIW. And I do restrict parameters - the list is about a page and a half long. It's a legitimately large site.
You're correct - I don't want Google to crawl the full 500M. But I do want them to crawl 100M. At the current crawl rate we limit them to, it's going to take Google more than 3 months to get to each page a single time. I'd actually like to let them crawl 3M pages a day. Is that an insane amount of Googlebot bandwidth? Does anyone else have a similar situation?
-
Gosh, that's a HUGE site. Are you having Google crawl parameter pages with that? If so, that's a bigger issue.
I can't imagine the crawl issues with 500M pages. A site:amazon.com search only returns 200M. Ebay.com returns 800M so your site is somewhere in between these two? (I understand both probably have a lot more - but not returning as indexed.)
You always WANT a full site crawl - but your techs do have a point. Unless there's an absolutely necessary reason to have 500M indexed pages, I'd also seek to cut that to what you want indexed. That sounds like a nightmare ecommerce store gone bad.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
My last site crawl shows over 700 404 errors all with void(0 added to the ends of my posts/pages.
Hello, My last site crawl shows over 700 404 errors all with void(0 added to the ends of my posts/pages. I have contacted my theme company but not sure what could have done this. Any ideas? The original posts/pages are still correct and working it just looks like it did duplicates and added void(0 to the end of each post/page. Questions: There is no way to undo this correct? Do I have to do a redirect on each of these? Will this hurt my rankings and domain authority? Any suggestions would be appreciated. Thanks, Wade
Intermediate & Advanced SEO | | neverenoughmusic.com0 -
Move domain to new domain, for how much time should I keep forwarding?
I'm not sure but my website looks like is not getting it's juice as supposed to be. As we already know, google preferred https sites and this is what happened to mine, it was been crawling as https but when the time came to move my domain to new domain, I used 301 or domain forwarding service, unfortunately they didn't have a way to forward from https to new https, they only had regular http to https, when users clicked to my old domain from google search my site was returned to "site does not exist", I used hreflang at least that google would detect my new domain been forwarding and yes it worked but now I'm wondering, for how much time should I keep the forwarding the old domain to the new one, my site looks like is not going up, I have changed all the external links, any help would be appreciated. Thanks!
Intermediate & Advanced SEO | | Fulanito1 -
[Very Urgent] More 100 "/search/adult-site-keywords" Crawl errors under Search Console
I just opened my G Search Console and was shocked to see more than 150 Not Found errors under Crawl errors. Mine is a Wordpress site (it's consistently updated too): Here's how they show up: Example 1: URL: www.example.com/search/adult-site-keyword/page2.html/feed/rss2 Linked From: http://an-adult-image-hosting.com/search/adult-site-keyword/page2.html Example 2 (this surprised me the most when I looked at the linked from data): URL: www.example.com/search/adult-site-keyword-2.html/page/3/ Linked From: www.example.com/search/adult-site-keyword-2.html/page/2/ (this is showing as if it's from our own site) http://a-spammy-adult-site.com/search/adult-site-keyword-2.html Example 3: URL: www.example.com/search/adult-site-keyword-3.html Linked From: http://an-adult-image-hosting.com/search/adult-site-keyword-3.html How do I address this issue?
Intermediate & Advanced SEO | | rmehta10 -
Is it normal for Bing rankings to fluctuate so much on a daily basis?
Hi all, I launched a new website in Aug 2015, and have had some success with ranking organically on Google (position 2 - 5 for all of my target terms). However I'm still not getting any traction on Bing. I know that they use completely different algorithms so it's not unusual to rank well on one but not the other, but the ranking behaviour that I see seems quite odd. We've been bouncing in and out of the top 50 for quite some time, with shifts of 30+ positions often on a daily basis (see attached). This seems to be the case for our full range of target terms, and not just the most competitive ones. I'm hoping someone can advise on whether this is normal behaviour for a relatively young website, or if it more likely points to an issue with how Bing is crawling my site. I'm using Bing Webmaster tools and there aren't any crawl or sitemap issues, or significant seo flags. Thanks dhYgh
Intermediate & Advanced SEO | | Tinhat0 -
Prevent Google from crawling Ajax
With Google figuring out how to make Ajax and JS more searchable/indexable, I am curious on thoughts or techniques to prevent this. Here's my Situation, we have a page that we do not ever want to be indexed/crawled or other. Currently we have the nofollow/noindex command, but due to technical changes for our site the method in which this information is being implemented if it is ever displayed it will not have the ability to block the content from search. It is also the decision of the business to not list the file in robots.txt due to the sensitivity of the content. Basically, this content doesn't exist unless something super important happens, and even if something super important happens, we do not want Google to know of its existence. Since the Dev team is planning on using Ajax/JS to pull in this content if the business turns it on, the concern is that it will be on the homepage and Google could index it. So the questions that I was asked; if Google can/does index, how long would that piece of content potentially appear in the SERPs? Can we block Google from caring about and indexing this section of content on the homepage? Sorry for the vagueness of this question, it's very sensitive in nature and I am trying to avoid too many specifics. I am able to discuss this in a more private way if necessary. Thanks!
Intermediate & Advanced SEO | | Shawn_Huber0 -
Will a disclaimer affect Crawling?
Hello everyone! My German users will have to get a disclaimer according to German laws, now my question is the following: Will a disclaimer affect crawling? What's the best practice to have regarding this? Should I have special care in this? What's the best disclaimer technique? A Plain HTML page? Something overlapping the site? Thank you all!
Intermediate & Advanced SEO | | NelsonF0 -
Google Analytics: how to filter out pages with low bounce rate?
Hello here, I am trying to find out how I can filter out pages in Google Analytics according to their bounce rate. The way I am doing now is the following: 1. I am working inside the Content > Site Content > Landing Pages report 2. Once there, I click the "advanced" link on the right of the filter field. 3. Once there, I define to "include" "Bounce Rate" "Greater than" "0.50" which should show me which pages have a bounce rate higher of 0.50%.... instead I get the following warning on the graph: "Search constraints on metrics can not be applied to this graph" I am afraid I am using the wrong approach... any ideas are very welcome! Thank you in advance.
Intermediate & Advanced SEO | | fablau0 -
How to stop Google crawling after 301 redirect?
I have removed all pages from my old website and set 301 redirect to new website. But, I have verified old website with Google webmaster tools' HTML verification file which enable me to track all data and existence of pages in Google search for my old website. I was assumed that, Google will stop crawling and DE-indexed all pages after 301 redirect. Because, I have set 301 redirect before 3 months. Now, I'm able to see Google bot activity on my website with help of Google webmaster tools. You can find out attachment to know more about it. How can it possible & How Google can crawl removed pages? You can see following image to know more about it. First & Second
Intermediate & Advanced SEO | | CommercePundit0