Some bots excluded from crawling client's domain

SimpleSearch

Hi all!

My client is in healthcare in the US and for HIPAA reasons, blocks traffic from most international sources.

a. I don't think this is good for SEO

b. The site won't allow Moz bot or Screaming Frog bot to crawl it. It's so frustrating.

We can't figure out what mechanism they are utilizing to execute this. Any help as we start down the rabbit hole to remedy is much appreciated.

thank you!

effectdigital

The main reason it's not good is that Google crawl from different data-centers around the world. So one day they may think the site is up, then the next they may think the site is gone and down

Typically you use a user-agent lance to pierce these kinds of setups. Screaming Frog for example, you can pre-select from a variety of user-agents (including 'googlebot' and Chrome) but you can also author or write your own user-agent

Write a long one that looks like an encryption key. Tell your client the user agent you have defined, let them create and exemption for it within their spam-defense system. Insert the user-agent (which no one else has or uses) into Screaming Frog, use it to allow the crawler to pierce the defense grid

Typically you would want to exempt 'Googlebot' (as a user agent) from these defense systems, but it comes with a risk. Anyone with basic scripting knowledge or who knows how to install Chrome extensions, can alter the user-agent of their script (or web browser, it's under the user's control) with ease and it is widely known that many sites make an exception for 'Googlebot' - thus it becomes a common vulnerability. For example, lots of publishers create URLs which Google can access and index, yet if you are a bog standard user they ask you to turn off ad-blockers or pay a fee

Download the Chrome User-Agent extension, set your user-agent to "googlebot" and sail right through. Not ideal from a defense perspective

For this reason I have often wished (and I am really hoping someone from Google might be reading) that in Search Console, you could tell Google a custom user-agent string and give it to them. You could then exempt that, safe in the knowledge that no one else knows it, and Google would use your own custom string to identify themselves when accessing your site and content. Then everyone could be safe, indexable and happy

We're not there yet

Welcome to the Q&A Forum

Browse the forum for helpful insights and fresh discussions about all things SEO.

Some bots excluded from crawling client's domain

Got a burning SEO question?

Browse Questions

Explore more categories

Related Questions

No: 'noindex' detected in 'robots' meta tag

Changed domains, saw significant drop in domain authority

Geo ip filtering / Subdomain can't be crawled

Is content on widget bar less 'seo important' than main content?

100's of Footer Links... what is the safe play?

SEOMoz is indicating I have 40 pages with duplicate content, yet it doesn't list the URL's of the pages???

Can someone break down 'page level link metrics' for me?

Google shows the wrong domain for client's homepage