Robots.txt gone wild
-
Hi guys, a site we manage, http://hhhhappy.com received an alert through web master tools yesterday that it can't be crawled. No changes were made to the site.
Don't know a huge amount about the robots.txt configuration expect that using Yoast by default it sets it not to crawl wp admin folder and nothing else. I checked this against all other sites and the settings are the same. And yet 12 hours later after the issue Happy is still not being crawled and meta data is not showing in search results. Any ideas what may have triggered this?
-
Hi Radi!
Have Matt and/or Martijn answered your question? If so, please mark one or both of their responses "Good Answer."
Otherwise, what's still tripping you up?
-
Have you checked the downtime of the site recently? Sometimes it could be that Google isn't able to reach your robots.txt file and because of that they'll stop crawling your site temporarily.
-
Are you getting the message in Search Console that there were errors crawling your page?
This typically means that your host was temporarily down when Google landed on your page. These types of things happen all the time and are no big deal.
Your homepage cache shows a crawl date of today so I'm assuming things are working properly ... if you really want to find out, try doing a "Fetch" of your site in Search Console.
Crawl > Fetch as Google > Fetch (big red button)
You should get a status of "Complete." If you get anything else there should be an error message with it. If so, paste that here.
I have checked the site headers, cache, crawlability with Screaming Frog, and everything is fine. This seems like one of those temporary messages but if the problem persists definitely let us know!
-
Our host has just offered this response which does not get me any closer:
Hi Radi,
It looks like your site has its own robots.txt file, which is not blocking any user agents. The only thing it's doing is blocking bots from indexing your admin area:
<code>User-agent: * Disallow: /wp-admin/</code>
This is a standard robots.txt file, and you shouldn't be having any issues with Google indexing your site from a hosting standpoint. To test this, I curled the site as Googlebot and received a 200OK response:
<code>curl -A "Googlebot/2.1" -IL [hhhhappy.com](http://hhhhappy.com) HTTP/1.1 200 OK Date: Sat, 05 Mar 2016 22:17:26 GMT Content-Type: text/html; charset=UTF-8 Connection: keep-alive Set-Cookie: __cfduid=d3177a1baa04623fb2573870f1d4b4bac1457216246; expires=Sun, 05-Mar-17 22:17:26 GMT; path=/; domain=.[hhhhappy.com](http://hhhhappy.com); HttpOnly X-Cacheable: bot Cache-Control: max-age=10800, must-revalidate X-Cache: HIT: 17 X-Cache-Group: bot X-Pingback: [http://hhhhappy.com/xmlrpc.php](http://hhhhappy.com/xmlrpc.php) Link: <[http://hhhhappy.com/](http://hhhhappy.com/)>; rel=shortlink Expires: Thu, 19 Nov 1981 08:52:00 GMT X-Type: default X-Pass-Why: Set-Cookie: X-Mapping-fjhppofk=2C42B261F74DA203D392B5EC5BF07833; path=/ Server: cloudflare-nginx CF-RAY: 27f0f02445920f09-IAD</code>
I didn't see any plugins on your site that looked like they would overwrite robots.txt, but I urge you to take another look at them, and then dive into your site's settings for the meta value that Googlebot would pick up. Everything on our end seems to be giving the green light.
Please let us know if you have any other questions or issues in the meantime.
Cheers,
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Meta robots
Hi, I am checking a website for SEO and I've noticed that a lot of pages from the blog have the following meta robots: meta name="robots" content="follow" Normally these pages should be indexed, since search engines will index and follow by default. In this case however, a lot of pages from this blog are not indexed. Is this because the meta robots is specified, but only contains follow? So will search engines only index and follow by default if there is no meta robots specified at all? And secondly, if I would change the meta robots, should I just add index or remove the meta robots completely from the code? Thanks for checking!
Intermediate & Advanced SEO | | Mat_C0 -
Robots.txt Allowed
Hello all, We want to block something that has the following at the end: http://www.domain.com/category/product/some+demo+-text-+example--writing+here So I was wondering if doing: /*example--writing+here would work?
Intermediate & Advanced SEO | | ThomasHarvey0 -
Robots.txt for Facet Results
Hi Does anyone know how to properly add facets URL's to Robots txt? E.g. of our facets URL - http://www.key.co.uk/en/key/platform-trolleys-trucks#facet:-10028265807368&productBeginIndex:0&orderBy:5&pageView:list& Everything after the # will need to be blocked on all pages with a facet. Thank you
Intermediate & Advanced SEO | | BeckyKey0 -
Pages getting into Google Index, blocked by Robots.txt??
Hi all, So yesterday we set up to Remove URL's that got into the Google index that were not supposed to be there, due to faceted navigation... We searched for the URL's by using this in Google Search.
Intermediate & Advanced SEO | | bjs2010
site:www.sekretza.com inurl:price=
site:www.sekretza.com inurl:artists= So it brings up a list of "duplicate" pages, and they have the usual: "A description for this result is not available because of this site's robots.txt – learn more." So we removed them all, and google removed them all, every single one. This morning I do a check, and I find that more are creeping in - If i take one of the suspecting dupes to the Robots.txt tester, Google tells me it's Blocked. - and yet it's appearing in their index?? I'm confused as to why a path that is blocked is able to get into the index?? I'm thinking of lifting the Robots block so that Google can see that these pages also have a Meta NOINDEX,FOLLOW tag on - but surely that will waste my crawl budget on unnecessary pages? Any ideas? thanks.0 -
Robots.txt: Syntax URL to disallow
Did someone ever experience some "collateral damages" when it's about "disallowing" some URLs? Some old URLs are still present on our website and while we are "cleaning" them off the site (which takes time), I would like to to avoid their indexation through the robots.txt file. The old URLs syntax is "/brand//13" while the new ones are "/brand/samsung/13." (note that there is 2 slash on the URL after the word "brand") Do I risk to erase from the SERPs the new good URLs if I add to the robots.txt file the line "Disallow: /brand//" ? I don't think so, but thank you to everyone who will be able to help me to clear this out 🙂
Intermediate & Advanced SEO | | Kuantokusta0 -
If i disallow unfriendly URL via robots.txt, will its friendly counterpart still be indexed?
Our not-so-lovely CMS loves to render pages regardless of the URL structure, just as long as the page name itself is correct. For example, it will render the following as the same page: example.com/123.html example.com/dumb/123.html example.com/really/dumb/duplicative/URL/123.html To help combat this, we are creating mod rewrites with friendly urls, so all of the above would simply render as example.com/123 I understand robots.txt respects the wildcard (*), so I was considering adding this to our robots.txt: Disallow: */123.html If I move forward, will this block all of the potential permutations of the directories preceding 123.html yet not block our friendly example.com/123? Oh, and yes, we do use the canonical tag religiously - we're just mucking with the robots.txt as an added safety net.
Intermediate & Advanced SEO | | mrwestern0 -
My site was number 1 for competetive keyword on Google now completely gone- what should I do?
WHat is the best way to determine if I have somehow been delisted from Google?
Intermediate & Advanced SEO | | TBKO0 -
Block all search results (dynamic) in robots.txt?
I know that google does not want to index "search result" pages for a lot of reasons (dup content, dynamic urls, blah blah). I recently optimized the entire IA of my sites to have search friendly urls, whcih includes search result pages. So, my search result pages changed from: /search?12345&productblue=true&id789 to /product/search/blue_widgets/womens/large As a result, google started indexing these pages thinking they were static (no opposition from me :)), but i started getting WMT messages saying they are finding a "high number of urls being indexed" on these sites. Should I just block them altogether, or let it work itself out?
Intermediate & Advanced SEO | | rhutchings0