Google has deindexed 40% of my site because it's having problems crawling it
-
Hi
Last week i got my fifth email saying 'Google can't access your site'. The first one i got in early November. Since then my site has gone from almost 80k pages indexed to less than 45k pages and the number is lowering even though we post daily about 100 new articles (it's a online newspaper).
The site i'm talking about is http://www.gazetaexpress.com/
We have to deal with DDoS attacks most of the time, so our server guy has implemented a firewall to protect the site from these attacks. We suspect that it's the firewall that is blocking google bots to crawl and index our site. But then things get more interesting, some parts of the site are being crawled regularly and some others not at all. If the firewall was to stop google bots from crawling the site, why some parts of the site are being crawled with no problems and others aren't?
In the screenshot attached to this post you will see how Google Webmasters is reporting these errors.
In this link, it says that if 'Error' status happens again you should contact Google Webmaster support because something is preventing Google to fetch the site. I used the Feedback form in Google Webmasters to report this error about two months ago but haven't heard from them. Did i use the wrong form to contact them, if yes how can i reach them and tell about my problem?
If you need more details feel free to ask. I will appreciate any help.
Thank you in advance
-
Great news - strange that these 608 errors didn't appear while crawling the site with Screaming Frog.
-
We found the problem. It was about website compression (GZIP). I found this after crawling my site with Moz, and saw lot's of pages with 608 Error code. Then i searched in Google and saw a response by Dr. Pete in another question here in Moz Q/A (http://moz.com/community/q/how-do-i-fix-608-s-please)
After we removed the GZIP, Google could crawl the site with no problems.
-
Dirk
Thanks a lot for your help. Unfortunately the problem remains the same. More than 65% of site has been de-indexed and it's making our work very difficult.
I'm hoping that somebody here might have any idea of what is causing this so we can find a solution to fix it.
Thank you all for your time.
-
Hi
Not sure if the indexing problem is solved now, but I did a few other checks. Most of the tools I used where able to capture the problem url without much issues even from California ip's & simulating Google bot.
I noticed that some of the pages (example http://www.gazetaexpress.com/fun/) are quite empty if you browse them without Javascript active. Navigating through the site with Javascript is extremely slow, and a lot of links don't seem to respond. When trying to go from /fun/ to /sport/ without Javascript - I got a 504 Gateway Time-out
Normally Google is now capable of indexing content by executing the javascript, but it's always better to have a non-javascript fallback that can always be indexed (http://googlewebmastercentral.blogspot.be/2014/05/understanding-web-pages-better.html) - the article states explicitly
- If your web server is unable to handle the volume of crawl requests for resources, it may have a negative impact on our capability to render your pages. If you’d like to ensure that your pages can be rendered by Google, make sure your servers are able to handle crawl requests for resources.
This could be the reason for the strange errors when trying to fetch like Google.
Hope this helps,
Dirk
-
Hi Dirk
Thanks a lot for your reply.
Today we turned off the firewall for a couple hours and tried to fetch the site as Google. It didn't work. The results we're the same as before.
This problem is starting to be pretty ugly since Google has started now not showing our mobile results as 'mobile-friendly' even though we have a mobile version of site, we are using rel=canonical and rel=alternate and 302 redirects for mobile users from desktop pages to mobile ones when they are browsing via smartphone.
Any other idea what might be causing this?
Thanks in advance
-
Hi,
It seems that you're pages are extremely heavy to load - I did 2 tests - on your homepage & on the /moti-sot page
Your homepage needed a whopping 73sec to load (http://www.webpagetest.org/result/150312_YV_H5K/1/details/) - the moti-sot page is quicker - but 8sec is still rather high (http://www.webpagetest.org/result/150312_SK_H9M/)
I sometimes noticed a crash of the Shockwave flash plugin, but not sure if this is related to your problem;I crawled your site with Screaming Frog, but it didn't really find any indexing problems - while you have a lot of pages very deep in your sitestructure, the bot didn't seem to have any specific troubles to access your page. Websniffer returns a normal 200 code when checking your sites - even with useragent "Google"
So I guess you're right about the firewall - may be it's blocking the ip addresses used by Google bot - do you have reporting from the firewall which traffic is blocked? Try to search for the useragent Googlebot in your logfiles and see if this traffic is rejected. The fact that some sections are indexed and others not could be related to the configuration of the firewall, and/or the ip addresses used by Google bot to check your site (the bot is not always using the same ip address)
Hope this helps,
Dirk
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Google how deal with licensed content when this placed on vendor & client's website too. Will Google penalize the client's site for this ?
One of my client bought licensed content from top vendor of Health Industry. This same content is on the vendor's website & my client's site also but on my site there is a link back to vendor is placed which clearly tells to anyone that this is a licensed content & we bought from this vendor. My client bought paid top quality content for best source of industry but at this same this is placed on vendor's website also. Will Google penalize my client's website for this ? Niche is HEALTH
Technical SEO | | sourabhrana1 -
Homepage is deindexed in Google
Happened sometime on the 12th or 13th of Feb (is there a way to tell exactly besides referring to GA?).
Technical SEO | | Shinosky
I've been on the Google Webmasters Tools forums trying to nail this down - https://productforums.google.com/forum/?utm_medium=email&utm_source=footer#!msg/webmasters/OgpmNCc3IFA/mmtgUilyXUUJ I can only think that Google is viewing this as duplicate content from an internal page for example: http://mudlifeled.com/shop Very frustrating because we were moving up on the first page for some good brand key words and traffic was climbing. Now I've got my hands up and am at a loss to what I can do.0 -
Matt Cutts says 404 unavailable products on the 'average' ecommerce site.
If you're an ecommerce site owner, will you be changing how you deal with unavailable products as a result of the recent video from Matt Cutts? Will you be moving over to a 404 instead of leaving the pages live still? For us, as more products were becoming unavailable, I had started to worry about the impact of this on the website (bad user experience, Panda issues from bounce rates, etc.). But, having spoken to other website owners, some say it's better to leave the unavailable product pages there as this offers more value (it ranks well so attracts traffic, links to those pages, it allows you to get the product back up quickly if it unexpectedly becomes available, etc.). I guess there's many solutions, for example, using ItemAvailability schema, that might be better than a 404 (custom or not). But then, if it's showing as unavailable on the SERPS, will anyone bother clicking on it anyway...? Would be interested in your thoughts.
Technical SEO | | Coraltoes770 -
Is Google caching date same as crawling/indexing date?
If a site is cached on say 9 oct 2012 doesn't that also mean that Google crawled it on same date ? And indexed it on same date?
Technical SEO | | Personnel_Concept0 -
Does a CMS inhibit a site's crawlability?
I smell baloney but I could use a little backup from the community! My client was recently told by an SEO that search engines have a hard time getting to their site because using a CMS (like WordPress) doesn't allow "direct access to the html". Here is what they emailed my client: "Word Press (like your site is built with) and other similar “do it yourself” web builder programs and websites are not good for search engine optimization since they do not allow direct access to the HTML. Direct HTML access is needed to input important items to enhance your websites search engine visibility, performance and creditability in order to gain higher search engine rankings." Bots are blind to CMSs and html is html, correct? What do you think about the information given by the other SEO?
Technical SEO | | Adpearance0 -
Site being indexed by Google before it has launched
We are currently coming towards the end of a site migration, and are at the final stage of testing redirects etc. However, to our horror we've just discovered Google has started indexing the new site. Any ideas on how this could have happened? I have most recently asked for robots.txt to exclude anything with a certain parameter in URL. Is there a chance this, wrongly implemented, could have caused this?
Technical SEO | | Sayers0 -
Google's "cache:" operator is returning a 404 error.
I'm doing the "cache:" operator on one of my sites and Google is returning a 404 error. I've swapped out the domain with another and it works fine. Has anyone seen this before? I'm wondering if G is crawling the site now? Thx!
Technical SEO | | AZWebWorks0 -
My site has vanished from google
Hi my site has vanished from google. We have been for a very long time. for example if you put in gastric band hypnotherapy then we would be first page number two and also lots of other keywords but now we have vanished from google and i do not know why or how to solve this. can anyone please help me and help me understand what i need to do to solve this please My site is http://www.clairehegarty.co.uk I am not sure if i have been banned or why i have dropped out of google
Technical SEO | | ClaireH-1848860