Google has deindexed 40% of my site because it's having problems crawling it
-
Hi
Last week i got my fifth email saying 'Google can't access your site'. The first one i got in early November. Since then my site has gone from almost 80k pages indexed to less than 45k pages and the number is lowering even though we post daily about 100 new articles (it's a online newspaper).
The site i'm talking about is http://www.gazetaexpress.com/
We have to deal with DDoS attacks most of the time, so our server guy has implemented a firewall to protect the site from these attacks. We suspect that it's the firewall that is blocking google bots to crawl and index our site. But then things get more interesting, some parts of the site are being crawled regularly and some others not at all. If the firewall was to stop google bots from crawling the site, why some parts of the site are being crawled with no problems and others aren't?
In the screenshot attached to this post you will see how Google Webmasters is reporting these errors.
In this link, it says that if 'Error' status happens again you should contact Google Webmaster support because something is preventing Google to fetch the site. I used the Feedback form in Google Webmasters to report this error about two months ago but haven't heard from them. Did i use the wrong form to contact them, if yes how can i reach them and tell about my problem?
If you need more details feel free to ask. I will appreciate any help.
Thank you in advance
-
Great news - strange that these 608 errors didn't appear while crawling the site with Screaming Frog.
-
We found the problem. It was about website compression (GZIP). I found this after crawling my site with Moz, and saw lot's of pages with 608 Error code. Then i searched in Google and saw a response by Dr. Pete in another question here in Moz Q/A (http://moz.com/community/q/how-do-i-fix-608-s-please)
After we removed the GZIP, Google could crawl the site with no problems.
-
Dirk
Thanks a lot for your help. Unfortunately the problem remains the same. More than 65% of site has been de-indexed and it's making our work very difficult.
I'm hoping that somebody here might have any idea of what is causing this so we can find a solution to fix it.
Thank you all for your time.
-
Hi
Not sure if the indexing problem is solved now, but I did a few other checks. Most of the tools I used where able to capture the problem url without much issues even from California ip's & simulating Google bot.
I noticed that some of the pages (example http://www.gazetaexpress.com/fun/) are quite empty if you browse them without Javascript active. Navigating through the site with Javascript is extremely slow, and a lot of links don't seem to respond. When trying to go from /fun/ to /sport/ without Javascript - I got a 504 Gateway Time-out
Normally Google is now capable of indexing content by executing the javascript, but it's always better to have a non-javascript fallback that can always be indexed (http://googlewebmastercentral.blogspot.be/2014/05/understanding-web-pages-better.html) - the article states explicitly
- If your web server is unable to handle the volume of crawl requests for resources, it may have a negative impact on our capability to render your pages. If you’d like to ensure that your pages can be rendered by Google, make sure your servers are able to handle crawl requests for resources.
This could be the reason for the strange errors when trying to fetch like Google.
Hope this helps,
Dirk
-
Hi Dirk
Thanks a lot for your reply.
Today we turned off the firewall for a couple hours and tried to fetch the site as Google. It didn't work. The results we're the same as before.
This problem is starting to be pretty ugly since Google has started now not showing our mobile results as 'mobile-friendly' even though we have a mobile version of site, we are using rel=canonical and rel=alternate and 302 redirects for mobile users from desktop pages to mobile ones when they are browsing via smartphone.
Any other idea what might be causing this?
Thanks in advance
-
Hi,
It seems that you're pages are extremely heavy to load - I did 2 tests - on your homepage & on the /moti-sot page
Your homepage needed a whopping 73sec to load (http://www.webpagetest.org/result/150312_YV_H5K/1/details/) - the moti-sot page is quicker - but 8sec is still rather high (http://www.webpagetest.org/result/150312_SK_H9M/)
I sometimes noticed a crash of the Shockwave flash plugin, but not sure if this is related to your problem;I crawled your site with Screaming Frog, but it didn't really find any indexing problems - while you have a lot of pages very deep in your sitestructure, the bot didn't seem to have any specific troubles to access your page. Websniffer returns a normal 200 code when checking your sites - even with useragent "Google"
So I guess you're right about the firewall - may be it's blocking the ip addresses used by Google bot - do you have reporting from the firewall which traffic is blocked? Try to search for the useragent Googlebot in your logfiles and see if this traffic is rejected. The fact that some sections are indexed and others not could be related to the configuration of the firewall, and/or the ip addresses used by Google bot to check your site (the bot is not always using the same ip address)
Hope this helps,
Dirk
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Sub domain? Micro site? What's the best solution?
My client currently has two websites to promote their art galleries in different parts of the country. They have bought a new domain (let's call it buyart.com) which they would eventually like to use as an e-commerce platform. They are wondering whether they keep their existing two gallery websites (non e-commerce) separate as they always have been, or somehow combine these into the new domain and have one overarching brand (buyart.com). I've read a bit on subdomains and microsites but am unsure at this stage what the best option would be, and what the pros and cons are. My feeling is to bring it all together under buyart.com so everything is in one place and creates a better user journey for anyone who would like to visit. Thoughts?
Technical SEO | | WhitewallGlasgow0 -
Do I submit a sitemap for a highly dynamic site or not? If so, what's the best way to go about doing it?
I do SEO for online boutique marketplace. I've been here for about 4 weeks and no one's done there SEO (they've been around for about 5 years), so there's lots to do. A big concern is whether or not to submit a sitemap, and if I do submit one, what's the best way to go about doing one.
Technical SEO | | Jane.com0 -
Pro's & contra's: http vs https
Hi there, We are planning to take the step and go from http to https. The main reason to do this, is to mean trustfull to our clients. And of course the rumours that it would be better for ranking (in the future). We have a large e-commerce site. A part of this site ia already HTTPS. I've read a lot of info about pro's and contra's, also this MOZ article: http://moz.com/blog/seo-tips-https-ssl
Technical SEO | | Leonie-Kramer
But i want to know some experience from others who already done this. What did you encountered when changing to HTTPS, did you had ranking drops, or loss of links etc? I want to make a list form pro's and contra's and things we have to do in advance. Thanx, Leonie0 -
How to get out of Google's sendbox
Hello, i posted this question before here in forum, that 2 of my pages were sendboxed but never had a clear answer on how to get them back up, i do know that i need to build high quality backlinks pointing to those pages, but where do i start? Thanks
Technical SEO | | tonyklu0 -
What's the correct SEO for a Gallery?
Hi there, I was wondering if anyone was an expert on galleries and using canonical URL's? URL: http://www.tecsew.com/gallery In short I'm doing SEO for a site and it has a large gallery (3000+ images) where each specific image has it's own page and each category (there's 200+) also has its own page. Now, what I'm thinking is that this should be reduced and asking Google to index/rank each page is wrong (I also think this because the quality of the pages are relatively low i.e little text & content etc) Therefore, what should be suggested/done to the gallery? Should just the main gallery categories get indexed (i.e http://www.tecsew.com/3d-cad-showcase)? Or should I continue to allow Google to trawl through all of it? Or should canonical URL's be used? Any help would be greatly appreciated. Best Wishes, Charlie S
Technical SEO | | media.street0 -
What's the best canonicalization method?
Hi there - is there a canonicalization method that is better than others? Our developers have used the
Technical SEO | | GBC0 -
ECommerce site - Duplicate pages problem.
We have an eCommerce site with multiple products being displayed on a number of pages. We use rel="next" and rel="prev" and have a display ALL which I understand Google should automatically be able to find. Should we also being using a Canonical tag as well to tell google to give authority to the first page or the All Pages. Or was the use of the next and prev rel tags that we currently do adequate. We currently display 20 products per page, we were thinking of increasing this to make fewer pages but they would be better as this which would make some later product pages redundant . If we add 301 redirects on the redundant pages, does anyone know of the sort of impact this might cause to traffic and seo ?. General thoughts if anyone has similar problems welcome
Technical SEO | | SarahCollins0 -
Internal Links not Crawled by Open Site Explorer
Can someone plz tell me why www.hotelelgreco.gr has only 2 internal links in OSE despite the fact that the text content has a plethora of them. Thanks in advance.
Technical SEO | | socrateskirtsios0