GWT False Reporting or GoogleBot has weird crawling ability?
-
Hi I hope someone can help me.
I have launched a new website and trying hard to make everything perfect. I have been using Google Webmaster Tools (GWT) to ensure everything is as it should be but the crawl errors being reported do not match my site. I mark them as fixed and then check again the next day and it reports the same or similar errors again the next day.
Example:
http://www.mydomain.com/category/article/ (this would be a correct structure for the site).
GWT reports:
http://www.mydomain.com/category/article/category/article/ 404 (It does not exist, never has and never will) I have been to the pages listed to be linking to this page and it does not have the links in this manner. I have checked the page source code and all links from the given pages are correct structure and it is impossible to replicate this type of crawl.
This happens accross most of the site, I have a few hundred pages all ending in a trailing slash and most pages of the site are reported in this manner making it look like I have close to 1000, 404 errors when I am not able to replicate this crawl using many different methods.
The site is using a htacess file with redirects and a rewrite condition.
Rewrite Condition:
Need to redirect when no trailing slash
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !.(html|shtml)$
RewriteCond %{REQUEST_URI} !(.)/$
RewriteRule ^(.)$ /$1/ [L,R=301]The above condition forces the trailing slash on folders.
Then we are using redirects in this manner:
Redirect 301 /article.html http://www.domain.com/article/
In addition to the above we had a development site whilst I was building the new site which was http://dev.slimandsave.co.uk now this had been spidered without my knowledge until it was too late. So when I put the site live I left the development domain in place (http://dev.domain.com) and redirected it like so:
<ifmodule mod_rewrite.c="">RewriteEngine on
RewriteRule ^ - [E=protossl]
RewriteCond %{HTTPS} on
RewriteRule ^ - [E=protossl:s]RewriteRule ^ http%{ENV:protossl}://www.domain.com%{REQUEST_URI} [L,R=301]</ifmodule>
Is there anything that I have done that would cause this type of redirect 'loop' ?
Any help greatly appreciated.\
-
Yeah - do this!
-
Anyone any thoughts on this?
-
Sorry I also should add that the url structure that google generates is like this:
http://www.domain.com/category/article/
http://www.domain.com/category/article/same-category/differentarticle/
http://www.domain.com/category/article/same-category/another-different-article/
http://www.domain.com/category/article/another-different-category/differentarticle/
etc, it is like it gets to a category article and then moves sideways and somehow adds the move onto the current url without keeping hold of the suffix of the URL
-
Doesn't sound like GWT is false reporting. May want to check your trailing slash URL rewrite. It seems like there is an issue there as what you are describing sounds like the URLs are being written incorrectly and causing the incorrect URLs to be generated and show up in GWT.
Your 301 looks ok and if the dev site was spidered and indexed, you should just add the site to GWT and then use the URL removal tool to remove the site from the index, then remove the site and redirect.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Unsolved Question about a Screaming Frog crawling issue
Hello, I have a very peculiar question about an issue I'm having when working on a website. It's a WordPress site and I'm using a generic plug in for title and meta updates. When I go to crawl the site through screaming frog, however, there seems to be a hard coded title tag that I can't find anywhere and the plug in updates don't get crawled. If anyone has any suggestions, thatd be great. Thanks!
Technical SEO | | KyleSennikoff0 -
Help with Getting Googlebot to See Google Charts
We received a message from Google saying we have an extremely high number of URLs that are linking to pages with similar or duplicate content. The main difference between these pages are the Google charts we use. It looks like Google isn't able to see these charts (most of the text are very similar) and the charts (lots of it) are the main differences between these pages. So my question is what is the best approach to allowing Google to see the data that exists in these charts? I read from here http://webmasters.stackexchange.com/questions/69818/how-can-i-get-google-to-index-content-that-is-written-into-the-page-with-javascr that a solution would be to have the text that is displayed on the charts coded into the html and hidden by CSS. I'm not sure but it seems like a bad idea to have it seen by Google but hidden to the user by CSS. It just sounds like a cloaking hack. Can someone clarify if this is even a solution or is there a better solution?
Technical SEO | | ERICompensationAnalytics1 -
Weird, long URLS returning crawl error
Hi everyone, I'm getting a crawl error "URL too long" for some really strange urls that I'm not sure where they are being generated from or how to resolve it. It's all with one page, our request info. Here are some examples: http://studyabroad.bridge.edu/request-info/?program=request info > ?program=request info > ?program=request info > ?program=request info > ?program=programs > ?country=country?type=internships&term=short%25 http://studyabroad.bridge.edu/request-info/?program=request info > ?program=blog > notes from the field tefl student elaina h in chile > ?utm_source=newsletter&utm_medium=article&utm_campaign=notes%2Bfrom%2Bthe%2Bf Has anyone seen anything like this before or have an idea of what may be causing it? Thanks so much!
Technical SEO | | Bridge_Education_Group0 -
Crawl Diagnostics and Duplicate Page Title
SOMOZ crawl our web site and say we have no duplicate page title but Google Webmaster Tool says we have 641 duplicate page titles, Which one is right?
Technical SEO | | iskq0 -
How Often is Site Crawled
Good morning- I saw some errors in my first crawl and immediately removed the pages from my website. I then re-created my XML sitemap and uploaded to Google. The question I have is will the site be crawled to recognize the changes in the next day or so? The pages were just placed on the site as test pages and never removed. The initial crawl that notified me it was done found the errors and were removed. Thanks for your help. Peter
Technical SEO | | VT_Pete0 -
CDN Being Crawled and Indexed by Google
I'm doing a SEO site audit, and I've discovered that the site uses a Content Delivery Network (CDN) that's being crawled and indexed by Google. There are two sub-domains from the CDN that are being crawled and indexed. A small number of organic search visitors have come through these two sub domains. So the CDN based content is out-ranking the root domain, in a small number of cases. It's a huge duplicate content issue (tens of thousands of URLs being crawled) - what's the best way to prevent the crawling and indexing of a CDN like this? Exclude via robots.txt? Additionally, the use of relative canonical tags (instead of absolute) appear to be contributing to this problem as well. As I understand it, these canonical tags are telling the SEs that each sub domain is the "home" of the content/URL. Thanks! Scott
Technical SEO | | Scott-Thomas0 -
Pagination/Crawl Errors
Hi, Ive only just joined SEO moz and after they crawled my site they came up with 3600 crawl errors mostly being duplicate content and duplicate urls. After researching this it soon became clear it was due to on page pagination and after speaking with Abe from SEO mozhe advised me to take action by getting our developers to implement rel=”next” & rel=”prev” to review. soon after our developers implemented this code ( I have no understanding of this what so ever) 90% of my keywords I had been ranking for in the top 10 have dropped out the top 50! Can anyone explain this or help me with this? Thanks Andy
Technical SEO | | beck3980 -
GWT indexing wrong pages
Hi SEOMoz I have a listings site. In a part of the page, I have 3 comboboxes, for state, county and city. On the change event, the javascript redirects the user to the page of the selected location. Parameters are passed via GET, and my URL is rewrited via htaccess. Example: http:///www.site.com/state/county/city.html The problem is, there is A LOT(more than 10k) of 404 errors. It is happenning because the crawler is trying to index the pages, sometimes WITHOUT a parameter, like http:///www.site.com/state//city.html I don't know how to stop it, and I don't wanna remove it, once it's very clicked by the users. What should I do?
Technical SEO | | elias990