GWT False Reporting or GoogleBot has weird crawling ability?
-
Hi I hope someone can help me.
I have launched a new website and trying hard to make everything perfect. I have been using Google Webmaster Tools (GWT) to ensure everything is as it should be but the crawl errors being reported do not match my site. I mark them as fixed and then check again the next day and it reports the same or similar errors again the next day.
Example:
http://www.mydomain.com/category/article/ (this would be a correct structure for the site).
GWT reports:
http://www.mydomain.com/category/article/category/article/ 404 (It does not exist, never has and never will) I have been to the pages listed to be linking to this page and it does not have the links in this manner. I have checked the page source code and all links from the given pages are correct structure and it is impossible to replicate this type of crawl.
This happens accross most of the site, I have a few hundred pages all ending in a trailing slash and most pages of the site are reported in this manner making it look like I have close to 1000, 404 errors when I am not able to replicate this crawl using many different methods.
The site is using a htacess file with redirects and a rewrite condition.
Rewrite Condition:
Need to redirect when no trailing slash
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !.(html|shtml)$
RewriteCond %{REQUEST_URI} !(.)/$
RewriteRule ^(.)$ /$1/ [L,R=301]The above condition forces the trailing slash on folders.
Then we are using redirects in this manner:
Redirect 301 /article.html http://www.domain.com/article/
In addition to the above we had a development site whilst I was building the new site which was http://dev.slimandsave.co.uk now this had been spidered without my knowledge until it was too late. So when I put the site live I left the development domain in place (http://dev.domain.com) and redirected it like so:
<ifmodule mod_rewrite.c="">RewriteEngine on
RewriteRule ^ - [E=protossl]
RewriteCond %{HTTPS} on
RewriteRule ^ - [E=protossl:s]RewriteRule ^ http%{ENV:protossl}://www.domain.com%{REQUEST_URI} [L,R=301]</ifmodule>
Is there anything that I have done that would cause this type of redirect 'loop' ?
Any help greatly appreciated.\
-
Yeah - do this!
-
Anyone any thoughts on this?
-
Sorry I also should add that the url structure that google generates is like this:
http://www.domain.com/category/article/
http://www.domain.com/category/article/same-category/differentarticle/
http://www.domain.com/category/article/same-category/another-different-article/
http://www.domain.com/category/article/another-different-category/differentarticle/
etc, it is like it gets to a category article and then moves sideways and somehow adds the move onto the current url without keeping hold of the suffix of the URL
-
Doesn't sound like GWT is false reporting. May want to check your trailing slash URL rewrite. It seems like there is an issue there as what you are describing sounds like the URLs are being written incorrectly and causing the incorrect URLs to be generated and show up in GWT.
Your 301 looks ok and if the dev site was spidered and indexed, you should just add the site to GWT and then use the URL removal tool to remove the site from the index, then remove the site and redirect.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Huge number of crawl anomalies and 404s - non- existent urls
Hi there, Our site was redesigned at the end of January 2020. Since the new site was launched we have seen a big drop in impressions (50-60%) and also a big drop in total and organic traffic (again 50-60%) when compared to the old site. I know in the current climate some businesses will see a drop in traffic, however we are a tech business and some of our core search terms have increased in search volume as a result of remote-working. According to search console there are 82k urls excluded from coverage - the majority of these are classed as 'crawl anomaly' and there are 250+ 404's - almost all of the urls are non-existent, they have our root domain with a string of random characters on the end. Here are a couple of examples: root.domain.com/96jumblestorebb42a1c2320800306682 root.domain.com/01sportsplazac9a3c52miz-63jth601 root.domain.com/39autoparts-agency26be7ff420582220 root.domain.com/05open-kitchenaf69a7a29510363 Is this a cause for concern? I'm thinking that all of these random fake urls could be preventing genuine pages from being indexed / or they could be having an impact on our search visibility. Can somebody advise please? Thanks!
Technical SEO | | nicola-10 -
How google crawls images and which url shows as source?
Hi, I noticed that some websites host their images to a different url than the one their actually website is hosted but in the end google link to the one that the site is hosted. Here is an example: This is a page of a hotel in booking.com: http://www.booking.com/hotel/us/harrah-s-caesars-palace.en-gb.html When I try a search for this hotel in google images it shows up one of the images of the slideshow. When I click on the image on Google search, if I choose the Visit Page button it links to the url above but the actual image is located in a totally different url: http://r-ec.bstatic.com/images/hotel/840x460/135/13526198.jpg My question is can you host your images to one site but show it to another site and in the end google will lead to the second one?
Technical SEO | | Tz_Seo0 -
Weird Cigarette URLs showing up in Google Webmaster Tools
Hi there, I'm noticing a bunch of URLs showing up in my google webmaster tools that are all cigarette related (they are appearing as 404s in the crawl error report). They are throwing 404 errors which is why they are listed here... Anyone have any idea of what this could be? I recently switched from Wordpress to Shopify and these weird URLs just started appearing on my webmaster tools in the last week. Kinda bizarre / a little alarming! Thanks,
Technical SEO | | TheBatesMillStore
Bianca0 -
20 000 duplicates in Moz crawl due to Joomla URL parameters. How to fix?
We have a problem of massive duplicate content in Joomla. Here is an example of the "base" URL: http://www.binary-options.biz/index.php/Web-Pages/binary-options-platforms.html For some reason Joomla creates many versions of this URL, for example: http://www.binary-options.biz/index.php/Web-Pages/binary-options-platforms.html?q=/index.php/Web-Pages/binary-options-platforms.html?q=/index.php/Web-Pages/binary-options-platforms.html?q=/index.php/Web-Pages/binary-options-platforms.html?q=/index.php/Web-Pages/binary-options-platforms.html?q=/index.php/Web-Pages/binary-options-platforms.html?q=/index.php/Web-Pages/binary-options-platforms.html?q=/index.php/Web-Pages/binary-options-platforms.html or http://www.binary-options.biz/index.php/Web-Pages/binary-options-platforms.html?q=/index.php/Web-Pages/binary-options-platforms.html?q=/index.php/Web-Pages/binary-options-platforms.html?q=/index.php/Web-Pages/binary-options-platforms.html So it lists the URL parameter ?q= and then repeats part of the beforegoing URL. This leads to tens of thousands duplicate pages in our content heavy site. Any ideas how to fix this? Thanks so much!
Technical SEO | | Xmanic0 -
I have 3500 pages crawled by Google, - why is SEOMOZ only able to crawl 400 of these ?
I added my site almost two weeks ago to the PRO DashBoard, and so far only 404 pages has been crawled, - but I know for a fact that there is 3500 pages that should be crawled. Other search engines has no problem in crawling and indexing these pages, so what can be wrong here ?
Technical SEO | | haybob270 -
Application/x-msdownload in crawl report?
In a crawl report for my company's site, the "content_type_header" column usually contains "text/html". But there are some random pages with "application/x-msdownload"... what does "application/x-msdownload" mean? <colgroup><col width="121"></colgroup>
Technical SEO | | JimLynch
| |0 -
Very Weird Type of Penguin Penalization
One of my client's sites has a bunch of bad links from blog networks with exact-match anchor text. Since Penguin, they have been completely removed from Google for that keyword. But here's the weird part: It's only the homepage that has been removed, and only for that keyword. If I put other keywords into Google, our homepage comes up. So the site hasn't been banned, and that page hasn't even been banned because it still comes up with all of our other keywords. It's only when you put in the keyword that has all the anchor text that the homepage doesn't come up anywhere. (I went all the way to the end). Has this happened to anyone else, and does it warrant a re-inclusion request since the site and even that page haven't technically been banned?
Technical SEO | | UnderRugSwept0 -
What is the largest page size a searchbot will crawl?
When setting up pagination, what should we limit the page size to? When will a searchbot stop crawling a particular page?
Technical SEO | | nicole.healthline0