Rogerbot Ignoring Robots.txt?
-
Hi guys,
We're trying to block Rogerbot from spending 8000-9000 of our 10000 pages per week for our site crawl on our zillions of PhotoGallery.asp pages. Unfortunately our e-commerce CMS isn't tremendously flexible so the only way we believe we can block rogerbot is in our robots.txt file.
Rogerbot keeps crawling all these PhotoGallery.asp pages so it's making our crawl diagnostics really useless.
I've contacted the SEOMoz support staff and they claim the problem is on our side. This is the robots.txt we are using:
User-agent: rogerbot
Disallow:/PhotoGallery.asp
Disallow:/pindex.asp
Disallow:/help.asp
Disallow:/kb.asp
Disallow:/ReviewNew.asp
User-agent: *
Disallow:/cgi-bin/
Disallow:/myaccount.asp
Disallow:/WishList.asp
Disallow:/CFreeDiamondSearch.asp
Disallow:/DiamondDetails.asp
Disallow:/ShoppingCart.asp
Disallow:/one-page-checkout.asp
Sitemap: http://store.jrdunn.com/sitemap.xml
For some reason the Wysiwyg edit is entering extra spaces but those are all single spaced.
Any suggestions? The only other thing I thought of to try is to something like "Disallow:/PhotoGallery.asp*" with a wildcard.
-
I have just encountered an interesting thing about Moz Link Search and its bot: if you do a search for Domains linking to Google.com , you find a list of about 900 000 domains, among which I was surprised to find webcache.googleusercontent.com
See the proof below in attache screen shot.
At the same time, the webcache.googleusercontent.com policy for robots is as shown in the second attachment.
In my opinion, there is only one possible explanation: Moz Bot does ignore robots.txt files...
-
Thanks Cyrus,
No, for some reason the editor double-spaced the file when I pasted. Other than that, it's the same though.
Yes, I actually tried ordering the exclusions both ways. Neither works.
The robots.txt checkers report no errors. I had actually checked them before posting.
Before I posted this, I was pretty convinced the problem wasn't in our robots.txt but the Seomoz support staff says essentially, "We don't think the problem is with Rogerbot, so it must be in your robots.txt file, but we can't look at that, so if by some chance your robots.txt file is fine, then there's nothing we can do for you because we're just going to assume the problem is on your side."
I figured, with everything I've already tried, and if the fabulous SEOMoz community can't come up with a solution, that'll be the best I can do.
-
Hi Kelly,
Thanks for letting us know. Could be a couple of things right off the bat. Is this your exact robots.txt file? If so, it's missing some formatting like proper spacing to be perfectly compliant. You can run a check of your robots.txt file at serveral places.
http://tool.motoricerca.info/robots-checker.phtml
http://www.searchenginepromotionhelp.com/m/robots-text-tester/robots-checker.php
http://www.sxw.org.uk/computing/robots/check.html
Also, it's generally a good idea to put specific inclusions towards the bottom, so I might flip the order and put the rogerbot directives last and the User-agent: * first.
Hope this helps. Let us know if any of this points in the right direction.
-
Thanks so much for the tip. Unfortunately still unsuccessful. (shrug)
-
Try
Disallow: /PhotoGallery.asp
I put wild cards all over usually just to be sure and had no issues so far.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Unsolved Rogerbot blocked by cloudflare and not display full user agent string.
Hi, We're trying to get MOZ to crawl our site, but when we Create Your Campaign we get the error:
Moz Pro | | BB_NPG
Ooops. Our crawlers are unable to access that URL - please check to make sure it is correct. If the issue persists, check out this article for further help. robot.txt is fine and we actually see cloudflare is blocking it with block fight mode. We've added in some rules to allow rogerbot but these seem to be getting ignored. If we use a robot.txt test tool (https://technicalseo.com/tools/robots-txt/) with rogerbot as the user agent this get through fine and we can see our rule has allowed it. When viewing the cloudflare activity log (attached) it seems the Create Your Campaign is trying to crawl the site with the user agent as simply set as rogerbot 1.2 but the robot.txt testing tool uses the full user agent string rogerbot/1.0 (http://moz.com/help/pro/what-is-rogerbot-, rogerbot-crawler+shiny@moz.com) albeit it's version 1.0. So seems as if cloudflare doesn't like the simple user agent. So is it correct the when MOZ is trying to crawl the site it uses the simple string of just rogerbot 1.2 now ? Thanks
Ben Cloudflare activity log, showing differences in user agent strings
2022-07-01_13-05-59.png0 -
Meta Robots query
Hi guys, I was ranking really well on my home page for certain keywords which has all dropped pretty dramatically over the last 3/4 weeks - I think the issue is since since the configuration of Yoast SEO Wordpress plugin. In March (when my rankings were strong) my crawl test showed the top data in the attached image, and in May (now the rankings have dropped severly) they show the bottom data. I don't fully understand canonical and Meta Robots so I am hoping someone can shed some light on the following points. 1. Will the change result in my loss of rankings.
Moz Pro | | RocketStats
2. How can I put it back to how it was in March? PS. I haven't had any Google penalties. Thanks,
Joshua RfTar0 -
Rogerbot crawls my site and causes error as it uses urls that don't exist
Whenever the rogerbot comes back to my site for a crawl it seems to want to crawl urls that dont exist and thus causes errors to be reported... Example:- The correct url is as follows: /vw-baywindow/cab_door_slide_door_tailgate_engine_lid_parts/cab_door_seals/genuine_vw_brazil_cab_door_rubber_68-79_10330/ But it seems to want to crawl the following: /vw-baywindow/cab_door_slide_door_tailgate_engine_lid_parts/cab_door_seals/genuine_vw_brazil_cab_door_rubber_68-79_10330/?id=10330 This format doesn't exist anywhere and never has so I have no idea where its getting this url format from The user agent details I get are as follows: IP ADDRESS: 107.22.107.114
Moz Pro | | spiralsites
USER AGENT: rogerbot/1.0 (http://moz.com/help/pro/what-is-rogerbot-, rogerbot-crawler+pr1-crawler-17@moz.com)0 -
Meta-Robots noFollow and Blocked by Meta-Robots
On my most recent campaign report, I have 2 Notices that we can't find any cause for: Meta-Robots nofollow-
Moz Pro | | gfiedel
http://www.fateyes.com/the-effect-of-social-media-on-the-serps-social-signals-seo/?replytocom=92
"noindex nofollow" for the page: http://www.fateyes.com/the-effect-of-social-media-on-the-serps-social-signals-seo/ Blocked by Meta-Robots -Meta-Robots nofollow-
http://www.fateyes.com/the-effect-of-social-media-on-the-serps-social-signals-seo/?replytocom=92
"noindex nofollow" for the page: http://www.fateyes.com/the-effect-of-social-media-on-the-serps-social-signals-seo/ We are unable to locate any code whatsoever that may explain this. Any ideas anyone?0 -
Its been over a month, rogerbot hasn't crawled the entire website yet. Any ideas?
Rogerbot has stopped crawling the website at 308 pages past week and has not crawled the website with over 1000+ pages. Any ideas on what I can do to get this fixed & crawling?
Moz Pro | | TejaswiNaidu0 -
How do you tell SEOmoz to ignore a subdomain?
We have a subdomain that I don't want to show up in our root SEOmoz campaign. How do I tell SEOmoz to ignore it?
Moz Pro | | wcsjohn0 -
Is seomoz rogerbot only crawling the subdomains by links or as well by id?
I´m new at seomoz and just set up a first campaign. After the first crawling i got quite a few 404 errors due to deleted (spammy) forum threads. I was sure there are no links to these deleted threads so my question is weather the seomoz rogerbot is only crawling my subdomains by links or as well by ids (the forum thread ids are serially numbered from 1 to x). If the rogerbot crawls as well serially numbered ids do i have to be concerned by the 404 error on behalf of the googlebot as well?
Moz Pro | | sauspiel0 -
To block with robots.txt or canonicalize?
I'm working with an apt community with a large number of communities across the US. I'm running into dup content issues where each community will have a page such as "amenities" or "community-programs", etc that are nearly identical (if not exactly identical) across all communities. I'm wondering if there are any thoughts on the best way to tackle this. The two scenarios I came up with so far are: Is it better for me to select the community page with the most authority and put a canonical on all other community pages pointing to that authoritative page? or Should i just remove the directory all-together via robots.txt to help keep the site lean and keep low quality content from impacting the site from a panda perspective? Is there an alternative I'm missing?
Moz Pro | | JonClark150