Moz Q&A is closed.
After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.
May know what's the meaning of these parameters in .htaccess?
-
Begin HackRepair.com Blacklist
RewriteEngine on
Abuse Agent Blocking
RewriteCond %{HTTP_USER_AGENT} ^BlackWidow [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Bolt\ 0 [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Bot\ mailto:craftbot@yahoo.com [NC,OR]
RewriteCond %{HTTP_USER_AGENT} CazoodleBot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^ChinaClaw [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Custo [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Default\ Browser\ 0 [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^DIIbot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^DISCo [NC,OR]
RewriteCond %{HTTP_USER_AGENT} discobot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Download\ Demon [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^eCatch [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ecxi [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^EirGrabber [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailCollector [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Express\ WebPictures [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^EyeNetIE [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^FlashGet [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^GetRight [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^GetWeb! [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Go!Zilla [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Go-Ahead-Got-It [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^GrabNet [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Grafula [NC,OR]
RewriteCond %{HTTP_USER_AGENT} GT::WWW [NC,OR]
RewriteCond %{HTTP_USER_AGENT} heritrix [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^HMView [NC,OR]
RewriteCond %{HTTP_USER_AGENT} HTTP::Lite [NC,OR]
RewriteCond %{HTTP_USER_AGENT} HTTrack [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ia_archiver [NC,OR]
RewriteCond %{HTTP_USER_AGENT} IDBot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} id-search [NC,OR]
RewriteCond %{HTTP_USER_AGENT} id-search.org [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Image\ Stripper [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Image\ Sucker [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Indy\ Library [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^InterGET [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Internet\ Ninja [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^InternetSeer.com [NC,OR]
RewriteCond %{HTTP_USER_AGENT} IRLbot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ISC\ Systems\ iRc\ Search\ 2.1 [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Java [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^JetCar [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^JOC\ Web\ Spider [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^larbin [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^LeechFTP [NC,OR]
RewriteCond %{HTTP_USER_AGENT} libwww [NC,OR]
RewriteCond %{HTTP_USER_AGENT} libwww-perl [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Link [NC,OR]
RewriteCond %{HTTP_USER_AGENT} LinksManager.com_bot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} linkwalker [NC,OR]
RewriteCond %{HTTP_USER_AGENT} lwp-trivial [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Mass\ Downloader [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Maxthon$ [NC,OR]
RewriteCond %{HTTP_USER_AGENT} MFC_Tear_Sample [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^microsoft.url [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Microsoft\ URL\ Control [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^MIDown\ tool [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Mister\ PiX [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Missigua\ Locator [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla.*Indy [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla.NEWT [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^MSFrontPage [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Navroad [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^NearSite [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^NetAnts [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^NetSpider [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Net\ Vampire [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^NetZIP [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Nutch [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Octopus [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Offline\ Explorer [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Offline\ Navigator [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^PageGrabber [NC,OR]
RewriteCond %{HTTP_USER_AGENT} panscient.com [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Papa\ Foto [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^pavuk [NC,OR]
RewriteCond %{HTTP_USER_AGENT} PECL::HTTP [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^PeoplePal [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^pcBrowser [NC,OR]
RewriteCond %{HTTP_USER_AGENT} PHPCrawl [NC,OR]
RewriteCond %{HTTP_USER_AGENT} PleaseCrawl [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^psbot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^RealDownload [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^ReGet [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Rippers\ 0 [NC,OR]
RewriteCond %{HTTP_USER_AGENT} SBIder [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^SeaMonkey$ [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^sitecheck.internetseer.com [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^SiteSnagger [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^SmartDownload [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Snoopy [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Steeler [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^SuperBot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^SuperHTTP [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Surfbot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^tAkeOut [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Teleport\ Pro [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Toata\ dragostea\ mea\ pentru\ diavola [NC,OR]
RewriteCond %{HTTP_USER_AGENT} URI::Fetch [NC,OR]
RewriteCond %{HTTP_USER_AGENT} urllib [NC,OR]
RewriteCond %{HTTP_USER_AGENT} User-Agent [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^VoidEYE [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Image\ Collector [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Sucker [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Web\ Sucker [NC,OR]
RewriteCond %{HTTP_USER_AGENT} webalta [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebAuto [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^[Ww]eb[Bb]andit [NC,OR]
RewriteCond %{HTTP_USER_AGENT} WebCollage [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebCopier [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebFetch [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebGo\ IS [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebLeacher [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebReaper [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebSauger [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Website\ eXtractor [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Website\ Quester [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebStripper [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebWhacker [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebZIP [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Wells\ Search\ II [NC,OR]
RewriteCond %{HTTP_USER_AGENT} WEP\ Search [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Wget [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Widow [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WWW-Mechanize [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WWWOFFLE [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Xaldon\ WebSpider [NC,OR]
RewriteCond %{HTTP_USER_AGENT} zermelo [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Zeus [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(.)Zeus.Webster [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ZyBorg [NC]
RewriteRule ^. - [F,L]Abuse bot blocking rule end
End HackRepair.com Blacklist
-
Now it's clear. Thanks a lot ThompsonPaul!
-
Thanks!
Typically these blacklists are created and maintained by security specialists who have done testing on the different bots to determine which are legit/beneficial and which are crapbots. They then provide these lists for others to use. Often the lists are amalgamations of bots detected and analysed on a number of different sites and by a number of different specialists to act as a double-check for each other.
You do need to be careful that you are using a well-curated list, as carelessly blocking bots can cause problems for legitimate bots. You would check out the creator of such a list the same way you'd check out the creator of a plugin you're considering using - check reviews, look at comments and responses on the post that provides the blacklist etc.
That answer your question?
Paul
-
Hi ThompsonPaul,
Wow! Superb explanation. One thing I just want to clarify, how would I know if these bots are "bad bots".
Thanks a lot!
-
As Lynn mentions, these entries form a blacklist for "bad bots". These are bots that are identified as being harmful (or at least non-helpful) to the real use of a website. Bots are essentially spiders that crawl and record the pages of your site the same way the GoogleBot does.There are 2 main reasons for blocking them
-
Too many unnecessary bots can put a real strain on server resources, causing the site to slow down for real users. This can be especially problematic with bad bots as they do not respect the entries in your robots.txt file and so will crawl even blocked pages. This can mean huge numbers of extra pages get crawled, leading to even more load.
-
Many (most?) of these bots are collecting data for nefarious purposes. Some are scrapers to collect your site content in order to re-use it illegally on another site, some are scanning for certain files/plugins on your site known to be insecure so they can target them for attack, etc.
Best case scenario, these bots waste your bandwidth and can cause site slowdowns on low-powered (e.g. shared) servers. Worst case, they can actually cause harm to your site.
There are literally many thousands of these types of bots out there, and their creators often change their identifying user agents just to get around these types of blacklists. But many have been around for some time and still use the same identifier. So having a blacklist to block the most common of them is actually very good security practice. To be totally proactive however, you'd need to update the list every couple of months.
Bottom line - those entries are providing some security and overload protection for your site, and there's essentially no downside to having them in place even if they're not catching everything.
Hope that helps - if any of my explanation isn't clear, just holler
Paul
-
-
Thanks Lynn! I'll just remove these parameters and leave this one:
BEGIN WordPress
<ifmodule mod_rewrite.c="">RewriteEngine On
RewriteBase /
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule . /index.php [L]
Rewritecond %{http_host} ^domain.com [NC]
Rewriterule ^(.*)$ http://www.domain.com/$1 [R=301,NC]</ifmodule>END WordPress
-
I dont use something like this myself. I suppose if you are having some problem with bots it might be useful, maybe someone else can chime in if they have some experience with this kind of blocking.
-
Thanks Lynn! Is this really necessary?
-
HI,
It is checking to see if the visiting user agent contains any of these strings (NC is telling it non case sensitive) and if it does to return a 403 forbidden message.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Forwarded vanity domains, suddenly resolving to 404 with appended URL's ending in random 5 characters
We have several vanity domains that forward to various pages on our primary domain.
Intermediate & Advanced SEO | | SS.Digital
e.g. www.vanity.com (301)--> www.mydomain.com/sub-page (200) These forwards have been in place for months or even years and have worked fine. As of yesterday, we have seen the following problem. We have made no changes in the forwarding settings. Now, inconsistently, they sometimes resolve and sometimes they do not. When we load the vanity URL with Chrome Dev Tools (Network Pane) open, it shows the following redirect chains, where xxxxx represents a random 5 character string of lower and upper case letters. (e.g. VGuTD) EXAMPLE:
www.vanity.com (302, Found) -->
www.vanity.com/xxxxx (302, Found) -->
www.vanity.com/xxxxx (302, Found) -->
www.vanity.com/xxxxx/xxxxx (302, Found) -->
www.mydomain.com/sub-page/xxxxx (404, Not Found) This is just one example, the amount of redirects, vary wildly. Sometimes there is only 1 redirect, sometimes there are as many as 5. Sometimes the request will ultimately resolve on the correct mydomain.com/sub-page, but usually it does not (as in the example above). We have cross-checked across every browser, device, private/non-private, cookies cleared, on and off of our network etc... This leads us to believe that it is not at the device or host level. Our Registrar is Godaddy. They have not encountered this issue before, and have no idea what this 5 character string is from. I tend to believe them because per our analytics, we have determined that this problem only started yesterday. Our primary question is, has anybody else encountered this problem either in the last couple days, or at any time in the past? We have come up with a solution that works to alleviate the problem, but to implement it across hundreds of vanity domains will take us an inordinate amount of time. Really hoping to fix the cause of the problem instead of just treating the symptom.0 -
Why do people put xml sitemaps in subfolders? Why not just the root? What's the best solution?
Just read this: "The location of a Sitemap file determines the set of URLs that can be included in that Sitemap. A Sitemap file located at http://example.com/catalog/sitemap.xml can include any URLs starting with http://example.com/catalog/ but can not include URLs starting with http://example.com/images/." here: http://www.sitemaps.org/protocol.html#location Yet surely it's better to put the sitemaps at the root so you have:
Intermediate & Advanced SEO | | McTaggart
(a) http://example.com/sitemap.xml
http://example.com/sitemap-chocolatecakes.xml
http://example.com/sitemap-spongecakes.xml
and so on... OR this kind of approach -
(b) http://example/com/sitemap.xml
http://example.com/sitemap/chocolatecakes.xml and
http://example.com/sitemap/spongecakes.xml I would tend towards (a) rather than (b) - which is the best option? Also, can I keep the structure the same for sitemaps that are subcategories of other sitemaps - for example - for a subcategory of http://example.com/sitemap-chocolatecakes.xml I might create http://example.com/sitemap-chocolatecakes-cherryicing.xml - or should I add a sub folder to turn it into http://example.com/sitemap-chocolatecakes/cherryicing.xml Look forward to reading your comments - Luke0 -
Need a layman's definition/analogy of the difference between schema and structured data
I'm currently writing a blog post about schema. However I want to set the record straight that schema is not exactly the same as structured data, although both are often used interchangeably. I understand this schema.org is a vocabulary of global identifiers for properties and things. Structured data is what Google officially stated as "a standard way to annotate your content so machines can understand it..." Does anybody know of a good analogy to compare the two? Thanks!
Intermediate & Advanced SEO | | RosemaryB0 -
Do 404s really 'lose' link juice?
It doesn't make sense to me that a 404 causes a loss in link juice, although that is what I've read. What if you have a page that is legitimate -- think of a merchant oriented page where you sell an item for a given merchant --, and then the merchant closes his doors. It makes little sense 5 years later to still have their merchant page so why would removing them from your site in any way hurt your site? I could redirect forever but that makes little sense. What makes sense to me is keeping the page for a while with an explanation and options for 'similar' products, and then eventually putting in a 404. I would think the eventual dropping out of the index actually REDUCES the overall link juice (ie less pages), so there is no harm in using a 404 in this way. It also is a way to avoid the site just getting bigger and bigger and having more and more 'bad' user experiences over time. Am I looking at it wrong? ps I've included this in 'link building' because it is related in a sense -- link 'paring'.
Intermediate & Advanced SEO | | friendoffood0 -
What's the deal with significantLinks?
http://schema.org/significantLink Schema.org has a definition for "non-navigation links that are clicked on the most." Presumably this means something like the big green buttons on Moz's homepage. But does anyone know how they affect anything? In http://moz.com/blog/schemaorg-a-new-approach-to-structured-data-for-seo#comment-142936, Jeremy Nelson says " It's quite possible that significant links will pass anchor text as well if a previous link to the page was set in navigation, effictively making obselete the first-link-counts rule, and I am interested in putting that to test." This is a pretty obscure comment but it's one of the only results I could find on the subject. Is this BS? I can't even make out what all of it is saying. So what's the deal with significantLinks and how can we use them to SEO?
Intermediate & Advanced SEO | | NerdsOnCall0 -
Two Pages with the Same Name Different URL's
I was hoping someone could give me some insight into a perplexing issue that I am having with my website. I run an 20K product ecommerce website and I am finding it necessary to have two pages for my content: 1 for content category pages about wigets one for shop pages for wigets 1st page would be .com/shop/wiget/ 2nd page would be .com/content/wiget/ The 1st page would be a catalogue of all the products with filters for the customer to narrow down wigets. So ultimately the URL for the shop page could look like this when the customer filters down... .com/shop/wiget/color/shape/ The second page would be content all about the Wigets. This would be types of wigets colors of wigets, how wigets are used, links to articles about wigets etc. Here are my questions. 1. Is it bad to have two pages about wigets on the site, one for shopping and one for information. The issue here is when I combine my content wiget with my shop wiget page, no one buys anything. But I want to be able to provide Google the best experience for rankings. What is the best approach for Google and the customer? 2. Should I rel canonical all of my .com/shop/wiget/ + .com/wiget/color/ etc. pages to the .com/content/wiget/ page? Or, Should I be canonicalizing all of my .com/shop/wiget/color/etc pages to .com/shop/wiget/ page? 3. Ranking issues. As it is right now, I rank #1 for wiget color. This page on my site would be .com/shop/wiget/color/ . If I rel canonicalize all of my pages to .com/content/wiget/ I am going to loose my rankings because all of my shop/wiget/xxx/xxx/ pages will then point to .com/content/wiget/ page. I am just finding with these massive ecommerce sites that there is WAY to much potential for duplicate content, not enough room to allow Google the ability to rank long tail phrases all the while making it completely complicated to offer people pages that promote buying. As I said before, when I combine my content + shop pages together into one page, my sales hit the floor (like 0 - 15 dollars a day), when i just make a shop page my sales are like (1k+ a day). But I have noticed that ever since Penguin and Panda my rankings have fallen from #1 across the board to #15 and lower for a lot of my phrase with the exception of the one mentioned above. This is why I want to make an information page about wigets and a shop page for people to buy wigets. Please advise if you would. Thanks so much for any insight you can give me!
Intermediate & Advanced SEO | | SKP0 -
How to prevent 404's from a job board ?
I have a new client with a job listing board on their site. I am getting a bunch of 404 errors as they delete the filled jobs. Question: Should we leave the the jobs pages up for extra content and entry points to the site and put a notice like this job has been filled, please search our other job listings ? Or should I no index - no follow these pages ? Or any other suggestions - it is an employment agency site. Overall what would be the best practice going forward - we are looking at probably 20 jobs / pages per month.
Intermediate & Advanced SEO | | jlane90 -
Soft 404's from pages blocked by robots.txt -- cause for concern?
We're seeing soft 404 errors appear in our google webmaster tools section on pages that are blocked by robots.txt (our search result pages). Should we be concerned? Is there anything we can do about this?
Intermediate & Advanced SEO | | nicole.healthline4