Regular Expressions for Filtering BOT Traffic?
-
I've set up a filter to remove bot traffic from Analytics. I relied on regular expressions posted in an article that eliminates what appears to be most of them.
However, there are other bots I would like to filter but I'm having a hard time determining the regular expressions for them.
How do I determine what the regular expression is for additional bots so I can apply them to the filter?
I read an Analytics "how to" but its over my head and I'm hoping for some "dumbed down" guidance.
-
No problem, feel free to reach out if you have any other RegEx related questions.
Regards,
Chris
-
I will definitely do that for Rackspace bots, Chris.
Thank you for taking the time to walk me through this and tweak my filter.
I'll give the site you posted a visit.
-
If you copy and paste my RegEx, it will filter out the rackspace bots. If you want to learn more about Regular Expressions, here is a site that explains them very well, though it may not be quite kindergarten speak.
-
Crap.
Well, I guess the vernacular is what I need to know.
Knowing what to put where is the trick isn't it? Is there a dummies guide somewhere that spells this out in kindergarten speak?
I could really see myself botching this filtering business.
-
Not unless there's a . after the word servers in the name. The . is escaping the . at the end of stumbleupon inc.
-
Does it need the . before the )
-
Ok, try this:
^(microsoft corp|inktomi corporation|yahoo! inc.|google inc.|stumbleupon inc.|rackspace cloud servers)$|gomez
Just added rackspace as another match, it should work if the name is exactly right.
Hope this helps,
Chris
-
Agreed! That's why I suggest using it in combination with the variables you mentioned above.
-
rackspace cloud servers
Maybe my problem is I'm not looking in the right place.
I'm in audience>technology>network and the column shows "service provider."
-
How is it titled in the ISP report exactly?
-
For example,
Since I implemented the filter four days ago, rackspace cloud servers have visited my site 848 times, , visited 1 page each time, spent 0 seconds on the page and bounced 100% of the time.
What is the reg expression for rackspace?
-
Time on page can be a tricky one because sometimes actual visits can record 00:00:00 due to the way it is measured. I'd recommend using other factors like the ones I mentioned above.
-
"...a combination of operating system, location, and some other factors can do the trick."
Yep, combined with those, look for "Avg. Time on Page = 00:00:00"
-
Ok, can you provide some information on the bots that are getting through this that you want to sort out? If they are able to be filtered through the ISP organization as the ones in your current RegEx, you can simply add them to the list: (microsoft corp| ... ... |stumbleupon inc.|ispnamefromyourbots|ispname2|etc.)$|gomez
Otherwise, you might need to get creative and find another way to isolate them (a combination of operating system, location, and some other factors can do the trick). When adding to the list, make sure to escape special characters like . or / by using a \ before them, or else your RegEx will fail.
-
Sure. Here's the post for filtering the bots.
Here's the reg x posted: ^(microsoft corp|inktomi corporation|yahoo! inc.|google inc.|stumbleupon inc.)$|gomez
-
If you give me an idea of how you are isolating the bots I might be able to help come up with a RegEx for you. What is the RegEx you have in place to sort out the other bots?
Regards,
Chris
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
50% Organic Traffic Drop In the last 48 Hours
Hello, My site had a 50% decrease in the last 48 hours (9/26/18) and I looking for ideas/reasons what would cause such a dramatic drop. Year to year organic traffic has been up 40% and September was up 30%. The site has a domain authority of 39 according to Moz and keywords positions have been flat for a few months. I made a change to the code and robots.txt file on Monday, pre-drop. The category pagination pages had a "NoIndex" with a rel =canonical and I removed the "NoIdnex" per: https://www.seroundtable.com/google-noindex-rel-canonical-confusion-26079.html. I also removed "Disallow" in the robots.txt for stuff like "/?dir" because the pages have the rel =canonical. Could this be the reason for drop?? Other possible reasons:
Intermediate & Advanced SEO | | chuck-layton
1. Google Update: I dont think this is it, but ti looks like the last one was August 1st: "Medic" Core Update — August 1, 2018
2. Site was hacked
3. All of keyword positions dropped overnight: I dont think this is it because Bing has also dropped at the same percentage. Any help, thoughts or suggestions would be awesome.0 -
Over 30,000 pages but only 100 get traffic... can I kill the others?
I have a website with over 30,000 pages. But only around 100 are getting traffic from Google/being used by the company. How safe is it for me to kill the other pages? Usually I'd do rel canonical or 301's to scrap as much link juice as I can from them, but at 30,000 we just don't have any place to 301 the pages that makes sense and rel canonical to irrelevant pages seems... wrong? So my hope was to just kill the pages, reuse their content when needed, but pretty much start fresh. Let me know your thoughts. Thanks,
Intermediate & Advanced SEO | | jacob.young.cricut0 -
My site shows 503 error to Google bot, but can see the site fine. Not indexing in Google. Help
Hi, This site is not indexed on Google at all. http://www.thethreehorseshoespub.co.uk Looking into it, it seems to be giving a 503 error to the google bot. I can see the site I have checked source code Checked robots Did have a sitemap param. but removed it for testing GWMT is showing 'unreachable' if I submit a site map or fetch Any ideas on how to remove this error? Many thanks in advance
Intermediate & Advanced SEO | | SolveWebMedia0 -
How stupid is it to launch a new URL structure when our traffic is climbing?
We decided to redesign our site to make it responsive as Google is ranking sites based on mobile friendliness. Along with this we have changed our URL structure, meta tags, page content, site navigation, internal interlinking. How stupid is it to launch this site right in the middle of record traffic? Our traffic is climbing 10,000 more visitors every day with the current site. Visitors have increased 34% over the last 30 days compared to the previous 30 days.
Intermediate & Advanced SEO | | CFSSEO0 -
Declining Organic Traffic despite PR, links and engagement
I have a client site that launched last June and rebranded this February 2012 as http://49thshelf.com The search traffic since Feb has been steadily declining despite some great campaigns to drive traffic and engagement. April down 40% vs. Mar May down 37% Jun down 51% Jul 16% We have a couple of challenges. The site is the only collection of Canadian-authored titles. It's like an Amazon of only Canadian titles. But it's not ecommerce, we direct traffic to other vendors like Amazon or the publisher to buy. We have 40,000 unique products on the site and the descriptions are primarily supplied by the publishers, which means it's the same content on the publisher site as Goodreads, Amazon and anyone else they share data with. Those big players like Amazon and Goodreads use user generated content to alter the descriptions but we don't have that level of activity on the site. Members create reading lists, the editorial staff curate collections on the homepage and there are interviews, blog posts and guest posts. No black hat SEO, no bad links that I can see. Great organic membership growth and interactions. Good activity from social media sites to the site. Good, trusted links from news sites and legit blogs. I don't know what to do to improve the organic traffic. July is the first month that we haven't seen 40-50% drops. Any advice is welcome, thank you!
Intermediate & Advanced SEO | | SoMisguided0 -
Having Content be the First thing the bots see
If you have all of your homepage content in a tab set at the bottom of the page, but really would want that to be the first thing Google reads when it crawls your site, is there something you can implement where Google reads your content first before it reads the rest of your site? Does this cause any violations or are there any red flags that get raised from doing this? The goal here would just be to get Google to read the content first, not hide any content
Intermediate & Advanced SEO | | imageworks-2612900 -
E-commerce Site - Filter Pages
Hi, We have a client who has a fairly large e-commerce site that went live quite recently. The site is near enough fully indexed by Google, but one thing I've noticed is that filtered search results pages are being indexed, all with duplicate page titles. Obviously this is an issue that needs to be looked at ASAP. My questions is this - would we be better tweaking site settings so that page titles are constructed from the filters (brand/price/size) and therefore unique (and useful for searchers who are after a specific brand or size of a given item). Or should we rel=canonical the filtered pages so that they are eventually dropped from the index (the safer of the two options)? Thanks in advance for your help!
Intermediate & Advanced SEO | | jasarrow0 -
How to prevent Google from crawling our product filter?
Hi All, We have a crawler problem on one of our sites www.sneakerskoopjeonline.nl. On this site, visitors can specify criteria to filter available products. These filters are passed as http/get arguments. The number of possible filter urls is virtually limitless. In order to prevent duplicate content, or an insane amount of pages in the search indices, our software automatically adds noindex, nofollow and noarchive directives to these filter result pages. However, we’re unable to explain to crawlers (Google in particular) to ignore these urls. We’ve already changed the on page filter html to javascript, hoping this would cause the crawler to ignore it. However, it seems that Googlebot executes the javascript and crawls the generated urls anyway. What can we do to prevent Google from crawling all the filter options? Thanks in advance for the help. Kind regards, Gerwin
Intermediate & Advanced SEO | | footsteps0