Regular Expressions for Filtering BOT Traffic?
-
I've set up a filter to remove bot traffic from Analytics. I relied on regular expressions posted in an article that eliminates what appears to be most of them.
However, there are other bots I would like to filter but I'm having a hard time determining the regular expressions for them.
How do I determine what the regular expression is for additional bots so I can apply them to the filter?
I read an Analytics "how to" but its over my head and I'm hoping for some "dumbed down" guidance.
-
No problem, feel free to reach out if you have any other RegEx related questions.
Regards,
Chris
-
I will definitely do that for Rackspace bots, Chris.
Thank you for taking the time to walk me through this and tweak my filter.
I'll give the site you posted a visit.
-
If you copy and paste my RegEx, it will filter out the rackspace bots. If you want to learn more about Regular Expressions, here is a site that explains them very well, though it may not be quite kindergarten speak.
-
Crap.
Well, I guess the vernacular is what I need to know.
Knowing what to put where is the trick isn't it? Is there a dummies guide somewhere that spells this out in kindergarten speak?
I could really see myself botching this filtering business.
-
Not unless there's a . after the word servers in the name. The . is escaping the . at the end of stumbleupon inc.
-
Does it need the . before the )
-
Ok, try this:
^(microsoft corp|inktomi corporation|yahoo! inc.|google inc.|stumbleupon inc.|rackspace cloud servers)$|gomez
Just added rackspace as another match, it should work if the name is exactly right.
Hope this helps,
Chris
-
Agreed! That's why I suggest using it in combination with the variables you mentioned above.
-
rackspace cloud servers
Maybe my problem is I'm not looking in the right place.
I'm in audience>technology>network and the column shows "service provider."
-
How is it titled in the ISP report exactly?
-
For example,
Since I implemented the filter four days ago, rackspace cloud servers have visited my site 848 times, , visited 1 page each time, spent 0 seconds on the page and bounced 100% of the time.
What is the reg expression for rackspace?
-
Time on page can be a tricky one because sometimes actual visits can record 00:00:00 due to the way it is measured. I'd recommend using other factors like the ones I mentioned above.
-
"...a combination of operating system, location, and some other factors can do the trick."
Yep, combined with those, look for "Avg. Time on Page = 00:00:00"
-
Ok, can you provide some information on the bots that are getting through this that you want to sort out? If they are able to be filtered through the ISP organization as the ones in your current RegEx, you can simply add them to the list: (microsoft corp| ... ... |stumbleupon inc.|ispnamefromyourbots|ispname2|etc.)$|gomez
Otherwise, you might need to get creative and find another way to isolate them (a combination of operating system, location, and some other factors can do the trick). When adding to the list, make sure to escape special characters like . or / by using a \ before them, or else your RegEx will fail.
-
Sure. Here's the post for filtering the bots.
Here's the reg x posted: ^(microsoft corp|inktomi corporation|yahoo! inc.|google inc.|stumbleupon inc.)$|gomez
-
If you give me an idea of how you are isolating the bots I might be able to help come up with a RegEx for you. What is the RegEx you have in place to sort out the other bots?
Regards,
Chris
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
WordPress – parent category "blog" instead of regular "post page"?
In WordPress you normally show you blog posts on: Your home page. Your "posts page" (configurable in the Reading Settings) I want to do neither and have a third option instead: Assign a parent category called "blog" for all posts, and show the latest posts on that category's archive page. For the readers, the experience will be 100% the same as a regular "posts page". The UI, permalinks, and breadcrumbs will be 100% the same. But, I have heard that the "posts page" is important for Google for indexing and understanding your blog. So is is smarter SEO-wise to use a "posts page" instead of a parent category named "blog"? What negative effects might there be, if I have no "posts page" and just use the parent category "blog" instead?
Intermediate & Advanced SEO | | NikolasB0 -
Making Filtered Search Results Pages Crawlable on an eCommerce Site
Hi Moz Community! Most of the category & sub-category pages on one of our client's ecommerce site are actually filtered internal search results pages. They can configure their CMS for these filtered cat/sub-cat pages to have unique meta titles & meta descriptions, but currently they can't apply custom H1s, URLs or breadcrumbs to filtered pages. We're debating whether 2 out of 5 areas for keyword optimization is enough for Google to crawl these pages and rank them for the keywords they are being optimized for, or if we really need three or more areas covered on these pages as well to make them truly crawlable (i.e. custom H1s, URLs and/or breadcrumbs)…what do you think? Thank you for your time & support, community!
Intermediate & Advanced SEO | | accpar0 -
Unexplained Drop In Ranking and Traffic-HELP!
I operate a real estate web site in New York City (www.nyc-officespace-leader.com). It was hit by Penguin in April 2012, with search volume falling from 6,800 per month in March 2012 to 3,300 by June 2012. After refreshing content and changing the theme, volume recovered to 4,300 per month in October 2013. There was a big improvement in early October 2013, perhaps tied to a Panda update. In November 2013 I hired an SEO company. They are reputable; on MOZ's recommended list. After following all their suggestions (searching and removing duplicate content, disavowing toxic links, improving the site structure to make it easier for Google to index listings, re-writing ten key landing pages, improving the design of the user interface) ranking and traffic started to decline in April of 2014 and crashed in June 2014 after an upgraded design with improved user interface was launched. Search volume is went from 4700 in March to around 3800 in June. However ranking on the keywords that generate conversions has really declined, and clicks from those terms are down at least 65%. My online business is severely compromised after I have spent almost double the anticipated budget to improve ranking and conversion. A few questions: 1. Could a drop in the number of domains lining to our site have led to this decline? About 30 domains that had toxic links to us agreed to remove them. We had another 70 domains disavowed in late April. We only have 78 domains pointing to our domain now, far less than before (see attached AHREFs image). It seems there is a correlation in the timeline between the number of domains pointing to us and ranking performance. The number of domains pointing to us has never been this low. Could this be causing the drop? My SEO firm believes that the quality of these links are very low and the fact that many are gone is in fact a plus. 2. The number of indexed pages has jumped to 851 from 675 in early June (see attached image from Google Webmaster tools), right after a site upgrade. The number of pages in the site map is around 650. Could the indexation of the extra 175 page somehow have diluted the quality of the site in Google's eyes? We have filed removal request for these pages in Mid June and again last week with Google but they still appear. In 2013 we also launched an upgrade and Google indexed an extra 500 pages (canonical tags were not set up correctly) and search volume and ranking collapsed. Oddly enough when the number of pages indexed by Google fell, ranking improved. I wonder if something similar has occurred. 3. May 2014 Panda update. Many of our URLs are product URLs of listings. They have less than 100 words. Could Google suddenly be penalizing us for that? It is very difficult to write descriptions of hundreds of words for products that change quickly. I would think the Google takes this into account. If someone could present some insight into this issue I would be very, very grateful. I have spent over $25,000 on SEO reports, wireframe design and coding and now find myself in a worse position than when I started. My SEO provider is now requesting that I purchase even more reports for several thousand dollars and I can't afford it, nor can I justify it after such poor results. I wish they would take it upon themselves to identify what went wrong. In any case, if anyone has any suggestions I would really appreciate it. I am very suspicious that this drop started in earnest at the time of link removal and the disavow and accelerated at the time of the launch of the upgrade. Thanks, Alan XjSCiIdAwWgU2ps e5DerSo tYqemUO
Intermediate & Advanced SEO | | Kingalan10 -
How can I track traffic source for each user?
We received an enquiry on one of our landing pages and I am trying to track down where that user come from? Whether he came from social networks or search engines and if it is from search engine which keywords he used etc.. Does anyone know if there is any way I could see that?
Intermediate & Advanced SEO | | Rubix0 -
Are sites that recently lost traffic considered bad neighborhood?
Hi, While searching for co-operations and guest blogging opportunities I found several sites that look legit and that lost all SE traffic (according to various tools such as SEMRush). By legit I mean sites that have PR3 and up. Have many solid looking articles and that the articles are on the site's subject and that the articles do not necessarily point to other sites with exact match anchors. Should I post in these sites or are they probably penalized and therefore being there put me at risk? Thanks
Intermediate & Advanced SEO | | BeytzNet0 -
Google bot vs google mobile bot
Hi everyone 🙂 I seriously hope you can come up with an idea to a solution for the problem below, cause I am kinda stuck 😕 Situation: A client of mine has a webshop located on a hosted server. The shop is made in a closed CMS, meaning that I have very limited options for changing the code. Limited access to pagehead and can within the CMS only use JavaScript and HTML. The only place I have access to a server-side language is in the root where a Defualt.asp file redirects the visitor to a specific folder where the webshop is located. The webshop have 2 "languages"/store views. One for normal browsers and google-bot and one for mobile browsers and google-mobile-bot.In the default.asp (asp classic). I do a test for user agent and redirect the user to one domain or the mobile, sub-domain. All good right? unfortunately not. Now we arrive at the core of the problem. Since the mobile shop was added on a later date, Google already had most of the pages from the shop in it's index. and apparently uses them as entrance pages to crawl the site with the mobile bot. Hence it never sees the default.asp (or outright ignores it).. and this causes as you might have guessed a huge pile of "Dub-content" Normally you would just place some user-agent detection in the page head and either throw Google a 301 or a rel-canon. But since I only have access to JavaScript and html in the page head, this cannot be done. I'm kinda running out of options quickly, so if anyone has an idea as to how the BEEP! I get Google to index the right domains for the right devices, please feel free to comment. 🙂 Any and all ideas are more then welcome.
Intermediate & Advanced SEO | | ReneReinholdt0 -
With Panda, which is more important, traffic or quantity?
If you were to prioritize how to fix a site, would you focus on traffic or quantity of urls? So for example, if 10% of a site had thin content, but accounted for 50% of the traffic and 50% of the site had a different type of thin content but only accounted for 5% of organic traffic, which would you work on first? I realize both need to be fixed, but am unsure of which to tackle first (this is an extremely large site). Also, I am wondering if the simply the presence of thin content on a domain can affect a site even if it isn't receiving any traffic.
Intermediate & Advanced SEO | | nicole.healthline0 -
How to prevent Google from crawling our product filter?
Hi All, We have a crawler problem on one of our sites www.sneakerskoopjeonline.nl. On this site, visitors can specify criteria to filter available products. These filters are passed as http/get arguments. The number of possible filter urls is virtually limitless. In order to prevent duplicate content, or an insane amount of pages in the search indices, our software automatically adds noindex, nofollow and noarchive directives to these filter result pages. However, we’re unable to explain to crawlers (Google in particular) to ignore these urls. We’ve already changed the on page filter html to javascript, hoping this would cause the crawler to ignore it. However, it seems that Googlebot executes the javascript and crawls the generated urls anyway. What can we do to prevent Google from crawling all the filter options? Thanks in advance for the help. Kind regards, Gerwin
Intermediate & Advanced SEO | | footsteps0