Ways to analyze a 1M rows dataset of search queries
-
Hi,
I have this large dataset, about 1 million search queries with visits, bounce rate and a few other metrics. I'm trying to explore this data to find keyword "buckets" (such as include product name, location name, transactional objective, informational, etc.), as well as explore the density of certain keywords (keywords as in instances of a single word amongst all queries)
My idea was to use Excel and a macro to split all queries in separate words (also clearing punctuation and uppercase/lowercase), then storing this word in a new worksheet, adding to another column the visit counts from the row where the word was extracted (as to give a sense of weight). Before adding the word to the new worksheet, the script will look if the word already existed, if so it would just add the current value of visits to the existing visit counts etc.
In the end it will create sort of a "dictionary" of all the keywords in all search queries ranked by weight (= visits from search query including this keyword)
This would help me get started I believe, because I can't segment and analyze 1M raw search queries...
My issue is: this VBA has been running on my (fast) PC for the last 24hr and it doesn't seem to get to an end. Obviously excel+VBA is not the best way to do text mining and manipulation in such a large dataset (although it's just a 30mb file)
What would you do if you had this dataset and would like to mine the text/semantic as I am doing? Any idea of tools? process?
I'm considering dumping this data into a MySQL db and doing the processing through PHP (the only backend language I'm versed in), and getting the "summified" data stored into another table, which I'll then be able to export to a Excel for analysis. But I'm afraid that I'll be facing memory limit issues and such...
In the meantime, I'm definitely interested into knowing what you guys would do if you had this data and wanted to simply start exploring its constituencies
Thanks!
-
Yeah, Access can process any number of rows. It's Microsoft's database program. You can upload data, and then create queries. They have a design view where you can construct queries in a WYSIWYG fashion, or if you want, you can write your own SQL.
-
Thanks a lot John!
I'm going to try this out tonight!
So, I assume, Access won't have the same processing limitations with 1 million rows, will it?
Once I'll be done with the "discovery phase" I'm going through with this keyword list, I'll definitely use Advanced filters (in Excel) as you recommend to understand keyword groups in details
-
I had a similar problem going through my search query reports. If you're already familiar with VB you could do this with a Microsoft Access database rather than setting up a MySQL one w/PHP. I've been working on creating an Access database that I can import my data into, and have it spit out all sorts of useful info (for example negative keywords and placements), but it's only in its early stages right now.
If you just want to see it for a few terms and don't mind doing it one at a time, in the past I've filtered data like this in Excel without VB using advanced filters. I found that using advanced filters rather than VB sped up the process quite a bit; I'd imagine because it's an innate Excel function. Using 4 filters you can match whole words in the queries. For example, to find queries with "blah", you'd set a filter for "blah", "* blah", "blah " and " blah *". Then you can use the Subtotal command to do calculations over the visible rows and calculate the data.
More about advanced filters: http://office.microsoft.com/en-us/excel-help/filter-by-using-advanced-criteria-HP005200178.aspx
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Is there a way to filter all computers on a specific IPv6 network in Google Analytics?
Is there a quick way of filtering the IP addresses for all the computers on a network that's using IPv6? I want to filter out visits to our websites from the devices on our office network, but each computer (and phone and tablet) seems to have a different address. It _looks _like they all start the same way, though. One computer is xxxx:xxxx:xxxx:xxxx:aaaa:aaaa:aaaa:aaaa, another is xxxx:xxxx:xxxx:xxxx:bbbb:bbbb:bbbb:bbbb, my phone is xxxx:xxxx:xxxx:xxxx:cccc:cccc:cccc:cccc, etc. Does this mean that xxxx:xxxx:xxxx:xxxx is the address for our network as a whole, and I can just set up a Google Analytics filter for "IP addresses starting with..."? Or would doing that also filter out hits from, like, every visitor within a 20 mile radius of our office? If I need to simply put in the individual addresses for each and every device, I will. I'm just hoping it doesn't come to that. Thanks!
Reporting & Analytics | | BrianAlpert780 -
Query on google analytic benchmarking report
Hi All, First I select My Industry Vertical - ABCD then I select Region - US ( all region) then size by daily session - 10000-99999 finally reports appears "Blank" but when I decrease daily session from 10000-99999 to 1000-4999 or less then report is perfect. So what does it mean? My Avg daily session is 70k to 80k. So how to analysis benchmarking in this case? Thanks!
Reporting & Analytics | | pragnesh96390 -
404 Status Codes in Google Search Console
Hi all, I've noticed in Google Search Console under 'Crawl errors' - 1. Why does the status code '410' come up as an 'error' in the crawl report? 2. Why are some articles labelled as '404' error when they have been completely deleted and should be a '410' - there are roughly around 1000-2000 of these. Thanks!
Reporting & Analytics | | lucwiesman0 -
In Google Search Console: Total Clicks differ from the sum clicks on search queries
Hi.This may be a google technical question, but I've searched a lot and I couldn't find any certain information about that.The problem appears when you look at two stats in Search Console ( WebMasters Tools, some months ago), particulary in the Search Analysis.1- Total clicks2- Sum of clicks at every search query.I attached an image to make me clear.8IZsxs6.png
Reporting & Analytics | | NachoRetta2 -
Changes to cookies, August 24 or 25? (organic search not being cookied?)
Hi folks. We've seen a precipitous change in the referral data that we're able to gather from cookies. Specifically: On August 24 or 25, traffic to our site that was cookied as coming from organic search dropped by ~25%, and traffic coming in with no referrer data at all (i.e., it appears to be "direct") rose by roughly the same amount. As far as we can see, we haven't changed anything in our systems that would have caused this (we're not just mis-reading the cookie info), so I'm looking for external reasons. Has anyone else seen this? Or have any ideas why it would happen?
Reporting & Analytics | | RobM4160 -
Best way to generate analytics reports for listing style website
I'm working on a website that includes dedicated pages for ~40 local businesses, and I need to be able to generate and export some basic reports I can send to each business. The data I need for each report is split between general sitewide data: total number of visitors to the site for month. number of visitors to each main category page what country they are from (main countries) - top 5 traffic source / keywords avg time on site As well as specific data for each individual page: how many people viewed specific total pages Time spent on individual page. Would this be possible with custom reports in analytics? I can see the number of different reports being difficult to maintain, especially as site grows. Anyone had expereince on a similar site of ideas on the best way to do this? Thanks
Reporting & Analytics | | zeald0 -
Google Analytics Goal Tracking Head Match w/ Query Strings
Hello, I have what should be a simple question here but there is a small nuisance I am trying to make sure I have configured correctly. We have a product based website w/ no e-commerce because they sell through a dealer network. All these product pages have "Where to Buy" links and the URL after you click where to buy always uses the query string ?r=XXX. Example: www.mysite.com/product/category/subcategory/product-name?r=12345 I want to setup a goal in GA with a URL and configure head match on the "?r" but which of the following is exactly how it should be configured with the "Goal URL" ?r= ?r r= Does it matter, because I had it setup as "?r" and it was never registering any goals. Do I need to leave off the "?" and just have it be r= Thanks in advance for the respones.
Reporting & Analytics | | Bevelwise0 -
Number of Visitor Entries to page via search engine
Hi, I wanted to figure out the most optimal way to track the number of visitors that comes to a specific page on my blog via search engine only. I know Google Analytics has a "top landing page" filter, but that includes all visitors that comes in directly or other channels. Is there a way to figure out how many visitors a certain page received that was generated through only search engine? Does SEOmoz have this capability?
Reporting & Analytics | | kevinyu10290