Ways to analyze a 1M rows dataset of search queries
-
Hi,
I have this large dataset, about 1 million search queries with visits, bounce rate and a few other metrics. I'm trying to explore this data to find keyword "buckets" (such as include product name, location name, transactional objective, informational, etc.), as well as explore the density of certain keywords (keywords as in instances of a single word amongst all queries)
My idea was to use Excel and a macro to split all queries in separate words (also clearing punctuation and uppercase/lowercase), then storing this word in a new worksheet, adding to another column the visit counts from the row where the word was extracted (as to give a sense of weight). Before adding the word to the new worksheet, the script will look if the word already existed, if so it would just add the current value of visits to the existing visit counts etc.
In the end it will create sort of a "dictionary" of all the keywords in all search queries ranked by weight (= visits from search query including this keyword)
This would help me get started I believe, because I can't segment and analyze 1M raw search queries...
My issue is: this VBA has been running on my (fast) PC for the last 24hr and it doesn't seem to get to an end. Obviously excel+VBA is not the best way to do text mining and manipulation in such a large dataset (although it's just a 30mb file)
What would you do if you had this dataset and would like to mine the text/semantic as I am doing? Any idea of tools? process?
I'm considering dumping this data into a MySQL db and doing the processing through PHP (the only backend language I'm versed in), and getting the "summified" data stored into another table, which I'll then be able to export to a Excel for analysis. But I'm afraid that I'll be facing memory limit issues and such...
In the meantime, I'm definitely interested into knowing what you guys would do if you had this data and wanted to simply start exploring its constituencies
Thanks!
-
Yeah, Access can process any number of rows. It's Microsoft's database program. You can upload data, and then create queries. They have a design view where you can construct queries in a WYSIWYG fashion, or if you want, you can write your own SQL.
-
Thanks a lot John!
I'm going to try this out tonight!
So, I assume, Access won't have the same processing limitations with 1 million rows, will it?
Once I'll be done with the "discovery phase" I'm going through with this keyword list, I'll definitely use Advanced filters (in Excel) as you recommend to understand keyword groups in details
-
I had a similar problem going through my search query reports. If you're already familiar with VB you could do this with a Microsoft Access database rather than setting up a MySQL one w/PHP. I've been working on creating an Access database that I can import my data into, and have it spit out all sorts of useful info (for example negative keywords and placements), but it's only in its early stages right now.
If you just want to see it for a few terms and don't mind doing it one at a time, in the past I've filtered data like this in Excel without VB using advanced filters. I found that using advanced filters rather than VB sped up the process quite a bit; I'd imagine because it's an innate Excel function. Using 4 filters you can match whole words in the queries. For example, to find queries with "blah", you'd set a filter for "blah", "* blah", "blah " and " blah *". Then you can use the Subtotal command to do calculations over the visible rows and calculate the data.
More about advanced filters: http://office.microsoft.com/en-us/excel-help/filter-by-using-advanced-criteria-HP005200178.aspx
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Can Anyone help me with Google search console soft 404 error?
Hello everyone, I just build one site on WordPress and submitted it to the Google search console along with the sitemap. Some URLs got indexed but one of my URL is showing error in Search console. My post https://hotpass.site/create-subdomain-godaddy/ is showing soft error 404 webpage not available in Google search console. I have rechecked the site and seo settings. Nothing is there. I have even checked it with LIVE URL test. But still Search console is not accepting this URL. Can anyone help me with this? Help will be appriciated.
Reporting & Analytics | | Pauline210 -
What's more accurate? GA queries data or Moz/SEMRush keyword data for rankings
What do you guys think? What's more accurate? GA queries data or Moz/SEMRush keyword data for rankings? Any thoughts appreciated.
Reporting & Analytics | | znotes0 -
How Reliable is the Avg Query Position in GA?
Hi Moz Community, I wanted to know how reliable the average position data is for queries in Google Analytics search console report. I know this report is fairly new this year and the numbers are calculated a bit differently than they were in the old search engine optimization report. I want to know what the biggest differences are between this search console report vs. the old SEO report in GA. I'm also pretty confused about how GA reports on the average position. Obviously it's an average position of whatever date range your choose. But for instance, if your site shows multiples landing pages for one search query will it roll that into the average or just take the landing page that ranks higher? Does the position average take into account video or photo serp results and is this the average across mobile, desktop and tablet? This number has always been a guess since it's sampled data but I want to know how accurate it is. I read this article in 2014 (linked below) but I'm not sure if it all still applies now that that data might be presented differently. https://moz.com/ugc/testing-the-accuracy-of-avg-position-for-search-queries-in-google-webmaster-tools Any answers or discussions would be great here. Thanks
Reporting & Analytics | | znotes1 -
Does the new Google Analytics Search Console Beta tool use API to pull more data?
So my client has been asking for definitive proof of why the search query data provided on Google Search Console does not exactly match up the data presented directly in the Search Console itself. The simple answer is that the Google Search Console is limited to 1000 rows of data. However our client is requesting a Google article/documentation of why the new Search Console beta tool has no row limit (hence much more data for a big website). I know that the Google Search Console API was available before Google announced the new Search Console Beta tool in Google Analytics. I also know this API could pull in more data than the 1000 row limit. However is there any article available (preferably from Google) that Google Analytics is pulling this Search Console data via API? Thanks!
Reporting & Analytics | | RosemaryB0 -
How to interpret search "clicks" in GA?
I am trying to make some sense of the data in the "Landing Pages" report under "Search Engine Optimization" in GA versus the data under the "Landing Pages" report in "Behavior | Site Content". For example, the SEO report says my page http://www.asiantraveltips.com/blog/bangkok-skytrain-bts-mrt-lines/ received 22,000 search impressions in the past 30 days and 900 "clicks" (12.42%). What are these "clicks" when the Content "Landing Pages" report says the same page only got 382 "sessions" in the same period??? What are these "clicks" if not clicks on the search results link that should be reflected in landing page sessions on the corresponding page?
Reporting & Analytics | | Gavin.Atkinson0 -
No Search Data in Google Search Console (Search Analytics)
Wondering if anyone has experienced 0 data in their Google Search Console (Search Analytics) and found possible solutions to retrieve data? We have a few clients who, prior to the update to Google Search Console, were getting data regularly in terms of the Search Queries report, but ever since the update to Google Search Console, they are no longer receiving data. As an FYI, both the www and non-www versions of the website are verified in Search Console and the XML Sitemaps and Robots.txt files are clean, tested and working fine. Any insights or experience of sites showing 0 data in Search Analytics? Any possible solutions would be greatly appreciated. Thanks!
Reporting & Analytics | | SEO5Team0 -
Difference between site: search and Total Indexed in Google Webmaster Tools.
This morning I did a search on Google for my site using the site: operator. I noticed that the number of results returned was significantly different than the "Total indexed" in Google Webmaster Tools. What is the difference and is it normal to have two very different numbers here?
Reporting & Analytics | | Gordian0 -
Organic search on google
Hi there, pl take a look at this link, there is a section which says shared results and has a star against two agencies and once article. can pl someone let me know what those starts are and how to get them? Thank you 🙂 so sorry - the link is http://www.google.com/search?sourceid=chrome&ie=UTF-8&q=digital+mareketing+agency+los+angeles Vijay
Reporting & Analytics | | vijayvasu0