What kind of data storage and processing is needed
-
Hi,
So after reading a few posts here I have realised it a big deal to crawl the web and index all the links.
For that I appreciate seomoz.org's efforts .
I was wondering what kind of infrastructure they might need to get this done ?
cheers,
Vishal
-
Thank you so much Kate for the explanation. It is quite helpful to better understand the process.
-
Hi vishalkhialani!
I thought I would answer your question with some detail that might satisfy your curiosity (although I know more detailed blog posts are in the works).
For Linkscape:
At the heart of our architecture is our own column oriented data store - much like Vertica, although far more specialized for our use case - particularly in terms of the optimizations around compression and speed.
Each month we crawl between 1-2 petabytes of data, strip out the parts we care about (links, page attributes, etc) and then compute a link graph of how all those sites link to one another (typically between 40-90 billion urls) and then calculate our metrics using those results. Once we have all of that we then precompute lots of views of the data, which is what gets displayed in Open SIte Explorer or retrieved via the Linkscape api. These resulting views of the data is over 12 terabytes (and this is all raw text compressed data - so it is a LOT of information). Making this fast and scalable is certainly a challenge.
For the crawling, we operate 10-20 boxes that crawl all the time.
For processing, we spin up between 40-60 instances to create the link graph, metrics and views.
And the API servers the index from S3 (Amazon's cloud storage) with 150-200 instances (but this was only 10 1 year ago, so we are seeing a lot of growth).All of this is Linux and C++ (with some python thrown in here and there).
For custom crawl:
We use similar crawling algorithms to Linkscape, only we keep the crawls per site, and also compute issues (like which pages are duplicates of one another). Then each of those crawls are processed and precomputed to be served quickly and easily within the web app (so calculating the aggregates and deltas you see in the overview sections).
We use S3 for archival of all old crawls. Cassandra for some of the details you see in detailed views, and a lot of the overviews and aggregates are served with the web app db.
Most of the code here is Ruby, except for the crawling and issue processing which is C++. All of it runs on Linux.
Hope that helps explain! Definitely let me know if you have more questions though!
Kate -
It is no where near that many. I attached an image of when I saw Rand moving the server to the new building. I think this may be the reason why there have been so many issues with the Linkscape crawl recently.
-
@keri and @Ryan
will ask them. my guess is around a thousand server instances.
-
Good answer from Ryan, and I caution that even then you may not get a direct answer. It might be similar to asking Google just how many servers they have. SEOmoz is fairly open with information, but that may be a bit beyond the scope of what they are willing to answer.
-
A question of this nature would probably be best as your one private question per month. That way you will be sure to receive a directly reply from a SEOmoz staff member. You could also try the help desk but it may be a stretch.
All I can say is it takes tremendous amounts of resources. Google does it very well, but we all know they have over 30 billion in revenue generated annually.
There are numerous crawl programs available, but the problem is the server hardware to run them.
I am only responding because I think your question may otherwise go unanswered and I wanted to point you in a direction where you can receive some info.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Need help Taking my Site to the Next Level
Any, and I mean ANY suggestions would be Great and Welcome, Good and Bad. Have a Good site, with what I believe good content, and just stuck at the same level for over 6 months. Please and Thank you for taking a few minute of time out of your day on this. Joe https://www.surecretedesign.com/
Reporting & Analytics | | surecreteproucts0 -
Help! Need to Get Traffic Back Up in Saturated Market
I was looking in one of my client's Google Analytics profiles, and noticed that they had two major drops in traffic before we started working with them—and they've never really recovered. The first, and most significant drop was around January 2015. And then the again, but not as drastic of a drop, was around September 2015. They are a heating and cooling company, but they are located out west so this shouldn't be a seasonality thing. Here is a link to what the BIG drop is traffic looks like in January 2015: http://imgur.com/a/Y1s8U To get a clearer picture, here are the numbers for the overall website traffic:
Reporting & Analytics | | BlueCorona
September 1, 2015 - September 30, 2015: 30,923 sessions
September 1, 2015 - September 30, 2016: 13,768 sessions Year over year traffic to the website dropped by 55%. Here is a link to what year over year looks like in Google Analytics: http://imgur.com/a/TPdJQ Like I said, we weren't working with them at the time so I don't know specifics about what might have caused this, but their numbers have never even come close to reaching what they used to be prior to the noticeable drop after September 2015. Does anyone have any insights into why this might be? Was there an algorithm change back then that could still be impacting them? Any ideas how to get them back to where they once were? Any input is greatly appreciated! Thanks.1 -
Ecommerce data in universal analytics
Does any have the same issue. We are tracking e-commerce data in GA universal. We are finding that sales are being recorded on the day of the sale and again a few days later. It does only seem to be 1 or 2 sales that do this not all. Does anyone have any suggestions? Much appreciated. David
Reporting & Analytics | | Archers0 -
Google analytics vs Webmaster tools data
Hi Which is more accurate WT or GA data ? Since GA reporting a KW (thats very recently fallen from page 1 to 3 hence looking into data to find the cause) in the Organic part of Search tab as having generated just 1 visitor over a month (hence presuming fall could be due to low visits from a page 1 result) whilst under Search Engine Optimisation tab (data sourced from WT i think) its reporting 5 click thrus from 150 impressions over same period resulting in a quite good 3.33% CTR (hence wouldn't expect to be the cause of a fall) and what i would have thought GA would report as 5 visits instead of the 1 they do report !? The reason im looking for answer in the data is because no on-page has changed and still scoring a grade A and off page metrics have all improved across the board (apart from small drop in majseo's Trust Flow) such as increased links, RD, Citation Flow, Ref Subnets etc etc etc Cheers Dan
Reporting & Analytics | | Dan-Lawrence0 -
False Conversion Data in GA
Hi all, I have a problem with Conversion Tracking in Google Analytics. Our contact form conversion completes when the user hits the Thank You page. Yesterday we had an increase in conversions that didn't correlate to form submission emails. It appears that one person filled out our form, then returned (hit the back button?) to the Thank You page another 8 times as there were 8 entrances to the Thank You page and it's currently no indexed. Is there a way to prevent this from happening? Or should I just note in Analytics that the Conversion data is wrong for that day and note how many? Thanks!
Reporting & Analytics | | nsauser0 -
Google Webmaster Tools is showing wrong data - help?
Hey all, I'm seeing some weird problems with Webmaster Tools. Specifically: We've submitted a sitemap with 174k URLs. According to the WMT dashboard, only 21 are indexed, though if you search our site via site:sitename.com blah blah, there are thousands of pages returned. Why is WMT only showing 21 indexed pages? Yet if I go to Health -> Index Status, it's showing nearly 199k URLs indexed. This seems consistent with searching Google site:sitename.com blah blah. Under "Search Queries", it's showing "no data available". Not sure why as it's linked to the proper Google Analytics account, which has keyword data. Any ideas what I'm doing wrong here? Thanks.
Reporting & Analytics | | chimptech0 -
Google Analytics data - Canonical problems?
Hi everyone, We're trying to determine why Google Analytics is showing multiple versions of the same page as having "landing page traffic". For instance, these 2 pages are both shown as landing pages in GA: www.oursite.com/product_page
Reporting & Analytics | | darkgreenguy
www.oursite.com/product_page/ This occurs many times in Google analytics. Also, there are instances such as these: www.oursite.com/index.php/custom_product_url www.oursite.com/custom_product_url I can't find anything in Google Webmaster tools that would indicate a problem. However, this GA data is making me think there are duplicate content issues on the site... Thanks in advance for any help...0 -
Need Tips to Track Advertising of Domain on my Car
Hi, I'm working on building a new site. http://www.pilatesboisfranc.com/ I also bought the domain .ca (the business will be in canada) .net .org and .info All the domains are redirect to the .com This site is about my wife new business, a Pilates Studio. We would like to advertise the site on our personal car, using vinyl letters to display the domain name. Is there a way tu use the (.ca) http://www.pilatesboisfranc.ca advertise on my car and track that advertising using Analytics? I know I can use a URL something like http://www.pilatesboisfranc.com/health and track the hit, but using a shorter URL is the key. Can you help? Thank you, BigBlaze
Reporting & Analytics | | BigBlaze2050