What kind of data storage and processing is needed
-
Hi,
So after reading a few posts here I have realised it a big deal to crawl the web and index all the links.
For that I appreciate seomoz.org's efforts .
I was wondering what kind of infrastructure they might need to get this done ?
cheers,
Vishal
-
Thank you so much Kate for the explanation. It is quite helpful to better understand the process.
-
Hi vishalkhialani!
I thought I would answer your question with some detail that might satisfy your curiosity (although I know more detailed blog posts are in the works).
For Linkscape:
At the heart of our architecture is our own column oriented data store - much like Vertica, although far more specialized for our use case - particularly in terms of the optimizations around compression and speed.
Each month we crawl between 1-2 petabytes of data, strip out the parts we care about (links, page attributes, etc) and then compute a link graph of how all those sites link to one another (typically between 40-90 billion urls) and then calculate our metrics using those results. Once we have all of that we then precompute lots of views of the data, which is what gets displayed in Open SIte Explorer or retrieved via the Linkscape api. These resulting views of the data is over 12 terabytes (and this is all raw text compressed data - so it is a LOT of information). Making this fast and scalable is certainly a challenge.
For the crawling, we operate 10-20 boxes that crawl all the time.
For processing, we spin up between 40-60 instances to create the link graph, metrics and views.
And the API servers the index from S3 (Amazon's cloud storage) with 150-200 instances (but this was only 10 1 year ago, so we are seeing a lot of growth).All of this is Linux and C++ (with some python thrown in here and there).
For custom crawl:
We use similar crawling algorithms to Linkscape, only we keep the crawls per site, and also compute issues (like which pages are duplicates of one another). Then each of those crawls are processed and precomputed to be served quickly and easily within the web app (so calculating the aggregates and deltas you see in the overview sections).
We use S3 for archival of all old crawls. Cassandra for some of the details you see in detailed views, and a lot of the overviews and aggregates are served with the web app db.
Most of the code here is Ruby, except for the crawling and issue processing which is C++. All of it runs on Linux.
Hope that helps explain! Definitely let me know if you have more questions though!
Kate -
It is no where near that many. I attached an image of when I saw Rand moving the server to the new building. I think this may be the reason why there have been so many issues with the Linkscape crawl recently.
-
@keri and @Ryan
will ask them. my guess is around a thousand server instances.
-
Good answer from Ryan, and I caution that even then you may not get a direct answer. It might be similar to asking Google just how many servers they have. SEOmoz is fairly open with information, but that may be a bit beyond the scope of what they are willing to answer.
-
A question of this nature would probably be best as your one private question per month. That way you will be sure to receive a directly reply from a SEOmoz staff member. You could also try the help desk but it may be a stretch.
All I can say is it takes tremendous amounts of resources. Google does it very well, but we all know they have over 30 billion in revenue generated annually.
There are numerous crawl programs available, but the problem is the server hardware to run them.
I am only responding because I think your question may otherwise go unanswered and I wanted to point you in a direction where you can receive some info.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Need help Taking my Site to the Next Level
Any, and I mean ANY suggestions would be Great and Welcome, Good and Bad. Have a Good site, with what I believe good content, and just stuck at the same level for over 6 months. Please and Thank you for taking a few minute of time out of your day on this. Joe https://www.surecretedesign.com/
Reporting & Analytics | | surecreteproucts0 -
Does the new Google Analytics Search Console Beta tool use API to pull more data?
So my client has been asking for definitive proof of why the search query data provided on Google Search Console does not exactly match up the data presented directly in the Search Console itself. The simple answer is that the Google Search Console is limited to 1000 rows of data. However our client is requesting a Google article/documentation of why the new Search Console beta tool has no row limit (hence much more data for a big website). I know that the Google Search Console API was available before Google announced the new Search Console Beta tool in Google Analytics. I also know this API could pull in more data than the 1000 row limit. However is there any article available (preferably from Google) that Google Analytics is pulling this Search Console data via API? Thanks!
Reporting & Analytics | | RosemaryB0 -
Referral Data Q's
1. We recently ran a promotion on both FB and Reddit, which is https, linking to our non-https site. We utilized UTM links to our landing page. Our GA campaign data returned extremely low hits in comparison to what we actually received (and recorded via FB/Reddit dashboard). Obviously our Direct traffic spiked during these times, caused by a secure to nonsecure referral, I'm sure. I'm also noticing a spike in referral traffic from lm.facebook.com that correlates to the ad times. Does this mean Facebook's link shim is stripping away my UTM data? My question is why we receive SOME properly UTM-tagged referral traffic in our campaigns? What's allowing some of it to go through? 2. I've tagged our email signature links with UTM as well, hoping to clean up some of our Direct traffic. I understand that external clients like Outlook and Thunderbird likely won't pass referral data, but do hosted clients like Gmail, Yahoo, and such? And if so, would the https to http difference obstruct this again? I'd love some insight onto these questions, especially if I'm off the mark with a few of my assumptions there.
Reporting & Analytics | | kirmeliux0 -
Problems with analytics, conversation data and assisted conversions
Hi guys. Having issues with analytics. Firstly, it doesn't record all our sales accurately. My boss downloaded a sales report directly from our site, and it doesn't match up to the analytics, our actual sales were significantly higher than Google analytics was reporting (by about a 1/3). Any idea why Google is missing so many of the conversions? Also, I'm trying to work our how many assisted conversion we are getting from our ad words campaigns. I've been looking through our top conversion path data to try find conversions that started with a paid click. However, I seem to have some inconsistencies. We have only one ad words group that is on the display network (Re-marketing V1 Ad group 1) , and according to analytics we have a conversion from it (See Screen shot 1). However, there is no record of this in our ad words data. (See screen shot 2). This is over the same time period. I don't know what to take from this. Can anyone help? Isaac. Screen%20shot%201.png Screen%20shot%202.png
Reporting & Analytics | | isaac6630 -
Weird referral data. Spam?
Hi all, a few of my sites are suffering from referral spam. I read a couple of articles here on how to exclude them from your traffic using htaccess but today I was going through some referral stats of the company I work for and I noticed a lot of referral traffic coming from prod2.ssosecure.com I didn't find any article telling me this is spam, so it could also be an intranet of one of our clients where they are sending their employees to our application. Anybody a clue what this could be?
Reporting & Analytics | | jorisbrabants0 -
Google Analytics - Still Seeing Keyword Data
Hi, I hope you can answer this question for me, obviously I'm aware of the changes regarding "not provided" in analytics however I am still getting keyword referral data in analytics (not much of course!) in the section Acquisition>Keyword>Organic Can anybody explain why I am still seeing organic keyword data even up until yesterday for the odd term? I cannot find an answer anywhere!!! Many thanks
Reporting & Analytics | | splendidapple1 -
Viewing 'overall' data for multiple Google Analytics accounts
Is there any way you can view data from all of your Google Analytics accounts? For example, if I wanted to view know how much mobile traffic all my sites had, could I do this? Rather than just looking at each site individually. Thanks
Reporting & Analytics | | intSchools0 -
Need Tips to Track Advertising of Domain on my Car
Hi, I'm working on building a new site. http://www.pilatesboisfranc.com/ I also bought the domain .ca (the business will be in canada) .net .org and .info All the domains are redirect to the .com This site is about my wife new business, a Pilates Studio. We would like to advertise the site on our personal car, using vinyl letters to display the domain name. Is there a way tu use the (.ca) http://www.pilatesboisfranc.ca advertise on my car and track that advertising using Analytics? I know I can use a URL something like http://www.pilatesboisfranc.com/health and track the hit, but using a shorter URL is the key. Can you help? Thank you, BigBlaze
Reporting & Analytics | | BigBlaze2050