Site Spider/ Crawler/ Scraper Software
-
Short of coding up your own web crawler - does anyone know/ have any experience with a good bit of software to run through all the pages on a single domain?
(And potentially on linked domains 1 hop away...)
This could be either server or desktop based.
Useful capabilities would include:
- Scraping (x-path parameters)
-
of clicks from homepage (site architecture)
- http headers
- Multi threading
- Use of proxies
- Robots.txt compliance option
- csv output
- Anything else you can think of...
Perhaps an oppourtunity for an additional SEOmoz tool here since they do it already!
Cheers!
Note:
I've had a look at:- Nutch
http://nutch.apache.org/ - Heritrix
https://webarchive.jira.com/wiki/display/Heritrix/Heritrix - Scrapy
http://doc.scrapy.org/en/latest/intro/overview.html - Mozenda (does scraping but doesn't appear extensible..)
Any experience/ preferences with these or others?
-
Hey Alex,
Screaming Frog is hands down the best desktop crawling software and it has most of what you are looking for.
-Mike
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Multilang site: Auto redirect 301 or 302?
We need to establish if 301 or 302 response code is to be used for our auto redirects based on Accept-Language header. https://domain.com
International SEO | | fJ66doneOIdDpj
30x > https://domain.com/en
30x > https://domain.com/ru
30x > https://domain.com/de The site architecture is set up with proper inline HREFLANG.
We have read different opinions about this, Ahrefs says 302 is the correct one:
https://ahrefs.com/blog/301-vs-302-redirects/
302 redirect:
"You want to redirect users to the right version of the site for them (based on location/language)." You could argue that the root redirect is never permanent as it varies based on user language settings (302)
On the other hand, the lang specific redirects are permanent per language: IF Accept-Language header = en
https://domain.com > 301 > https://domain.com/en
IF Accept-Language header = ru
https://domain.com > 301 > https://domain.com/ru So each of these is 'permanent'. So which is the correct?0 -
Setting up I.P Filter Google Analytics - I.p ending with 0/24
Hi everyone, Your help would be much appreciated for the following: I am trying to setup I.P filters for our Google Analytic account to exclude internal traffic. We are located in multiple locations and each location have multiple I.p addresses. The I.P addresses we have end either by 0/24 which apparently means they provide a range from 0 to 255 and or 128/25. I have tried to setup the I.P addresses in different formats on the GA filter but they are apparently are not valid: example of one setup I tried: 1**.\2**.\8*.([0-256]) I have gone through the Filter setup guide from Google but I must be doing something wrong- probably to do on how I setup the I.P's ending with 0/24 and 128/25 If anyone could help me on how I can set up the I.P filters Google analytic would be great. The I.P addresses look like the following (changed digits): Location 1: 174.177.179.0/25 174.177.179.128/25 Location 2: 196.222.87.0/24
International SEO | | AlphaDigital2
194.59.197.0/24 Thanks you so much for your help, L.0 -
Wordpress SEO/ Ecommerce , Site with Multiple Domains ( International ) & Canonical URLs
Hi I have an ecommerce site with an integrated wordpress instance. I want to have one wordpress site that outputs to 2 domains exactly the same content , but one will have canonical URL . NZ & Australia Sites. So: Would I use the rel="Alternate" hreflang="en-nz" . I want the same content to rank well for each country and not be penalised for duplicate content. Ideas?
International SEO | | s_EOgi_Bear0 -
Sub-domains or sub-directories for country-specific versions of the site?
What approach do you think would be better from an SEO perspective when creating country-targeted versions for an eCommerce site (all in the same language with slight regional changes) - sub-domains or sub-directories? Is any of the approaches more cost effective, web development-wise? I know this topic's been under much debate and I would really like to hear your opinion. Many thanks!
International SEO | | ramarketing0 -
Multi country targeting for listing site, ccTLD, sub domain or .com/folder?
Hi I know this has been covered in a few questions but seen nothing recent that may take into account changes google may have applied. We would like to target multiple english speaking counties with a new project and I'm a little unsure as to whether ccTLD, subdomain or subfolders are the best way to publish country specific information. Can anyone shed some light on this?
International SEO | | Mulith0 -
Multinational Sites - The main SEO issues
I currently work for the UK arm of a Company with headquarters in Germany - The have outlets in half-a-dozen European countries, and up until now each country has had it's own website. The group has decided that from next year they will close all the individual country sites and then run new sites each from a central .location, I guess with a shared database of products. I see the sense in having central stock control etc, but I'm worried about the SEO impact. I have searched Q&A and the blog but could not find much to help me. What I would like to do is to provide some advice and pointers at to what they should be aiming for, both in terms website structure and on-going SEO for each country. Any advice welcome, thanks in advance.
International SEO | | cottamg0 -
Different Home Sites for different Countries but same Language
We'r starting a new webshop soon and and one of our programmers came up with the following: Different Home Sites (Index Pages) for Austria and Germany. The Language is both times German but some words are different than others. The customer would like to have that. So we would have: domain.com (No Austrian or German IP Address) domain.com/at/ (User with Austrian IP Adress) domain.com/de/ (User with German IP Address) Is this SEO wise a disadvantage? How to set up the canonicals? DE & AT Page with the Canonical on the main Domain? Any advice? Thank you
International SEO | | leitpix0 -
International SEO whats best 2 sites co.uk and com.au ?
We have the co.uk and com.au ccTLDS and currently operate out of the UK only but plans are in place for Australia. We can't get hold of the .org or .com so it has to be the ccTLD. I want to use the same site for both countries and either host 2 identical sites (same content) or 1 site with different domain names + meta tags for the 2 countries. Whats the best way to make this happen without screwing things up?
International SEO | | therealmarkhall0