Good alternatives to Xenu's Link Sleuth and AuditMyPc.com Sitemap Generator
-
I am working on scraping title tags from websites with 1-5 million pages. Xenu's Link Sleuth seems to be the best option for this, at this point. Sitemap Generator from AuditMyPc.com seems to be working too, but it starts handing up, when a sitemap file, the tools is working on,becomes too large. So basically, the second one looks like it wont be good for websites of this size. I know that Scrapebox can scrape title tags from list of url, but this is not needed, since this comes with both of the above mentioned tools.
I know about DeepCrawl.com also, but this one is paid, and it would be very expensive with this amount of pages and websites too (5 million ulrs is $1750 per month, I could get a better deal on multiple websites, but this obvioulsy does not make sense to me, it needs to be free, more or less). Seo Spider from Screaming Frog is not good for large websites.
So, in general, what is the best way to work on something like this, also time efficient. Are there any other options for this?
Thanks.
-
import.io and it's free
-
Another idea that I have here, is to look for sitemaps of these websites. There may be a way to get a list of all the urls, right away, without crawling. Look at /robots.txt, /sitemap.xml, search for sitemap in Google, things like that. If there is urls, title tags can be scraped with Scrapebox, and as far as their website is saying, it can be done relatively fast.
# # Edit:
I had somebody suggesting http://inspyder.com, around $40 and free trial. May be a good option too.
-
So there is probably no way to tell, whether I have all the urls of the site, or what percentage I have... I may have 80 or even less percent of the total site, and not know about it, I would assume. This is one of the parts of working on the sites (I've never needed it, but I am working on something like this now), and there is no good tool, which would do the work.
I have a website with 33,500,000 pages. I've been running the tool for close to 5 hours, and I have around 125,000 urls, so far. This means, that it would take 1340 hours to do the entire site. This is close to two months of running the program 24 hours a day, which does not make sense. And besides that I was planning to do it on up to 100 sites. Definitely not something that can be done, and I would say that it should be possible, software-wise.
I will try your method, and see what I will get. I dont have too much time for experimenting with it too. I need to work, and generate results...
# # Edit
I will now how the number of urls compares to the 33,500,000 figure, obviously, but whats indexed in Google is not necessarily the complete website too. The method that you are suggesting is not perfect, but I dont have two months to wait too, obviously...
-
You will crawl some of the same URLs - that's why you remove duplicates at the end. There's no way to keep it from re-crawling some of the URLs, as far as I know.
But yes, get it to recognize 600-800k URLs and then split the file. (Export, put the links in as an html file and start over.) Let me break it down the best I can:
-
Crawl your main (seed) URL until you've recognized 800k.
-
Pause/stop and then export the results.
-
Create an html file with the URLs from the export - separated 50k to 100k at a time.
-
Recrawl those files in Xenu with the "file" option.
-
Build them back up to 800k or so recognized URLs again and repeat.
After a few (4-6) iterations of this, you'll have most URLs crawled on most sites no matter how large. Doing it this way, I think you could expect to crawl about 2-3 million URLs a day. If you really paid attention to it and created smaller files but ran them more frequently, you could get 4-5 million, I think. I've crawled close to that in a day for a scrape once.
-
-
Thanks. It is good to hear, that there is a way to do, of what I am trying to do, especially on 50 or more sites, large.
I've been running Xenu on a 33,500,000 pages site for a little over 4 hours and 15 minutes, and I have something like this, so far:
Close to 500,000 urls recognized, and only 115,000 processed, it looks like. I am manually saving it to a file, every now and then, as there is no way to auto save, as far as I was checking (there could be though, I am not sure, there is no too many options there).
I am not sure, based on your advice, how I could speed it up this process. Should I wait from this point, then stop the program, and divide the file into 8 separate files, and load it to the program separately? Then the program will recognize these separate files as one, and it will continue crawling for new urls? If possible, please give better information on how this would need to be done, as I dont fully understand. I also dont see how this could do this large website in one day, or lets say even five days...
# # Edit:
I actually got to understanding what you mean, get 8 separate files (can be 6 or, lets say 10) and run them all at the same time. But still, how will the program know not to crawl and download the same urls, on all the files? In general, I would like to ask for better explanation, on how this needs to be done.
Thanks.
-
Let Xenu crawl until you have about 800k links. Then export the file and add it back as 8 x 100k lists of URLs. You can then run it again and repeat the process. By the time you have split it 4-5 times, you can then export everything, put it into one file and remove duplicates.
Xenu, done this way, with 100 threads, is probably the fastest way to do the whole thing. I think you could get the 5M results in under 1 day of work this way.
-
Ok. So it looks like Screaming Frog may be a good way to go too, if not better. Xenu is free, which is a big plus. On the top of that Creaming Frog's Seo Spider is based on a yearly subscription, and not a one time fee. For those who dont know, there is a version of Xenu for large sites, which can be found on their website. They also have a support group at groups.yahoo.com (find it through there), I am not sure if it is still active.
Xenu upgraded to the version for larger sites may be the best way to go, since it is free. I've been testing AuditMyPc.com Sitemap Creator and the better version of Xenu, and the first one already hanged up (I discontinued using it). They were both collecting the info at about the same speed, but Xenu is working better (does not hang up, looks like it should be good). Either way, this will take quite a lot of time, with it, as previously mentioned.
-
I agree with Moosa and Danny - in terms of I use Screaming Frog (full paid version) on a stripped down windows machine with an SSD and 16GB of performance RAM. I have also download the 64 bit version of Java and increased the memory allocation for Screaming Frog to 12GB (default limit is 512mb) - here's how - http://www.screamingfrog.co.uk/seo-spider/user-guide/general/ (look at the section Increasing Memory on Windows 32 & 64-bit)
I did this as I was having issues crawling a large site - after I put this system in place it eats any site I have thrown at it so far so it works well for me personally. In terms of speed of crawl large sites such as you mention will still take a while - you can set crawl speed in Screaming Frog, but you need to be careful as you can overload the server of the site you are crawling and cause issues...
Another option would be to buy a server and configure it for Screaming Frog and other tools you may use - this gives you options to grow the system as your needs grow. It all depends on budget and how often you crawl large sites - obviously buying a server such as a windows instance on Amazon EC2 will cost more in the long run but it takes the strain away from your own systems and networks plus you should effectively never hit capacity on the server as you can just upgrade. It will also allow you to remote desktop in on whatever system you use - yes even a Mac
Hope this helps
-
I believe when you are talking about 1 to 5 million URLs it is going to take time no matter what tool you use but if you ask me screaming frog is a better tool and if you have a paid version of it you still can crawl websites with few million URLs in it.
Xenu is not a bad choice either but it’s kind of confusing and there is a possibility that it can broke.
Hope this helps!
-
I was facing similar issue with huge sites, that have over 100s of thousands of pages. But ever since I upgraded my computer with RAM and SSD it run way better on huge sites as well. I tried several scrappers and I still believe Xenu is the best one and most recommended by SEO experts. Also you might want to check this post on Moz Blog about Xenu's
http://moz.com/blog/xenu-link-sleuth-more-than-just-a-broken-links-finderGood luck!
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
URL Structure On Site - Currently it's domain/product-name NOT domain/category/product name is this bad?
I have a eCommerce site and the site structure is domain/product-name rather than domain/product-category/product-name Do you think this will have a negative impact SEO Wise? I have seen that some of my individual product pages do get better rankings than my categories.
Technical SEO | | the-gate-films0 -
Yoast's Magento Guide "Nofollowing unnecessary link" is that really a good idea?
I have been following Yoast's Magento guide here: https://yoast.com/articles/magento-seo/ Under section 3.2 it says: Nofollowing unnecessary links Another easy step to increase your Magento SEO is to stop linking to your login, checkout, wishlist, and all other non-content pages. The same goes for your RSS feeds, layered navigation, add to wishlist, add to compare etc. I always thought that nofollowing internal links is a bad idea as it just throwing link juice out the window. Why would Yoast recommend to do this? To me they are suggesting link sculpting via nofollowing but that has not worked since 2009!
Technical SEO | | PaddyDisplays0 -
3,638 mystery inbound links from google.com
Hello, can anybody help me. I my webmaster tools, under "Links to your Site" and "Who links the most" It shows Google as the top inbound link source, with 3,669 out of 4,14, all these links are pointing to one sub page of content. Yet when I click on the source it only shows me one Google Plus page. My traffic hell by 60% when I first noticed these links. we have not been building unnatural links and we have not had any Manual link building warning. This is the site: iunlock.org Two questions:- I have deleted this one linked to page what else can I do to reverse this impact Does anybody know what has happened? See attached WWwnb1W.png
Technical SEO | | Jack4ireland0 -
Duplicate Content - What's the best bad idea?
Hi all, I have 1000s of products where the product description is very technical and extremely hard to rewrite or create an unique one. I'll probably will have to use the contend provided by the brands, which can already be found in dozens of other sites. My options are: Use the Google on/off tags "don't index
Technical SEO | | Carlos-R
" Put the content in an image Are there any other options? We'd always write our own unique copy to go with the technical bit. Cheers0 -
Will syndicated content hurt a website's ranking potential?
I work with a number of independent insurance agencies across the United States. All of these agencies have setup their websites through one preferred insurance provider. The websites are customizable to a point, but the content for the entire website is mostly the same. Therefore, literally hundreds of agency sites have essentially the same content. The only thing that changes is a few "wildcards" in the copy where the agency fills in their city, state, services areas, company history, etc. My questions is: will this syndicated content hurt their ranking potential? I've been toying with the idea of further editing the content to make it more unique to an agency, but I would hate to waste a lot of hours doing this if it won't help anything. Would you expect this approach to be beneficial or a waste of time? Thank you for your help!
Technical SEO | | copyjack0 -
What's the best URL Structure if my company is in multiple locations or cities?
I have read numerous intelligent, well informed responses to this question but have yet to hear a definitive answer from an authority. Here's the situation. Let's say I have a company who's URL is www.awesomecompany.com who provides one service called 'Awesome Service' This company has 20 franchises in the 20 largest US cities. They want a uniform online presence, meaning they want their design to remain consistent across all 20 domains. My question is this; what's the best domain or url structure for these 20 sites? Subdomain - dallas.awesomecompany.co Unique URL - www.dallasawesomecompany.com Directory - www.awesomecompany.com/dallas/ Here's my thoughts on this question but I'm really hoping someone b*tch slaps me and tells me I'm wrong: Of these three potential solutions these are how I would rank them and why: Subdomains Pros: Allows me to build an entire site so if my local site grows to 50+ pages, it's still easy to navigate Allows me to brand root domain and leverage brand trust of root domain (let's say the franchise is starbucks.com for instance) Cons: This subdomain is basically a brand new url in google's eyes and any link building will not benefit root domain. Directory Pros Fully leverages the root domain branding and fully allows for further branding If the domain is an authority site, ranking for sub pages will be achieved much quicker Cons While this is a great solution if you just want a simple map listing and contact info page for each of your 20 locations, what if each location want's their own "about us" page and their own "Awesome Service" page optimized for their respective City (i.e. Awesome Service in Dallas)? The Navigation and potentially the URL is going to start to get really confusing and cumbersome for the end user. Think about it, which is preferable?: dallas.awesomcompany.com/awesome-service/ www.awesomecompany.com/dallas/awesome-service (especially when www.awesomecompany.com/awesome-service/ already exists Unique URL Pros Potentially quicker rankings achieved than a subdomain if it's an exact match domain name (i.e. dallasawesomeservice.com) Cons Does not leverage the www.awesomecompany.com brand Could look like an imposter It is literally a brand new domain in Google's eyes so all SEO efforts would start from scratch Obviously what goes without saying is that all of these domains would need to have unique content on them to avoid duplicate content penalties. I'm very curious to hear what you all have to say.
Technical SEO | | BrianJGomez0 -
Rel cannonical on all my URL's
Hi, sorry if this question has already been asked, but I can't seem to find the correct answer. In my crawling report for the domain: http://www.wellbo.de I get rel cannonical notices. I have redirected all pages of http://wellbo.de to http://www.wellbo.de with a 301 redirect. Where is my error? Why do I get these notices? I hope the image helps. Ep7Rw.jpg
Technical SEO | | wellbo0 -
Best Dynamic Sitemap Generator
Hello Mozers, Could you please share the best Dynamic Sitemap Generator you are using. I have found this place: http://www.seotools.kreationstudio.com/xml-sitemap-generator/free_dynamic_xml_sitemap_generator.php Thanks in advanced for your help.
Technical SEO | | SEOPractices0