Good alternatives to Xenu's Link Sleuth and AuditMyPc.com Sitemap Generator
-
I am working on scraping title tags from websites with 1-5 million pages. Xenu's Link Sleuth seems to be the best option for this, at this point. Sitemap Generator from AuditMyPc.com seems to be working too, but it starts handing up, when a sitemap file, the tools is working on,becomes too large. So basically, the second one looks like it wont be good for websites of this size. I know that Scrapebox can scrape title tags from list of url, but this is not needed, since this comes with both of the above mentioned tools.
I know about DeepCrawl.com also, but this one is paid, and it would be very expensive with this amount of pages and websites too (5 million ulrs is $1750 per month, I could get a better deal on multiple websites, but this obvioulsy does not make sense to me, it needs to be free, more or less). Seo Spider from Screaming Frog is not good for large websites.
So, in general, what is the best way to work on something like this, also time efficient. Are there any other options for this?
Thanks.
-
import.io and it's free
-
Another idea that I have here, is to look for sitemaps of these websites. There may be a way to get a list of all the urls, right away, without crawling. Look at /robots.txt, /sitemap.xml, search for sitemap in Google, things like that. If there is urls, title tags can be scraped with Scrapebox, and as far as their website is saying, it can be done relatively fast.
# # Edit:
I had somebody suggesting http://inspyder.com, around $40 and free trial. May be a good option too.
-
So there is probably no way to tell, whether I have all the urls of the site, or what percentage I have... I may have 80 or even less percent of the total site, and not know about it, I would assume. This is one of the parts of working on the sites (I've never needed it, but I am working on something like this now), and there is no good tool, which would do the work.
I have a website with 33,500,000 pages. I've been running the tool for close to 5 hours, and I have around 125,000 urls, so far. This means, that it would take 1340 hours to do the entire site. This is close to two months of running the program 24 hours a day, which does not make sense. And besides that I was planning to do it on up to 100 sites. Definitely not something that can be done, and I would say that it should be possible, software-wise.
I will try your method, and see what I will get. I dont have too much time for experimenting with it too. I need to work, and generate results...
# # Edit
I will now how the number of urls compares to the 33,500,000 figure, obviously, but whats indexed in Google is not necessarily the complete website too. The method that you are suggesting is not perfect, but I dont have two months to wait too, obviously...
-
You will crawl some of the same URLs - that's why you remove duplicates at the end. There's no way to keep it from re-crawling some of the URLs, as far as I know.
But yes, get it to recognize 600-800k URLs and then split the file. (Export, put the links in as an html file and start over.) Let me break it down the best I can:
-
Crawl your main (seed) URL until you've recognized 800k.
-
Pause/stop and then export the results.
-
Create an html file with the URLs from the export - separated 50k to 100k at a time.
-
Recrawl those files in Xenu with the "file" option.
-
Build them back up to 800k or so recognized URLs again and repeat.
After a few (4-6) iterations of this, you'll have most URLs crawled on most sites no matter how large. Doing it this way, I think you could expect to crawl about 2-3 million URLs a day. If you really paid attention to it and created smaller files but ran them more frequently, you could get 4-5 million, I think. I've crawled close to that in a day for a scrape once.
-
-
Thanks. It is good to hear, that there is a way to do, of what I am trying to do, especially on 50 or more sites, large.
I've been running Xenu on a 33,500,000 pages site for a little over 4 hours and 15 minutes, and I have something like this, so far:
Close to 500,000 urls recognized, and only 115,000 processed, it looks like. I am manually saving it to a file, every now and then, as there is no way to auto save, as far as I was checking (there could be though, I am not sure, there is no too many options there).
I am not sure, based on your advice, how I could speed it up this process. Should I wait from this point, then stop the program, and divide the file into 8 separate files, and load it to the program separately? Then the program will recognize these separate files as one, and it will continue crawling for new urls? If possible, please give better information on how this would need to be done, as I dont fully understand. I also dont see how this could do this large website in one day, or lets say even five days...
# # Edit:
I actually got to understanding what you mean, get 8 separate files (can be 6 or, lets say 10) and run them all at the same time. But still, how will the program know not to crawl and download the same urls, on all the files? In general, I would like to ask for better explanation, on how this needs to be done.
Thanks.
-
Let Xenu crawl until you have about 800k links. Then export the file and add it back as 8 x 100k lists of URLs. You can then run it again and repeat the process. By the time you have split it 4-5 times, you can then export everything, put it into one file and remove duplicates.
Xenu, done this way, with 100 threads, is probably the fastest way to do the whole thing. I think you could get the 5M results in under 1 day of work this way.
-
Ok. So it looks like Screaming Frog may be a good way to go too, if not better. Xenu is free, which is a big plus. On the top of that Creaming Frog's Seo Spider is based on a yearly subscription, and not a one time fee. For those who dont know, there is a version of Xenu for large sites, which can be found on their website. They also have a support group at groups.yahoo.com (find it through there), I am not sure if it is still active.
Xenu upgraded to the version for larger sites may be the best way to go, since it is free. I've been testing AuditMyPc.com Sitemap Creator and the better version of Xenu, and the first one already hanged up (I discontinued using it). They were both collecting the info at about the same speed, but Xenu is working better (does not hang up, looks like it should be good). Either way, this will take quite a lot of time, with it, as previously mentioned.
-
I agree with Moosa and Danny - in terms of I use Screaming Frog (full paid version) on a stripped down windows machine with an SSD and 16GB of performance RAM. I have also download the 64 bit version of Java and increased the memory allocation for Screaming Frog to 12GB (default limit is 512mb) - here's how - http://www.screamingfrog.co.uk/seo-spider/user-guide/general/ (look at the section Increasing Memory on Windows 32 & 64-bit)
I did this as I was having issues crawling a large site - after I put this system in place it eats any site I have thrown at it so far so it works well for me personally. In terms of speed of crawl large sites such as you mention will still take a while - you can set crawl speed in Screaming Frog, but you need to be careful as you can overload the server of the site you are crawling and cause issues...
Another option would be to buy a server and configure it for Screaming Frog and other tools you may use - this gives you options to grow the system as your needs grow. It all depends on budget and how often you crawl large sites - obviously buying a server such as a windows instance on Amazon EC2 will cost more in the long run but it takes the strain away from your own systems and networks plus you should effectively never hit capacity on the server as you can just upgrade. It will also allow you to remote desktop in on whatever system you use - yes even a Mac
Hope this helps
-
I believe when you are talking about 1 to 5 million URLs it is going to take time no matter what tool you use but if you ask me screaming frog is a better tool and if you have a paid version of it you still can crawl websites with few million URLs in it.
Xenu is not a bad choice either but it’s kind of confusing and there is a possibility that it can broke.
Hope this helps!
-
I was facing similar issue with huge sites, that have over 100s of thousands of pages. But ever since I upgraded my computer with RAM and SSD it run way better on huge sites as well. I tried several scrappers and I still believe Xenu is the best one and most recommended by SEO experts. Also you might want to check this post on Moz Blog about Xenu's
http://moz.com/blog/xenu-link-sleuth-more-than-just-a-broken-links-finderGood luck!
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Should I create a new site or keep company on parent company's subdomain?
I am working with a realty company that is hosted on a subdomain of the larger, parent realty company: [local realty company].[parent realty company].com How important is it to ride on the DA of the larger company (only about a 40)? I'm trying to weigh the value of creating an entirely separate domain for simplicity of the end user and Google bots: [local company].realtor They don't have any substantial links to their subdomain, so it wouldn't a huge loss. I have a couple options... Create an entirely new site on their current subdomain, leveraging the DA of the larger parent company. Create an entirely new site on a new URL, starting from scratch (which doesn't hurt you as much as it seems it once did). Create two sites, a micro site that targets a sector of their audience that they really want to reach, plus option (1) or (2). Love this community!
Technical SEO | | Gabe_BlueGuru0 -
Do I submit a sitemap for a highly dynamic site or not? If so, what's the best way to go about doing it?
I do SEO for online boutique marketplace. I've been here for about 4 weeks and no one's done there SEO (they've been around for about 5 years), so there's lots to do. A big concern is whether or not to submit a sitemap, and if I do submit one, what's the best way to go about doing one.
Technical SEO | | Jane.com0 -
Specifying Your Organization's Logo Schema Required If Corporate Contacts Schema is in Place?
Does anyone know if specifying the organization's logo schema is required if corporate contacts schema is in place? I have the corporate contact schema in place on my site but not the second one. The site is http://www.cobaltrecruitment.com/ Thanks,
Technical SEO | | the-gate-films0 -
What's going on with google index - javascript and google bot
Hi all, Weird issue with one of my websites. The website URL: http://www.athletictrainers.myindustrytracker.com/ Let's take 2 diffrenet article pages from this website: 1st: http://www.athletictrainers.myindustrytracker.com/en/article/71232/ As you can see the page is indexed correctly on google: http://webcache.googleusercontent.com/search?q=cache:dfbzhHkl5K4J:www.athletictrainers.myindustrytracker.com/en/article/71232/10-minute-core-and-cardio&hl=en&strip=1 (that the "text only" version, indexed on May 19th) 2nd: http://www.athletictrainers.myindustrytracker.com/en/article/69811 As you can see the page isn't indexed correctly on google: http://webcache.googleusercontent.com/search?q=cache:KeU6-oViFkgJ:www.athletictrainers.myindustrytracker.com/en/article/69811&hl=en&strip=1 (that the "text only" version, indexed on May 21th) They both have the same code, and about the dates, there are pages that indexed before the 19th and they also problematic. Google can't read the content, he can read it when he wants to. Can you think what is the problem with that? I know that google can read JS and crawl our pages correctly, but it happens only with few pages and not all of them (as you can see above).
Technical SEO | | cobano0 -
How Many Words To Make Content 'unique?'
Hi All, I'm currently working on creating a variety of new pages for my website. These pages are based upon different keyword searches for cars, for example used BMW in London, Used BMW in Edinburgh and many many more similar kinds of variations. I'm writing some content for each page so that they're completely unique to each other (the cars displayed on each page will also be different so this would not be duplicated either). My question is really, how much content do you think that I'll need on each page? or what is optimal? What would be the minimum you might need? Thank for your help!
Technical SEO | | Sandicliffe0 -
Will sitemap generated in Yoast for a combined wordpress/magento site map entire site ?
Hi For an ecommerce site thats been developed via a combination of wordpress and magento and has yoast installed, will the sitemap (& other yoast features) map (& apply to) the entire site or just wordpress aspects ? In other words does one need to do anything else to have a full sitemap for a combined magento/wordpress site or will Yoast cover it all ? This link seems to suggest should be fine but seeing if anyone else encountered this and had problems or if straightforward ? http://fishpig.co.uk/wordpress-integration/docs/plugins.html cheers dan
Technical SEO | | Dan-Lawrence0 -
Outlinks not link with profitok.com/index.php
Dear sir my website www.profitok.com is not indexing in google can you give me the right answer wht to do
Technical SEO | | Luckykhullar0 -
How can I best find out which URLs from large sitemaps aren't indexed?
I have about a dozen sitemaps with a total of just over 300,000 urls in them. These have been carefully created to only select the content that I feel is above a certain threshold. However, Google says they have only indexed 230,000 of these urls. Now I'm wondering, how can I best go about working out which URLs they haven't indexed? No errors are showing in WMT related to these pages. I can obviously manually start hitting it, but surely there's a better way?
Technical SEO | | rango0