Good alternatives to Xenu's Link Sleuth and AuditMyPc.com Sitemap Generator
-
I am working on scraping title tags from websites with 1-5 million pages. Xenu's Link Sleuth seems to be the best option for this, at this point. Sitemap Generator from AuditMyPc.com seems to be working too, but it starts handing up, when a sitemap file, the tools is working on,becomes too large. So basically, the second one looks like it wont be good for websites of this size. I know that Scrapebox can scrape title tags from list of url, but this is not needed, since this comes with both of the above mentioned tools.
I know about DeepCrawl.com also, but this one is paid, and it would be very expensive with this amount of pages and websites too (5 million ulrs is $1750 per month, I could get a better deal on multiple websites, but this obvioulsy does not make sense to me, it needs to be free, more or less). Seo Spider from Screaming Frog is not good for large websites.
So, in general, what is the best way to work on something like this, also time efficient. Are there any other options for this?
Thanks.
-
import.io and it's free
-
Another idea that I have here, is to look for sitemaps of these websites. There may be a way to get a list of all the urls, right away, without crawling. Look at /robots.txt, /sitemap.xml, search for sitemap in Google, things like that. If there is urls, title tags can be scraped with Scrapebox, and as far as their website is saying, it can be done relatively fast.
# # Edit:
I had somebody suggesting http://inspyder.com, around $40 and free trial. May be a good option too.
-
So there is probably no way to tell, whether I have all the urls of the site, or what percentage I have... I may have 80 or even less percent of the total site, and not know about it, I would assume. This is one of the parts of working on the sites (I've never needed it, but I am working on something like this now), and there is no good tool, which would do the work.
I have a website with 33,500,000 pages. I've been running the tool for close to 5 hours, and I have around 125,000 urls, so far. This means, that it would take 1340 hours to do the entire site. This is close to two months of running the program 24 hours a day, which does not make sense. And besides that I was planning to do it on up to 100 sites. Definitely not something that can be done, and I would say that it should be possible, software-wise.
I will try your method, and see what I will get. I dont have too much time for experimenting with it too. I need to work, and generate results...
# # Edit
I will now how the number of urls compares to the 33,500,000 figure, obviously, but whats indexed in Google is not necessarily the complete website too. The method that you are suggesting is not perfect, but I dont have two months to wait too, obviously...
-
You will crawl some of the same URLs - that's why you remove duplicates at the end. There's no way to keep it from re-crawling some of the URLs, as far as I know.
But yes, get it to recognize 600-800k URLs and then split the file. (Export, put the links in as an html file and start over.) Let me break it down the best I can:
-
Crawl your main (seed) URL until you've recognized 800k.
-
Pause/stop and then export the results.
-
Create an html file with the URLs from the export - separated 50k to 100k at a time.
-
Recrawl those files in Xenu with the "file" option.
-
Build them back up to 800k or so recognized URLs again and repeat.
After a few (4-6) iterations of this, you'll have most URLs crawled on most sites no matter how large. Doing it this way, I think you could expect to crawl about 2-3 million URLs a day. If you really paid attention to it and created smaller files but ran them more frequently, you could get 4-5 million, I think. I've crawled close to that in a day for a scrape once.
-
-
Thanks. It is good to hear, that there is a way to do, of what I am trying to do, especially on 50 or more sites, large.
I've been running Xenu on a 33,500,000 pages site for a little over 4 hours and 15 minutes, and I have something like this, so far:
Close to 500,000 urls recognized, and only 115,000 processed, it looks like. I am manually saving it to a file, every now and then, as there is no way to auto save, as far as I was checking (there could be though, I am not sure, there is no too many options there).
I am not sure, based on your advice, how I could speed it up this process. Should I wait from this point, then stop the program, and divide the file into 8 separate files, and load it to the program separately? Then the program will recognize these separate files as one, and it will continue crawling for new urls? If possible, please give better information on how this would need to be done, as I dont fully understand. I also dont see how this could do this large website in one day, or lets say even five days...
# # Edit:
I actually got to understanding what you mean, get 8 separate files (can be 6 or, lets say 10) and run them all at the same time. But still, how will the program know not to crawl and download the same urls, on all the files? In general, I would like to ask for better explanation, on how this needs to be done.
Thanks.
-
Let Xenu crawl until you have about 800k links. Then export the file and add it back as 8 x 100k lists of URLs. You can then run it again and repeat the process. By the time you have split it 4-5 times, you can then export everything, put it into one file and remove duplicates.
Xenu, done this way, with 100 threads, is probably the fastest way to do the whole thing. I think you could get the 5M results in under 1 day of work this way.
-
Ok. So it looks like Screaming Frog may be a good way to go too, if not better. Xenu is free, which is a big plus. On the top of that Creaming Frog's Seo Spider is based on a yearly subscription, and not a one time fee. For those who dont know, there is a version of Xenu for large sites, which can be found on their website. They also have a support group at groups.yahoo.com (find it through there), I am not sure if it is still active.
Xenu upgraded to the version for larger sites may be the best way to go, since it is free. I've been testing AuditMyPc.com Sitemap Creator and the better version of Xenu, and the first one already hanged up (I discontinued using it). They were both collecting the info at about the same speed, but Xenu is working better (does not hang up, looks like it should be good). Either way, this will take quite a lot of time, with it, as previously mentioned.
-
I agree with Moosa and Danny - in terms of I use Screaming Frog (full paid version) on a stripped down windows machine with an SSD and 16GB of performance RAM. I have also download the 64 bit version of Java and increased the memory allocation for Screaming Frog to 12GB (default limit is 512mb) - here's how - http://www.screamingfrog.co.uk/seo-spider/user-guide/general/ (look at the section Increasing Memory on Windows 32 & 64-bit)
I did this as I was having issues crawling a large site - after I put this system in place it eats any site I have thrown at it so far so it works well for me personally. In terms of speed of crawl large sites such as you mention will still take a while - you can set crawl speed in Screaming Frog, but you need to be careful as you can overload the server of the site you are crawling and cause issues...
Another option would be to buy a server and configure it for Screaming Frog and other tools you may use - this gives you options to grow the system as your needs grow. It all depends on budget and how often you crawl large sites - obviously buying a server such as a windows instance on Amazon EC2 will cost more in the long run but it takes the strain away from your own systems and networks plus you should effectively never hit capacity on the server as you can just upgrade. It will also allow you to remote desktop in on whatever system you use - yes even a Mac
Hope this helps
-
I believe when you are talking about 1 to 5 million URLs it is going to take time no matter what tool you use but if you ask me screaming frog is a better tool and if you have a paid version of it you still can crawl websites with few million URLs in it.
Xenu is not a bad choice either but it’s kind of confusing and there is a possibility that it can broke.
Hope this helps!
-
I was facing similar issue with huge sites, that have over 100s of thousands of pages. But ever since I upgraded my computer with RAM and SSD it run way better on huge sites as well. I tried several scrappers and I still believe Xenu is the best one and most recommended by SEO experts. Also you might want to check this post on Moz Blog about Xenu's
http://moz.com/blog/xenu-link-sleuth-more-than-just-a-broken-links-finderGood luck!
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Does anyone know the linking of hashtags on Wix sites does it negatively or postively impact SEO. It is coming up as an error in site crawls 'Pages with 404 errors' Anyone got any experience please?
Does anyone know the linking of hashtags on Wix sites does it negatively or positively impact SEO. It is coming up as an error in site crawls 'Pages with 404 errors' Anyone got any experience please? For example at the bottom of this blog post https://www.poppyandperle.com/post/face-painting-a-global-language the hashtags are linked, but they don't go to a page, they go to search results of all other blogs using that hashtag. Seems a bit of a strange approach to me.
Technical SEO | | Mediaholix0 -
'domain:example.com/' is this line with a '/' at the end of the domain valid in a disavow report file ?
Hi everyone Just out of curiosity, what would happen if in my disavow report I have this line : domain:example.com**/** instead of domain:example.com as recommended by google. I was just wondering if adding a / at the end of a domain would automatically render the line invalid and ignored by Google's disavow backlinks tool. Many thanks for your thoughts
Technical SEO | | LabeliumUSA0 -
Will sitemap generated in Yoast for a combined wordpress/magento site map entire site ?
Hi For an ecommerce site thats been developed via a combination of wordpress and magento and has yoast installed, will the sitemap (& other yoast features) map (& apply to) the entire site or just wordpress aspects ? In other words does one need to do anything else to have a full sitemap for a combined magento/wordpress site or will Yoast cover it all ? This link seems to suggest should be fine but seeing if anyone else encountered this and had problems or if straightforward ? http://fishpig.co.uk/wordpress-integration/docs/plugins.html cheers dan
Technical SEO | | Dan-Lawrence0 -
Amazon Product Descriptions and our website's product descriptions
I am updating our product descriptions site-wide. I wanted to also update our amazon listings for those same products. Is that considered duplicate content if it would be on amazon and our site? Is there any reason why I wouldn't want to do that? Is google product ads also a problem?
Technical SEO | | EcomLkwd0 -
Outlinks not link with profitok.com/index.php
Dear sir my website www.profitok.com is not indexing in google can you give me the right answer wht to do
Technical SEO | | Luckykhullar0 -
I cannot find a way to implement to the 2 Link method as shown in this post: http://searchengineland.com/the-definitive-guide-to-google-authorship-markup-123218
Did Google stop offering the 2 link method of verification for Authorship? See this post below: http://searchengineland.com/the-definitive-guide-to-google-authorship-markup-123218 And see this: http://www.seomoz.org/blog/using-passive-link-building-to-build-links-with-no-budget In both articles the authors talk about how to set up Authorship snippets for posts on blogs where they have no bio page and no email verification just by linking directly from the content to their Google+ profile and then by linking the from the the Google+ profile page (in the Contributor to section) to the blog home page. But this does not work no matter how many ways I trie it. Did Google stop offering this method?
Technical SEO | | jeff.interactive0 -
Generating a Sitemap for websites linked to a wordpress blog
Greetings, I'm looking for a way to generate a sitemap that will include static pages on my home directory, as well as my wordpress blog. The site that I'm trying to build this for is in a temporary folder, and can be accessed at http://www.lifewaves.com/Website 3.0 I plan on moving the contents of this folder to the root directory for lifewaves.com whenever we are go for launch. What I'm wondering is, is there a way to build a sitemap or sitemap index that will point to the static pages of my site, as well as the wordpress blog while taking advantage of the built in wordpress hierarchy? If so, what's an easy way to do this. I have generated a sitemap using Yoast, but I can't seem to find any xml files within the wordpress folder. Within the plugin is a button that I can click to access the sitemap index, but it just brings me to the homepage of my blog. Can I build a sitemap index that points to a sitemap for the static pages as well as the sitemap generated by yoast? Thank you in advance for your help!! P.S. I'm kind of a noob.
Technical SEO | | WaveMaker0 -
Does 'framing' a website create duplicate content?
Something I have not come across before, but hope others here are able offer advice based on experience: A client has independently created a series of mini-sites, aimed at targeting specific locations. The tactic has worked very well and they have achieved a large amount of well targeted traffic as a result. Each mini-site is different but then in the nav, if you want to view prices or go to the booking page, that then links to what at first appears to be their main site. However, you then notice that the URL is actually situated on the mini-site. What they have done is 'framed' the main site so that it appears exactly the same even when navigating through this exact replica site. Checking the code, there is almost nothing there - in fact there is actually no content at all. Below the head, there is a piece of code: <frameset rows="*" framespacing=0 frameborder=0> <frame src="[http://www.example.com](view-source:http://www.yellowskips.com/)" frameborder=0 marginwidth=0 marginheight=0> <noframes>Your browser does not support frames. Click [here](http://www.example.com) to view.noframes> frameset> Given that main site content does not appear to show in the source code, do we have an issue with duplicate content? This issue is that these 'referrals' are showing in Analytics, despite the fact that the code does not appear in the source, which is slightly confusing for me. They have done this without consultation and I'm very concerned that this could potentially be creating duplicate content of their ENTIRE main site on dozens of mini-sites. I should also add that there are no links to the mini-sites from the main site, so if you guys advise that this is creating duplicate content, I would not be worried about creating a link-wheel if I advise them to link directly to the main site rather than the framed pages. Thanks!
Technical SEO | | RiceMedia0