How to extract URLs from a site (without bringing the server down!)
-
Hi everybody.
One of my clients is migrating to a new ecommerce platform, and we need to get a list of urls from the existing site to start mapping out the 301 redirects. Usually, I'd use a tool like Xenu or Integrity to crawl and output a list.
However, the database and server setup is so bad that it can't handle the requests from these tools and it sends the site down. This, unsurprisingly, is one of the reasons for the migration.
Does anybody know of a way to get a full list of urls without having to make a bunch of http requests which will kill the site? Any advice would be much appreciated!
-
Just a follow-up to my endorsement. It looks like Screaming Frog will let you control the number of pages crawled per second, but to do a full crawl you'll need to get the paid version (the free version only crawls 500 URLs):
http://www.screamingfrog.co.uk/seo-spider/
It's a good tool, and nice to have around, IMO.
-
Copy the site, set it up on a staging server and run http://www.xml-sitemaps.com/ on it?
-
why not find the links to the site, becauase you will only need to 301 the urls with extenal links. let teh rest 404. i use Bing WMT as it has a most complete collection IMO. they also export to a csv
-
Thanks Yannick, I don't know why I didn't think of using a scraper! Can you recommend any good code (PHP perhaps)?
-
-
Scrape Google?
-
Make your own scraper and keep the requests per second really low ?
-
Maybe the site has an automated sitemap somewhere ?
-
Google webmaster tools -> download "internal links" table
-
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Site-wide Links
Hey y'all, I know this question has been asked many times before but I wanted to see what your stance was on this particular case. The organisation I work for is a group of 12 companies - each with its own website. On some of the sites we have a link to the other sites within the group on every single page of that site. Our organic search traffic has dropped a bit but not significantly and we haven't received any manual penalties from Google. It's also worth mentioning that the referral traffic for these sites from the other sites I control is quite good and the bounce rate is extremely low. If you were in my shoes would you remove the links, put a nofollow tag on the links or leave the links as they are? Thanks guys 🙂
Technical SEO | | AAttias0 -
Why does my site rank so badly
its my turn to ask the interminable question why does my site rank so badly? site is: marriagerecords.org.uk. it was #1 for 'marriage records' on google for about 6 months. then it was 5th to 10th for about 2 months. now it is nowhere for this phrase and anything else, none of the pages I have written rank for anything. I have spent hours upon hours researching original content and I have got some great backlinks from sites like wrexham.gov.uk and somerset.gov.uk (some dont show in opensiteexplorer yet). im guessing im over-optimizing something but i'd love some concrete fixes if anyone could suggest any. thanks, tom
Technical SEO | | lethal0r0 -
Server Connectivity
Hey there When we go to our webmaster tools there is a orange triangle. The issue is that Google's robot can not access our site. Does anyone know why this could be? Thanks!
Technical SEO | | Comunicare0 -
What happens when a link goes to a dead url on my site?
I noticed in Open Site Explorer, I have several incoming links going to dead urls because i re-organized my site. For example, there might be an incoming link to: sample.php?ID=8 The problem is that I moved the file to /subdir1 so it would be nice if it could link to /subdir1/sample.php?ID=8 BUT, on top of that, I have also changed the url to seo-friendly urls. So, really, it should link to /Category_Descripton/ProductName/8 and then get re-written to /subdir1/sample.php?ID=8 So, what are the implications of having these incoming links to dead urls other than the bad user experience. What are the implications from an SEO standpoint? What's the best way to fix this? Thanks.
Technical SEO | | webtarget0 -
Way to find how many sites within a given set link to a specific site?
Hi, Does anyone have an idea on how to determine how many sites within a list of 50 sites link to a specific site? Thanks!
Technical SEO | | SparkplugDigital0 -
Google and QnA sites
My website has a QnA site - a bit like this one except it's not private to premium members. It is a page with a left colomn for category links and it has a list of recently asked questions, each question is a link to view the full question and answers etc. Does google know this is a QnA ? Or will it say - hey, there are far too many links on this page, tut tut. Is there anything I can do to help it understand what the page is.
Technical SEO | | borderbound0 -
We're working on a site that is a beer company. Because it is required to have an age verification page, how should we best redirect the bots (useragents) to the actual homepage (thus skipping ahead of the age verification without allowing all browsers)?
This question is about useragents and alcohol sites that have an age verification screen upon landing on the site.
Technical SEO | | OveritMedia0 -
I have a site that has both http:// and https:// versions indexed, e.g. https://www.homepage.com/ and http://www.homepage.com/. How do I de-index the https// versions without losing the link juice that is going to the https://homepage.com/ pages?
I can't 301 https// to http:// since there are some form pages that need to be https:// The site has 20,000 + pages so individually 301ing each page would be a nightmare. Any suggestions would be greatly appreciated.
Technical SEO | | fthead90