How to extract URLs from a site (without bringing the server down!)
-
Hi everybody.
One of my clients is migrating to a new ecommerce platform, and we need to get a list of urls from the existing site to start mapping out the 301 redirects. Usually, I'd use a tool like Xenu or Integrity to crawl and output a list.
However, the database and server setup is so bad that it can't handle the requests from these tools and it sends the site down. This, unsurprisingly, is one of the reasons for the migration.
Does anybody know of a way to get a full list of urls without having to make a bunch of http requests which will kill the site? Any advice would be much appreciated!
-
Just a follow-up to my endorsement. It looks like Screaming Frog will let you control the number of pages crawled per second, but to do a full crawl you'll need to get the paid version (the free version only crawls 500 URLs):
http://www.screamingfrog.co.uk/seo-spider/
It's a good tool, and nice to have around, IMO.
-
Copy the site, set it up on a staging server and run http://www.xml-sitemaps.com/ on it?
-
why not find the links to the site, becauase you will only need to 301 the urls with extenal links. let teh rest 404. i use Bing WMT as it has a most complete collection IMO. they also export to a csv
-
Thanks Yannick, I don't know why I didn't think of using a scraper! Can you recommend any good code (PHP perhaps)?
-
-
Scrape Google?
-
Make your own scraper and keep the requests per second really low ?
-
Maybe the site has an automated sitemap somewhere ?
-
Google webmaster tools -> download "internal links" table
-
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
URL is invalid: Why?
Hello everyone, I am currently listing my company on business directories. For some websites however when I add my website URL, it comes up as URL is invalid. What could be the reason for this? I have tried different variations like www., http:// and https://. Kind Regards,
Technical SEO | | SMCCoachHire
Aqib0 -
How are Server side redirects perceived compared to direct links (on a Directory site)
Hi, Im creating some listings for a client on a relevant b2b directory (a good quality directory) I asked if the links are 'followed' or no 'followed' and they said they are 'server side redirects' so no direct links. Does anyone know how these are likely to be perceived by Google ? All BEst Dan
Technical SEO | | Dan-Lawrence1 -
Can the Hosting location of image files have a negative effect if 'off-site' such as on the devs own media server ?
Hi Can the Hosting location of image files have a negative effect if 'off-site' such as if they are on the developers own media server ? As opposed to on the actual websites server or file structure ? In the case i'm looking at the image files are hosted on a totally separate server (a media subdomain of the developers site server) from the subject sites dedicated server. Will engines still attribute the properties of files hosted in this manner to the main website (such as file name, alt attributes, etc etc) ? Or should they really be on the subject sites server own media folder ? Cheers Dan
Technical SEO | | Dan-Lawrence0 -
Moved a site and changed URL structures: Looking for help with pay
Hi Gents and Ladies Before I get started, here is the website in question. www.moldinspectiontesting.ca. I apologize in advance if I miss any important or necessary details. This might actually seem like several disjointed thoughts. It is very late where I am and I am a very exhausted. No on to this monster of a post. **The background story: ** My programmer and I recently moved the website from a standalone CMS to Wordpress. The owners of the site/company were having major issues with their old SEO/designer at the time. They felt very abused and taken by this person (which I agree they were - financially, emotionally and more). They wanted to wash their hands of the old SEO/designer completely. They sought someone out to do a minor redesign (the old site did look very dated) and transfer all of their copy as affordably as possible. We took the job on. I have my own strengths with SEO but on this one I am a little out of my element. Read on to find out what that is. **Here are some of the issues, what we did and a little more history: ** The old site had a terribly unclean URL structure as most of it was machine written. The owners would make changes to one central location/page and the old CMS would then generate hundreds of service area pages that used long, parameter heavy url's (along with duplicate content). We could not duplicate this URL structure during the transfer and went with a simple, clean structure. Here is an example of how we modified the url's... Old: http://www.moldinspectiontesting.ca/service_area/index.cfm?for=Greater Toronto Area New: http://www.moldinspectiontesting.ca/toronto My programmer took to writing 301 redirects and URL rewrites (.htaccess) for all their service area pages (which tally in the hundreds). As I hinted to above, the site also suffers from a overwhelming amount of duplicate copy which we are very slowly modifying so that it becomes unique. It's also currently suffering from a tremendous amount of keyword cannibalization. This is also a result of the old SEO's work which we had to transfer without fixing first (hosting renewal deadline with the old SEO/designer forced us to get the site up and running in a very very short window). We are currently working on both of these issues now. SERPs have been swinging violently since the transfer and understandably so. Changes have cause and effect. I am bit perplexed though. Pages are indexed one day and ranking very well locally and then apparently de-indexed the next. It might be worth noting that they had some de-index problems in the months prior to meeting us. I suspect this was in large part to the duplicate copy. The ranking pages (on a url basis) are also changing up. We will see a clean url rank and then drop one week and then an unclean version rank and drop off the next (for the same city, same web search). Sometimes they rank along side each other. The terms they want to rank for are very easy to rank on because they are so geographically targeted. The competition is slim in many cases. This time last year, they were having one of the best years in the company's 20+ year history (prior to being de-indexed). **On to the questions: ** **What should we do to reduce the loss in these ranked pages? With the actions we took, can I expect the old unclean url's to drop off over time and the clean url's to pick up the ranks? Where would you start in helping this site? Is there anything obvious we have missed? I planned on starting with new keyword research to diversify what they rank on and then following that up with fresh copy across the board. ** If you are well versed with this type of problem/situation (url changes, index/de-index status, analyzing these things etc), I would love to pick your brain or even bring you on board to work with us (paid).
Technical SEO | | mattylac0 -
Structure of urls
**Hallo from Athens, Greece. We have to implement the following project and i need your help: ** We will build a company guide for the whole country and company local guides for each city for the same client. **Information of the country guide is the sum of information of local guides, so when a user is at the country guide he sees information from companies from all cities and when the user is at city guide he sees info only for the city. ** The problem is the structure of the url we should have. Should the page of presentation of each company should have structure as domain.gr/id/company? or city.domain.gr/id/company and the one to be canonical to the other? is this good for seo? Should both urls be included in the sitemap? Thank you
Technical SEO | | herculesopa0 -
How can I see the SEO of a URL? I need to know the progress of a specific landing-page of my web. Not a keyword, an url please. Thanks.
I need to know the evolution on SEO of a specific landing-page (an URL) of my web. Not a keyword, a url. Thanks. (Necesito saber si es posible averiguar el progreso de una URL específica en el posicionamiento de Google. Es decir, lo que hace SEOmoz con las palabras clave pero al revés. Yo tengo una url concreta que quiero posicionar en las primeras posiciones de Google pero quiero ver cómo va progresando en función a los cambios que le voy aplicando. Muchas gracias)
Technical SEO | | online_admiral0 -
How can you get the right site links for your site?
Hello all, I have been trying to get Google to list relevant site links for my site when you type in our brand name, Loco2 or for when Loco2 comes up in a search result. Different things come up when you search Loco2 and Loco 2. We would like site links to look like how they do when you search Loco 2. However Loco2 is our brand name, NOT Loco 2. Does anyone know why Google is doing this and whether we can influence results? We have done as much as possible via Google webmaster, in terms of specifying the links we DO NOT want Google to list for Loco2. However, when you search "Loco2", results only show simple site links. Ideally what we want is: Loco2 to be recognised as the brand NOT Loco 2 The same results (substantial, identical) for Loco2 as for Loco 2 (think o2 and o 2) For the site links to reflect the main pages of our site (Times & Tickets, Engine Room forum etc.) Many thanks in advance! Anila
Technical SEO | | anilababla0 -
Should we introduce subfolders into the URLs on a new site?
A site we are working on currently gives no indication of the subfolders in the URL. Eg. the site uses: www.examplesite.com/brand-name Rather than: www.examplesite.com/popular-products/brand-name There are breadcrumbs on site to show the user what part of the site they are in and how they navigated there. We are building a new site and have to decide what route to take: Since the site is already performing relatively well in the SERPs and the URLs are nice and short this way, is it a good idea to keep them like this or is it better for usability to include the subfolders? This post suggests that we would be best off to keep the URLs as they are - particularly since less would be changed http://www.seomoz.org/blog/should-i-change-my-urls-for-seo Thanks in advance for your opinions! Liz @lizstraws
Technical SEO | | oneresult0