Moz Q&A is closed.
After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.
How to extract URLs from a site (without bringing the server down!)
-
Hi everybody.
One of my clients is migrating to a new ecommerce platform, and we need to get a list of urls from the existing site to start mapping out the 301 redirects. Usually, I'd use a tool like Xenu or Integrity to crawl and output a list.
However, the database and server setup is so bad that it can't handle the requests from these tools and it sends the site down. This, unsurprisingly, is one of the reasons for the migration.
Does anybody know of a way to get a full list of urls without having to make a bunch of http requests which will kill the site? Any advice would be much appreciated!
-
Just a follow-up to my endorsement. It looks like Screaming Frog will let you control the number of pages crawled per second, but to do a full crawl you'll need to get the paid version (the free version only crawls 500 URLs):
http://www.screamingfrog.co.uk/seo-spider/
It's a good tool, and nice to have around, IMO.
-
Copy the site, set it up on a staging server and run http://www.xml-sitemaps.com/ on it?
-
why not find the links to the site, becauase you will only need to 301 the urls with extenal links. let teh rest 404. i use Bing WMT as it has a most complete collection IMO. they also export to a csv
-
Thanks Yannick, I don't know why I didn't think of using a scraper! Can you recommend any good code (PHP perhaps)?
-
-
Scrape Google?
-
Make your own scraper and keep the requests per second really low ?
-
Maybe the site has an automated sitemap somewhere ?
-
Google webmaster tools -> download "internal links" table
-
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Does anyone know the linking of hashtags on Wix sites does it negatively or postively impact SEO. It is coming up as an error in site crawls 'Pages with 404 errors' Anyone got any experience please?
Does anyone know the linking of hashtags on Wix sites does it negatively or positively impact SEO. It is coming up as an error in site crawls 'Pages with 404 errors' Anyone got any experience please? For example at the bottom of this blog post https://www.poppyandperle.com/post/face-painting-a-global-language the hashtags are linked, but they don't go to a page, they go to search results of all other blogs using that hashtag. Seems a bit of a strange approach to me.
Technical SEO | | Mediaholix0 -
URL Structure On Site - Currently it's domain/product-name NOT domain/category/product name is this bad?
I have a eCommerce site and the site structure is domain/product-name rather than domain/product-category/product-name Do you think this will have a negative impact SEO Wise? I have seen that some of my individual product pages do get better rankings than my categories.
Technical SEO | | the-gate-films0 -
Can anyone tell me why some of the top referrers to my site are porn site?
We noticed today that 4 of the top referring sites are actually porn sites. Does anyone know what that is all about? Thanks!
Technical SEO | | thinkcreativegroup1 -
Does my "spam" site affect my other sites on the same IP?
I have a link directory called Liberty Resource Directory. It's the main site on my dedicated IP, all my other sites are Addon domains on top of it. While exploring the new MOZ spam ranking I saw that LRD (Liberty Resource Directory) has a spam score of 9/17 and that Google penalizes 71% of sites with a similar score. Fair enough, thin content, bunch of follow links (there's over 2,000 links by now), no problem. That site isn't for Google, it's for me. Question, does that site (and linking to my own sites on it) negatively affect my other sites on the same IP? If so, by how much? Does a simple noindex fix that potential issues? Bonus: How does one go about going through hundreds of pages with thousands of links, built with raw, plain text HTML to change things to nofollow? =/
Technical SEO | | eglove0 -
Staging site and "live" site have both been indexed by Google
While creating a site we forgot to password protect the staging site while it was being built. Now that the site has been moved to the new domain, it has come to my attention that both the staging site (site.staging.com) and the "live" site (site.com) are both being indexed. What is the best way to solve this problem? I was thinking about adding a 301 redirect from the staging site to the live site via HTACCESS. Any recommendations?
Technical SEO | | melen0 -
Trailing Slashes In Url use Canonical Url or 301 Redirect?
I was thinking of using 301 redirects for trailing slahes to no trailing slashes for my urls. EG: www.url.com/page1/ 301 redirect to www.url.com/page1 Already got a redirect for non-www to www already. Just wondering in my case would it be best to continue using htacces for the trailing slash redirect or just go with Canonical URLs?
Technical SEO | | upick-1623910 -
Delete old site but redirect domain to a new domain and site
I just have a quick query and I have a feeling about what the answer is so just wanted to see what you guys thought... Basically I am working on a client site. This client has a few other websites that are divisions of their company. However these divisions/websites are no longer used. They are wanting to delete the websites but redirect the domains to their name main website. They believe this will pass on SEO benefits as these old division sites are old and have a good PR and history. I'm unsure for DEFINITE, which way is correct?
Technical SEO | | Weerdboil0 -
Using a third party server to host site elements
Hi guys - I have a client who are recently experiencing a great deal of more traffic to their site. As a result, their web development agency have given them a server upgrade to cope with the new demand. One thing they have also done is put all website scripts, CSS files, images, downloadable content (such as PDFs) - onto a 3rd party server (Amazon S3). Apparently this was done so that my clients server just handles the page requests now - and all other elements are then grabbed from the Amazon s3 server. So basically, this means any HTML content and web pages are still hosted through my clients domain - but all other content is accessible through an Amazon s3 server URL. I'm wondering what SEO implications this will have for my clients domain? While all pages and HTML content is still accessible thorugh their domain name, each page is of course now making many server calls to the Amazon s3 server through external URLs (s3.amazonaws.com). I imagine this will mean any elements sitting on the Amazon S3 server can no longer contribute value to the clients SEO profile - because that actual content is not physically part of their domain anymore. However what I am more concerned about is whether all of these external server calls are going to have a negative effect on the web pages value overall. Should I be advising my client to ensure all site elements are hosted on their own server, and therefore all elements are accessible through their domain? Hope this makes sense (I'm not the best at explaining things!)
Technical SEO | | zealmedia0