What is the best tool to crawl a site with millions of pages?
-
I want to crawl a site that has so many pages that Xenu and Screaming Frog keep crashing at some point after 200,000 pages.
What tools will allow me to crawl a site with millions of pages without crashing?
-
Don't forget to exclude pages that don't contain the information you are looking for - exclude query parameters which just result in duplicate content, system files, etc. That may help to bring the amount down.
-
Only basic stuff: URL, Title, Description, and a few HTML elements.
I am aware that building a crawler would be fairly easy, but is there one out there that already does it without consuming too many resources?
-
For what purpose do you want to crawl the site?
A web crawler isn't really hard to write. In 100 lines of code you can probably code one. The question is of course: what do you want out of the crawl?
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Best practice recommendations for enabling multiple languages on your site?
I find that the advice for multi-language sites is always tied with multi-region, but what about US only sites that want to be multi-lingual? What are the best practice recommendations there? HREFLANG tags necessary? TLDs? Do you need to purchase yoursite.us , yoursite.sp , etc.. or would yoursite.com/en yoursite.com/sp suffice? Should the extensions be region based even if the language is the only difference?
Intermediate & Advanced SEO | | emilydavidson0 -
[Very Urgent] More 100 "/search/adult-site-keywords" Crawl errors under Search Console
I just opened my G Search Console and was shocked to see more than 150 Not Found errors under Crawl errors. Mine is a Wordpress site (it's consistently updated too): Here's how they show up: Example 1: URL: www.example.com/search/adult-site-keyword/page2.html/feed/rss2 Linked From: http://an-adult-image-hosting.com/search/adult-site-keyword/page2.html Example 2 (this surprised me the most when I looked at the linked from data): URL: www.example.com/search/adult-site-keyword-2.html/page/3/ Linked From: www.example.com/search/adult-site-keyword-2.html/page/2/ (this is showing as if it's from our own site) http://a-spammy-adult-site.com/search/adult-site-keyword-2.html Example 3: URL: www.example.com/search/adult-site-keyword-3.html Linked From: http://an-adult-image-hosting.com/search/adult-site-keyword-3.html How do I address this issue?
Intermediate & Advanced SEO | | rmehta10 -
Tidied up site by getting rid of bad pages and now rankings tanked. - Please help
Hello Mozzers. We historically had Location specific landing pages on our eCommerce site. examples - site.co.ukj/cleaning-enquipment-london site.co.ukj/cleaning-enquipment-Manchester These all had unique content(600 words approx) and ranked in top 10 for many cities. I understand these would have been classed as doorway pages so we got rid of them (301'd back to the category pages) and now our rankings for these terms have tanked. We also have specific branch pages but we have kept these like many other companies with multiple branches do. It feels like by doing a good thing and tidying up everything , we are actually making our site worse. Everything else seems to be in place. Loads of new regular content , clean profile , mobile friendly, lots of citations etc etc. Any idea what could be going on here. Here's a link in our site - http://goo.gl/0yjSd8 thanks Pete
Intermediate & Advanced SEO | | PeteC120 -
URL Errors in webmaster tools to pages that don't exist?
Hello, for sometime now we have URLs showing up in Google webmaster saying these are 404 errors but don't exist on our website.......but also never have? Heres an example cosmetic-dentistry/28yearold-southport-dentist-wins-best-young-dentist-award/801530293 The root being this goo.gl/vi4N4F Really confused about this? We have recently made our website wordpress? Thanks Ade
Intermediate & Advanced SEO | | popcreativeltd0 -
Should I let Google crawl my production server if the site is still under development?
I am building out a brand new site. It's built on Wordpress so I've been tinkering with the themes and plug-ins on the production server. To my surprise, less than a week after installing Wordpress, I have pages in the index. I've seen advice in this forum about blocking search bots from dev servers to prevent duplicate content, but this is my production server so it seems like a bad idea. Any advice on the best way to proceed? Block or no block? Or something else? (I know how to block, so I'm not looking for instructions). We're around 3 months from officially launching (possibly less). We'll start to have real content on the site some time in June, even though we aren't planning to launch. We should have a development environment ready in the next couple of weeks. Thanks!
Intermediate & Advanced SEO | | DoItHappy0 -
Why do pages with a 404 error drop out of webmaster tools only to reappear again?
I have noticed a lot of pages which have fallen out of webmaster tools crawl error log that had bee 404'ing are reappearing again Any suggestions as to why this might be the case? How can I make sure they don't reappear again?
Intermediate & Advanced SEO | | Towelsrus0 -
How would you handle 12,000 "tag" pages on Wordpress site?
We have a Wordpress site where /tag/ pages were not set to "noindex" and they are driving 25% of site's traffic (roughly 100,000 visits year to date). We can't simply "noindex" them all now, or we'll lose a massive amount of traffic. We can't possibly write unique descriptions for all of them. We can't just do nothing or a Panda update will come by and ding us for duplicate content one day (surprised it hasn't already). What would you do?
Intermediate & Advanced SEO | | M_D_Golden_Peak1 -
How to Set Custom Crawl Rate in Google Webmaster Tools?
This is really silly question to set custom crawl rate in Google webmaster tools. Any one can find out that section under setting tab. But, I have confusion to decide number for request per second and second between requests text field. I want to set custom crawl rate for my eCommerce website. I checked my Google webmaster tools and find out as attachment. So, Can I use this facility to improve my crawling? 6233755578_33ce83bb71_b.jpg
Intermediate & Advanced SEO | | CommercePundit0