What is the best tool to crawl a site with millions of pages?
-
I want to crawl a site that has so many pages that Xenu and Screaming Frog keep crashing at some point after 200,000 pages.
What tools will allow me to crawl a site with millions of pages without crashing?
-
Don't forget to exclude pages that don't contain the information you are looking for - exclude query parameters which just result in duplicate content, system files, etc. That may help to bring the amount down.
-
Only basic stuff: URL, Title, Description, and a few HTML elements.
I am aware that building a crawler would be fairly easy, but is there one out there that already does it without consuming too many resources?
-
For what purpose do you want to crawl the site?
A web crawler isn't really hard to write. In 100 lines of code you can probably code one. The question is of course: what do you want out of the crawl?
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
My last site crawl shows over 700 404 errors all with void(0 added to the ends of my posts/pages.
Hello, My last site crawl shows over 700 404 errors all with void(0 added to the ends of my posts/pages. I have contacted my theme company but not sure what could have done this. Any ideas? The original posts/pages are still correct and working it just looks like it did duplicates and added void(0 to the end of each post/page. Questions: There is no way to undo this correct? Do I have to do a redirect on each of these? Will this hurt my rankings and domain authority? Any suggestions would be appreciated. Thanks, Wade
Intermediate & Advanced SEO | | neverenoughmusic.com0 -
Client has an inexplicable jump in crawled pages being reported in Google Search Console
Recently a client of mine noticed an inexplicable jump in crawled pages being reported in Google Search Console. We researched the following culprits and found nothing: Rel=canonicals are put in place No SSL/non SSL duplication We used a tool to extrapolate search query page data from Google Search Insights; nothing unusual No dynamic pages being made on the website All necessary landing pages are in the XML sitemap Could this be a glitch in GSC? We are wondering what the heck is going on. 7eaeS
Intermediate & Advanced SEO | | BigChad20 -
Link Removal Request Sent to Google, Bad Pages Gone from Index But Still Appear in Webmaster Tools
| On June 14th the number of indexed pages for our website on Google Webmaster tools increased from 676 to 851 pages. Our ranking and traffic have taken a big hit since then. The increase in indexed pages is linked to a design upgrade of our website. The upgrade was made June 6th. No new URLS were added. A few forms were changed, the sidebar and header were redesigned. Also, Google Tag Manager was added to the site. My SEO provider, a reputable firm endorsed by MOZ, believes the extra 175 pages indexed by Google, pages that do not offer much content, may be causing the ranking decline. My developer submitted a page removal request to Google via Webmaster tools around June 20th. Now when a Google search is done for site:www.nyc-officespace-leader.com 851 results display. Would these extra pages cause a drop in ranking? My developer issued a link removal request for these pages around June 20th and the number in the Google search results appeared to drop to 451 for a few days, now it is back up to 851. In Google Webmaster Tools it is still listed as 851 pages. My ranking drop more and more everyday. At the end of displayed Google Search Results for site:www.nyc-officespace-leader.comvery strange URSL are displaying like:www.nyc-officespace-leader.com/wp-content/plugins/... If we can get rid of these issues should ranking return to what it was before?I suspect this is an issue with sitemaps and Robot text. Are there any firms or coders who specialize in this? My developer has really dropped the ball. Thanks everyone!! Alan |
Intermediate & Advanced SEO | | Kingalan10 -
Moving career site to new URL from main site. Will it hurt SEO for main page?
For one of our clients we are building a career site and putting it under a different URL and hosting service (mainly due to security concerns of hosting it under the same host and domain). almost 100% of the incoming traffic to their current career section (which it is in a sub-folder) receives traffic for branded keywords (brand + job/career/employment), that is, there are no job position specific keywords. The client is now worried that after moving the site, the inbound traffic to the main site will be severely affected as well as the SERP results. My questions are, will the non-career related SERPs be affected? I don't see how will they be but I could be wrong If no, how could we reassure her that the SEO to the main site wont be affected? are there any case studies of a similar case (splitting part of the website under a new URL and hosting service?) Thank you for your help. PS: this is my first post so please forgive me if this has been asked before. I could not find a good response.
Intermediate & Advanced SEO | | rflores0 -
50,000 backlinks in webmaster tools from one site???
Hi All, I'm new to evaluating backlinks, but I just saw I got over 50,000 links from a backlink that was added on ONE page at this site here: http://www.netnewspublisherDOTcom. I presume this is not a good thing, and if I contact them to remove the one link on the one page, it won't solve the other 49,999 links that Google is seeing pointing to us, so what do I do??. Should I contact them and ask to remove it and see if they don't and then disavow? Or would you just tell Google to disavow the whole site? Thanks!
Intermediate & Advanced SEO | | mlm120 -
Urgent Site Migration Help: 301 redirect from legacy to new if legacy pages are NOT indexed but have links and domain/page authority of 50+?
Sorry for the long title, but that's the whole question. Notes: New site is on same domain but URLs will change because URL structure was horrible Old site has awful SEO. Like real bad. Canonical tags point to dev. subdomain (which is still accessible and has robots.txt, so the end result is old site IS NOT INDEXED by Google) Old site has links and domain/page authority north of 50. I suspect some shady links but there have to be good links as well My guess is that since that are likely incoming links that are legitimate, I should still attempt to use 301s to the versions of the pages on the new site (note: the content on the new site will be different, but in general it'll be about the same thing as the old page, just much improved and more relevant). So yeah, I guess that's it. Even thought the old site's pages are not indexed, if the new site is set up properly, the 301s won't pass along the 'non-indexed' status, correct? Thanks in advance for any quick answers!
Intermediate & Advanced SEO | | JDMcNamara0 -
Whats the best search parameters on Open Site Explorer for identifying un-natural back links?
Using open site explorer, what parameters will best narrow down low quality back links(or back links that could be viewed as un-natural by Google)? ie. blog networks, link schemes, etc.
Intermediate & Advanced SEO | | Stromme0 -
How to best utilize network of 50 sites to increase traffic on main site
Hey All, First off I wanna thank everyone who has responded to all my previous questions! Love to see a community that is so willing to help those who are learning the ropes! Anyways back to my point. We have a main site that is a PR 3 and our main focal point for lead generation. We recently acquired 50 additional sites (all with a PR of 1-3) that we would like to use as our own little back linking campaign with. All the domains are completely relevant to our main site as well as specific pages within our main site. I know that reciprocal links will get me no where and that google is quickly on to the attempted 3 way link exchange. My question is how do I best link these 50 sites to not only maintain there own integrity and PR but also assist our main site. Thanks All!
Intermediate & Advanced SEO | | deuce1s0