What is the best tool to crawl a site with millions of pages?
-
I want to crawl a site that has so many pages that Xenu and Screaming Frog keep crashing at some point after 200,000 pages.
What tools will allow me to crawl a site with millions of pages without crashing?
-
Don't forget to exclude pages that don't contain the information you are looking for - exclude query parameters which just result in duplicate content, system files, etc. That may help to bring the amount down.
-
Only basic stuff: URL, Title, Description, and a few HTML elements.
I am aware that building a crawler would be fairly easy, but is there one out there that already does it without consuming too many resources?
-
For what purpose do you want to crawl the site?
A web crawler isn't really hard to write. In 100 lines of code you can probably code one. The question is of course: what do you want out of the crawl?
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Robots blocked by pages webmasters tools
a mistake made in software. How can I solve the problem quickly? help me. XTRjH
Intermediate & Advanced SEO | | mihoreis0 -
Taxonomy question - best approach for site structure
Hi all, I'm working on a dentist's website and want some advice on the best way to lay out the navigation. I would like to know which structure will help the site work naturally. I feel the second example would be better as it would focus the 'power' around the type of treatment and get that to rank better. .com/assessment/whitening
Intermediate & Advanced SEO | | Bee159
.com/assessment/straightening
.com/treatment/whitening
.com/treatment/straightening or .com/whitening/assessment
.com/straightening/assessment
.com/whitening/treatment
.com/straightening/treatment Please advise, thanks.0 -
Best format for E-Commerce Pages in Title Text / Link Text & Markup
Hello Please comment on which you think is best SEO practice for each & any comments on link juice following through. Title text ( on Product Page ) <title>Brandname ProductName</title>
Intermediate & Advanced SEO | | s_EOgi_Bear
OR
<title>ProductName by Brandname</title> on category page <a <span="" class="html-attribute-name">itemprop="name" href="[producturl]">ProductName</a>
<a <span="" class="html-attribute-name">itemprop="brand" href="[brandurl]>BrandName</a> OR <a <span class="html-attribute-name">itemprop="name" href="[producturl]">BrandName ProductName
( Leave Brand Link Out)</a <span> Product Page <a itemprop="name" href="[producturl]">ProductName
<a itemprop="brand" href="[brandurl]>BrandName</a itemprop="brand" href="[brandurl]></a itemprop="name" href="[producturl]"> OR <a itemprop="name" href="[producturl]">BrandName ProductName
( Leave Brand Link Out)</a itemprop="name" href="[producturl]"> Thoughts?0 -
Mobile Site Outranking Main Site
Hi, We have recently been hit with a problem regarding our mobile site, where it is outranking our main site. This is causing a drop in orders and ranknings for our main site. It would appear that google has indexed our mobile site and so the two are now competing against each other. Our main site is on a .co.uk and our mobile site on a .mobi, but we have now taken down the mobile site until we get this sorted. Does anyone else have any experience of this happening and how to stop it happening again? Thanks Steve
Intermediate & Advanced SEO | | Steve251 -
Max # of Products / Links per Page on E-Commerce Site
We are getting ready to re-launch our e-commerce site and are trying to decide how many products to list per category page. Some of of our category pages have upwards of 100 products. While I'd love to list ALL the products on the root category page (to reduce hassle for customer, to index more products on a higher PR page), I'm a little worried about having it be too long, and containing too many on-page links. Would love some guidance on: Maximum number of internal links on a page If Google frowns on really long category pages Anything else I should be considering when making this decision Thanks for your input!
Intermediate & Advanced SEO | | AndrewY2 -
202 error page set in robots.txt versus using crawl-able 404 error
We currently have our error page set up as a 202 page that is unreachable by the search engines as it is currently in our robots.txt file. Should the current error page be a 404 error page and reachable by the search engines? Is there more value or is it a better practice to use 404 over a 202? We noticed in our Google Webmaster account we have a number of broken links pointing the site, but the 404 error page was not accessible. If you have any insight that would be great, if you have any questions please let me know. Thanks, VPSEO
Intermediate & Advanced SEO | | VPSEO0 -
Does a page on a site with high domain authority build page authority easier? i.e. less inbound links?
Is this also why people build backlinks to their BBB profiles, Yellowpages Profiles, etc. i.e. why do people build backlinks to other pages that link to them? Wouldn't it be more beneficial to just build that backlink directly to your target?
Intermediate & Advanced SEO | | adriandg0 -
Question about "launching to G" a new site with 500000 pages
Hey experts, how you doing? Hope everything is ok! I'm about to launch a new website, the code is almost done. Totally fresh new domain. The site will have like 500000 pages, fully internal optimized of course. I got my taticts to make G "travel" over my site to get things indexed. The problem is: to release it in "giant mode" or release it "thin" and increase the pages over the time? What do you recomend? Release the big G at once and let them find the 500k pages (do they think this can be a SPAM or something like that)? Or release like 1k/2k per day? Anybody know any good aproach to improve my chances of success here? Any word will be apreciated. Thanks!
Intermediate & Advanced SEO | | azaiats20