What is the best tool to crawl a site with millions of pages?
-
I want to crawl a site that has so many pages that Xenu and Screaming Frog keep crashing at some point after 200,000 pages.
What tools will allow me to crawl a site with millions of pages without crashing?
-
Don't forget to exclude pages that don't contain the information you are looking for - exclude query parameters which just result in duplicate content, system files, etc. That may help to bring the amount down.
-
Only basic stuff: URL, Title, Description, and a few HTML elements.
I am aware that building a crawler would be fairly easy, but is there one out there that already does it without consuming too many resources?
-
For what purpose do you want to crawl the site?
A web crawler isn't really hard to write. In 100 lines of code you can probably code one. The question is of course: what do you want out of the crawl?
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Is it best to condense 2 similar category pages?
After reading Rand's great article about building Seo focused pages to serve topics, not keywords (http://moz.com/blog/topics-people-over-keywords-rankings-whiteboard-friday ) I started looking at my site. Question: I have 2 very similar category pages, orginally built to go after similar but different keyword terms. For example, one is: domain.com/blue-rings.html and the other is domain.com/blue-bands.com. (bridal jewelry) "Blue" is just a hypothetical type. At one time I could rank for "blue-rings" and "blue-bands". But with google changes, I think it's better to focus on a general term, right? Not set up similar pages, with same product, for very similar keywords. I'm thinking that having these 2 pages could be actually hurting, as they are competing with each other. Any recommendations? Thanks folks! Ron
Intermediate & Advanced SEO | | yatesandcojewelers0 -
Google crawled my rich snippet pages and then excluded them
Hi guysWe have added schema.org mark up a few months ago and it all looked well and showed up then suddenly last month all the crawled pages disappeared from Webmaster tools Structured data (see the screenshot attached). This happened to another site of mine and I cannot figure out what causes it. Nothing has been changed on the pages and you can see by yourself in the HTML code. Any ideas to why this might happened this way?wenR89I.png?1
Intermediate & Advanced SEO | | Walltopia0 -
Duplicate site (disaster recovery) being crawled and creating two indexed search results
I have a primary domain, toptable.co.uk, and a disaster recovery site for this primary domain named uk-www.gtm.opentable.com. In the event of a disaster, toptable.co.uk would get CNAMEd (DNS alias) to the .gtm site. Naturally the .gtm disaster recover domian is an exact match to the toptable.co.uk domain. Unfortunately, Google has crawled the uk-www.gtm.opentable site, and it's showing up in search results. In most cases the gtm urls don't get redirected to toptable they actually appear as an entirely separate domain to the user. The strong feeling is that this duplicate content is hurting toptable.co.uk, especially as .gtm.ot is part of the .opentable.com domain which has significant authority. So we need a way of stopping Google from crawling gtm. There seem to be two potential fixes. Which is best for this case? use the robots.txt to block Google from crawling the .gtm site 2) canonicalize the the gtm urls to toptable.co.uk In general Google seems to recommend a canonical change but in this special case it seems robot.txt change could be best. Thanks in advance to the SEOmoz community!
Intermediate & Advanced SEO | | OpenTable0 -
Will thousands of redirected pages have a negative impact on the site?
A client site has thousands of pages with unoptimized urls. I want to change the url structure to make them a little more search friendly. Many of the pages I want to update have backlinks to them and good PR so I don't want to delete them entirely. If I change the urls on thousands of pages, that means a lot of 301 redirects. Will thousands of redirected pages have a negative impact on the site? Thanks, Dino
Intermediate & Advanced SEO | | Dino641 -
Why do pages with a 404 error drop out of webmaster tools only to reappear again?
I have noticed a lot of pages which have fallen out of webmaster tools crawl error log that had bee 404'ing are reappearing again Any suggestions as to why this might be the case? How can I make sure they don't reappear again?
Intermediate & Advanced SEO | | Towelsrus0 -
Best Practices for Pagination on E-commerce Site
One of my e-commerce clients has a script enabled on their category pages that allows more products to automatically be displayed as you scroll down. They use this instead of page 1, 2, and a view all. I'm trying to decide if I want to insist that they change back to the traditional method of multiple pages with a view all button, and then implement rel="next", rel="prev", etc. I think the current auto method is disorienting for the user, but I can't figure out if it's the same for the spiders. Does anyone have any experience with this, or thoughts? Thanks!
Intermediate & Advanced SEO | | smallbox0 -
Best tool to calculate link distribution?
What is the best tool to calculate the total link distribution throughout a site? I know opensiteexplorer.com's "top pages" breaks down the numbers for you? Are there any others?
Intermediate & Advanced SEO | | nicole.healthline0