What is the best tool to crawl a site with millions of pages?
-
I want to crawl a site that has so many pages that Xenu and Screaming Frog keep crashing at some point after 200,000 pages.
What tools will allow me to crawl a site with millions of pages without crashing?
-
Don't forget to exclude pages that don't contain the information you are looking for - exclude query parameters which just result in duplicate content, system files, etc. That may help to bring the amount down.
-
Only basic stuff: URL, Title, Description, and a few HTML elements.
I am aware that building a crawler would be fairly easy, but is there one out there that already does it without consuming too many resources?
-
For what purpose do you want to crawl the site?
A web crawler isn't really hard to write. In 100 lines of code you can probably code one. The question is of course: what do you want out of the crawl?
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
B2B site targeting 20,000 companies with 20,000 dedicated "target company pages" on own website.
An energy company I'm working with has decided to target 20,000 odd companies on their own b2b website, by producing a new dedicated page per target company on their website - each page including unique copy and a sales proposition (20,000 odd new pages to optimize! Yikes!). I've never come across such an approach before... what might be the SEO pitfalls (other than that's a helluva number of pages to optimize!). Any thoughts would be very welcome.
Intermediate & Advanced SEO | | McTaggart0 -
Best practice for H1 on site without H1 - Alternative methods?
I have recently set up a mens style blog - the site is made up of articles pulled in from a CMS and I am wanting to keep the design as clean as possible - so no text other than the articles. This makes it hard to get a H1 tag into the page - are there any solutions/alternatives? that would be good for SEO? The site is http://www.iamtheconnoisseur.com/ Thanks
Intermediate & Advanced SEO | | SWD.Advertising0 -
Would be the network site map page considered link spam
In the course of the last 18 months my sites have lost from 50 to 70 percent of traffic. Never have used any tricks, just simple white-hat SEO. Anyway, I am now trying to fix things that hadn't been a problem before all those Google updates, but apparently now are. Would appreciate any help.. I used to have a network site map page on everyone of my sites (about 30 sites). It basically would be a page called 'our network' and it'll show a list of links to all of my other sites. These pages were indexed, had decent PR and didn't seem to cause any problem. Here's an example of one of them:
Intermediate & Advanced SEO | | romanbond
http://www.psoriasisguide.ca/psoriasis_scg.html In the light of Panda and Penguin and all these 'bad links' I decided to get rid of most of them. My traffic didn't recover at all, it actually went further down. Not sure if there is any connection to what I'd done. So, the question is: In your opinion/experience, do you think such network sitemap pages could be causing penalties for link spam?0 -
Help with Best Content Posting Approach - WordPress site
I have a word document that i would like to add to my wordpress site as a page. The document has a large detailed flow chart of a complex legal process. (about 20+ boxes in the flow chart). I do not want to add it as an image because i want search engines to read/index the information in the flow chart. any suggestions to post this detailed flow chart on a WP page in the best SEO manner? Thanks.
Intermediate & Advanced SEO | | CamiloSC0 -
Best Approach to Get Backlinks for this site
Hello, What would be a good approach to gain backlinks for this site: www.nlpca.com The owners don't have much time to write content. I as the consultant have time but do not have the expertise the owners do. The people that run the site are authorities in the field. Thanks!
Intermediate & Advanced SEO | | BobGW0 -
Question about best approach to site structure
I am curious if anyone can share some advice. I am working on planning architecture for a tour company. The key piece of the content strategy will be providing details on each of the tour destinations, with associated profiles for each city within those destinations. Lots of content, which should be great for the SEO strategy. With regards to the architecture, I have a ‘destinations’ section on the Website where users can access each of the key destinations served by the tour company. My question is – from a planning perspective I can organize my folder structure in a few different ways. http://www.companyurl.com/destinations/touring-regions/cities/ or http://www.companyurl.com/destinations/ http://www.companyurl.com/touring-regionA/ http://www.companyurl.com/touring-regionB/cities-profile/ I am curious if anyone has an opinion on what might perform best in terms of the site structure from an SEO perspective. My fear is taking all of this rich content and placing it so many tiers down in the architecture of the site. Any advice that could be offered would be appreciated. Thanks.
Intermediate & Advanced SEO | | VERBInteractive0 -
Does duplicate content penalize the whole site or just the pages affected?
I am trying to assess the impact of duplicate content on our e-commerce site and I need to know if the duplicate content is affecting only the pages that contain the dupe content or does it affect the whole site? In Google that is. But of course. Lol
Intermediate & Advanced SEO | | bjs20100 -
Best SEO format for a blog page on an ecommerce website.. inc Source Ordered Content
Does anyone know of a page template or code I might want to base a blog on as part of an eccomerce website? I am interested in keeping the look (includes) of the website and paying attention to Source Ordered Content helping crawlers index the new great blogs we have to share. I could just knock up a page with a template from the site but I would like to investigate SOC at this stage as it may benefit us in the long run. Any ideas?
Intermediate & Advanced SEO | | robertrRSwalters0