Tools that crawl 2 million page sites
-
Our site is about 2million pages deep, 50% of which is stale content. Yes, I know - OMG #unhygienic. Even if we get approval to get rid of half of it. SEOMoz Pro Elite only crawls 20k deep - what can i do to crawl and diagnose the whole site. Are there any tools anyone can suggest. SEOMoz??
-
That's good to know. It sounds like that's probably the best way. I also use Screaming Frog (http://www.screamingfrog.co.uk/seo-spider/) to try and crawl sites and with dedicated 2Gigs of ram, it's able to crawl around 50k pages. If your site is structured in sub-folders, you might be able to break it into parts and then crawl. But then if not, the SEOMOZ Enterprise looks like the way to go.
-
There is an enterprise version of SEOmoz which will do 1 million pages a month and up to 30k keywords which is well worth looking into if you have a enormous web property.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Unsolved Moz can't crawl my site
Moz is being blocked from crawling the following site - https://www.cleanchain.com. When looking at Robot.txt, the following is disallowing access but don't know whether this is preventing Moz from crawling too? User-agent: *
Moz Pro | | danhart2020
Disallow: /adeci/
Disallow: /core/
Disallow: /connectors/
Disallow: /assets/components/ Could something else be preventing the crawl?0 -
Why my site not crawl?
hi all me allow rogerbot in robots.txt but rogerbot can't crawl my site my site: toska-co.ir
Moz Pro | | jahanidawodi0 -
Filter Pages
Howdy Moz Forum!! I have a headache of a job over here in the UK and I'd welcome any advice! - It's sunny today, only 1 of 5 days in a year and i'm stuck on this! I have a client that currently has 22,000 pages indexed to Google with almost 4000 showing as duplicate content. The site has a "jobs" and "candidates" list. This can cause all sorts of variations such as job title, language, location etc. The filter pages all seem to be indexed. Plus the static pages are indexed. For example if there were 100 jobs at Moz being advertised, it is displaying the jobs on the following URL structure - /moz
Moz Pro | | Slumberjac
/moz/moz-jobs
/moz/moz-jobs/page/2
/moz/moz-jobs/page/3
/moz/moz-jobs/page/4
/moz/moz-jobs/page/5 ETC ETC Imagine this with some going up to page/250 I have checked GA data and can see that although there are tons of pages indexed this way, non of them past the "/moz/moz-jobs" URL get any sort of organic traffic. So, my first question! - Should I use rel-canonical tags on all the /page/2 & /page/3 etc results and point them all at the /moz/moz-jobs parent page?? The reason for this is these pages have the same title and content and fall very close to "duplicate" content even though it does pull in different jobs... I hope i'm making sense? There is also a lot of pages indexed in a way such as- https://www.examplesite.co.uk/moz-jobs/search/page/9/?candidate_search_type=seo-consulant&candidate_search_language=blank-language These are filter pages... and as far as I'm concerned shouldn't really be indexed? Second question! - Should I "no follow" everything after /page in this instance? To keep things tidy? I don't want all the variations indexed! Any help or general thoughts would be much appreciated! Thanks.0 -
How to know exactly which page links to a 404 page on my website?
Hi Moz users, Sometimes I get 404 crawl errors using Moz Pro and when my website has a few dozen pages it is hard for me to find the original page that links to a 404 page. Is there a way to find this automatically using Moz or do I have to look for it manually? I just need to find the original link and delete it to fix my 404 issue. Please let me know thank you for you help. -Marc
Moz Pro | | marcandre0 -
How do a run a MOZ crawl of my site before waiting for the scheduled weekly crawl?
Greetings: I have just updated my site and would like to run a crawl immediately. How can I do so before waiting for the next MOZ crawl? Thanks,
Moz Pro | | Kingalan1
Alan0 -
Is www.domain.com/page the same url as www.domain.com/page/ for Google? (extra slash at end of url)
Dear all, in open site explorer there is a difference the url's 'www.domain.com/page' and 'www.domain.com/page/' (extra slash at end). There can be different values in pageauthority etc. in the open site explorer tool, but is this also the case for Google? Thanks for replying, Regards, Ben
Moz Pro | | HMK-NL0 -
Indexed sites
Hello I´m a newbie here on seomoz. My question is just simple but i need to be sure about that:Refering to the report which you can create in the "crawl diagnostic summary". Does the CSV export lists all and ONLY the URLS which are in the google index?If not: does a report is available where all and ONLY the URLS are listed which are in google index?Many thanks! Henrik
Moz Pro | | drgoodcat0 -
Crawl Diagnostics Report
I'm a bit concerned about the results I'm getting from the Crawl Diagnostics Report. I've updated the site with canonical urls to remove duplicate content and when I check the site - it all displays the right values, but the report, which has just finished crawling is still showing a lot of pages as duplicate content. Simple example: http://www.domain.com http://www.domain.com/ Both of them are in the duplicate content section although both have canonical url set as: Does each crawl check the entire site from the beginning or just the pages it didn't have a chance to crawl the last time? This is just one of 333 duplicate content pages, which have canonical url pointing to the right page. Can someone please explain?
Moz Pro | | coremediadesign0