Exclude status codes in Screaming Frog
-
I have a very large ecommerce site I'm trying to spider using screaming frog. Problem is I keep hanging even though I have turned off the high memory safeguard under configuration.
The site has approximately 190,000 pages according to the results of a Google site: command.
- The site architecture is almost completely flat. Limiting the search by depth is a possiblity, but it will take quite a bit of manual labor as there are literally hundreds of directories one level below the root.
- There are many, many duplicate pages. I've been able to exclude some of them from being crawled using the exclude configuration parameters.
- There are thousands of redirects. I haven't been able to exclude those from the spider b/c they don't have a distinguishing character string in their URLs.
Does anyone know how to exclude files using status codes? I know that would help.
If it helps, the site is kodylighting.com.
Thanks in advance for any guidance you can provide.
-
Thanks for your help. It literally was just the fact that it had to be done before the crawl began and could not be changed during the crawl. Hopefully this is changed because sometimes during a crawl you find things you want to exclude that you may have not known of their existence before hand.
-
Are you sure it's just on Mac,have you tried on PC? Do you have any other rules in include or perhaps a conflicting rule in exclude? Try running a single exclude rule, also on another small site to test.
Also from support if failing on all fronts:
- Mac version, please make sure you have the most up to date version of the OS which will update Java.
- Please uninstall, then reinstall the spider ensuring you are using the latest version and try again.
To be sure - http://www.youtube.com/watch?v=eOQ1DC0CBNs
-
does the exclude function work on mac. i have tried every possible way to exclude folders and have not been successful while running an analysis
-
That's exactly the problem, the redirects are disbursed randomly throughout the site. Although, and the job's still running, it now appears as though there's almost a 1-2-1 correlation between pages and redirects on the site.
I also heard from Dan Sharp via Twitter. He said "You can't, as we'd have to crawl a URL to see the status code You can right click and remove after though!"
Thanks again Michael. Your thoroughness and follow through is appreciated.
-
Took another look, also looked at documentation/online and don't see any way to exclude URLs from crawl based on response codes. As I see it you would only want to exclude on name or directory as response code is likely to be random throughout a site and impede a thorough crawl.
-
Thank you Michael.
You're right. I was on a 64 bit machine running a 32 bit verson of java. I updated it and the scan has been running for more than 24 hours now without hanging. So thank you.
If anyone else knows of a way to exclude files using status codes I'd still like to learn about it. So far the scan is showing me 20,000 redirected files which I'd just as soon not inventory.
-
I don't think you can filter out on response codes.
However, first I would ensure you are running the right version of Java if you are on a 64bit machine. The 32bit version functions but you cannot increase the memory allocation which is why you could be running into problems. Take a look at http://www.screamingfrog.co.uk/seo-spider/user-guide/general/ under Memory.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Crawl errors - 2,513 not found. Response code 404
Hi,
Technical SEO | | JamesHancocks1
I've just inherited a website that I'll be looking after. I've looked in the Search Console in the Crawl errors section and discovered thousands of urls that point to non- existent pages on Desktop. There's 1,128 on Smartphone.
Some are odd and make no sense. for example: | bdfqgnnl-z3543-qh-i39634-imbbfuceonkqrihpbptd/ | Not sure why these have are occurring but what's the best way to deal with them to improve our SEO? | northeast/ | 404 | 8/29/18 |
| | 2 | blog/2016/06/27/top-tips-for-getting-started-with-the-new-computing-curriculum/ | 404 | 8/10/18 |
| | 3 | eastmidlands | 404 | 8/21/18 |
| | 4 | eastmidlands/partner-schools/pingle-school/ | 404 | 8/27/18 |
| | 5 | z3540-hyhyxmw-i18967-fr/ | 404 | 8/19/18 |
| | 6 | northeast/jobs/maths-teacher-4/ | 404 | 8/24/18 |
| | 7 | qfscmpp-z3539-i967-mw/ | 404 | 8/29/18 |
| | 8 | manchester/jobs/history-teacher/ | 404 | 8/5/18 |
| | 9 | eastmidlands/jobs/geography-teacher-4/ | 404 | 8/30/18 |
| | 10 | resources | 404 | 8/26/18 |
| | 11 | blog/2016/03/01/world-book-day-how-can-you-get-your-pupils-involved/ | 404 | 8/31/18 |
| | 12 | onxhtltpudgjhs-z3548-i4967-mnwacunkyaduobb/ | Cheers.
Thanks in advance,
James.0 -
Tags v a short code for city and town seo
My seo strategy is based around uk county geo and genre pages I want to optimise for cities and towns too and wanted to know your thoughts on tags v a nice little short code plug in that will punt out a random band order from that area and a genre. Then thinking of unique geo and genre target. What do you think? [loop type=bands-to-hire taxonomy=Genres term=blues-band count=3 orderby=random]
Technical SEO | | agentmorris1
[field thumbnail]
[field genres]
[field title-link]
[field excerpt] [/loop]0 -
Is my knowledge graph code wrong?
I inserted the Knowledge Graph code on our site last week and am still not seeing the knowledge graph in our search results. Is something incorrect with my code? <script type="<a class="attribute-value">application/ld+json</a>"> { "@context" : "http://schema.org", "@type" : "Organization", "name" : "IssueTrak", "url" : "http://www.issuetrak.com/", "sameAs" : [ "http://www.facebook.com/issuetrak", "http://www.twitter.com/issuetrak", "http://plus.google.com/google.com/+Issuetrak"] } script>head> I suspect it is the alignment of the "{" and "}" but others in the company say that doesn't matter. Any other explanations for why the KG isn't showing in the results? Thanks I did test it with Google's Structured Data Testing Tool and got the "all's good."
Technical SEO | | Nobody15969167212220 -
Does Title Tag location in a page's source code matter?
Currently our meta description is on line 8 for our page - http://www.paintball-online.com/Paintball-Guns-And-Markers-0Y.aspx The title tag, however sits below a bunch of code on line 237 Does the location of the title tag, meta tags, and any structured data have any influence with respect to SEO and search engines? Put another way, could we benefit from moving the title tag up to the top? I "surfed 'n surfed" and could not find any articles about this. I would really appreciate any help on this as our site got decimated organically last May and we are looking for any help with SEO. NIck
Technical SEO | | Istoresinc0 -
If content is at the bottom of the page but the code is at the top, does Google know that the content is at the bottom?
I'm working on creating content for top category pages for an ecommerce site. I can put them under the left hand navigation bar, and that content would be near the top in the code. I can also put the content at the bottom center, where it would look nicer but be at the bottom of the code. What's the better approach? Thanks for reading!
Technical SEO | | DA20130 -
405 HTTP Status instead of 404
Hi We need to block some www1-pages from being indexed. Now IT has resolved this but pages like http://www1.swisscom.ch/fr/business/pme.html return a 405 status instead of a 404. The pages are currently still indexed in Google. Must the status be changed to 404 or should I just wait and see if Google de-indexes them anyhow?
Technical SEO | | zeepartner0 -
"Not Selected" in index status rising continously
Hello, After the penguin update my site slowly suffered loss in traffic. and now from daily 15K-18K its droped to 8K. (6K in weekends) I have been trying to find out what the reasons are but i havent got any good luck yet been few months now. I noticed this change in the GWT tho : Not selected in index status significantly risen up. please see attached image. My site is Designzzz i am continously fixing errors and problems shown in the seomoz pro tools. If you guys can take few mins to evaluate what could be the reason for such drop i will be thankful :} cheers 6Xtkp.jpg
Technical SEO | | wickedsunny10 -
302 redirect and NO DATA as HTTP Status in Top Pages in SEOMOZ Link Analysis
I recently performed a link analysis using SEOMOZ and my home page (top page) indicates that there is a 302 status. Is this bad? Also, 2 other key landing pages have [NO STATUS] as the http status and [NO DATA] for the page title. Could anyone offer insight into what might be happening here, and whether or not it's something that is potentially hurting us? Thanks for your help!
Technical SEO | | dstepchew0