Exclude status codes in Screaming Frog
-
I have a very large ecommerce site I'm trying to spider using screaming frog. Problem is I keep hanging even though I have turned off the high memory safeguard under configuration.
The site has approximately 190,000 pages according to the results of a Google site: command.
- The site architecture is almost completely flat. Limiting the search by depth is a possiblity, but it will take quite a bit of manual labor as there are literally hundreds of directories one level below the root.
- There are many, many duplicate pages. I've been able to exclude some of them from being crawled using the exclude configuration parameters.
- There are thousands of redirects. I haven't been able to exclude those from the spider b/c they don't have a distinguishing character string in their URLs.
Does anyone know how to exclude files using status codes? I know that would help.
If it helps, the site is kodylighting.com.
Thanks in advance for any guidance you can provide.
-
Thanks for your help. It literally was just the fact that it had to be done before the crawl began and could not be changed during the crawl. Hopefully this is changed because sometimes during a crawl you find things you want to exclude that you may have not known of their existence before hand.
-
Are you sure it's just on Mac,have you tried on PC? Do you have any other rules in include or perhaps a conflicting rule in exclude? Try running a single exclude rule, also on another small site to test.
Also from support if failing on all fronts:
- Mac version, please make sure you have the most up to date version of the OS which will update Java.
- Please uninstall, then reinstall the spider ensuring you are using the latest version and try again.
To be sure - http://www.youtube.com/watch?v=eOQ1DC0CBNs
-
does the exclude function work on mac. i have tried every possible way to exclude folders and have not been successful while running an analysis
-
That's exactly the problem, the redirects are disbursed randomly throughout the site. Although, and the job's still running, it now appears as though there's almost a 1-2-1 correlation between pages and redirects on the site.
I also heard from Dan Sharp via Twitter. He said "You can't, as we'd have to crawl a URL to see the status code
You can right click and remove after though!"
Thanks again Michael. Your thoroughness and follow through is appreciated.
-
Took another look, also looked at documentation/online and don't see any way to exclude URLs from crawl based on response codes. As I see it you would only want to exclude on name or directory as response code is likely to be random throughout a site and impede a thorough crawl.
-
Thank you Michael.
You're right. I was on a 64 bit machine running a 32 bit verson of java. I updated it and the scan has been running for more than 24 hours now without hanging. So thank you.
If anyone else knows of a way to exclude files using status codes I'd still like to learn about it. So far the scan is showing me 20,000 redirected files which I'd just as soon not inventory.
-
I don't think you can filter out on response codes.
However, first I would ensure you are running the right version of Java if you are on a 64bit machine. The 32bit version functions but you cannot increase the memory allocation which is why you could be running into problems. Take a look at http://www.screamingfrog.co.uk/seo-spider/user-guide/general/ under Memory.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Crawl errors - 2,513 not found. Response code 404
Hi,
Technical SEO | | JamesHancocks1
I've just inherited a website that I'll be looking after. I've looked in the Search Console in the Crawl errors section and discovered thousands of urls that point to non- existent pages on Desktop. There's 1,128 on Smartphone.
Some are odd and make no sense. for example: | bdfqgnnl-z3543-qh-i39634-imbbfuceonkqrihpbptd/ | Not sure why these have are occurring but what's the best way to deal with them to improve our SEO? | northeast/ | 404 | 8/29/18 |
| | 2 | blog/2016/06/27/top-tips-for-getting-started-with-the-new-computing-curriculum/ | 404 | 8/10/18 |
| | 3 | eastmidlands | 404 | 8/21/18 |
| | 4 | eastmidlands/partner-schools/pingle-school/ | 404 | 8/27/18 |
| | 5 | z3540-hyhyxmw-i18967-fr/ | 404 | 8/19/18 |
| | 6 | northeast/jobs/maths-teacher-4/ | 404 | 8/24/18 |
| | 7 | qfscmpp-z3539-i967-mw/ | 404 | 8/29/18 |
| | 8 | manchester/jobs/history-teacher/ | 404 | 8/5/18 |
| | 9 | eastmidlands/jobs/geography-teacher-4/ | 404 | 8/30/18 |
| | 10 | resources | 404 | 8/26/18 |
| | 11 | blog/2016/03/01/world-book-day-how-can-you-get-your-pupils-involved/ | 404 | 8/31/18 |
| | 12 | onxhtltpudgjhs-z3548-i4967-mnwacunkyaduobb/ | Cheers.
Thanks in advance,
James.0 -
Https and 404 code that goes into htaccess
The 404 error code we put into htaccess files for our websites does not work correctly for our https site. We recently changed one of our http sites to https. When we went to create a 404.html page for it by creating an htaccess folder with the 404 error code in it, once we uploaded the file all of our webpages were displaying incorrectly, as if the css was not attached. The 404 code we used works successfully for our other 404.html pages for our other sites (www.telfordinc.com/404.html). However, it does not work for the https site. Below is the 404 error code we are using for our https site (currently not uploaded until pages display correctly) ErrorDocument 404 /404-error.html RewriteEngine on RewriteCond %{HTTP_REFERER} !^$ RewriteCond %{HTTP_REFERER} !^http://(www.)?privatemoneyhardmoneyloan.com/.*$ [NC] RewriteRule .(gif|jpg|js|css)$ - [F] Options +FollowSymLinks RewriteEngine On RewriteBase / RewriteCond %{HTTP_HOST} !^www.privatemoneyhardmoneyloan.com$ [NC] RewriteRule ^(.*)$ http://www.privatemoneyhardmoneyloan.com/$1 [R=301,L] So we want to know if there is a different 404 error code that goes into the htaccess file for an https vs. http? Appreciate your feedback on this issue
Technical SEO | | Manifestation0 -
What if an old site goes into PENDINGDELETE status
Hi, I have an old domain which accidentally was set as PENDINGDELETE by the registry. It's now not resolving to any ip address any more. Actually I was relocating from the old domain to a new domain. Just one month before it become PENDINGDELETE, I have submitted a "Chang of Address" in Google Webmasters Tools as well as setup the web server to 301 redirect all urls on old domain to the new domain. I have some sub-questions for this case: 1. What will happen to the effectiveness of the "Change of Address" in Google Webmasters Tool after old domain is dropped. As a domain is deleted, I have no way to maintain the verified ownership of the it in case Google asks me to reverify. 2. Suppose during last month before it's deleted, Googlebot had crawled 50% of urls on old domains, detected the 301 redirects and save them to its index. When Googlebot crawls those 50% urls again after the old domain is deleted, as those urls are not resolving to any web server, will Googlebot retain the last 301 redirects or drop the 301 redirects as well? 3. After a domain is deleted, how soon will Google purge urls on old domain from its index? Thank you. Best regards Jack Zhao
Technical SEO | | Bull1350 -
Paging Links Code - Best Way?
Currently we are using previous 1 2 3 next for our link to other inventory pages, with some variation of this javascript code javascript:__doPostBack('ctl00$phMain$dlPagesTop$ctl01$lnkPageTop','') . Can search engines even index the other pages with this javascript? Is there a better way to do this?
Technical SEO | | CFSSEO0 -
Schema coding
Hi, I was wondering if you may know if you have to keep to the and coding when adding schema code to the site. For example if I'm already using H and P tags can I add the "itemprop" to those or do they have to be in aor as in the example below: <span itemprop="name">Kenmore White 17" Microwavespan>
Technical SEO | | DragonSearch
Product description:
<span itemprop="description">0.7 cubic feet countertop microwave. Has six preset cooking categories and convenience features like Add-A-Minute and Child Lock.span> So could I code it like this? <h1 itemprop="name">Kenmore White 17" Microwaveh1>
Product description:
<p itemprop="description">0.7 cubic feet countertop microwave. Has six preset cooking categories and convenience features like Add-A-Minute and Child Lock.p> Thank you,
Etela0 -
How can I exclude display ads from robots.txt?
Google has stated that you can do this to get spiders to content only, and faster. Our IT guy is saying it's impossible.
Technical SEO | | GregBeddor
Do you know how to exlude display ads from robots.txt? Any help would be much appreciated.0 -
Coding - where to start?
I've been working in digital marketing for a few years and over the past 2, I've concentrated on getting my head around (or trying to!) the world of SEO. Everything from reading this forum, listening to webinars and re-reading the beginners guide to SEO has provided me with the confidence to put in actionable insights into my work such as on page optimisation, keyword research, link build and passing this knowledge onto co-workers However, one area I am beginning to realise is just as important is having the ability to code for yourself. Can anyone out there point me in the right direction of where to start, what I program I should concentrate on? and what I shuld focus my energies on? Any help would be great Simon
Technical SEO | | simonsw0 -
How to implement 503 status?
Hello, I want to know how to implement 503 status for a website hosted on Windows server? Will this code trigger automatically as per some specific condition or is it done manually? THanks & Regards
Technical SEO | | IM_Learner0