Exclude status codes in Screaming Frog
-
I have a very large ecommerce site I'm trying to spider using screaming frog. Problem is I keep hanging even though I have turned off the high memory safeguard under configuration.
The site has approximately 190,000 pages according to the results of a Google site: command.
- The site architecture is almost completely flat. Limiting the search by depth is a possiblity, but it will take quite a bit of manual labor as there are literally hundreds of directories one level below the root.
- There are many, many duplicate pages. I've been able to exclude some of them from being crawled using the exclude configuration parameters.
- There are thousands of redirects. I haven't been able to exclude those from the spider b/c they don't have a distinguishing character string in their URLs.
Does anyone know how to exclude files using status codes? I know that would help.
If it helps, the site is kodylighting.com.
Thanks in advance for any guidance you can provide.
-
Thanks for your help. It literally was just the fact that it had to be done before the crawl began and could not be changed during the crawl. Hopefully this is changed because sometimes during a crawl you find things you want to exclude that you may have not known of their existence before hand.
-
Are you sure it's just on Mac,have you tried on PC? Do you have any other rules in include or perhaps a conflicting rule in exclude? Try running a single exclude rule, also on another small site to test.
Also from support if failing on all fronts:
- Mac version, please make sure you have the most up to date version of the OS which will update Java.
- Please uninstall, then reinstall the spider ensuring you are using the latest version and try again.
To be sure - http://www.youtube.com/watch?v=eOQ1DC0CBNs
-
does the exclude function work on mac. i have tried every possible way to exclude folders and have not been successful while running an analysis
-
That's exactly the problem, the redirects are disbursed randomly throughout the site. Although, and the job's still running, it now appears as though there's almost a 1-2-1 correlation between pages and redirects on the site.
I also heard from Dan Sharp via Twitter. He said "You can't, as we'd have to crawl a URL to see the status code
You can right click and remove after though!"
Thanks again Michael. Your thoroughness and follow through is appreciated.
-
Took another look, also looked at documentation/online and don't see any way to exclude URLs from crawl based on response codes. As I see it you would only want to exclude on name or directory as response code is likely to be random throughout a site and impede a thorough crawl.
-
Thank you Michael.
You're right. I was on a 64 bit machine running a 32 bit verson of java. I updated it and the scan has been running for more than 24 hours now without hanging. So thank you.
If anyone else knows of a way to exclude files using status codes I'd still like to learn about it. So far the scan is showing me 20,000 redirected files which I'd just as soon not inventory.
-
I don't think you can filter out on response codes.
However, first I would ensure you are running the right version of Java if you are on a 64bit machine. The 32bit version functions but you cannot increase the memory allocation which is why you could be running into problems. Take a look at http://www.screamingfrog.co.uk/seo-spider/user-guide/general/ under Memory.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Hi anyone please help I use this code but now getting 404 error. please help.
#index redirect
Technical SEO | | roynguyen
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /index.html\ HTTP/
RewriteRule ^index.html$ http://domain.com/ [R=301,L]
RewriteCond %{THE_REQUEST} .html
RewriteRule ^(.*).html$ /$1 [R=301,L] hi anyone please help I use this code but now getting 404 error. please help. homepage and service.html page is working, but the rest pages like about.html, servicearea.html, and contact.html is not working showing 404 error. and also when you type this URL. generalapplianceserice.ca/about.html generalapplianceserice.ca/contact.html generalapplianceserice.ca/servicearea.html it automatically remove the .HTML extension and shows 404 error, the pages name in root directory is same. these pages work like generalapplianceservice.ca and generalapplianceservice.ca/services why? i also remove this code again but still same issue.0 -
Can bad html code hurt your website from ranking ?
Hello,For example if I search for “ Bike Tours in France” I am looking for a page with a list of tours in France.Does it mean that if my html doesn’t have list * in the code but only that apparently doesn’t have any semantic meaning for a search engine my page won’t rank because of that ?Example on this page : https://bit.ly/2C6hGUn According to W3schools: "A semantic element clearly describes its meaning to both the browser and the developer. Examples of non-semantic elements: <div> and - Tells nothing about its content. Examples of semanticelements: <form>, , and- Clearly defines its content."Has anyone any experience with something similar ?Thank you, </form>
Technical SEO | | seoanalytics0 -
Product Code Error in Volusion
I started working with about 800+ 404 errors in September after we migrated our site to Volusion 13. There is a recurring 404 error that I can't trace inside of our source code or in our Sitemap. I don't know what is causing this error so I have no way of knowing how to fix it. Tech support at Volusion has been less than helpful so any feed back would be appreciated. | http://www.apelectric.com/Generac-6438-Guardian-Series-11kW-p/{1} | The error is seemingly starting with the product code. The addendum at the end of the URL "p/" should be followed by the product code. In this example, 6438. Instead, the code is being automatically populated with %7B1%7D Has anyone else this issue with Volusion or does this look familiar across any other platform?
Technical SEO | | MonicaOConnor0 -
How Google can interpret all "hreflag" links into HTML code
I've found the solution. The problem was that did not put any closing tag into the HTML code....
Technical SEO | | Red_educativa0 -
Webmaster Tools vs Screaming from for 404's
Hey guys, I was just wondering which is better to use to find the 404's effecting your site. I have been using webmaster tools and just purchased screaming frog which has given me a totally different list of 404's compared to WMT. Which do I use, or do I use both? Cheers
Technical SEO | | Adamshowbiz0 -
After I 301 redirect duplicate pages to my rel=canonical page, do I need to add any tags or code to the non canonical pages?
I have many duplicate pages. Some pages have 2-3 duplicates. Most of which have Uppercase and Lowercase paths (generated by Microsoft IIS). Does this implementation of 301 and rel=canonical suffice? Or is there more I could do to optimize the passing of duplicate page link juice to the canonical. THANK YOU!
Technical SEO | | PFTools0 -
No Search Results Found - Should this return status code 404?
A question came up today on how to correctly serve the right status code on pages where no search results are found. I did a couple searches on some major eccomerce and news sites and they were ALL serving status code 200 for No Search Results Found http://www.zappos.com/dsfasdgasdgadsg http://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=sdafasdklgjasdklgjsjdjkl http://www.ebay.com/sch/i.html?_trksid=p5197.m570.l1313&_nkw=dfjakljgdkslagklasd&_sacat=0 http://www.cnn.com/search/?query=sdgadgdsagas&x=0&y=0&primaryType=mixed&sortBy=date&intl=false http://www.seomoz.org/pages/search_results?q=sdagasdgasdgasg I thought I read somewhere were it was recommended to serve a status code 404 on these types of pages. Based on what I found above, all sites were serving a 200, so it appears this may not be the best practice. Any thoughts?
Technical SEO | | WEB-IRS0 -
Putting nav code at the bottom of a page?
Hey, We are doing a re-design on our websites and we have run into a little problem. Basically we need to put the nav code at the bottom of the page (so when you view source all the nav code it at the bottom) but the nav will of course still show at the top. Will this cause any issues with our SEO? Will it make the nav seem less important or get crawled less? Thanks for the help in advance! Ricky
Technical SEO | | Fubra0