Exclude status codes in Screaming Frog
-
I have a very large ecommerce site I'm trying to spider using screaming frog. Problem is I keep hanging even though I have turned off the high memory safeguard under configuration.
The site has approximately 190,000 pages according to the results of a Google site: command.
- The site architecture is almost completely flat. Limiting the search by depth is a possiblity, but it will take quite a bit of manual labor as there are literally hundreds of directories one level below the root.
- There are many, many duplicate pages. I've been able to exclude some of them from being crawled using the exclude configuration parameters.
- There are thousands of redirects. I haven't been able to exclude those from the spider b/c they don't have a distinguishing character string in their URLs.
Does anyone know how to exclude files using status codes? I know that would help.
If it helps, the site is kodylighting.com.
Thanks in advance for any guidance you can provide.
-
Thanks for your help. It literally was just the fact that it had to be done before the crawl began and could not be changed during the crawl. Hopefully this is changed because sometimes during a crawl you find things you want to exclude that you may have not known of their existence before hand.
-
Are you sure it's just on Mac,have you tried on PC? Do you have any other rules in include or perhaps a conflicting rule in exclude? Try running a single exclude rule, also on another small site to test.
Also from support if failing on all fronts:
- Mac version, please make sure you have the most up to date version of the OS which will update Java.
- Please uninstall, then reinstall the spider ensuring you are using the latest version and try again.
To be sure - http://www.youtube.com/watch?v=eOQ1DC0CBNs
-
does the exclude function work on mac. i have tried every possible way to exclude folders and have not been successful while running an analysis
-
That's exactly the problem, the redirects are disbursed randomly throughout the site. Although, and the job's still running, it now appears as though there's almost a 1-2-1 correlation between pages and redirects on the site.
I also heard from Dan Sharp via Twitter. He said "You can't, as we'd have to crawl a URL to see the status code
You can right click and remove after though!"
Thanks again Michael. Your thoroughness and follow through is appreciated.
-
Took another look, also looked at documentation/online and don't see any way to exclude URLs from crawl based on response codes. As I see it you would only want to exclude on name or directory as response code is likely to be random throughout a site and impede a thorough crawl.
-
Thank you Michael.
You're right. I was on a 64 bit machine running a 32 bit verson of java. I updated it and the scan has been running for more than 24 hours now without hanging. So thank you.
If anyone else knows of a way to exclude files using status codes I'd still like to learn about it. So far the scan is showing me 20,000 redirected files which I'd just as soon not inventory.
-
I don't think you can filter out on response codes.
However, first I would ensure you are running the right version of Java if you are on a 64bit machine. The 32bit version functions but you cannot increase the memory allocation which is why you could be running into problems. Take a look at http://www.screamingfrog.co.uk/seo-spider/user-guide/general/ under Memory.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Handling Pages with query codes
In Moz my client's site is getting loads of error messages for no follow tags on pages. This is down to the query codes on the E-commerce site so the URLs can look like this https://www.lovebombcushions.co.uk/?bskt=31d49bd1-c21a-4efa-a9d6-08322bf195af Clearly I just want the URL before the ? to be crawled but what can I do in the site to ensure that these errors for nofollow are removed? Is there something I should do in the site to fix this? In the back of my mind I'm thinking rel-conanical tag but I'm not sure. Can you help please?
Technical SEO | | Marketing_Optimist1 -
Include or exclude noindex urls in sitemap?
We just added tags to our pages with thin content. Should we include or exclude those urls from our sitemap.xml file? I've read conflicting recommendations.
Technical SEO | | vcj0 -
Site Wide Text to Code Ratio Tool
Does anyone know of a free or paid tool which provides the text to code ratio for all pages on a site? Something like Screaming Frog but with all the ratios for each page. At the moment we are checking key landing pages individually.
Technical SEO | | Dave_Schulhof0 -
Is any code to prevent duplicate meta description on blog pages
Is any code to prevent duplicate meta description on blog pages I use rell canonical on blog page and to prevent duplicate title y use on page category title de code %%page%% Is there any similar code so to description?
Technical SEO | | maestrosonrisas0 -
Exclude Child URLs from XML Sitemap Generator (Wordpress)
Hi all, I was recommended the XML Sitemap Generator for Wordpress by the very helpful Keith Bloemendaal and John Pring - however I can't seem to exclude child URLs. There is a section Exclude items and a subsection Exclude posts. I have tried inputting the URLs for the pages I don't want in the sitemap, however that didn't work. So I read that you have to include a list of "IDs" - not sure where on earth to find that info, tried the page name and the post= number from the URL, however neither worked. I hope somebody can point me in the right direction - and apologies, I am a Wordpress novice, and I got no answers from the Wordpress forums so turned right back to SEOmoz! Cheers.
Technical SEO | | markadoi840 -
Meta description showing in source code but not being detected by SEO Moz or other tools?
Hello fellow SEO enthusiasts, Re www.appetise.com Our developers have added a meta description and I can see it when I right click on pages to 'view source' as follows : Example : BUT - using the on page seo assessment tool on SEO Moz (and also using other tools which assess title, description and keyword optimisation) - they are telling us that the meta description is not present. Please could someone suggest why? If we can get the meta description picked up - we will reach A Grade for our core pages! And this will make us feel good - and hopefully shine through in our results :-). Any help greatly appreciated. Kind Regs, Richard Best - Appetise.com <meta http-equiv="description" content="Online Takeaway Food with appetise.com. 100's of Local Takeaways Menus Online. Order Take Away Food Online for Delivery. Pay by Card Safely. Including Pizza, Chinese, Indian, Italian, Kebab."/>
Technical SEO | | E-resistible0 -
Status code 404??!
among other things... I'm getting this error: http://worldvoicestudio.com/blog/"http://worldvoicestudio.com/" http://worldvoicestudio.com/blog/"http://worldvoicestudio.com/" Any ideas on how to fix this? many thanks!!
Technical SEO | | malexandro0 -
What code do I need to remove Comments Rss from Wordpress?
Guessing I need to put something in functions.php file to remove site wide the "Comments Rss" link under "Meta" in sidebar?
Technical SEO | | bozzie3110