I can't crawl the archive of this website with Screaming Frog
-
Hi
I'm trying to crawl this website (http://zeri.info/) with Screaming Frog but because of some technical issue with their site (i can't find what is causing it) i'm able to crawl only the first page of each category (ex. http://zeri.info/sport/) and then it will go to crawl each page of their archive (hundreds of thousands of pages) but it won't crawl the links inside these pages.
Thanks a lot!
-
I think the issue comes from the way you handle the pagination and or the way your render archived pages.
Example: First archive page of Aktualehttp://zeri.info/arkiva/?formkey=7301c1be1634ffedb1c3780e5063819b6ec19157&acid=aktuale
Clicking on page 2 adds the date
http://zeri.info/arkiva/?from=2016-06-01&until=2016-06-16&acid=aktuale&formkey=cc0a40ca389eb511b1369a9aa9da915826d6ca44&faqe=2#archive-results => I assume that you're only listing the articles published from June 1st till today.
If I check all the different section & the number of articles listed in each archive I get approx. 1200 pages - add some additional pages linked on these pages and you get to the 2K pages you mentioned.
There seems to be no possibility to reach the previously published content without executing a search - which Screaming Frog can't do. It's quite possible that this is causing issues for Google bot as well so I would try to fix this.
If you really want to crawl the full site in the mean time - add another rule in url rewriting - this time selecting 'regex replace' -
add regex: from=2016-06-01
replace regex from=2010-01-01 (replace by the earliest date of publishing)This way - the system will call url http://zeri.info/arkiva/?from=2010-06-01&until=2016-06-16&acid=kultura&formkey=5932742bd5dd77799524ba31b94928810908fc07&faqe=2 rather than the original one - listing all the articles instead of only the june articles.
Hope this helps.
Dirk
-
I can't make it work. After removing 'fromkey' parameter i was able to crawl 1.7k and it stopped there. The site has more than 400k pages so .. something must be wrong
I want to crawl only the root domain without subdomains and all i can crawl is around 2k pages.
Do you have any idea what might be happening?
-
Great it worked. Just a small note - if Screaming Frog is getting confused by all these parameters, it could well be that Googlebot (while more sophisticated) is also having these issues. You could consider to exclude the formkey parameter in the Search Console (Crawl > URL parameters)
DIrk
-
Dirk, thanks a lot.
I just added "formkey" to be removed as a parameter and it seems to be working. I crawled 1k pages until now and i'm going to monitor how it goes.
The site has more than 400k pages so the process to crawl them all will take time (and i'm going to have to crawl each sector so i can create sitemaps for them).
Thanks again
Gjergj -
In the menu 'url rewriting' you can simply put the parameters the site should ignore (like date, formkey,..). I removed the formkey parameter and I checked the pages of the archive in Screaming Frog.
It is clearly able to detect all the internal links on the page - so will crawl them.
How are you certain that the pages below are not crawled - could you give a specific example of page that should be crawled but isn't?
Dirk
-
I've tried changing settings to respect noindex & canonical .. it will stop crawling the archive pages but still it won't crawl the links inside those pages. (i've added NOINDEX, FOLLOW in all archive pagination pages)
What do you mean by rewriting the url to ignore the formkey? How do you think it should be.
Gjergji
-
It think Screaming Frog is going nuts on the formkey value in the url which is constantly changing when changing pages.
Could you modify the settings of the spider to respect noindex & respect canonical - looks like this is solving the issue.
Alternatively you could rewrite the url to ignore the formkey (remove parameter)
Dirk
-
Hi Logan
I've tried going back to default configuration but it didn't help .. still i don't believe Screaming Frog is to blame, i think there is something wrong with the way the site has been developed (they are using a custom CMS) .. but i can't find the reason why this is happening. As soon as i find the solution then i can ask the guys who developed this site to make the necessary changes.
Thanks a lot.
-
Hi Dirk
Thanks a lot for replying back. The issue is that Screaming Frog is crawling the archive pages (like these examples) but it won't crawl the articles that are listed inside these pages.
The hierarchy of the site goes like this:
Homepage
- Categories (with about 20 articles in them)
- Archive of that category (with all the remaining articles, which in this case means thousands since they are a news website)Screaming Frog will crawl the homepage and categories ... but after it goes to the archive it won't crawl the articles inside archive, instead it will only crawl the pages (pagination) of that archive.
Thanks again.
-
Try going to File > Default Conif > Clear Default Configuration. This happens to me sometimes as well as I've edited settings over time. Clearing it out and going back to default settings is usually quicker than clicking through the settings to identify which one is causing the problem.
-
Did you put in some special filters - just tried to crawl the site & it seems to work just fine?
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
How to use Google search console's 'Name change' tool?
Hi There, I'm having trouble performing a 'Name change' for a new website (rebrand and domain change) in Google Search console. Because the 301 redirects are in place (a requirement of the name change tool), Google can no longer verify the site, which means I can't complete the name change? To me, step two (301 redirect) conflicts with step there (site verification) - or is there a way to perform a 301 redirect and have the tool verify the old site? Any pointers in the right direction would be much appreciated. Cheers Ben
Technical SEO | | cmscss0 -
Brushing up on my SEO skills - how do I check my website to see if Javascript is blocking search engines from crawling the links within a javascript-enabled drop down menu?
I set my user agent in my Chrome browser to Googlebot and I disable javascript within my Chrome settings, but then what?
Technical SEO | | MagnitudeSEO0 -
Site address change: new site isn't showing up in Google, old site is gone.
We just transitioned mccacompanies.com to confluentstrategies.com. The problem is that when I search for the old name, the old website doesn't come up anymore to redirect people to the new site. On the local card, Google has even taken off the website altogether. (I'm currently still trying to gain access to manage the business listing) When I search for confluent strategies, the website doesn't come up at all. But if I use the site: operator, it is in the index. Basically, my client has effectively disappeared off the face of the Google. (In doing other name changes, this has never happened to me before) What can I do?
Technical SEO | | MichaelGregory0 -
How can I get the most out of uploading a print magazine to my client's website?
Hi Mozers, My client is just about to launch a print magazine for her watch business. There is so much valuable content in the magazine and we want to feature it on the website both for SEO purposes and also for those who prefer to read articles online instead of reading a physical magazine. My question is: what is the best method of displaying the magazine to get the most from search rankings and also to capitalise on the beautiful imagery from the magazine. The best option that I can think of is to upload the magazine as a flipbook and create a separate page on the website to display each article so that search engine crawlers can index the content. I do understand that this could be problematic if users are only spending time reading the flipbook and not so much time on the article pages. Do you guys have any suggestions about how to get the most out of this opportunity for my client? THANK YOU IN ADVANCE. Meaghan
Technical SEO | | StoryScout0 -
Optimizing Flash websites
Does anyone know the latest on how well Google is indexing flash sites? I'm talking to a prospect now whose site is 100% flash, but looking at it all the content within the site is indexed and returning in the results. This is promising enough, but does anyone know exactly how well Google is indexing flash content these days and are there still any problems with that? Thanks!
Technical SEO | | MattBarker0 -
Why this page doesn't get indexed?
Hi, I've just taken over development and SEO for a site and we're having difficulty getting some key pages indexed on our site. They are two clicks away from the homepage, but still not getting indexed. They are recently created pages, with unique content on. The architecture looks like this:Homepage >> Car page >> Engine specific pageWhenever we add a new car, we link to its 'Car page' and it gets indexed very quickly. However the 'Engine pages' for that car don't get indexed, even after a couple of weeks. An example of one of these index pages are - http://www.carbuzz.co.uk/car-reviews/Volkswagen/Beetle-New/2.0-TSISo, things we've checked - 1. Yes, it's not blocked by robots.txt2. Yes, it's in the sitemap (http://www.carbuzz.co.uk/sitemap.xml)3. Yes, it's viewable to search spiders (e.g. the link is present in the html source)This page doesn't have a huge amount of unique content. We're a review aggregator, but it still does have some. Any suggestions as to why it isn't indexed?Thanks, David
Technical SEO | | soulnafein0 -
Pictures 'being stolen'
Helping my wife with ecommerce site. Selling clothes. Some photos are given by producer, but at times they are not too good. Some are therefore taking their own photos and i suspect ppl are copying them and using them on their own site. Is there anyting to do about this - watermarking of course, but can they be 'marked' in anyway linking to your site ?
Technical SEO | | danlae0 -
Removing pages from website
Hello all, I am fairly new to the SEOmoz community. But i am working for a company which organizes exhibitons, events and training in Holland. A lot of these events are only given ones ore twice and then we do not organise them any more because they are no longer relevant. Every event has its own few webpages which provide information about the event and are being indexed by Google. In the past we did not remove any of these events. I was looking in the CMS and saw a lot of events of 2008 and older which are being indexed. To clean the website and the CMS i am thinking of removing these pages of old events. The risk is that these pages have some links to them and are getting some traffic, so if i remove them there is a risk of losing traffic and rankings. What would be the wise thing to do? Make a folder with archive or something? Regards, Ruud
Technical SEO | | RuudHeijnen0