I can't crawl the archive of this website with Screaming Frog
-
Hi
I'm trying to crawl this website (http://zeri.info/) with Screaming Frog but because of some technical issue with their site (i can't find what is causing it) i'm able to crawl only the first page of each category (ex. http://zeri.info/sport/) and then it will go to crawl each page of their archive (hundreds of thousands of pages) but it won't crawl the links inside these pages.
Thanks a lot!
-
I think the issue comes from the way you handle the pagination and or the way your render archived pages.
Example: First archive page of Aktualehttp://zeri.info/arkiva/?formkey=7301c1be1634ffedb1c3780e5063819b6ec19157&acid=aktuale
Clicking on page 2 adds the date
http://zeri.info/arkiva/?from=2016-06-01&until=2016-06-16&acid=aktuale&formkey=cc0a40ca389eb511b1369a9aa9da915826d6ca44&faqe=2#archive-results => I assume that you're only listing the articles published from June 1st till today.
If I check all the different section & the number of articles listed in each archive I get approx. 1200 pages - add some additional pages linked on these pages and you get to the 2K pages you mentioned.
There seems to be no possibility to reach the previously published content without executing a search - which Screaming Frog can't do. It's quite possible that this is causing issues for Google bot as well so I would try to fix this.
If you really want to crawl the full site in the mean time - add another rule in url rewriting - this time selecting 'regex replace' -
add regex: from=2016-06-01
replace regex from=2010-01-01 (replace by the earliest date of publishing)This way - the system will call url http://zeri.info/arkiva/?from=2010-06-01&until=2016-06-16&acid=kultura&formkey=5932742bd5dd77799524ba31b94928810908fc07&faqe=2 rather than the original one - listing all the articles instead of only the june articles.
Hope this helps.
Dirk
-
I can't make it work. After removing 'fromkey' parameter i was able to crawl 1.7k and it stopped there. The site has more than 400k pages so .. something must be wrong
I want to crawl only the root domain without subdomains and all i can crawl is around 2k pages.
Do you have any idea what might be happening?
-
Great it worked. Just a small note - if Screaming Frog is getting confused by all these parameters, it could well be that Googlebot (while more sophisticated) is also having these issues. You could consider to exclude the formkey parameter in the Search Console (Crawl > URL parameters)
DIrk
-
Dirk, thanks a lot.
I just added "formkey" to be removed as a parameter and it seems to be working. I crawled 1k pages until now and i'm going to monitor how it goes.
The site has more than 400k pages so the process to crawl them all will take time (and i'm going to have to crawl each sector so i can create sitemaps for them).
Thanks again
Gjergj -
In the menu 'url rewriting' you can simply put the parameters the site should ignore (like date, formkey,..). I removed the formkey parameter and I checked the pages of the archive in Screaming Frog.
It is clearly able to detect all the internal links on the page - so will crawl them.
How are you certain that the pages below are not crawled - could you give a specific example of page that should be crawled but isn't?
Dirk
-
I've tried changing settings to respect noindex & canonical .. it will stop crawling the archive pages but still it won't crawl the links inside those pages. (i've added NOINDEX, FOLLOW in all archive pagination pages)
What do you mean by rewriting the url to ignore the formkey? How do you think it should be.
Gjergji
-
It think Screaming Frog is going nuts on the formkey value in the url which is constantly changing when changing pages.
Could you modify the settings of the spider to respect noindex & respect canonical - looks like this is solving the issue.
Alternatively you could rewrite the url to ignore the formkey (remove parameter)
Dirk
-
Hi Logan
I've tried going back to default configuration but it didn't help .. still i don't believe Screaming Frog is to blame, i think there is something wrong with the way the site has been developed (they are using a custom CMS) .. but i can't find the reason why this is happening. As soon as i find the solution then i can ask the guys who developed this site to make the necessary changes.
Thanks a lot.
-
Hi Dirk
Thanks a lot for replying back. The issue is that Screaming Frog is crawling the archive pages (like these examples) but it won't crawl the articles that are listed inside these pages.
The hierarchy of the site goes like this:
Homepage
- Categories (with about 20 articles in them)
- Archive of that category (with all the remaining articles, which in this case means thousands since they are a news website)Screaming Frog will crawl the homepage and categories ... but after it goes to the archive it won't crawl the articles inside archive, instead it will only crawl the pages (pagination) of that archive.
Thanks again.
-
Try going to File > Default Conif > Clear Default Configuration. This happens to me sometimes as well as I've edited settings over time. Clearing it out and going back to default settings is usually quicker than clicking through the settings to identify which one is causing the problem.
-
Did you put in some special filters - just tried to crawl the site & it seems to work just fine?
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Please who can help me rank my website www.shopdocuments.com ?
Hi , i bought my website this month , i built it myself now i want it to grow seo on google pages and stuff tried google ads but they always block my account don't know why so somebody help me please any company? individual willing to work it for me ? even if paid company no problem
Technical SEO | | planetdocs1 -
Can we re rank our Penalyzed website in Google?
Hello This is Maqbul, from India. I have a jobs portal blog [ bharatrecruit.com]. It was getting around 50K to 100K Views a Day and made me $100 a day. But after a few months, my competitor made negative SEO with 12,000 Spammy backlinks. Suddenly my site was hit by Google and now it is getting 200 to 300 Pageviews a day. So the question is I did not disavow bad links for a long time like 3 to 4 months. Now I disavow all the bad links but the website is not ranking. Can we re-rank this site or create another website. Please reply must. None of the bloggers can answer this. Thanks, Regards Maqbul
Technical SEO | | vinaso960 -
My Website's Home Page is Missing on Google SERP
Hi All, I have a WordPress website which has about 10-12 pages in total. When I search for the brand name on Google Search, the home page URL isn't appearing on the result pages while the rest of the pages are appearing. There're no issues with the canonicalization or meta titles/descriptions as such. What could possibly the reason behind this aberration? Looking forward to your advice! Cheers
Technical SEO | | ugorayan0 -
'sameAs' Mark up for different spellings of a Product/Keyword, is it possible?
Hi There, I've seen that for social media profiles you can mark them up to be the 'sameAs', example below: - <code><scripttype="application ld+json"="">{ "@context":"http://schema.org", "@type":"Organization", "name":"Your Organization Name", "url":"http://www.your-site.com", "sameAs":[ "http://www.facebook.com/your-profile", "http://www.twitter.com/yourProfile", "http://plus.google.com/your_profile" ] }</scripttype="application></code> My question is can you do something similar for your product/keyword? For example when you can spell the word in different ways e.g. Whisky (English) or Whiskey (Irish/US). I've had a look at schema.org but I'm not sure if I'm headed down the wrong path? Thanks
Technical SEO | | Jon-S0 -
We just can't figure out the right anchor text to use
We have been trying everything we can with anchor text. We have read here that we should try naturalistic language. Our competitors who are above us in Google search results don't do any of this. They only use their names or a single term like "austin web design". Is what we are doing hurting our listings? We don't have any black hat links. Here's what we are doing now. We are going crazy trying to figure this out. We are afraid to do anything in fear it will damage our position. Bob | pallasart web design | 31 | 1,730 |
Technical SEO | | pallasart
| website by pallasart a texas web design company in austin | 15 | 1,526 |
| website by the austin design company pallasart | 14 | 1,525 |
| created by pallasart a web design company in austin texas | 13 | 1,528 |
| created by an austin web design company pallasart | 12 | 1,499 |
| website by pallasart web design an austin web design company | 12 | 1,389 |
| website by pallasart an austin web design company | 11 | 1,463 |
| pallasart austin web design | 9 | 2,717 |
| website created by pallasart a web design company in austin texas | 9 | 1,369 |
| website by pallasart | 8 | 910 |
| austin web design | 5 | 63 |
| pallasart website design austin |0 -
Are sidewide badge links can harm your website?
Hey all, I wanted to check if links that have built naturally over the past years, linking from a badge (image) sitewide, can harm the linked website? Here is some more information: 1. It's from a competition that the winners were able to add the badge with the link to their site (the link to our website was to a subpage, not homepage). 2. There are around 15 websites with the badge as a link. The website has around 200 root domain links. There will not be any more websites with the badge, just these 15. 3. The sitewide links percentage are 5% of the overall number of pages linked to our website. Based on the last penguin update (4th of October, 2013), can our website be harmed from the badge link building?
Technical SEO | | stevanl0 -
Ecommerce website: Product page setup & SKU's
I manage an E-commerce website and we are looking to make some changes to our product pages to try and optimise them for search purposes and to try and improve the customer buying experience. This is where my head starts to hurt! Now, let's say I am selling a T shirt that comes in 4 sizes and 6 different colours. At the moment my website would have 24 products, each with pretty much the same content (maybe differing references to the colour & size). My idea is to change this and have 1 main product page for the T-shirt, but to have 24 product SKU's/variations that exist to give the exact product details. Some different ways I have been considering to do this: a) have drop-down fields on the product page that ask the customer to select their Tshirt size and colour. The image & price then changes on the page. b) All product 24 product SKUs sre listed under the main product with the 'Add to Cart' open next to each one. Each one would be clickable so a page it its own right. Would I need to set up a canonical links for each SKU that point to the top level product page? I'm obviously looking to minimise duplicate content but Im not exactly sure on how to set this up - its a big decision so I need to be 100% clear before signing off on anything. . Any other tips on how to do this or examples of good e-commerce websites that use product SKus well? Kind regards Tom
Technical SEO | | DHS_SH0 -
Walking into a site I didn't build, easy way to fix this # indexing problem?
I recently joined a team with a site without a) Great content b) Not much of any search traffic I looked and all their url's are built in this way: Normal looking link -> not actually a new page but # like: /#content-title And it has no h1 tag. Page doesn't refresh. My initial thought is to gut the site and build it in wordpress, but first have to ask, is there a way to make a site with /#/ content loading friendly to search engines?
Technical SEO | | andrewhyde0