I can't crawl the archive of this website with Screaming Frog
-
Hi
I'm trying to crawl this website (http://zeri.info/) with Screaming Frog but because of some technical issue with their site (i can't find what is causing it) i'm able to crawl only the first page of each category (ex. http://zeri.info/sport/) and then it will go to crawl each page of their archive (hundreds of thousands of pages) but it won't crawl the links inside these pages.
Thanks a lot!
-
I think the issue comes from the way you handle the pagination and or the way your render archived pages.
Example: First archive page of Aktualehttp://zeri.info/arkiva/?formkey=7301c1be1634ffedb1c3780e5063819b6ec19157&acid=aktuale
Clicking on page 2 adds the date
http://zeri.info/arkiva/?from=2016-06-01&until=2016-06-16&acid=aktuale&formkey=cc0a40ca389eb511b1369a9aa9da915826d6ca44&faqe=2#archive-results => I assume that you're only listing the articles published from June 1st till today.
If I check all the different section & the number of articles listed in each archive I get approx. 1200 pages - add some additional pages linked on these pages and you get to the 2K pages you mentioned.
There seems to be no possibility to reach the previously published content without executing a search - which Screaming Frog can't do. It's quite possible that this is causing issues for Google bot as well so I would try to fix this.
If you really want to crawl the full site in the mean time - add another rule in url rewriting - this time selecting 'regex replace' -
add regex: from=2016-06-01
replace regex from=2010-01-01 (replace by the earliest date of publishing)This way - the system will call url http://zeri.info/arkiva/?from=2010-06-01&until=2016-06-16&acid=kultura&formkey=5932742bd5dd77799524ba31b94928810908fc07&faqe=2 rather than the original one - listing all the articles instead of only the june articles.
Hope this helps.
Dirk
-
I can't make it work. After removing 'fromkey' parameter i was able to crawl 1.7k and it stopped there. The site has more than 400k pages so .. something must be wrong
I want to crawl only the root domain without subdomains and all i can crawl is around 2k pages.
Do you have any idea what might be happening?
-
Great it worked. Just a small note - if Screaming Frog is getting confused by all these parameters, it could well be that Googlebot (while more sophisticated) is also having these issues. You could consider to exclude the formkey parameter in the Search Console (Crawl > URL parameters)
DIrk
-
Dirk, thanks a lot.
I just added "formkey" to be removed as a parameter and it seems to be working. I crawled 1k pages until now and i'm going to monitor how it goes.
The site has more than 400k pages so the process to crawl them all will take time (and i'm going to have to crawl each sector so i can create sitemaps for them).
Thanks again
Gjergj -
In the menu 'url rewriting' you can simply put the parameters the site should ignore (like date, formkey,..). I removed the formkey parameter and I checked the pages of the archive in Screaming Frog.
It is clearly able to detect all the internal links on the page - so will crawl them.
How are you certain that the pages below are not crawled - could you give a specific example of page that should be crawled but isn't?
Dirk
-
I've tried changing settings to respect noindex & canonical .. it will stop crawling the archive pages but still it won't crawl the links inside those pages. (i've added NOINDEX, FOLLOW in all archive pagination pages)
What do you mean by rewriting the url to ignore the formkey? How do you think it should be.
Gjergji
-
It think Screaming Frog is going nuts on the formkey value in the url which is constantly changing when changing pages.
Could you modify the settings of the spider to respect noindex & respect canonical - looks like this is solving the issue.
Alternatively you could rewrite the url to ignore the formkey (remove parameter)
Dirk
-
Hi Logan
I've tried going back to default configuration but it didn't help .. still i don't believe Screaming Frog is to blame, i think there is something wrong with the way the site has been developed (they are using a custom CMS) .. but i can't find the reason why this is happening. As soon as i find the solution then i can ask the guys who developed this site to make the necessary changes.
Thanks a lot.
-
Hi Dirk
Thanks a lot for replying back. The issue is that Screaming Frog is crawling the archive pages (like these examples) but it won't crawl the articles that are listed inside these pages.
The hierarchy of the site goes like this:
Homepage
- Categories (with about 20 articles in them)
- Archive of that category (with all the remaining articles, which in this case means thousands since they are a news website)Screaming Frog will crawl the homepage and categories ... but after it goes to the archive it won't crawl the articles inside archive, instead it will only crawl the pages (pagination) of that archive.
Thanks again.
-
Try going to File > Default Conif > Clear Default Configuration. This happens to me sometimes as well as I've edited settings over time. Clearing it out and going back to default settings is usually quicker than clicking through the settings to identify which one is causing the problem.
-
Did you put in some special filters - just tried to crawl the site & it seems to work just fine?
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Can't get my site recognised for keyword
My site prettycool.co.uk and primary we sell fascinators, the problem is I can't get the word fascinators to be listed by Google. We are on the 1st page for most colours ie. pink fascinators, blue fascinators etc. but for the term fascinators even if we fetch we are listed for a couple of hours and then disappear. I've checked for keyword stuffing but our site sell fascinators and we need to have this word in our site and other sites have a lot more references to the term and are listed on the 1st or 2nd pages. We used to be listed on page 1 for many years but the last 2 or 3 years dropped back to page 4 but now nothing. Any help or suggestions would be fantastic!
Technical SEO | | Rutts0 -
Mobile first - what about content that you don't want to display on mobile?
ANOTHER mobile first question. Have searched the forum and didn't see something similar. Feel free to passive- aggressively link to an old thread. TL;DR - Some content would just clutter the page on mobile but is worth having on desktop. Will this now be ignored on desktop searches? Long form: We have a few ecommerce websites. We're toying with the idea of placing a lot more text on our collection/category pages. Primarily to try and set the scene for our products and sell the company a bit more effectively. It's also, obviously, an opportunity to include a couple of long tail keywords. Because mobile screens are small (duh) and easily cluttered, we're inclined _not _to display this content on mobile. In this case; will any SEO benefit be lost entirely, even to searchers on desktop? Sorry if I've completely misunderstood mobile-first indexing! Just an in-house marketing manager trying to keep up! cries into keyboard Thanks for your time.
Technical SEO | | MSGroup
Ross0 -
Can you help me figure out why my website is in conflict with guidelines of Google ?
Hi, Since few weeks now we received this message saying that our website is in conflict with the guidelines of Google's Webmaster. Here is the website for which we received the message from Google: http://www.gocustomized.es/. This url redirect to https://www.gocustomized.es/ I thought after reading your some messages from the forum that the problem was our website reviews which appear on all pages of the site. And I know that the review of the websites shouldn't be considered like the review of the product. But we remove the reviews and our request is still declined. Thanks in advance for your help.
Technical SEO | | steph_ba0 -
Can someone evaluate this page so I can continue adding others?
Hi, I am adding a bunch of similar category stickers and I am not looking into that good SEO for these since there will be hundreds of them coming but I just want to include the relevant keywords that people perhaps use in the Google image search to take them to our site. They are all related to JDM (Japanese Domestic Motors) so I decided to include JDM at the end of all the SEO titles. I am writing totally different short descriptions for all of these stickers and the Related Products are changing as well. I just want to achieve something like Amazon or eBay listings do - not the perfect SEO since I cannot spend too much time with each sticker optimizing it but I don't want to NOINDEX, FOLLOW them either - hence the different related products for all items and also unique short descriptions. If you check one of the pages: http://www.redrockdecals.com/rising-sun-wakaba-leaf-sticker-red-black-jdm Do you think I should be in the safe side so I don't hurt my overall SEO? Thanks!!
Technical SEO | | speedbird12290 -
How to properly change your website's address in Webmaster Tools?
Hi There,We've launched a new website and as part of the update have changed our domain name - now we need to tell Google of the changes: Both sites were verified in Webmaster Tools From the old site's gear icon, we chose "Change of address" As part of the "Change of address" checklist Google presented, we added 301 redirects to redirect the old domain to the new one But now that the 301 redirects are in place, Google can no longer verify the old site And because it can no longer verify the old site, Google won't let us complete the change of address form How do we tell Google of the change of address in this instance - and has anyone else encountered this?CheersBen
Technical SEO | | cmscss0 -
New Page Showing Up On My Reports w/o Page Title, Words, etc - However, I didn't create it
I have a WordPress site and I was doing a crawl for errors and it is now showing up as of today that this page : https://thinkbiglearnsmart.com/event-registration/?event_id=551&name_of_event=HTML5 CSS3 is new and has no page title, words, etc. I am not even sure where this page or URL came from. I was messing with the robots.txt file to allow some /category/ posts that were being hidden, but I didn't re-allow anything with the above appendages. I just want to make sure that I didn't screw something up that is now going to impact my rankings - this was just a really odd message to come up as I didn't create this page recently - and that shouldnt even be a page accessible to the public. When I edit the page - it is using an Event Espresso (WordPress plugin) shortcode - and I don't want to noindex this page as it is all of my events. Sorry this post is confusing, any help or insight would be appreciated! I am also interested in hiring someone for some hourly consulting work on SEO type issues if anyone has any references. Thank you!
Technical SEO | | webbmason0 -
I have a penalized site and don't know what the cause is
I have a site which appears to have a Google indexation penalty. According to Google because its violating the T/Cs. Here are some background details about the site: The site is a online poker + deposit methods related site on a .co.uk TLD. It has 30+ uniquely written pages, and no advertising at the moment. In June of 2010, June 10 to be precisely, I bought this site from a fellow webmaster/affiliate. After the site 's ownership changed I tried accessing the server, but I couldn't log into it . I noticed that this host had serious problems and the IP was unreachable. After trying for some time the previous owner got me all the content in Word files and I created a new hosting account and re-launched the site on June 28. Between a couple of days after June 10 and June 28, the site was unreachable, and completely de-indexed from Google. When I re-launched the site, I used the default Wordpress Template Twenty Ten, and created new pages with the Word files I received from the previous owner. I waited a bit, but noticed the site didn't get re-indexed. So on August 18th I moved the content of domain xxx.com to yyy.co.uk/xxx/ and 301-ed all the former locations, hoping that this might help yyy.co.uk get indexed..... but nothing. On October 28 of 2010 I submitted my first reconsideration request, which was processed on November 17th without any change. At that time Google didn't say if anything was wrong like now, so I just waited... and waited... and waited some more. At some point I was ready to let this one go, as I didn't/don't see any problems with it. In fact, it used to be indexed before. By now, I removed all links pointing to it that I had control off, and there are hardly any left over. The site as well doesn't have any outgoing links left, so that can't be it either. I also removed a kind-a duplicate keyword heavy menu from the sidebar, as well as the widgets from the footer. Finally I also fixed a problem caused by Yoast Wordpress SEO Plugin, but I only installed this plugin recently, so that could not be the problem that caused the penalty. So after another reconsideration request Google again let me know this site still has issues, but I really have no clue which, or how to find out. I don't feel like doing any work on this site, as there is no guarantee that it will ever lose its penalty. What should I do now?
Technical SEO | | VisualSense0 -
Issue with 'Crawl Errors' in Webmaster Tools
Have an issue with a large number of 'Not Found' webpages being listed in Webmaster Tools. In the 'Detected' column, the dates are recent (May 1st - 15th). However, looking clicking into the 'Linked From' column, all of the link sources are old, many from 2009-10. Furthermore, I have checked a large number of the source pages to double check that the links don't still exist, and they don't as I expected. Firstly, I am concerned that Google thinks there is a vast number of broken links on this site when in fact there is not. Secondly, why if the errors do not actually exist (and never actually have) do they remain listed in Webmaster Tools, which claims they were found again this month?! Thirdly, what's the best and quickest way of getting rid of these errors? Google advises that using the 'URL Removal Tool' will only remove the pages from the Google index, NOT from the crawl errors. The info is that if they keep getting 404 returns, it will automatically get removed. Well I don't know how many times they need to get that 404 in order to get rid of a URL and link that haven't existed for 18-24 months?!! Thanks.
Technical SEO | | RiceMedia0