Moz Q&A is closed.
After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.
Can't crawl website with Screaming frog... what is wrong?
-
Hello all - I've just been trying to crawl a site with Screaming Frog and can't get beyond the homepage - have done the usual stuff (turn off JS and so on) and no problems there with nav and so on- the site's other pages have indexed in Google btw.
Now I'm wondering whether there's a problem with this robots.txt file, which I think may be auto-generated by Joomla (I'm not familiar with Joomla...) - are there any issues here? [just checked... and there isn't!]
If the Joomla site is installed within a folder such as at
e.g. www.example.com/joomla/ the robots.txt file MUST be
moved to the site root at e.g. www.example.com/robots.txt
AND the joomla folder name MUST be prefixed to the disallowed
path, e.g. the Disallow rule for the /administrator/ folder
MUST be changed to read Disallow: /joomla/administrator/
For more information about the robots.txt standard, see:
http://www.robotstxt.org/orig.html
For syntax checking, see:
http://tool.motoricerca.info/robots-checker.phtml
User-agent: *
Disallow: /administrator/
Disallow: /bin/
Disallow: /cache/
Disallow: /cli/
Disallow: /components/
Disallow: /includes/
Disallow: /installation/
Disallow: /language/
Disallow: /layouts/
Disallow: /libraries/
Disallow: /logs/
Disallow: /modules/
Disallow: /plugins/
Disallow: /tmp/ -
For anyone wondering; The answer above by Ecommerce Site (odd name btw) works - 21-Nov-2016.
- topic:timeago_earlier,2 years
-
This is the best I could find to so someone who had a similar problem with Joomla-
"In the premium version you can slow down the crawl rate under 'speed' in the configuration. In the free lite version, you can crawl the site and then right click on any URLs with a 403 response and press 're-spider'. The server will generally then allow you to crawl these pages (and return a 200 ok response) as you're not requesting too many at once, so you might have to re-spider them individually."
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Can Google Crawl & Index my Schema in CSR JavaScript
We currently only have one option for implementing our Schema. It is populated in the JSON which is rendered by JavaScript on the CLIENT side. I've heard tons of mixed reviews about if this will work or not. So, does anyone know for sure if this will or will not work. Also, how can I build a test to see if it does or does not work?
Intermediate & Advanced SEO | Jan 27, 2020, 2:39 PM | MJTrevens0 -
Can't generate a sitemap with all my pages
I am trying to generate a site map for my site nationalcurrencyvalues.com but all the tools I have tried don't get all my 70000 html pages... I have found that the one at check-domains.com crawls all my pages but when it writes the xml file most of them are gone... seemingly randomly. I have used this same site before and it worked without a problem. Can anyone help me understand why this is or point me to a utility that will map all of the pages? Kindly, Greg
Intermediate & Advanced SEO | Dec 7, 2016, 11:52 PM | Banknotes0 -
Duplicate Content through 'Gclid'
Hello, We've had the known problem of duplicate content through the gclid parameter caused by Google Adwords. As per Google's recommendation - we added the canonical tag to every page on our site so when the bot came to each page they would go 'Ah-ha, this is the original page'. We also added the paramter to the URL parameters in Google Wemaster Tools. However, now it seems as though a canonical is automatically been given to these newly created gclid pages; below https://www.google.com.au/search?espv=2&q=site%3Awww.mypetwarehouse.com.au+inurl%3Agclid&oq=site%3A&gs_l=serp.3.0.35i39l2j0i67l4j0i10j0i67j0j0i131.58677.61871.0.63823.11.8.3.0.0.0.208.930.0j3j2.5.0....0...1c.1.64.serp..8.3.419.nUJod6dYZmI Therefore these new pages are now being indexed, causing duplicate content. Does anyone have any idea about what to do in this situation? Thanks, Stephen.
Intermediate & Advanced SEO | Oct 20, 2015, 9:00 PM | MyPetWarehouse0 -
Alt tag for src='blank.gif' on lazy load images
I didn't find an answer on a search on this, so maybe someone here has faced this before. I am loading 20 images that are in the viewport and a bit below. The next 80 images I want to 'lazy-load'. They therefore are seen by the bot as a blank.gif file. However, I would like to get some credit for them by giving a description in the alt tag. Is that a no-no? If not, do they all have to be the same alt description since the src name is the same? I don't want to mess things up with Google by being too aggressive, but at the same time those are valid images once they are lazy loaded, so would like to get some credit for them. Thanks! Ted
Intermediate & Advanced SEO | Dec 29, 2014, 2:38 PM | friendoffood0 -
Would you rate-control Googlebot? How much crawling is too much crawling?
One of our sites is very large - over 500M pages. Google has indexed 1/8th of the site - and they tend to crawl between 800k and 1M pages per day. A few times a year, Google will significantly increase their crawl rate - overnight hitting 2M pages per day or more. This creates big problems for us, because at 1M pages per day Google is consuming 70% of our API capacity, and the API overall is at 90% capacity. At 2M pages per day, 20% of our page requests are 500 errors. I've lobbied for an investment / overhaul of the API configuration to allow for more Google bandwidth without compromising user experience. My tech team counters that it's a wasted investment - as Google will crawl to our capacity whatever that capacity is. Questions to Enterprise SEOs: *Is there any validity to the tech team's claim? I thought Google's crawl rate was based on a combination of PageRank and the frequency of page updates. This indicates there is some upper limit - which we perhaps haven't reached - but which would stabilize once reached. *We've asked Google to rate-limit our crawl rate in the past. Is that harmful? I've always looked at a robust crawl rate as a good problem to have. Is 1.5M Googlebot API calls a day desirable, or something any reasonable Enterprise SEO would seek to throttle back? *What about setting a longer refresh rate in the sitemaps? Would that reduce the daily crawl demand? We could set increase it to a month, but at 500M pages Google could still have a ball at the 2M pages/day rate. Thanks
Intermediate & Advanced SEO | Jun 29, 2015, 5:23 AM | lzhao0 -
Is there any negative SEO effect of having comma's in URL's?
Hello, I have a client who has a large ecommerce website. Some category names have been created with comma's in - which has meant that their software has automatically generated URL's with comma's in for every page that comes beneath the category in the site hierarchy. eg. 1 : http://shop.deliaonline.com/store/music,-dvd-and-games/dvds-and-blu_rays/ eg. 2 : http://shop.deliaonline.com/store/music,-dvd-and-games/dvds-and-blu_rays/action-and-adventure/ etc... I know that URL's with comma's in look a bit ugly! But is there 'any' SEO reason why URL's with comma's in are any less effective? Kind Regs, RB
Intermediate & Advanced SEO | Jul 3, 2012, 5:30 AM | RichBestSEO0 -
Any way to find which domains are 301 redirected to competitors' websites?
By looking at the work from an SEO collegue it became clear that his weak linkbuilding graph probably is not the cause for his good rankings for a pretty competitive keyword. (also no social mentions where found) I was wondering what it could be, site structure and other on page optimization factors seems to be ok and I don't think there will be exceptionally good or bad user behavior... Finally I looked at the competitors and found that they have more links, better content en better design, so I got a little stuck. The only reason I can think of is that he is doing 301 redirects (or is rel=canonical tags). Is there a way to trace these redirects back to the source in order to include this important variable in your competitor research? thnx
Intermediate & Advanced SEO | Oct 30, 2011, 3:52 AM | djingel10