Moz Q&A is closed.
After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.
Robots.txt: how to exclude sub-directories correctly?
-
Hello here,
I am trying to figure out the correct way to tell SEs to crawls this:
http://www.mysite.com/directory/
But not this:
http://www.mysite.com/directory/sub-directory/
or this:
http://www.mysite.com/directory/sub-directory2/sub-directory/...
But with the fact I have thousands of sub-directories with almost infinite combinations, I can't put the following definitions in a manageable way:
disallow: /directory/sub-directory/
disallow: /directory/sub-directory2/
disallow: /directory/sub-directory/sub-directory/
disallow: /directory/sub-directory2/subdirectory/
etc...
I would end up having thousands of definitions to disallow all the possible sub-directory combinations.
So, is the following way a correct, better and shorter way to define what I want above:
allow: /directory/$
disallow: /directory/*
Would the above work?
Any thoughts are very welcome! Thank you in advance.
Best,
Fab.
-
I mentioned both. You add a meta robots to noindex and remove from the sitemap.
-
But google is still free to index a link/page even if it is not included in xml sitemap.
-
Install Yoast Wordpress SEO plugin and use that to restrict what is indexed and what is allowed in a sitemap.
-
I am using wordpress, Enfold theme (themeforest).
I want some files to be accessed by google, but those should not be indexed.
Here is an example: http://prntscr.com/h8918o
I have currently blocked some JS directories/files using robots.txt (check screenshot)
But due to this I am not able to pass Mobile Friendly Test on Google:Â http://prntscr.com/h8925z (check screenshot)
Is its possible to allow access, but use a tag like noindex in the robots.txt file. Or is there any other way out.
-
Yes, everything looks good, Webmaster Tools gave me the expected results with the following directives:
allow: /directory/$
disallow: /directory/*
Which allows this URL:
http://www.mysite.com/directory/
But doesn't allow the following one:
http://www.mysite.com/directory/sub-directory2/...
This page also gives an update similar to mine:
https://support.google.com/webmasters/answer/156449?hl=en
I think I am good! Thanks

-
Thank you Michael, it is my understanding then that my idea of doing this:
allow: /directory/$
disallow: /directory/*
Should work just fine. I will test it within Google Webmaster Tools, and let you know if any problems arise.
In the meantime if anyone else has more ideas about all this and can confirm me that would be great!
Thank you again.
-
I've always stuck to Disallow and followed -
"This is currently a bit awkward, as there is no "Allow" field. The easy way is to put all files to be disallowed into a separate directory, say "stuff", and leave the one file in the level above this directory:"
http://www.robotstxt.org/robotstxt.html
From https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt this seems contradictory
|
/*| equivalent to / | equivalent to / | Equivalent to "/" -- the trailing wildcard is ignored. |I think this post will be very useful  for you - http://moz.com/community/q/allow-or-disallow-first-in-robots-txt
-
Thank you Michael,
Google and other SEs actually recognize the "allow:" command:
https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt
The fact is: if I don't specify that, how can I be sure that the following single command:
disallow: /directory/*
Doesn't prevent SEs to spider the /directory/ index as I'd like to?
-
As long as you dont have directories somewhere in /* that you want indexed then I think that will work. Â There is no allow so you don't need the first line just
disallow: /directory/*
You can test out here-Â https://support.google.com/webmasters/answer/156449?rd=1
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
How to find correct schema type
Dear Moz members, I m currently working on schema optimizations of my website casinobesty.com which review online casino websites. I have a doubt which schema itemReviewed type I have to use in the review pages. Currently I m using type as "Game" but I m not sure it is correct. "description": "",
Intermediate & Advanced SEO | | CongthanhThe
"itemReviewed": {
"@type": "Game",
"name": "LeoVegas Casino",
"url": "https://casinobesty.com/casino/leovegas-casino/"
}, Thank you1 -
H1 and Schema Codes Set Up Correctly?
Greetings: It was pointed out to me that the h1 tags on my website (www.nyc-officespace-leader.com) all had exactly the same text and that duplication may be contributing to the very low page authority for most URLs. The duplicate h1 appears in line 54-54 (see below) of the home page: www.nyc-officespace-leader.com: itemscope itemtype="http://schema.org/LocalBusiness" style="position:absolute;top:-9999em;"> <span<br>itemprop="name">Metro Manhattan Office Space</span<br> <img< p="">But the above refers to schema" so is this really duplicate H1 or is there an exception if the H1 is within a schema? Also, I was told that the company street address and city and state were set up incorrectly as part of an alt tag. However these items also appear as schema in lines 49-68 shown below: Dangerous for me to perform surgery on the code without being certain about these key items!! Could ask my developer, however they may be uncomfortable considering that they set this up in the 1st place. So the view of neutral professionals would be highly welcome! itemprop="address" itemscope itemtype="http://schema.org/PostalAddress">
Intermediate & Advanced SEO | | Kingalan1
<span<br>itemprop="streetAddress">347 5th Ave #1008
<span<br>itemprop="addressLocality">New York
<span<br>itemprop="addressRegion">NY
<span<br>itemprop="postalCode">10016<div<br>itemprop="brand" itemscope itemtype="http://schema.org/Organization">
---------------------------------------------------------------------------</div<br></span<br></span<br></span<br></span<br></img<>0 -
Is a 301 Redirect and a Canonical Tag on Uppercase to Lowercase Pages Correct?
We have a medium size site that lost more than 50% of its traffic in July 2013 just before the Panda rollout. After working with a SEO agency, we were advised to clean up various items, one of them being that the 10k+ urls were all mixed case (i.e. www.example.com/Blue-Widget). A 301 redirect was set up thereafter forcing all these urls to go to a lowercase version (i.e. www.example.com/blue-widget). In addition, there was a canonical tag placed on all of these pages in case any parameters or other characters were incorporated into a url. I thought this was a good set up, but when running a SEO audit through a third party tool, it shows me the massive amount of 301 redirects. And, now I wonder if there should only be a canonical without the redirect or if its okay to have tens of thousands 301 redirects on the site. We have not recovered yet from the traffic loss yet and we are wondering if its really more of a technical problem than a Google penalty. Guidance and advise from those experienced in the industry is appreciated.
Intermediate & Advanced SEO | | ABK7170 -
Should I use meta noindex and robots.txt disallow?
Hi, we have an alternate "list view" version of every one of our search results pages The list view has its own URL, indicated by a URL parameter I'm concerned about wasting our crawl budget on all these list view pages, which effectively doubles the amount of pages that need crawling When they were first launched, I had the noindex meta tag be placed on all list view pages, but I'm concerned that they are still being crawled Should I therefore go ahead and also apply a robots.txt disallow on that parameter to ensure that no crawling occurs? Or, will Googlebot/Bingbot also stop crawling that page over time? I assume that noindex still means "crawl"... Thanks 🙂
Intermediate & Advanced SEO | | ntcma0 -
How to handle a blog subdomain on the main sitemap and robots file?
Hi, I have some confusion about how our blog subdomain is handled in our sitemap.  We have our main website, example.com, and our blog, blog.example.com. Should we list the blog subdomain URL in our main sitemap?  In other words, is listing a subdomain allowed in the root sitemap? What does the final structure look like in terms of the sitemap and robots file?  Specifically: **example.com/sitemap.xml ** would I include a link to our blog subdomain (blog.example.com)? example.com/robots.xml would I include a link to BOTH our main sitemap and blog sitemap? blog.example.com/sitemap.xml would I include a link to our main website URL (even though it's not a subdomain)? blog.example.com/robots.xml does a subdomain need its own robots file? I'm a technical SEO and understand the mechanics of much of on-page SEO.... but for some reason I never found an answer to this specific question and I am wondering how the pros do it.  I appreciate your help with this.
Intermediate & Advanced SEO | | seo.owl0 -
Best Practices for Moving a Sub-Domain to a Sub-Folder
One of my clients is moving their subdomain to a subfolder on their main domain. Â (ie. Â blog.example.com to example.com/blog) I just wanted to get everyone's thoughts on some best practices for things we should be doing/looking for when making this move.? ie WMT, .htaccess, 301s etc? Thanks.
Intermediate & Advanced SEO | | DarinPirkey0 -
Archiving a festival website - subdomain or directory?
Hi guys I look after a festival website whose program changes year in and year out. There are a handful of mainstay events in the festival which remain each year, but there are a bunch of other events which change each year around the mainstay programming.This often results in us redoing the website each year (a frustrating experience indeed!) We don't archive our past festivals online, but I'd like to start doing so for a number of reasons 1. These past festivals have historical value - they happened, and they contribute to telling the story of the festival over the years. They can also be used as useful windows into the upcoming festival. 2. The old events (while no longer running) often get many social shares, high quality links and in some instances still drive traffic. We try out best to 301 redirect these high value pages to the new festival website, but it's not always possible to find a similar alternative (so these redirects often go to the homepage) Anyway, I've noticed some festivals archive their content into a subdirectory - i.e. www.event.com/2012 However, I'm thinking it would actually be easier for my team to archive via a subdomain like 2012.event.com - and always use the www.event.com URL for the current year's event. I'm thinking universally redirecting the content would be easier, as would cloning the site / database etc. My question is - is one approach (i.e. directory vs. subdomain) better than the other? Do I need to be mindful of using a subdomain for archival purposes? Hope this all makes sense. Many thanks!
Intermediate & Advanced SEO | | cos20300 -
Soft 404's from pages blocked by robots.txt -- cause for concern?
We're seeing soft 404 errors appear in our google webmaster tools section on pages that are blocked by robots.txt (our search result pages). Should we be concerned? Is there anything we can do about this?
Intermediate & Advanced SEO | | nicole.healthline4