Moz Q&A is closed.
After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.
Robots.txt: how to exclude sub-directories correctly?
-
Hello here,
I am trying to figure out the correct way to tell SEs to crawls this:
http://www.mysite.com/directory/
But not this:
http://www.mysite.com/directory/sub-directory/
or this:
http://www.mysite.com/directory/sub-directory2/sub-directory/...
But with the fact I have thousands of sub-directories with almost infinite combinations, I can't put the following definitions in a manageable way:
disallow: /directory/sub-directory/
disallow: /directory/sub-directory2/
disallow: /directory/sub-directory/sub-directory/
disallow: /directory/sub-directory2/subdirectory/
etc...
I would end up having thousands of definitions to disallow all the possible sub-directory combinations.
So, is the following way a correct, better and shorter way to define what I want above:
allow: /directory/$
disallow: /directory/*
Would the above work?
Any thoughts are very welcome! Thank you in advance.
Best,
Fab.
-
I mentioned both. You add a meta robots to noindex and remove from the sitemap.
-
But google is still free to index a link/page even if it is not included in xml sitemap.
-
Install Yoast Wordpress SEO plugin and use that to restrict what is indexed and what is allowed in a sitemap.
-
I am using wordpress, Enfold theme (themeforest).
I want some files to be accessed by google, but those should not be indexed.
Here is an example: http://prntscr.com/h8918o
I have currently blocked some JS directories/files using robots.txt (check screenshot)
But due to this I am not able to pass Mobile Friendly Test on Google: http://prntscr.com/h8925z (check screenshot)
Is its possible to allow access, but use a tag like noindex in the robots.txt file. Or is there any other way out.
-
Yes, everything looks good, Webmaster Tools gave me the expected results with the following directives:
allow: /directory/$
disallow: /directory/*
Which allows this URL:
http://www.mysite.com/directory/
But doesn't allow the following one:
http://www.mysite.com/directory/sub-directory2/...
This page also gives an update similar to mine:
https://support.google.com/webmasters/answer/156449?hl=en
I think I am good! Thanks
-
Thank you Michael, it is my understanding then that my idea of doing this:
allow: /directory/$
disallow: /directory/*
Should work just fine. I will test it within Google Webmaster Tools, and let you know if any problems arise.
In the meantime if anyone else has more ideas about all this and can confirm me that would be great!
Thank you again.
-
I've always stuck to Disallow and followed -
"This is currently a bit awkward, as there is no "Allow" field. The easy way is to put all files to be disallowed into a separate directory, say "stuff", and leave the one file in the level above this directory:"
http://www.robotstxt.org/robotstxt.html
From https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt this seems contradictory
|
/*
| equivalent to / | equivalent to / | Equivalent to "/" -- the trailing wildcard is ignored. |I think this post will be very useful for you - http://moz.com/community/q/allow-or-disallow-first-in-robots-txt
-
Thank you Michael,
Google and other SEs actually recognize the "allow:" command:
https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt
The fact is: if I don't specify that, how can I be sure that the following single command:
disallow: /directory/*
Doesn't prevent SEs to spider the /directory/ index as I'd like to?
-
As long as you dont have directories somewhere in /* that you want indexed then I think that will work. There is no allow so you don't need the first line just
disallow: /directory/*
You can test out here- https://support.google.com/webmasters/answer/156449?rd=1
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
H1 and Schema Codes Set Up Correctly?
Greetings: It was pointed out to me that the h1 tags on my website (www.nyc-officespace-leader.com) all had exactly the same text and that duplication may be contributing to the very low page authority for most URLs. The duplicate h1 appears in line 54-54 (see below) of the home page: www.nyc-officespace-leader.com: itemscope itemtype="http://schema.org/LocalBusiness" style="position:absolute;top:-9999em;"> <span<br>itemprop="name">Metro Manhattan Office Space</span<br> <img< p="">But the above refers to schema" so is this really duplicate H1 or is there an exception if the H1 is within a schema? Also, I was told that the company street address and city and state were set up incorrectly as part of an alt tag. However these items also appear as schema in lines 49-68 shown below: Dangerous for me to perform surgery on the code without being certain about these key items!! Could ask my developer, however they may be uncomfortable considering that they set this up in the 1st place. So the view of neutral professionals would be highly welcome! itemprop="address" itemscope itemtype="http://schema.org/PostalAddress">
Intermediate & Advanced SEO | | Kingalan1
<span<br>itemprop="streetAddress">347 5th Ave #1008
<span<br>itemprop="addressLocality">New York
<span<br>itemprop="addressRegion">NY
<span<br>itemprop="postalCode">10016<div<br>itemprop="brand" itemscope itemtype="http://schema.org/Organization">
---------------------------------------------------------------------------</div<br></span<br></span<br></span<br></span<br></img<>0 -
How do I know if I am correctly solving an uppercase url issue that may be affecting Googlebot?
We have a large e-commerce site (10k+ SKUs). https://www.flagandbanner.com. As I have begun analyzing how to improve it I have discovered that we have thousands of urls that have uppercase characters. For instance: https://www.flagandbanner.com/Products/patriotic-paper-lanterns-string-lights.asp. This is inconsistently applied throughout the site. I directed our website vendor to fix the issue and they placed 301 redirects via a rule to the web.config file. Any url that contains an uppercase character now displays as a lowercase. However, as I use screaming frog to monitor our site, I see all these 301 redirects--thousands of them. The XML sitemap still shows the the uppercase versions. We have had indexing issues as well. So I'm wondering what is the most effective way to make sure that I'm not placing an extra burden on Googlebot when they index our site? Should I have just not cared about the uppercase issue and let it alone?
Intermediate & Advanced SEO | | webrocket0 -
What do you add to your robots.txt on your ecommerce sites?
We're looking at expanding our robots.txt, we currently don't have the ability to noindex/nofollow. We're thinking about adding the following: Checkout Basket Then possibly: Price Theme Sortby other misc filters. What do you include?
Intermediate & Advanced SEO | | ThomasHarvey0 -
Meta Robot Tag:Index, Follow, Noodp, Noydir
When should "Noodp" and "Noydir" meta robot tag be used? I have hundreds or URLs for real estate listings on my site that simply use "Index", Follow" without using Noodp and Noydir. Should the listing pages use these Noodp and Noydr also? All major landing pages use Index, Follow, Noodp, Noydir. Is this the best setting in terms of ranking and SEO. Thanks, Alan
Intermediate & Advanced SEO | | Kingalan10 -
Dilemma about "images" folder in robots.txt
Hi, Hope you're doing well. I am sure, you guys must be aware that Google has updated their webmaster technical guidelines saying that users should allow access to their css files and java-scripts file if it's possible. Used to be that Google would render the web pages only text based. Now it claims that it can read the css and java-scripts. According to their own terms, not allowing access to the css files can result in sub-optimal rankings. "Disallowing crawling of Javascript or CSS files in your site’s robots.txt directly harms how well our algorithms render and index your content and can result in suboptimal rankings."http://googlewebmastercentral.blogspot.com/2014/10/updating-our-technical-webmaster.htmlWe have allowed access to our CSS files. and Google bot, is seeing our webapges more like a normal user would do. (tested it in GWT)Anyhow, this is my dilemma. I am sure lot of other users might be facing the same situation. Like any other e commerce companies/websites.. we have lot of images. Used to be that our css files were inside our images folder, so I have allowed access to that. Here's the robots.txt --> http://www.modbargains.com/robots.txtRight now we are blocking images folder, as it is very huge, very heavy, and some of the images are very high res. The reason we are blocking that is because we feel that Google bot might spend almost all of its time trying to crawl that "images" folder only, that it might not have enough time to crawl other important pages. Not to mention, a very heavy server load on Google's and ours. we do have good high quality original pictures. We feel that we are losing potential rankings since we are blocking images. I was thinking to allow ONLY google-image bot, access to it. But I still feel that google might spend lot of time doing that. **I was wondering if Google makes a decision saying, hey let me spend 10 minutes for google image bot, and let me spend 20 minutes for google-mobile bot etc.. or something like that.. , or does it have separate "time spending" allocations for all of it's bot types. I want to unblock the images folder, for now only the google image bot, but at the same time, I fear that it might drastically hamper indexing of our important pages, as I mentioned before, because of having tons & tons of images, and Google spending enough time already just to crawl that folder.**Any advice? recommendations? suggestions? technical guidance? Plan of action? Pretty sure I answered my own question, but I need a confirmation from an Expert, if I am right, saying that allow only Google image access to my images folder. Sincerely,Shaleen Shah
Intermediate & Advanced SEO | | Modbargains1 -
Should canonical links be included or excluded in a sitemap?
Our company is in the process of updating our sitemap. Should we include or exclude canonical links.
Intermediate & Advanced SEO | | WebRiverGroup0 -
Robots.txt is blocking Wordpress Pages from Googlebot?
I have a robots.txt file on my server, which I did not develop, it was done by the web designer at the company before me. Then there is a word press plugin that generates a robots.txt file. How Do I unblock all the wordpress pages from googlebot?
Intermediate & Advanced SEO | | ENSO0