Robots.txt: how to exclude sub-directories correctly?
-
Hello here,
I am trying to figure out the correct way to tell SEs to crawls this:
http://www.mysite.com/directory/
But not this:
http://www.mysite.com/directory/sub-directory/
or this:
http://www.mysite.com/directory/sub-directory2/sub-directory/...
But with the fact I have thousands of sub-directories with almost infinite combinations, I can't put the following definitions in a manageable way:
disallow: /directory/sub-directory/
disallow: /directory/sub-directory2/
disallow: /directory/sub-directory/sub-directory/
disallow: /directory/sub-directory2/subdirectory/
etc...
I would end up having thousands of definitions to disallow all the possible sub-directory combinations.
So, is the following way a correct, better and shorter way to define what I want above:
allow: /directory/$
disallow: /directory/*
Would the above work?
Any thoughts are very welcome! Thank you in advance.
Best,
Fab.
-
I mentioned both. You add a meta robots to noindex and remove from the sitemap.
-
But google is still free to index a link/page even if it is not included in xml sitemap.
-
Install Yoast Wordpress SEO plugin and use that to restrict what is indexed and what is allowed in a sitemap.
-
I am using wordpress, Enfold theme (themeforest).
I want some files to be accessed by google, but those should not be indexed.
Here is an example: http://prntscr.com/h8918o
I have currently blocked some JS directories/files using robots.txt (check screenshot)
But due to this I am not able to pass Mobile Friendly Test on Google: http://prntscr.com/h8925z (check screenshot)
Is its possible to allow access, but use a tag like noindex in the robots.txt file. Or is there any other way out.
-
Yes, everything looks good, Webmaster Tools gave me the expected results with the following directives:
allow: /directory/$
disallow: /directory/*
Which allows this URL:
http://www.mysite.com/directory/
But doesn't allow the following one:
http://www.mysite.com/directory/sub-directory2/...
This page also gives an update similar to mine:
https://support.google.com/webmasters/answer/156449?hl=en
I think I am good! Thanks
-
Thank you Michael, it is my understanding then that my idea of doing this:
allow: /directory/$
disallow: /directory/*
Should work just fine. I will test it within Google Webmaster Tools, and let you know if any problems arise.
In the meantime if anyone else has more ideas about all this and can confirm me that would be great!
Thank you again.
-
I've always stuck to Disallow and followed -
"This is currently a bit awkward, as there is no "Allow" field. The easy way is to put all files to be disallowed into a separate directory, say "stuff", and leave the one file in the level above this directory:"
http://www.robotstxt.org/robotstxt.html
From https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt this seems contradictory
|
/*
| equivalent to / | equivalent to / | Equivalent to "/" -- the trailing wildcard is ignored. |I think this post will be very useful for you - http://moz.com/community/q/allow-or-disallow-first-in-robots-txt
-
Thank you Michael,
Google and other SEs actually recognize the "allow:" command:
https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt
The fact is: if I don't specify that, how can I be sure that the following single command:
disallow: /directory/*
Doesn't prevent SEs to spider the /directory/ index as I'd like to?
-
As long as you dont have directories somewhere in /* that you want indexed then I think that will work. There is no allow so you don't need the first line just
disallow: /directory/*
You can test out here- https://support.google.com/webmasters/answer/156449?rd=1
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Pages blocked by robots
**yazılım sürecinde yapılan bir yanlışlıktı.** Sorunu hızlı bir şekilde nasıl çözebilirim? bana yardım et. ```[XTRjH](https://imgur.com/a/XTRjH)
Intermediate & Advanced SEO | | mihoreis0 -
Do I understand Silos correctly?
Hi everyone, Rob Cairns brought me onto the concept of silos. I'm very amazed of the concept and want to make sure that I get it right. When I create silos I sort subjects according to their main keyword which will rank most difficult and make this the main landing page (the main menu entry within a menu navigation). Then I create submenuentries that have the main menu entry as parent and that also have (in wordpress) the landing page (which is the main menu entry) as a parent site. I use these subpages for LSI and long tail KWs. I do not link between silo subpages of different categories. There is (naturally) a link from the main menu entry to the sub menu entries (which is a little weaker) and a link back to the main menu entry (the landing page) from within the text (which is a little stronger). I also can link between subpages of the same silo or from subpages to a landing page of a different silo if the keyword fits to that landing page. Is that correct so far? Cheers Marc
Intermediate & Advanced SEO | | RWW0 -
Robots.txt
Hi all, Happy New Year! I want to block certain pages on our site as they are being flagged (according to my Moz Crawl Report) as duplicate content when in fact that isn't strictly true, it is more to do with the problems faced when using a CMS system... Here are some examples of the pages I want to block and underneath will be what I believe to be the correct robots.txt entry... http://www.XYZ.com/forum/index.php?app=core&module=search&do=viewNewContent&search_app=members&search_app_filters[forums][searchInKey]=&period=today&userMode=&followedItemsOnly= Disallow: /forum/index.php?app=core&module=search http://www.XYZ.com/forum/index.php?app=core&module=reports&rcom=gallery&imageId=980&ctyp=image Disallow: /forum/index.php?app=core&module=reports http://www.XYZ.com/forum/index.php?app=forums&module=post§ion=post&do=reply_post&f=146&t=741&qpid=13308 Disallow: /forum/index.php?app=forums&module=post http://www.XYZ.com/forum/gallery/sizes/182-promenade/small/ http://www.XYZ.com/forum/gallery/sizes/182-promenade/large/ Disallow: /forum/gallery/sizes/ Any help \ advice would be much appreciated. Many thanks Andy
Intermediate & Advanced SEO | | TomKing0 -
Should I switch all paid-for directory backlinks to nofollow backlinks?
Hello Mozzers, I'm looking at a niche party services directory (b2c), established for over 8 years. They're not using nofollow tags on backlinks from their paid entries (free entries only get phone numbers and not backlinks). If they suddenly switch all the paid-for backlinks in their directory to nofollow backlinks, might that have some kind of negative impact. Switching sounds like the best way forward, but I want to avoid any unintended consequences. Perhaps I should only implement this change gradually? Thanks in advance, Luke Edited 30 minutes ago by Luke Rowland
Intermediate & Advanced SEO | | McTaggart0 -
Issue with Robots.txt file blocking meta description
Hi, Can you please tell me why the following error is showing up in the serps for a website that was just re-launched 7 days ago with new pages (301 redirects are built in)? A description for this result is not available because of this site's robots.txt – learn more. Once we noticed it yesterday, we made some changed to the file and removed the amount of items in the disallow list. Here is the current Robots.txt file: # XML Sitemap & Google News Feeds version 4.2 - http://status301.net/wordpress-plugins/xml-sitemap-feed/ Sitemap: http://www.website.com/sitemap.xml Sitemap: http://www.website.com/sitemap-news.xml User-agent: * Disallow: /wp-admin/ Disallow: /wp-includes/ Other notes... the site was developed in WordPress and uses that followign plugins: WooCommerce All-in-One SEO Pack Google Analytics for WordPress XML Sitemap Google News Feeds Currently, in the SERPs, it keeps jumping back and forth between showing the meta description for the www domain and showing the error message (above). Originally, WP Super Cache was installed and has since been deactivated, removed from WP-config.php and deleted permanently. One other thing to note, we noticed yesterday that there was an old xml sitemap still on file, which we have since removed and resubmitted a new one via WMT. Also, the old pages are still showing up in the SERPs. Could it just be that this will take time, to review the new sitemap and re-index the new site? If so, what kind of timeframes are you seeing these days for the new pages to show up in SERPs? Days, weeks? Thanks, Erin ```
Intermediate & Advanced SEO | | HiddenPeak0 -
Are paid directories any good?
Hi Everyone, I have have listing our site in free directories. Are there any paid directories that are worth listing in? Our company and site is based in New Zealand. Thanks, Pete
Intermediate & Advanced SEO | | Hardley10 -
Any reason not to redirect entire directory from old site structure to new?
I'm helping on a site that has tons of content and recently moved from a 10 year old .ASP structure to WordPress. There are ~800 404s, with 99% of them in the same directory that is no longer used at all. The old URL structures offer no indication of what the old page contents was. So, there is basically no way to manually redirect page by page to the new site at this point.....is there any reason not to redirect that entire old directory to the new homepage? Matt Cutts seems to think its OK to point an entire old directory to a new homepage, but its not as good as the 1:1 redirects: http://support.google.com/webmasters/bin/answer.py?hl=en&answer=93633 Any thoughts?
Intermediate & Advanced SEO | | wattssw0 -
Correcting an unnatural link profile
A site I work with ranked page 1 for a competitive keyphrase until recently. (Not Panda-related as far as we can tell.) We've done extensive on-site tweaking and the page is still parked at 27-32 in the SERPs. We believe the only viable explanation at this point is an unnatural link profile. Over the course of several years the site has racked up a large collection of footer links with anchor text due to business relationships with the sites in question. So the profile is now skewed, with the result as follows: 100,000 domain links (top 10 competitors range 1800-50k) 87% anchor text optimized (competitors 0-41%) 99% follow links (competitors 85-100%) The vast majority of links are footer links We're working on creating more natural, high-value links but this of course takes time. In the short term, two questions: Should we aim to remove or change some of the footer links? If so, do we remove them, or just change anchor text? How many? How many new links should we pursue each month to make a meaningful impact on the profile without being too aggressive? Any other thoughts on how to fix this are also appreciated. Thanks!
Intermediate & Advanced SEO | | kdcomms0