How do I use the Robots.txt "disallow" command properly for folders I don't want indexed?
-
Today's sitemap webinar made me think about the disallow feature, seems opposite of sitemaps, but it also seems both are kind of ignored in varying ways by the engines.
I don't need help semantically, I got that part. I just can't seem to find a contemporary answer about what should be blocked using the robots.txt file.
For example, I have folders containing site comps for clients that I really don't want showing up in the SERPS. Is it better to not have these folders on the domain at all?
There are also security issues I've heard of that make sense, simply look at a site's robots file to see what they are hiding. It makes it easier to hunt for files when they know the directory the files are contained in. Do I concern myself with this?
Another example is a folder I have for my xml sitemap generator. I imagine google isn't going to try to index this or count it as content, so do I need to add folders like this to the disallow list?
-
Hi,
Usin;
User-agent: *
Disallow: /folder/subfolderis fine, however if you have information stored in your website that you certainly want crawled make sure it is in your site map and use ...
User-agent: *
allow: /folder/subfolderadding a no follow attribute to all of your pages wont be practical, if a spam crawler ignores the robots.txt it will ignore your no follow attribute. If anything new occurs with robots.txt check large website's robots.txt as they always update to new trends i.e
Hope this helps:)
-
Hi Jay,
There's actually a recent similar discussion at http://www.seomoz.org/q/what-reasons-exist-to-use-noindex-robots-txt regarding deciding what to block via robots.
For site comps for clients, you could also password-protect those to help hide them, or do a different domain that you have entirely excluded in robots. I've also seen services like Basecamp used for posting comps. It all depends on how much you want to hide the comps.
You do want your sitemap itself to be crawled, but I'm presuming this is in the root directory so that shouldn't be a problem. Folders like your sitemap generator and other purely-framework folders can certainly be disallowed. Blocking the files that list the version of your website (if you're using a CMS) can help prevent people from searching for opportunities to hack that version and finding your site.
Also, just do a site:domain.com search on your domain, see what's indexed, see what content from there you don't want indexed, and use that as a starting point.
Are you running on a content management system, or a custom site? For a CMS, here are example robots.txt files for several popular CMSs. http://www.stayonsearch.com/robots-txt-guide
-
You may also want to think about slapping a robots noindex on the individual pages as well.
-
You can type the following syntax:
after User-agent: *
Disallow: /foldername/subfoldername
also, you can name your sitemaps in the robots.txt file.
They can be defined as
Sitemap: http://www.yourdomain.com/sitemap.xml
If you have multiple sitemaps, you can have multiple sitemaps listed.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Sudden Indexation of "Index of /wp-content/uploads/"
Hi all, I have suddenly noticed a massive jump in indexed pages. After performing a "site:" search, it was revealed that the sudden jump was due to the indexation of many pages beginning with the serp title "Index of /wp-content/uploads/" for many uploaded pieces of content & plugins. This has appeared approximately one month after switching to https. I have also noticed a decline in Bing rankings. Does anyone know what is causing/how to fix this? To be clear, these pages are **not **normal /wp-content/uploads/ but rather "index of" pages, being included in Google. Thank you.
Technical SEO | | Tom3_150 -
Adding your sitemap to robots.txt
Hi everyone, Best practice question: When adding your sitemap to your robots.txt file, do you add the whole sitemap at once or do you add different subcategories (products, posts, categories,..) separately? I'm very curious to hear your thoughts!
Technical SEO | | WeAreDigital_BE0 -
Robots.txt on subdomains
Hi guys! I keep reading conflicting information on this and it's left me a little unsure. Am I right in thinking that a website with a subdomain of shop.sitetitle.com will share the same robots.txt file as the root domain?
Technical SEO | | Whittie0 -
Blocked URL's by robots.txt
In Google Webmaster Tools shows me 10,936 Blocked URL's by robots.txt and it is very strange when you go to the "Index Status" section where shows that since April 2012 robots.txt blocked many URL's. You can see more precise on the image attached (chart WMT) I can not explain why I have blocked URL's ? because I have nothing in robots.txt.
Technical SEO | | meralucian37
My robots.txt is like this: User-agent: * I thought I was penalized by Penguin in April 2012 because constantly i'am losing visitors now reaching over 40%. It may be a different penalty? Any help is welcome because i'm already so saturated. Mera robotstxt.jpg0 -
Robots.txt Question
In the past, I had blocked a section of my site (i.e. domain.com/store/) by placing the following in my robots.txt file: "Disallow: /store/" Now, I would like the store to be indexed and included in the search results. I have removed the "Disallow: /store/" from the robots.txt file, but approximately one week later a Google search for the URL produces the following meta description in the search results: "A description for this result is not available because of this site's robots.txt – learn more" Is there anything else I need to do to speed up the process of getting this section of the site indexed?
Technical SEO | | davidangotti0 -
Backlinks go to "example.com" our homepage is "example.com/default.html" am I losing internal link power?
Hey everyone! Thanks again for everybodies contributions to my questions over the last few months. As the title states, our homepage is at "example.com/default.html" but everybody that backlinks to us (as expected) to "example.com" does that mean that I am probably losing a lot of the power of my links??
Technical SEO | | TylerAbernethy0 -
Is it better to delete web pages that I don't want anymore or should I 301 redirect all of the pages I delete to the homepage or another live page?
Is it better for SEO to delete web pages that I don't want anymore or should I 301 redirect all of the pages I delete to the homepage or another live page?
Technical SEO | | CustomOnlineMarketing0 -
Robots.txt file getting a 500 error - is this a problem?
Hello all! While doing some routine health checks on a few of our client sites, I spotted that a new client of ours - who's website was not designed built by us - is returning a 500 internal server error when I try to look at the robots.txt file. As we don't host / maintain their site, I would have to go through their head office to get this changed, which isn't a problem but I just wanted to check whether this error will actually be having a negative effect on their site / whether there's a benefit to getting this changed? Thanks in advance!
Technical SEO | | themegroup0