How do I use the Robots.txt "disallow" command properly for folders I don't want indexed?
-
Today's sitemap webinar made me think about the disallow feature, seems opposite of sitemaps, but it also seems both are kind of ignored in varying ways by the engines.
I don't need help semantically, I got that part. I just can't seem to find a contemporary answer about what should be blocked using the robots.txt file.
For example, I have folders containing site comps for clients that I really don't want showing up in the SERPS. Is it better to not have these folders on the domain at all?
There are also security issues I've heard of that make sense, simply look at a site's robots file to see what they are hiding. It makes it easier to hunt for files when they know the directory the files are contained in. Do I concern myself with this?
Another example is a folder I have for my xml sitemap generator. I imagine google isn't going to try to index this or count it as content, so do I need to add folders like this to the disallow list?
-
Hi,
Usin;
User-agent: *
Disallow: /folder/subfolderis fine, however if you have information stored in your website that you certainly want crawled make sure it is in your site map and use ...
User-agent: *
allow: /folder/subfolderadding a no follow attribute to all of your pages wont be practical, if a spam crawler ignores the robots.txt it will ignore your no follow attribute. If anything new occurs with robots.txt check large website's robots.txt as they always update to new trends i.e
Hope this helps:)
-
Hi Jay,
There's actually a recent similar discussion at http://www.seomoz.org/q/what-reasons-exist-to-use-noindex-robots-txt regarding deciding what to block via robots.
For site comps for clients, you could also password-protect those to help hide them, or do a different domain that you have entirely excluded in robots. I've also seen services like Basecamp used for posting comps. It all depends on how much you want to hide the comps.
You do want your sitemap itself to be crawled, but I'm presuming this is in the root directory so that shouldn't be a problem. Folders like your sitemap generator and other purely-framework folders can certainly be disallowed. Blocking the files that list the version of your website (if you're using a CMS) can help prevent people from searching for opportunities to hack that version and finding your site.
Also, just do a site:domain.com search on your domain, see what's indexed, see what content from there you don't want indexed, and use that as a starting point.
Are you running on a content management system, or a custom site? For a CMS, here are example robots.txt files for several popular CMSs. http://www.stayonsearch.com/robots-txt-guide
-
You may also want to think about slapping a robots noindex on the individual pages as well.
-
You can type the following syntax:
after User-agent: *
Disallow: /foldername/subfoldername
also, you can name your sitemaps in the robots.txt file.
They can be defined as
Sitemap: http://www.yourdomain.com/sitemap.xml
If you have multiple sitemaps, you can have multiple sitemaps listed.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Is this a true rel=nofollow for the whole article? "printfriendly.com" is part of the URL which is why I'm confused.
Is the rel=nofollow tag on this article a true NoFollow for the whole article (and all the external links to other sites in the article), or is it just for a specific part of the page? Here is the article: https://www.aplaceformom.com/blog/americans-are-not-ready-for-retirement/ The reason I ask is that I'm confused about the code since it has "printfriendly.com..." as a portion of the URL. Your help is greatly appreciated. Thanks!
Technical SEO | | dklarse0 -
Using http: shorthand inside canonical tag ("//" instead of "http:") can cause harm?
HI, I am planning to launch a new site, and shortly after to move to HTTPS. to save the need to change over 5,000 canonical tags in pages the webmaster suggested we implement inside the rel canonical "//" instead of the absolute path, would that do any damage or be a problem? oranges-south-dakota" />
Technical SEO | | Kung_fu_Panda0 -
How ro write a robots txt file to point to your site map
Good afternoon from still wet & humid wetherby UK... I want to write a robots text file that instruct the bots to index everything and give a specific location to the sitemap. The sitemap url is:http://business.leedscityregion.gov.uk/CMSPages/GoogleSiteMap.aspx Is this correct: User-agent: *
Technical SEO | | Nightwing
Disallow:
SITEMAP: http://business.leedscityregion.gov.uk/CMSPages/GoogleSiteMap.aspx Any insight welcome 🙂0 -
Properly Moving Blog from Index to its Own Page
Right now I have a website that is exclusively a blog. I want to create pages outside of the blog and move the blog to a page other than the index file e.g.) from domain.com to domain.com/blog I will have the blog post pages stay in the root directory. e.g.) domain.com/blog-post Any suggestions how to properly tell SE's and other websites that the blog has moved?
Technical SEO | | Bartell0 -
Robots.txt blocking site or not?
Here is the robots.txt from a client site. Am I reading this right --
Technical SEO | | 540SEO
that the robots.txt is saying to ignore the entire site, but the
#'s are saying to ignore the robots.txt command? See http://www.robotstxt.org/wc/norobots.html for documentation on how to use the robots.txt file To ban all spiders from the entire site uncomment the next two lines: User-Agent: * Disallow: /0 -
Switching Site to a Domain Name that's in Use
I'm comfortable with the steps of moving a site to a new domain name as recommended by Google. However, in this case, the domain name I'm asked to move to is not really "new" ... meaning it's currently hosting a website and has been for a long time. So my question is, do I do this in steps and take the old website down first in order to "free up" the domain name in they eyes of search engines to avoid large numbers of 404s and then (in step 2) switch to the "new" domain in a few months? Thanks.
Technical SEO | | R2iSEO0 -
Robots.txt
Hi there, My question relates to the robots.txt file. This statement: /*/trackback Would this block domain.com/trackback and domain.com/fred/trackback ? Peter
Technical SEO | | PeterM220 -
What is the sense of robots.txt?
Using robots.txt to prevent search engine from indexing the page is not a good idea. so what is the sense of robots.txt? just for attracting robots to crawl sitemap?
Technical SEO | | jallenyang0