Moz Q&A is closed.
After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.
Robots.txt, does it need preceding directory structure?
-
Do you need the entire preceding path in robots.txt for it to match?
e.g:
I know if i add Disallow: /fish to robots.txt it will block
/fish
/fish.html
/fish/salmon.html
/fishheads
/fishheads/yummy.html
/fish.php?id=anythingBut would it block?:
en/fish
en/fish.html
en/fish/salmon.html
en/fishheads
en/fishheads/yummy.html
**en/fish.php?id=anything(taken from Robots.txt Specifications)** I'm hoping it actually wont match, that way writing this particular robots.txt will be much easier!
As basically I'm wanting to block many URL that have BTS- in such as:
http://www.example.com/BTS-something
http://www.example.com/BTS-somethingelse
http://www.example.com/BTS-thingybobBut have other pages that I do not want blocked, in subfolders that also have BTS- in, such as:
http://www.example.com/somesubfolder/BTS-thingy
http://www.example.com/anothersubfolder/BTS-otherthingyThanks for listening
-
Yes this is what I thought, but wanted some second opinions.
Although I wouldn't actually need a wild card after BTS, as just leaving it open is the same as using a wildcard:
/fish*.......... Equivalent to "/fish" -- the trailing wildcard is ignored. https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt Thanks for the link, I'll take a look
-
You're right in with the **Disallow: /fish **in the robots file blocking all those initial links, but if you wanted to block everything inside the /en/ folder, you would need to do disallow: /en/fish
You could use a wildcard in the robots.txt file to do something along the lines of Disallow: /BTS-*
This _'should' _work, but it's always worth checking using a tool to make sure it's all implemented correctly. Distilled did a post a while back about a JS tool which allows you to test if robots.txt files work correctly which can be found here - http://www.distilled.net/blog/seo/js-bookmarklet-for-checking-if-a-page-is-blocked-by-robots-txt/
In addition to this, you could also use the 'blocked URLs' tool in GWT to see if the pages are successfully blocked once you've implemented the code.
Hope this helps!
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
What happens to crawled URLs subsequently blocked by robots.txt?
We have a very large store with 278,146 individual product pages. Since these are all various sizes and packaging quantities of less than 200 product categories my feeling is that Google would be better off making sure our category pages are indexed. I would like to block all product pages via robots.txt until we are sure all category pages are indexed, then unblock them. Our product pages rarely change, no ratings or product reviews so there is little reason for a search engine to revisit a product page. The sales team is afraid blocking a previously indexed product page will result in in it being removed from the Google index and would prefer to submit the categories by hand, 10 per day via requested crawling. Which is the better practice?
Intermediate & Advanced SEO | Jul 27, 2021, 9:02 PM | AspenFasteners1 -
Directory with Duplicate content? what to do?
Moz keeps finding loads of pages with duplicate content on my website. The problem is its a directory page to different locations. E.g if we were a clothes shop we would be listing our locations: www.sitename.com/locations/london www.sitename.com/locations/rome www.sitename.com/locations/germany The content on these pages is all the same, except for an embedded google map that shows the location of the place. The problem is that google thinks all these pages are duplicated content. Should i set a canonical link on every single page saying that www.sitename.com/locations/london is the main page? I don't know if i can use canonical links because the page content isn't identical because of the embedded map. Help would be appreciated. Thanks.
Intermediate & Advanced SEO | Sep 30, 2016, 8:16 AM | nchlondon0 -
Need a layman's definition/analogy of the difference between schema and structured data
I'm currently writing a blog post about schema. However I want to set the record straight that schema is not exactly the same as structured data, although both are often used interchangeably. I understand this schema.org is a vocabulary of global identifiers for properties and things. Structured data is what Google officially stated as "a standard way to annotate your content so machines can understand it..." Does anybody know of a good analogy to compare the two? Thanks!
Intermediate & Advanced SEO | Apr 29, 2016, 4:03 PM | RosemaryB0 -
Do backlinks need to be clicked to pass linkjuice?
Hi all: Do backlinks need to be clicked to pass linkjuice? Is so, can someone explain how much traffic is needed from a backlink to count as linkjuice? Thanks for the help. Audrey.
Intermediate & Advanced SEO | May 30, 2020, 8:16 PM | Addythenurse1 -
Do I need to re-index the page after editing URL?
Hi, I had to edit some of the URLs. But, google is still showing my old URL in search results for certain keywords, which ofc get 404. By crawling with ScremingFrog it gets me 301 'page not found' and still giving old URLs. Why is that? And do I need to re-index pages with new URLs? Is 'fetch as Google' enough to do that or any other advice? Thanks a lot, hope the topic will help to someone else too. Dusan
Intermediate & Advanced SEO | Feb 4, 2015, 12:21 PM | Chemometec0 -
How is Google crawling and indexing this directory listing?
We have three Directory Listing pages that are being indexed by Google: http://www.ccisolutions.com/StoreFront/jsp/ http://www.ccisolutions.com/StoreFront/jsp/html/ http://www.ccisolutions.com/StoreFront/jsp/pdf/ How and why is Googlebot crawling and indexing these pages? Nothing else links to them (although the /jsp.html/ and /jsp/pdf/ both link back to /jsp/). They aren't disallowed in our robots.txt file and I understand that this could be why. If we add them to our robots.txt file and disallow, will this prevent Googlebot from crawling and indexing those Directory Listing pages without prohibiting them from crawling and indexing the content that resides there which is used to populate pages on our site? Having these pages indexed in Google is causing a myriad of issues, not the least of which is duplicate content. For example, this file <tt>CCI-SALES-STAFF.HTML</tt> (which appears on this Directory Listing referenced above - http://www.ccisolutions.com/StoreFront/jsp/html/) clicks through to this Web page: http://www.ccisolutions.com/StoreFront/jsp/html/CCI-SALES-STAFF.HTML This page is indexed in Google and we don't want it to be. But so is the actual page where we intended the content contained in that file to display: http://www.ccisolutions.com/StoreFront/category/meet-our-sales-staff As you can see, this results in duplicate content problems. Is there a way to disallow Googlebot from crawling that Directory Listing page, and, provided that we have this URL in our sitemap: http://www.ccisolutions.com/StoreFront/category/meet-our-sales-staff, solve the duplicate content issue as a result? For example: Disallow: /StoreFront/jsp/ Disallow: /StoreFront/jsp/html/ Disallow: /StoreFront/jsp/pdf/ Can we do this without risking blocking Googlebot from content we do want crawled and indexed? Many thanks in advance for any and all help on this one!
Intermediate & Advanced SEO | Sep 13, 2013, 6:49 PM | danatanseo0 -
Soft 404's from pages blocked by robots.txt -- cause for concern?
We're seeing soft 404 errors appear in our google webmaster tools section on pages that are blocked by robots.txt (our search result pages). Should we be concerned? Is there anything we can do about this?
Intermediate & Advanced SEO | Jul 20, 2012, 5:15 PM | nicole.healthline4 -
URL Structure for Directory Site
We have a directory that we're building and we're not sure if we should try to make each page an extension of the root domain or utilize sub-directories as users narrow down their selection. What is the best practice here for maximizing your SERP authority? Choice #1 - Hyphenated Architecture (no sub-folders): State Page /state/ City Page /city-state/ Business Page /business-city-state/
Intermediate & Advanced SEO | Apr 6, 2012, 8:10 PM | knowyourbank
4) Location Page /locationname-city-state/ or.... Choice #2 - Using sub-folders on drill down: State Page /state/ City Page /state/city Business Page /state/city/business/
4) Location Page /locationname-city-state/ Again, just to clarify, I need help in determining what the best methodology is for achieving the greatest SEO benefits. Just by looking it would seem that choice #1 would work better because the URL's are very clear and SEF. But, at the same time it may be less intuitive for search. I'm not sure. What do you think?0