Robots.txt usage
-
Hey Guys,
I am about make an important improvement to our site's robots.txt
we have large number of properties on our site and we have different views for them. List, gallery and map view. By default list view shows up and user can navigate through gallery view.
We donot want gallery pages to get indexed and want to save our crawl budget for more important pages.
this is one example of our site:
http://www.holiday-rentals.co.uk/France/r31.htm
When you click on "gallery view" URL of this site will remain same in your address bar: but when you mouse over the "gallery view" tab it will show you URL with parameter "view=g". there are number of parameters: "view=g, view=l and view=m".
http://www.holiday-rentals.co.uk/France/r31.htm?view=l
http://www.holiday-rentals.co.uk/France/r31.htm?view=g
http://www.holiday-rentals.co.uk/France/r31.htm?view=m
Now my question is:
I If restrict bots by adding "Disallow: ?view=" in our robots.txt will it effect the list view too?
Will be very thankful if yo look into this for us.
Many thanks
Hassan
I will test this on some other site within our network too before putting it to important one's. to measure the impact but will be waiting for your recommendations. Thanks
-
Others are right by the way canonical may be better, but if you insist on robots restriction you should add two schemas to each parameter:
disallow:?view=m disallow:?view=m*
so that you block the urls that contain the parameter at the end and block the ones that have it in the middle as well.
-
I had a similar issue with my website: there were many ways of sorting a likst of items (date, title, etc) which ended up causing duplicate content, we solved the issue a couple of days ago by restricting the "sorted" pages using the robots.txt file. HOWEVER, this morning i found this text in the Google Webmaster Tools support section:
Google no longer recommends blocking crawler access to duplicate content on your website, whether with a robots.txt file or other methods. If search engines can't crawl pages with duplicate content, they can't automatically detect that these URLs point to the same content and will therefore effectively have to treat them as separate, unique pages. A better solution is to allow search engines to crawl these URLs, but mark them as duplicates by using the
rel="canonical"
link element, the URL parameter handling tool, or 301 redirects. In cases where duplicate content leads to us crawling too much of your website, you can also adjust the crawl rate setting in Webmaster Tools.source:
http://www.google.com/support/webmasters/bin/answer.py?answer=66359I havent seen any negative effect on my site (yet), but I would agree with SuperlativB in the sense that YOU might be better off using "canonical" tags on these links
http://www.holiday-rentals.co.uk/...?view=l
-
For these paratmeters are not at the very end os the url you should add * after the letter of the parameter as well in the restriction
you got my point, thanks for looking into this. Since our search page load with list view by default and it is not in URL but still v=l represents the list view.
I want to disallow both parameters "view=g, view=m" in any URL from bots.
If these parameters are sometimes in between and some time at the end of URL what will be the work around for for both cases, you suggest?
Thanks for looking into this...
-
You can do the restriction you want but if i get it right m stands for map view g stands for gallery view and l stands for list view. So if you want list view to be indexed and map and gallery view not to be indexed you should add two lines of distriction:
disallow:?view=m disallow:?view=g
if these paratmeters are not at the very end os the url you should add * after the letter of the parameter as well in the restriction
-
Sounds like this is something canonical could solve for you. If you disallow ?view=* you would disallow all "?view" on your homepage, if you are unsure you should go for exact match rather that all.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Do I need a separate robots.txt file for my shop subdomain?
Hello Mozzers! Apologies if this question has been asked before, but I couldn't find an answer so here goes... Currently I have one robots.txt file hosted at https://www.mysitename.org.uk/robots.txt We host our shop on a separate subdomain https://shop.mysitename.org.uk Do I need a separate robots.txt file for my subdomain? (Some Google searches are telling me yes and some no and I've become awfully confused!
Technical SEO | | sjbridle0 -
3,511 Pages Indexed and 3,331 Pages Blocked by Robots
Morning, So I checked our site's index status on WMT, and I'm being told that Google is indexing 3,511 pages and the robots are blocking 3,331. This seems slightly odd as we're only disallowing 24 pages on the robots.txt file. In light of this, I have the following queries: Do these figures mean that Google is indexing 3,511 pages and blocking 3,331 other pages? Or does it mean that it's blocking 3,331 pages of the 3,511 indexed? As there are only 24 URLs being disallowed on robots.text, why are 3,331 pages being blocked? Will these be variations of the URLs we've submitted? Currently, we don't have a sitemap. I know, I know, it's pretty unforgivable but the old one didn't really work and the developers are working on the new one. Once submitted, will this help? I think I know the answer to this, but is there any way to ascertain which pages are being blocked? Thanks in advance! Lewis
Technical SEO | | PeaSoupDigital0 -
Will an XML sitemap override a robots.txt
I have a client that has a robots.txt file that is blocking an entire subdomain, entirely by accident. Their original solution, not realizing the robots.txt error, was to submit an xml sitemap to get their pages indexed. I did not think this tactic would work, as the robots.txt would take precedent over the xmls sitemap. But it worked... I have no explanation as to how or why. Does anyone have an answer to this? or any experience with a website that has had a clear Disallow: / for months , that somehow has pages in the index?
Technical SEO | | KCBackofen0 -
Robots.txt to disallow /index.php/ path
Hi SEOmoz, I have a problem with my Joomla site (yeah - me too!). I get a large amount of /index.php/ urls despite using a program to handle these issues. The URLs cause indexation errors with google (404). Now, I fixed this issue once before, but the problem persist. So I thought, instead of wasting more time, couldnt I just disallow all paths containing /index.php/ ?. I don't use that extension, but would it cause me any problems from an SEO perspective? How do I disallow all index.php's? Is it a simple: Disallow: /index.php/
Technical SEO | | Mikkehl0 -
I accidentally blocked Google with Robots.txt. What next?
Last week I uploaded my site and forgot to remove the robots.txt file with this text: User-agent: * Disallow: / I dropped from page 11 on my main keywords to past page 50. I caught it 2-3 days later and have now fixed it. I re-imported my site map with Webmaster Tools and I also did a Fetch as Google through Webmaster Tools. I tweeted out my URL to hopefully get Google to crawl it faster too. Webmaster Tools no longer says that the site is experiencing outages, but when I look at my blocked URLs it still says 249 are blocked. That's actually gone up since I made the fix. In the Google search results, it still no longer has my page title and the description still says "A description for this result is not available because of this site's robots.txt – learn more." How will this affect me long-term? When will I recover my rankings? Is there anything else I can do? Thanks for your input! www.decalsforthewall.com
Technical SEO | | Webmaster1230 -
Un-Indexing a Page without robots.txt or access to HEAD
I am in a situation where a page was pushed live (Went live for an hour and then taken down) before it was supposed to go live. Now normally I would utilize the robots.txt or but I do not have access to either and putting a request in will not suffice as it is against protocol with the CMS. So basically I am left to just utilizing the and I cannot seem to find a nice way to play with the SE to get this un-indexed. I know for this instance I could go to GWT and do it but for clients that do not have GWT and for all the other SE's how could I do this? Here is the big question here: What if I have a promotional page that I don't want indexed and am met with these same limitations? Is there anything to do here?
Technical SEO | | DRSearchEngOpt0 -
Mobile site: robots.txt best practices
If there are canonical tags pointing to the web version of each mobile page, what should a robots.txt file for a mobile site have?
Technical SEO | | bonnierSEO0 -
Subdomain Removal in Robots.txt with Conditional Logic??
I would like to see if there is a way to add conditional logic to the robots.txt file so that when we push from DEV to PRODUCTION and the robots.txt file is pushed, we don't have to remember to NOT push the robots.txt file OR edit it when it goes live. My specific situation is this: I have www.website.com, dev.website.com and new.website.com and somehow google has indexed the DEV.website.com and NEW.website.com and I'd like these to be removed from google's index as they are causing duplicate content. Should I: a) add 2 new GWT entries for DEV.website.com and NEW.website.com and VERIFY ownership - if I do this, then when the files are pushed to LIVE won't the files contain the VERIFY META CODE for the DEV version even though it's now LIVE? (hope that makes sense) b) write a robots.txt file that specifies "DISALLOW: DEV.website.com/" is that possible? I have only seen examples of DISALLOW with a "/" in the beginning... Hope this makes sense, can really use the help! I'm on a Windows Server 2008 box running ColdFusion websites.
Technical SEO | | ErnieB0