Best practice for disallowing URLS with Robots.txt
-
Hi Everybody,
We are currently trying to tidy up the crawling errors which are appearing when we crawl the site. On first viewing, we were very worried to say the least:17000+. But after looking closer at the report, we found the majority of these errors were being caused by bad URLs featuring:
- Currency - For example: "directory/currency/switch/currency/GBP/uenc/aHR0cDovL2NlbnR1cnlzYWZldHkuY29tL3dvcmt3ZWFyP3ByaWNlPTUwLSZzdGFuZGFyZHM9NzEx/"
- Color - For example: ?color=91
- Price - For example: "?price=650-700"
- Order - For example: ?dir=desc&order=most_popular
- Page - For example: "?p=1&standards=704"
- Login - For example: "customer/account/login/referer/aHR0cDovL2NlbnR1cnlzYWZldHkuY29tL2NhdGFsb2cvcHJvZHVjdC92aWV3L2lkLzQ1ODczLyNyZXZpZXctZm9ybQ,,/"
My question now is as a novice of working with Robots.txt, what would be the best practice for disallowing URLs featuring these from being crawled?
Any advice would be appreciated!
-
If you are looking to disallow url parameters you could use something like the following as a convention.
Disallow: /? or Disallow: /?dir=&order=&p= if you wanted to be more accurate with specific parameters. There have been a few Moz questions of this type over the last few years, if you do look to remove the parameters.
Also try and ensure that the product pages you have listed are well canonicalised and point to the original product etc. A good review on how to do this can be found here. This will in most cases be enough to remove any indexation/duplicate issues.
-
First I assume you have webmaster tools set up?
They have a robots.txt tester tool which you can test out different parameters to make sure you get the right syntax. For example color would be blocked by: Disallow: /?color=91* and you would follow that similar format more or less.
If you are confused I highly recommend reading through Moz's robots.txt best practices guide before you make any changes. Be sure to test all out in webmaster tools(search console)>robots.txt tester.
Let me know if you run into any problems.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
What should my main sitemap URL be?
Hi Mozzers - regarding the URL of a website's main website: http://example.com/sitemap.xml is the normal way of doing it but would it matter if I varied this to: http://example.com/mainsitemapxml.xml or similar? I can't imagine it would matter but I have never moved away from the former before - and one of my clients doesn't want to format the URL in that way. What the client is doing is actually quite interesting - they have the main sitemap: http://example.com/sitemap.xml - that redirects to the sitemap file which is http://example.com/sitemap (with no xml extension) - might that redirect and missing xml extension the redirected to sitemap cause an issue? Never come across such a setup before. Thanks in advance for your feedback - Luke
Intermediate & Advanced SEO | | McTaggart0 -
Www. or naked url?
Hi everyone, I am about to start a new WordPress site and debating whether to use www or naked URL for the URL structure. Using naked URL makes sense from a branding and minimalistic perspective but I am reading that using naked URL might have some technical deficiencies. Specifically, cookie issues and DNS can't be cname. Are these technical deficiencies still valid when using naked url? Would appreciate any feedback on this! Cheers
Intermediate & Advanced SEO | | nsereke1 -
Set Robots.txt file to crawl my website at specific times
Our website provider has stated that they can only 'lift' their block on our website in order for it to be crawled as specific times. Is there any way to amend a robots.txt to ensure that it crawls our website at a specific time of day/night in order to coincide with the block being lifted? Many Thanks, Charlene
Intermediate & Advanced SEO | | CharleneKennedy120 -
Best to Combine Listing URLs? Are 300 Listing Pages a "Thin Content" Risk?
We operate www.metro-manhattan.com, a commercial real estate website. There about 550 pages. About 300 pages are for individual listings. About 150 are for buildings. Most of the listings pages have 180-240 words. Would it be better from an SEO perspective to have multiple listings on a single page, say all Chelsea listings on the Chelsea neighborhood page? Are we shooting ourselves in the foot by having separate URLs for each listing? Are we at risI for a thin cogent Google penalty? Would the same apply to building pages (about 150)? Sample Listing: http://www.nyc-officespace-leader.com/listings/364-madison-ave-office-lease-1802sf Sample Building: http://www.nyc-officespace-leader.com/for-a-new-york-office-space-rental-consider-one-worldwide-plaza-825-eighth-avenue My concern is that the existing site architecture may result in some form of Google penalty. If we have to consolidate these pages what would be the best way of doing so? Thanks,
Intermediate & Advanced SEO | | Kingalan1
Alan0 -
Will disallowing URL's in the robots.txt file stop those URL's being indexed by Google
I found a lot of duplicate title tags showing in Google Webmaster Tools. When I visited the URL's that these duplicates belonged to, I found that they were just images from a gallery that we didn't particularly want Google to index. There is no benefit to the end user in these image pages being indexed in Google. Our developer has told us that these urls are created by a module and are not "real" pages in the CMS. They would like to add the following to our robots.txt file Disallow: /catalog/product/gallery/ QUESTION: If the these pages are already indexed by Google, will this adjustment to the robots.txt file help to remove the pages from the index? We don't want these pages to be found.
Intermediate & Advanced SEO | | andyheath0 -
Mixing static.htm urls and dynamic urls on a Windows IIS Server?
Hi all, We've had a website originally built using static html with .htm extensions ranking well in Google hence we want to keep those pages/urls. We are on a dedicated sever (Windows IIS). However our developer has custom made a new DYNAMIC section for the site which shows new added products dynamically and allows them to be booked online via shopping cart. We are having problems displaying them both on the same domain even if we put the dynamic section withing its own subfolder and keep the static htms in the root. Is it possible to have both function on IIS (even if they may have to function a little separately)? Does anyone have previous experience of this kind of issue or a way of making both work? What setup do we need to do on the dedicated server.
Intermediate & Advanced SEO | | emerald0 -
How important is it to canonicalize mobile URLs to desktop URLs?
I know many SEO's prefer a stylesheet and single URL, but if you use m.domain.com, do you canonicalize to your desktop URLS?
Intermediate & Advanced SEO | | nicole.healthline0 -
Robots.txt is blocking Wordpress Pages from Googlebot?
I have a robots.txt file on my server, which I did not develop, it was done by the web designer at the company before me. Then there is a word press plugin that generates a robots.txt file. How Do I unblock all the wordpress pages from googlebot?
Intermediate & Advanced SEO | | ENSO0