Blocking Dynamic URLs with Robots.txt
-
Background:
My e-commerce site uses a lot of layered navigation and sorting links. While this is great for users, it ends up in a lot of URL variations of the same page being crawled by Google. For example, a standard category page:
...which uses a "Price" layered navigation sidebar to filter products based on price also produces the following URLs which link to the same page:
http://www.mysite.com/widgets.html?price=1%2C250
http://www.mysite.com/widgets.html?price=2%2C250
http://www.mysite.com/widgets.html?price=3%2C250
As there are literally thousands of these URL variations being indexed, so I'd like to use Robots.txt to disallow these variations.
Question:
-
Is this a wise thing to do? Or does Google take into account layered navigation links by default, and I don't need to worry.
-
To implement, I was going to do the following in Robots.txt:
User-agent: *
Disallow: /*?
Disallow: /*=
....which would prevent any dynamic URL with a '?" or '=' from being indexed. Is there a better way to do this, or is this a good solution?
Thank you!
-
-
If you are happy with any URLs with query strings not being indexed your robots.txt will work fine.
Do any or your URLs with question marks in them have links to them? If so you might want to be careful blocking google from indexing them. I would think you'd lose the benefits those links would pass to your site.
-
Tait,
Thanks for the answer. I think the canonical tag would be ideal, but in terms of implementation, it would require some substantial code modification to the site / PHP code as I have a lot of categories, and adding this manually to each one would be very time consuming.
Would preventing the spiders from indexing any URLs with a "?" or "&" (which would only be dynamic URLs variations) cause any problems? Or is this just not an ideal best practice?
Thanks!
-
I don't know if there's a good solution with robots.txt given your URL structure. However, you could use the rel=canonical link tag in the header to force google to treat many of your URLs the same way. This would help you avoid duplicate content penalties.
More on rel=canonical:
http://www.google.com/support/webmasters/bin/answer.py?answer=139394
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Robots.txt & Disallow: /*? Question!
Hi, I have a site where they have: Disallow: /*? Problem is we need the following indexed: ?utm_source=google_shopping What would the best solution be? I have read: User-agent: *
Intermediate & Advanced SEO | | vetofunk
Allow: ?utm_source=google_shopping
Disallow: /*? Any ideas?0 -
Changing URL to a subdomain?
Hi there, I had a website www.footballshirtcollective.com that has been live since July. It contains both content and eCommerce. I am now separating out the content so that; 1. The master domain is www.footballshirtcollective.com (content) pointing to a new site 2. Subdomain is store.footballshirtcollective.com (ecommerce) - pointing to the existing site. What do you advise I can do to minimise the impact on my search? Many thanks Mike
Intermediate & Advanced SEO | | mjmaxwell0 -
Robots.txt Disallowed Pages and Still Indexed
Alright, I am pretty sure I know the answer is "Nothing more I can do here." but I just wanted to double check. It relates to the robots.txt file and that pesky "A description for this result is not available because of this site's robots.txt". Typically people want the URL indexed and the normal Meta Description to be displayed but I don't want the link there at all. I purposefully am trying to robots that stuff outta there.
Intermediate & Advanced SEO | | DRSearchEngOpt
My question is, has anybody tried to get a page taken out of the Index and had this happen; URL still there but pesky robots.txt message for meta description? Were you able to get the URL to no longer show up or did you just live with this? Thanks folks, you are always great!0 -
Default Robots.txt in WordPress - Should i change it??
I have a WordPress site as using theme Genesis i am using default robots.txt. that has a line Allow: /wp-admin/admin-ajax.php, is it okay or any problem. Should i change it?
Intermediate & Advanced SEO | | rootwaysinc0 -
Meta robots or robot.txt file?
Hi Mozzers! For parametric URL's would you recommend meta robot or robot.txt file?
Intermediate & Advanced SEO | | eLab_London
For example: http://www.exmaple.com//category/product/cat no./quickView I want to stop indexing /quickView URLs. And what's the real difference between the two? Thanks again! Kay0 -
Could this URL issue be affecting our rankings?
Hi everyone, I have been building links to a site for a while now and we're struggling to get page 1 results for their desired keywords. We're wondering if a web development / URL structure issue could be to blame in what's holding it back. The way the site's been built means that there's a 'false' 1st-level in the URL structure. We're building deeplinks to the following page: www.example.com/blue-widgets/blue-widget-overview However, if you chop off the 2nd-level, you're not given a category page, it's a 404: www.example.com/blue-widgets/ - [Brings up a 404] I'm assuming the web developer built the site and URL structure this way just for the purposes of getting additional keywords in the URL. What's worse is that there is very little consistency across other products/services. Other pages/URLs include: www.example.com/green-widgets/widgets-in-green www.example.com/red-widgets/red-widget-intro-page www.example.com/yellow-widgets/yellow-widgets I'm wondering if Google is aware of these 'false' pages* and if so, if we should advise the client to change the URLs and therefore the URL structure of the website. This is bearing in mind that these pages haven't been linked to (because they don't exist) and therefore aren't being indexed by Google. I'm just wondering if Google can determine good/bad URL etiquette based on other parts of the URL, i.e. the fact that that middle bit doesn't exist. As a matter of fact, my colleague Steve asked this question on a blog post that Dr. Pete had written. Here's a link to Steve's comment - there are 2 replies below, one of which argues that this has no implication whatsoever. However, 5 months on, it's still an issue for us so it has me wondering... Many thanks!
Intermediate & Advanced SEO | | Gmorgan0 -
Google Maps results doesn't show my site url but rather the maps url, why is this?
For several of my clients landing pages that show up in the Maps results the website url has been overwritten by the maps url (maps.google.com). Even though on my places page I have the correct website set up. Does anyone have any idea why they would be doing this and how I can correct it? Thanks kinldy in advance, Aaron. maps-url.png
Intermediate & Advanced SEO | | afranklin0 -
Service Keyword in URL - too much?
We're working on revamping the URL structure for a site from the ground up. This firm provides a service and has a library of case studies to back up their work. Here's some options on URL structure: 1. /cases/[industry keyword]-[service keyword] (for instance: /cases/retail-pest-control) There is some search traffic for the industry/service combination, so that would be the benefit of using both in URL. But we'd end up with about 70 pages with the same service keyword at the end. 2. /cases/[industry keyword] (/cases/retail) Shorter, less spam potential, but have to optimize for the service keyword -- the primary -- in another way. 3. /cases/clientname (/cases/wehaveants) No real keyword potential but better usability. We also want the service keyword to rank on its own on another page (so, a separate "pest control" page). So don't want to dilute that page's value even after we chase some of the long tail traffic. Any thoughts on the best course of action? Thanks!
Intermediate & Advanced SEO | | kdcomms1