Moz Q&A is closed.
After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.
No indexing url including query string with Robots txt
-
Dear all,
how can I block url/pages with query strings like page.html?dir=asc&order=name with robots txt?
Thanks!
-
Dear all, what is the best option? And are the option below good? A: Disallow
- sort-order (Only URLs with value = asc)
"A single URL may contain many parameters for each of which you can specify settings. More restrictive settings override less restrictive settings. For example, here are three parameters and their settings"
source:
http://support.google.com/webmasters/bin/answer.py?hl=en&answer=1235687
B: User-agent:
Googlebot Disallow: /*.=name$
for example www.sub.domain.com/collection.html?dir=desc&order=name source: http://support.google.com/webmasters/bin/answer.py?hl=en&answer=156449
Thanks!
-
You could always just use rel="canonical" which would be much better than completely blocking all URL parameters.
-
Hey,
Should that second URL be www.sub.domain.com/collection/adresboeken.html?whatever=something If so, then by using /collection/?* you are saying that anything within /collection/ with a query string should not be indexed. If adresboeken.html always has a query string, it may not get indexed.
The other options I'd consider before using robots.txt are telling Google to ignore dir=desc&order=color in Google Webmaster Tools parameter handling. This is the best way to handle query string issues. (Assuming you are trying to influence Google. Clearly Google Webmaster Tools won't affect Bing!)
Another idea is to set a canonical URL on /collection/adresboeken.html referencing /collection/adresboeken.html without the query string. This tells the search engines that the query strings do not make a unique URL. (adresboeken.html?dir=desc&order=color is the same as adresboeken.html?dir=desc&order=price is the same as adresboeken.html?dir=asc&order=color is the same as adresboeken.html, and so on).
I hope that helps. Thanks,
Matthew -
Hi,
Robots.txt works mainly on 2 rules. Those are User-agent: and Disallow:
User-agent: the name of the robot you need to block
Disallow: the url or folder or other url with conditions you need to block.
As you have asked in your question you need to block a url with a condition. But you have to remember that Robot.txt is giving so critical results if you did not use it correctly.
Anyway in your question, you wanted to block url/pages with query strings like page.html?dir=asc&order=name
so you have to use following:
User-agent: *
Disallow: /*?
So the above will block all the urls with a question mark (?) for all the search robots. This will not block only page.html?dir=asc&order=name it will alos block comments.html?dir=asc&order=name
So use it so carefully.
Hope this is the what you have looked for. If need more help you may ask.
Regards
Prasad
-
Dear all,
thanks for responding. If I have a pages like
1. www.sub.domain.com/collection.html exists, I want to index it, and
2. www.sub.domain.com/collection.html?dir=desc&order=color which I don't want to index
Is this the way to do this in de robots.txt?:
Disallow: /collection/?*
Thanks!
-
Hi,
Here is an article explaining how to do this in robots.txt:
http://sanzon.wordpress.com/2008/04/29/advanced-usage-of-robotstxt-w-querystrings/Depending on what you are trying to do, it might also be worth investigating parameter handling in Google Webmaster Tools:
http://support.google.com/webmasters/bin/answer.py?hl=en&answer=1235687Thanks,
Matthew
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Removing a site from Google index with no index met tags
Hi there! I wanted to remove a duplicated site from the google index. I've read that you can do this by removing the URL from Google Search console and, although I can't find it in Google Search console, Google keeps on showing the site on SERPs. So I wanted to add a "no index" meta tag to the code of the site however I've only found out how to do this for individual pages, can you do the same for a entire site? How can I do it? Thank you for your help in advance! L
Technical SEO | | Chris_Wright1 -
Google tries to index non existing language URLs. Why?
Hi, I am working for a SAAS client. He uses two different language versions by using two different subdomains.
Technical SEO | | TheHecksler
de.domain.com/company for german and en.domain.com for english. Many thousands URLs has been indexed correctly. But Google Search Console tries to index URLs which were never existing before and are still not existing. de.domain.com**/en/company
en.domain.com/de/**company ... and an thousand more using the /en/ or /de/ in between. We never use this variant and calling these URLs will throw up a 404 Page correctly (but with wrong respond code - we`re fixing that 😉 ). But Google tries to index these kind of URLs again and again. And, I couldnt find any source of these URLs. No Website is using this as an out going link, etc.
We do see in our logfiles, that a Screaming Frog Installation and moz.com w opensiteexplorer were trying to access this earlier. My Question: How does Google comes up with that? From where did they get these URLs, that (to our knowledge) never existed? Any ideas? Thanks 🙂0 -
Query Strings causing Duplicate Content
I am working with a client that has multiple locations across the nation, and they recently merged all of the location sites into one site. To allow the lead capture forms to pre-populate the locations, they are using the query string /?location=cityname on every page. EXAMPLE - www.example.com/product www.example.com/product/?location=nashville www.example.com/product/?location=chicago There are thirty locations across the nation, so, every page x 30 is being flagged as duplicate content... at least in the crawl through MOZ. Does using that query string actually cause a duplicate content problem?
Technical SEO | | Rooted1 -
My old URL's are still indexing when I have redirected all of them, why is this happening?
I have built a new website and have redirected all my old URL's to their new ones but for some reason Google is still indexing the old URL's. Also, the page authority for all of my pages has dropped to 1 (apart from the homepage) but before they were between 12 to 15. Can anyone help me with this?
Technical SEO | | One2OneDigital0 -
Robots.txt on http vs. https
We recently changed our domain from http to https. When a user enters any URL on http, there is an global 301 redirect to the same page on https. I cannot find instructions about what to do with robots.txt. Now that https is the canonical version, should I block the http-Version with robots.txt? Strangely, I cannot find a single ressource about this...
Technical SEO | | zeepartner0 -
Block Domain in robots.txt
Hi. We had some URLs that were indexed in Google from a www1-subdomain. We have now disabled the URLs (returning a 404 - for other reasons we cannot do a redirect from www1 to www) and blocked via robots.txt. But the amount of indexed pages keeps increasing (for 2 weeks now). Unfortunately, I cannot install Webmaster Tools for this subdomain to tell Google to back off... Any ideas why this could be and whether it's normal? I can send you more domain infos by personal message if you want to have a look at it.
Technical SEO | | zeepartner0 -
Can I Disallow Faceted Nav URLs - Robots.txt
I have been disallowing /*? So I know that works without affecting crawling. I am wondering if I can disallow the faceted nav urls. So disallow: /category.html/? /category2.html/? /category3.html/*? To prevent the price faceted url from being cached: /category.html?price=1%2C1000
Technical SEO | | tylerfraser
and
/category.html?price=1%2C1000&product_material=88 Thanks!0 -
Subdomain Removal in Robots.txt with Conditional Logic??
I would like to see if there is a way to add conditional logic to the robots.txt file so that when we push from DEV to PRODUCTION and the robots.txt file is pushed, we don't have to remember to NOT push the robots.txt file OR edit it when it goes live. My specific situation is this: I have www.website.com, dev.website.com and new.website.com and somehow google has indexed the DEV.website.com and NEW.website.com and I'd like these to be removed from google's index as they are causing duplicate content. Should I: a) add 2 new GWT entries for DEV.website.com and NEW.website.com and VERIFY ownership - if I do this, then when the files are pushed to LIVE won't the files contain the VERIFY META CODE for the DEV version even though it's now LIVE? (hope that makes sense) b) write a robots.txt file that specifies "DISALLOW: DEV.website.com/" is that possible? I have only seen examples of DISALLOW with a "/" in the beginning... Hope this makes sense, can really use the help! I'm on a Windows Server 2008 box running ColdFusion websites.
Technical SEO | | ErnieB0