Moz Q&A is closed.
After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.
How to stop URLs that include query strings from being indexed by Google
-
Hello Mozzers
Would you use rel=canonical, robots.txt, or Google Webmaster Tools to stop the search engines indexing URLs that include query strings/parameters. Or perhaps a combination?
I guess it would be a good idea to stop the search engines crawling these URLs because the content they display will tend to be duplicate content and of low value to users.
I would be tempted to use a combination of canonicalization and robots.txt for every page I do not want crawled or indexed, yet perhaps Google Webmaster Tools is the best way to go / just as effective??? And I suppose some use meta robots tags too.
Does Google take a position on being blocked from web pages.
Thanks in advance, Luke
-
WIthout a specific example, there are a couple of options here. I am going to assume that you have an ecommerce site where parameters are being used for sort functions on search results or different options on a given product.
I know you may not be able to do this, but using parameters in this case is just a bad idea to start with. If you can (and I know this can be difficult) find a way to rework this so that your site functions without the use of parameters.
You could use canonicals, but then Google would still be crawling all those pages and then go through the process of using the canonical link to find out what page is canonical. That is a big waste of Google's time. Why waste Googlebots time on crawling a bunch of pages that you do not want to have crawled anyway? I would rather Googlebot focus on crawling your most important pages.
You can use the robots.txt file to stop Google from crawling sections of your site. The only issue with this is that if some of your pages with a bunch of parameters in them are ranking, once you tell Google to stop crawling it, you would then lose traffic.
It is not that Google does not "like" robot.txt to block them, or that they do not "like" the use of the canonical tag, it is just that there are directives that Google will follow in a certain way and so if not implemented correctly or in the wrong sequence can cause negative results because you have basically told Google to do something without fully understanding what will happen.
Here is what I would do. Long version for long term success
-
Look at Google Analytics (or other Analytics) and Moz tools and see what pages are ranking and sending you traffic. Make note of your results.
-
Think of the most simple way that you could organize your site that would be logical to your users and would allow Google to crawl every page you deem important. Creating a hierarchical sitemap is a good way to do this. How does this relate to what you found in #1.
-
Rework your URL structure to reflect what you found in #2 without using parameters. If you have to use parameters, then make sure Google can crawl your basic sitemap without using any of the parameters. Use robots.txt to then block the crawling of any parameters on your site. You have now ensured that Google can crawl and will rank pages without parameters and you are not hiding any important pages or page information on a page that uses parameters.
There are other reasons not to use parameters (e.g. easier for users remember, tend to be shorter, etc), so think about if you want to get rid of them.
- 301 redirect all your main traffic pages from the old URL structure to the new URL structure. Show 404s for all the old pages including the ones with parameters. That way all the good pages will move to the new URL structure and the bad ones will go away.
Now, if you are stuck using parameters. I would do a variant of the above. Still see if there are any important or well ranked pages that use parameters. Consider if there is a way to use the canonical on those pages to get Google to the right page to know what should rank. All the other pages I would use the noindex directive to get them out of the Google index, then later use robots to block Google crawling them. You want to do this in sequence as if you block Google first, it will never see the noindex directive.
Now, everything I said above is generally "correct" but depending on your situation, things may need to be tweaked. I hope the information I gave might help with you being able to work out the best options for what works for your site and your customers.
Good luck!
-
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Can Google Crawl & Index my Schema in CSR JavaScript
We currently only have one option for implementing our Schema. It is populated in the JSON which is rendered by JavaScript on the CLIENT side. I've heard tons of mixed reviews about if this will work or not. So, does anyone know for sure if this will or will not work. Also, how can I build a test to see if it does or does not work?
Intermediate & Advanced SEO | | MJTrevens0 -
Disallow: /jobs/? is this stopping the SERPs from indexing job posts
Hi,
Intermediate & Advanced SEO | | JamesHancocks1
I was wondering what this would be used for as it's in the Robots.exe of a recruitment agency website that posts jobs. Should it be removed? Disallow: /jobs/?
Disallow: /jobs/page/*/ Thanks in advance.
James0 -
Will disallowing URL's in the robots.txt file stop those URL's being indexed by Google
I found a lot of duplicate title tags showing in Google Webmaster Tools. When I visited the URL's that these duplicates belonged to, I found that they were just images from a gallery that we didn't particularly want Google to index. There is no benefit to the end user in these image pages being indexed in Google. Our developer has told us that these urls are created by a module and are not "real" pages in the CMS. They would like to add the following to our robots.txt file Disallow: /catalog/product/gallery/ QUESTION: If the these pages are already indexed by Google, will this adjustment to the robots.txt file help to remove the pages from the index? We don't want these pages to be found.
Intermediate & Advanced SEO | | andyheath0 -
Should I disallow all URL query strings/parameters in Robots.txt?
Webmaster Tools correctly identifies the query strings/parameters used in my URLs, but still reports duplicate title tags and meta descriptions for the original URL and the versions with parameters. For example, Webmaster Tools would report duplicates for the following URLs, despite it correctly identifying the "cat_id" and "kw" parameters: /Mulligan-Practitioner-CD-ROM
Intermediate & Advanced SEO | | jmorehouse
/Mulligan-Practitioner-CD-ROM?cat_id=87
/Mulligan-Practitioner-CD-ROM?kw=CROM Additionally, theses pages have self-referential canonical tags, so I would think I'd be covered, but I recently read that another Mozzer saw a great improvement after disallowing all query/parameter URLs, despite Webmaster Tools not reporting any errors. As I see it, I have two options: Manually tell Google that these parameters have no effect on page content via the URL Parameters section in Webmaster Tools (in case Google is unable to automatically detect this, and I am being penalized as a result). Add "Disallow: *?" to hide all query/parameter URLs from Google. My concern here is that most backlinks include the parameters, and in some cases these parameter URLs outrank the original. Any thoughts?0 -
Google indexing only 1 page out of 2 similar pages made for different cities
We have created two category pages, in which we are showing products which could be delivered in separate cities. Both pages are related to cake delivery in that city. But out of these two category pages only 1 got indexed in google and other has not. Its been around 1 month but still only Bangalore category page got indexed. We have submitted sitemap and google is not giving any crawl error. We have also submitted for indexing from "Fetch as google" option in webmasters. www.winni.in/c/4/cakes (Indexed - Bangalore page - http://www.winni.in/sitemap/sitemap_blr_cakes.xml) 2. http://www.winni.in/hyderabad/cakes/c/4 (Not indexed - Hyderabad page - http://www.winni.in/sitemap/sitemap_hyd_cakes.xml) I tried searching for "hyderabad site:www.winni.in" in google but there also http://www.winni.in/hyderabad/cakes/c/4 this link is not coming, instead of this only www.winni.in/c/4/cakes is coming. Can anyone please let me know what could be the possible issue with this?
Intermediate & Advanced SEO | | abhihan0 -
Does Google Read URL's if they include a # tag? Re: SEO Value of Clean Url's
An ECWID rep stated in regards to an inquiry about how the ECWID url's are not customizable, that "an important thing is that it doesn't matter what these URLs look like, because search engines don't read anything after that # in URLs. " Example http://www.runningboards4less.com/general-motors#!/Classic-Pro-Series-Extruded-2/p/28043025/category=6593891 Basically all of this: #!/Classic-Pro-Series-Extruded-2/p/28043025/category=6593891 That is a snippet out of a conversation where ECWID said that dirty urls don't matter beyond a hashtag... Is that true? I haven't found any rule that Google or other search engines (Google is really the most important) don't index, read, or place value on the part of the url after a # tag.
Intermediate & Advanced SEO | | Atlanta-SMO0 -
Best practice for removing indexed internal search pages from Google?
Hi Mozzers I know that it’s best practice to block Google from indexing internal search pages, but what’s best practice when “the damage is done”? I have a project where a substantial part of our visitors and income lands on an internal search page, because Google has indexed them (about 3 %). I would like to block Google from indexing the search pages via the meta noindex,follow tag because: Google Guidelines: “Use robots.txt to prevent crawling of search results pages or other auto-generated pages that don't add much value for users coming from search engines.” http://support.google.com/webmasters/bin/answer.py?hl=en&answer=35769 Bad user experience The search pages are (probably) stealing rankings from our real landing pages Webmaster Notification: “Googlebot found an extremely high number of URLs on your site” with links to our internal search results I want to use the meta tag to keep the link juice flowing. Do you recommend using the robots.txt instead? If yes, why? Should we just go dark on the internal search pages, or how shall we proceed with blocking them? I’m looking forward to your answer! Edit: Google have currently indexed several million of our internal search pages.
Intermediate & Advanced SEO | | HrThomsen0 -
Should you stop indexing of short lived pages?
In my site there will be a lot of pages that have a short life span of about a week as they are items on sale, should I nofollow the links meaning the site has a fwe hundred pages or allow indexing and have thousands but then have lots of links to pages that do not exist. I would of course if allowing indexing make sure the page links does not error and sends them to a similarly relevant page but which is best for me with the SEarch Engines? I would like to have the option of loads of links with pages of loads of content but not if it is detrimental Thanks
Intermediate & Advanced SEO | | barney30120