Moz Q&A is closed.
After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.
Robots Disallow Backslash - Is it right command
-
Bit skeptical, as due to dynamic url and some other linkage issue, google has crawled url with backslash and asterisk character
ex - www.xyz.com/\/index.php?option=com_product
www.xyz.com/\"/index.php?option=com_product
Now %5c is the encoded version of \ - backslash & %22 is encoded version of asterisk
Need to know for command :-
User-agent: * Disallow: \As am disallowing all backslash url through this - will it only remove the backslash url which are duplicates or the entire site,
-
Thanks, you seem lucky to me.. Almost after 2 month i have got the code for making all these encoded url's redirect correctly. Finally, now if one types
http://www.mycarhelpline.com/\"/index.php?option=com_latestnews&view=list&Itemid=10
then he's redirected through 301 to the correct url
http://www.mycarhelpline.com/index.php?option=com_latestnews&view=list&Itemid=10
-
Hello Gagan,
I think the best way to handle this would be using the rel canonical tag or rewriting the URLs to get rid of the parameters and replace them with something more user-friendly.
The rel canonical tag would be the easiest way out of those two. I notice the version without the encoding (e.g. http://www.mycarhelpline.com/index.php?option=com_latestnews&view=list&Itemid=10 ) have a rel canonical tag that correctly references itself as the canonical version. However, the encoded URLs (e.g. http://www.mycarhelpline.com/\"/index.php?option=com_latestnews&view=list&Itemid=10) which is actually http://www.mycarhelpline.com/\"/index.php?option=com_latestnews&view=list&Itemid=10 does NOT have a rel canonical tag.
If the version with the backslash had a rel canonical tag stating that the following URL is canonical it would solve your issue, I think.
Canonical URL:
http://www.mycarhelpline.com/index.php?option=com_latestnews&view=list&Itemid=10 -
Sure, If i show you some url they are crawled as :-
Sample Incorrect URLs crawled and reported as duplicate one in Google Webmaster & Moz too
|
http://www.mycarhelpline.com/\"/index.php?option=com_latestnews&view=list&Itemid=10
| http://www.mycarhelpline.com/\"/index.php?option=com_newcar&view=category&Itemid=2 |
|
Correct URL
http://www.mycarhelpline.com/index.php?option=com_latestnews&view=list&Itemid=10
http://www.mycarhelpline.com/index.php?option=com_newcar&view=search&Itemid=2
What we found online
Since URLs often contain characters outside the ASCII set, the URL has to be converted into a valid ASCII format. URL encoding replaces unsafe ASCII characters with a "%" followed by two hexadecimal digits. URLs cannot contain spaces.
%22 reflects - " and %5c as \ (forward slash)
We intend to remove these duplicate one created having %22 and %5c within them..
Many thanks
-
I am not entirely sure I understood your question as intended, but I will do my best to answer.
I would not put this in my robots.txt flie because it could possibly be misunderstood as a forward slash, in which case your entire domain would be blocked:
Disallow: \
We can possibly provide you with some alternative suggestions on how to keep Google from crawling those pages if you could share some real examples.
It may be best to rewrite/redirect those URls instead since they don't seem to be the canonical version you intend to be presented to the user.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Robots.txt blocked internal resources Wordpress
Hi all, We've recently migrated a Wordpress website from staging to live, but the robots.txt was deleted. I've created the following new one: User-agent: *
Intermediate & Advanced SEO | | Mat_C
Allow: /
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /wp-content/plugins/
Disallow: /wp-content/cache/
Disallow: /wp-content/themes/
Allow: /wp-admin/admin-ajax.php However, in the site audit on SemRush, I now get the mention that a lot of pages have issues with blocked internal resources in robots.txt file. These blocked internal resources are all cached and minified css elements: links, images and scripts. Does this mean that Google won't crawl some parts of these pages with blocked resources correctly and thus won't be able to follow these links and index the images? In other words, is this any cause for concern regarding SEO? Of course I can change the robots.txt again, but will urls like https://example.com/wp-content/cache/minify/df983.js end up in the index? Thanks for your thoughts!2 -
Robots.txt & Disallow: /*? Question!
Hi, I have a site where they have: Disallow: /*? Problem is we need the following indexed: ?utm_source=google_shopping What would the best solution be? I have read: User-agent: *
Intermediate & Advanced SEO | | vetofunk
Allow: ?utm_source=google_shopping
Disallow: /*? Any ideas?0 -
Should I disallow all URL query strings/parameters in Robots.txt?
Webmaster Tools correctly identifies the query strings/parameters used in my URLs, but still reports duplicate title tags and meta descriptions for the original URL and the versions with parameters. For example, Webmaster Tools would report duplicates for the following URLs, despite it correctly identifying the "cat_id" and "kw" parameters: /Mulligan-Practitioner-CD-ROM
Intermediate & Advanced SEO | | jmorehouse
/Mulligan-Practitioner-CD-ROM?cat_id=87
/Mulligan-Practitioner-CD-ROM?kw=CROM Additionally, theses pages have self-referential canonical tags, so I would think I'd be covered, but I recently read that another Mozzer saw a great improvement after disallowing all query/parameter URLs, despite Webmaster Tools not reporting any errors. As I see it, I have two options: Manually tell Google that these parameters have no effect on page content via the URL Parameters section in Webmaster Tools (in case Google is unable to automatically detect this, and I am being penalized as a result). Add "Disallow: *?" to hide all query/parameter URLs from Google. My concern here is that most backlinks include the parameters, and in some cases these parameter URLs outrank the original. Any thoughts?0 -
Should comments and feeds be disallowed in robots.txt?
Hi My robots file is currently set up as listed below. From an SEO point of view is it good to disallow feeds, rss and comments? I feel allowing comments would be a good thing because it's new content that may rank in the search engines as the comments left on my blog often refer to questions or companies folks are searching for more information on. And the comments are added regularly. What's your take? I'm also concerned about the /page being blocked. Not sure how that benefits my blog from an SEO point of view as well. Look forward to your feedback. Thanks. Eddy User-agent: Googlebot Crawl-delay: 10 Allow: /* User-agent: * Crawl-delay: 10 Disallow: /wp- Disallow: /feed/ Disallow: /trackback/ Disallow: /rss/ Disallow: /comments/feed/ Disallow: /page/ Disallow: /date/ Disallow: /comments/ # Allow Everything Allow: /*
Intermediate & Advanced SEO | | workathomecareers0 -
Recovering from robots.txt error
Hello, A client of mine is going through a bit of a crisis. A developer (at their end) added Disallow: / to the robots.txt file. Luckily the SEOMoz crawl ran a couple of days after this happened and alerted me to the error. The robots.txt file was quickly updated but the client has found the vast majority of their rankings have gone. It took a further 5 days for GWMT to file that the robots.txt file had been updated and since then we have "Fetched as Google" and "Submitted URL and linked pages" in GWMT. In GWMT it is still showing that that vast majority of pages are blocked in the "Blocked URLs" section, although the robots.txt file below it is now ok. I guess what I want to ask is: What else is there that we can do to recover these rankings quickly? What time scales can we expect for recovery? More importantly has anyone had any experience with this sort of situation and is full recovery normal? Thanks in advance!
Intermediate & Advanced SEO | | RikkiD220 -
How to Disallow Tag Pages With Robot.txt
Hi i have a site which i'm dealing with that has tag pages for instant - http://www.domain.com/news/?tag=choice How can i exclude these tag pages (about 20+ being crawled and indexed by the search engines with robot.txt Also sometimes they're created dynamically so i want something which automatically excludes tage pages from being crawled and indexed. Any suggestions? Cheers, Mark
Intermediate & Advanced SEO | | monster990 -
Using 2 wildcards in the robots.txt file
I have a URL string which I don't want to be indexed. it includes the characters _Q1 ni the middle of the string. So in the robots.txt can I use 2 wildcards in the string to take out all of the URLs with that in it? So something like /_Q1. Will that pickup and block every URL with those characters in the string? Also, this is not directly of the root, but in a secondary directory, so .com/.../_Q1. So do I have to format the robots.txt as //_Q1* as it will be in the second folder or just using /_Q1 will pickup everything no matter what folder it is on? Thanks.
Intermediate & Advanced SEO | | seo1234560 -
Blocking Dynamic URLs with Robots.txt
Background: My e-commerce site uses a lot of layered navigation and sorting links. While this is great for users, it ends up in a lot of URL variations of the same page being crawled by Google. For example, a standard category page: www.mysite.com/widgets.html ...which uses a "Price" layered navigation sidebar to filter products based on price also produces the following URLs which link to the same page: http://www.mysite.com/widgets.html?price=1%2C250 http://www.mysite.com/widgets.html?price=2%2C250 http://www.mysite.com/widgets.html?price=3%2C250 As there are literally thousands of these URL variations being indexed, so I'd like to use Robots.txt to disallow these variations. Question: Is this a wise thing to do? Or does Google take into account layered navigation links by default, and I don't need to worry. To implement, I was going to do the following in Robots.txt: User-agent: * Disallow: /*? Disallow: /*= ....which would prevent any dynamic URL with a '?" or '=' from being indexed. Is there a better way to do this, or is this a good solution? Thank you!
Intermediate & Advanced SEO | | AndrewY1