Robots Disallow Backslash - Is it right command
-
Bit skeptical, as due to dynamic url and some other linkage issue, google has crawled url with backslash and asterisk character
ex - www.xyz.com/\/index.php?option=com_product
www.xyz.com/\"/index.php?option=com_product
Now %5c is the encoded version of \ - backslash & %22 is encoded version of asterisk
Need to know for command :-
User-agent: * Disallow: \As am disallowing all backslash url through this - will it only remove the backslash url which are duplicates or the entire site,
-
Thanks, you seem lucky to me.. Almost after 2 month i have got the code for making all these encoded url's redirect correctly. Finally, now if one types
http://www.mycarhelpline.com/\"/index.php?option=com_latestnews&view=list&Itemid=10
then he's redirected through 301 to the correct url
http://www.mycarhelpline.com/index.php?option=com_latestnews&view=list&Itemid=10
-
Hello Gagan,
I think the best way to handle this would be using the rel canonical tag or rewriting the URLs to get rid of the parameters and replace them with something more user-friendly.
The rel canonical tag would be the easiest way out of those two. I notice the version without the encoding (e.g. http://www.mycarhelpline.com/index.php?option=com_latestnews&view=list&Itemid=10 ) have a rel canonical tag that correctly references itself as the canonical version. However, the encoded URLs (e.g. http://www.mycarhelpline.com/\"/index.php?option=com_latestnews&view=list&Itemid=10) which is actually http://www.mycarhelpline.com/\"/index.php?option=com_latestnews&view=list&Itemid=10 does NOT have a rel canonical tag.
If the version with the backslash had a rel canonical tag stating that the following URL is canonical it would solve your issue, I think.
Canonical URL:
http://www.mycarhelpline.com/index.php?option=com_latestnews&view=list&Itemid=10 -
Sure, If i show you some url they are crawled as :-
Sample Incorrect URLs crawled and reported as duplicate one in Google Webmaster & Moz too
|
http://www.mycarhelpline.com/\"/index.php?option=com_latestnews&view=list&Itemid=10
| http://www.mycarhelpline.com/\"/index.php?option=com_newcar&view=category&Itemid=2 |
|
Correct URL
http://www.mycarhelpline.com/index.php?option=com_latestnews&view=list&Itemid=10
http://www.mycarhelpline.com/index.php?option=com_newcar&view=search&Itemid=2
What we found online
Since URLs often contain characters outside the ASCII set, the URL has to be converted into a valid ASCII format. URL encoding replaces unsafe ASCII characters with a "%" followed by two hexadecimal digits. URLs cannot contain spaces.
%22 reflects - " and %5c as \ (forward slash)
We intend to remove these duplicate one created having %22 and %5c within them..
Many thanks
-
I am not entirely sure I understood your question as intended, but I will do my best to answer.
I would not put this in my robots.txt flie because it could possibly be misunderstood as a forward slash, in which case your entire domain would be blocked:
Disallow: \
We can possibly provide you with some alternative suggestions on how to keep Google from crawling those pages if you could share some real examples.
It may be best to rewrite/redirect those URls instead since they don't seem to be the canonical version you intend to be presented to the user.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Internal search pages (and faceted navigation) solutions for 2018! Canonical or meta robots "noindex,follow"?
There seems to conflicting information on how best to handle internal search results pages. To recap - they are problematic because these pages generally result in lots of query parameters being appended to the URL string for every kind of search - whilst the title, meta-description and general framework of the page remain the same - which is flagged in Moz Pro Site Crawl - as duplicate, meta descriptions/h1s etc. The general advice these days is NOT to disallow these pages in robots.txt anymore - because there is still value in their being crawled for all the links that appear on the page. But in order to handle the duplicate issues - the advice varies into two camps on what to do: 1. Add meta robots tag - with "noindex,follow" to the page
Intermediate & Advanced SEO | | SWEMII
This means the page will not be indexed with all it's myriad queries and parameters. And so takes care of any duplicate meta /markup issues - but any other links from the page can still be crawled and indexed = better crawling, indexing of the site, however you lose any value the page itself might bring.
This is the advice Yoast recommends in 2017 : https://yoast.com/blocking-your-sites-search-results/ - who are adamant that Google just doesn't like or want to serve this kind of page anyway... 2. Just add a canonical link tag - this will ensure that the search results page is still indexed as well.
All the different query string URLs, and the array of results they serve - are 'canonicalised' as the same.
However - this seems a bit duplicitous as the results in the page body could all be very different. Also - all the paginated results pages - would be 'canonicalised' to the main search page - which we know Google states is not correct implementation of canonical tag
https://webmasters.googleblog.com/2013/04/5-common-mistakes-with-relcanonical.html this picks up on this older discussion here from 2012
https://moz.com/community/q/internal-search-rel-canonical-vs-noindex-vs-robots-txt
Where the advice was leaning towards using canonicals because the user was seeing a percentage of inbound into these search result pages - but i wonder if it will still be the case ? As the older discussion is now 6 years old - just wondering if there is any new approach or how others have chosen to handle internal search I think a lot of the same issues occur with faceted navigation as discussed here in 2017
https://moz.com/blog/large-site-seo-basics-faceted-navigation1 -
Can you disallow links via Search Console?
Hey guys, Is it possible in anyway to nofollow links via search console (not disavow) but just nofollow external links pointing to your site? Cheers.
Intermediate & Advanced SEO | | lohardiu90 -
Search engine blocked by robots-crawl error by moz & GWT
Hello Everyone,. For My Site I am Getting Error Code 605: Page Banned by robots.txt, X-Robots-Tag HTTP Header, or Meta Robots Tag, Also google Webmaster Also not able to fetch my site, tajsigma.com is my site Any expert Can Help please, Thanx
Intermediate & Advanced SEO | | falguniinnovative0 -
Why Google won’t display the right page title?
I'm using a WordPress site with the WordPress SEO plugin by Yoast. Why Google won’t display the right page title for my CATEGORY pages? This is s example category page I'm having a problem with: http://bit.ly/1DReQPP
Intermediate & Advanced SEO | | soralsokal
In the source code of this category page you can see the title is:
<title>Yoga Übungen - Mit visuellen Guides, Videos & viel Inspiration</title> Now when I check the title in the SERPS it only gives me the the category name 'Yoga Übungen'. See screenshot here: http://awesomescreenshot.com/05942buz60
This happens with ALL the category pages on my site. Google uses the category name instead of the title provided in the source code. I found an article from Yoast dealing with this issue: https://yoast.com/google-page-title/ It's correct, that Google sometimes chooses a different page title, but this article doesn't address the exclusive category problem. For 'normal' pages or posts, Google always shows the title which I've setup in Yoast (and which is in the source code). I don't understand what goes wrong for category pages? Do any of you have a similar problem or experience with that?0 -
How to handle a blog subdomain on the main sitemap and robots file?
Hi, I have some confusion about how our blog subdomain is handled in our sitemap. We have our main website, example.com, and our blog, blog.example.com. Should we list the blog subdomain URL in our main sitemap? In other words, is listing a subdomain allowed in the root sitemap? What does the final structure look like in terms of the sitemap and robots file? Specifically: **example.com/sitemap.xml ** would I include a link to our blog subdomain (blog.example.com)? example.com/robots.xml would I include a link to BOTH our main sitemap and blog sitemap? blog.example.com/sitemap.xml would I include a link to our main website URL (even though it's not a subdomain)? blog.example.com/robots.xml does a subdomain need its own robots file? I'm a technical SEO and understand the mechanics of much of on-page SEO.... but for some reason I never found an answer to this specific question and I am wondering how the pros do it. I appreciate your help with this.
Intermediate & Advanced SEO | | seo.owl0 -
Should comments and feeds be disallowed in robots.txt?
Hi My robots file is currently set up as listed below. From an SEO point of view is it good to disallow feeds, rss and comments? I feel allowing comments would be a good thing because it's new content that may rank in the search engines as the comments left on my blog often refer to questions or companies folks are searching for more information on. And the comments are added regularly. What's your take? I'm also concerned about the /page being blocked. Not sure how that benefits my blog from an SEO point of view as well. Look forward to your feedback. Thanks. Eddy User-agent: Googlebot Crawl-delay: 10 Allow: /* User-agent: * Crawl-delay: 10 Disallow: /wp- Disallow: /feed/ Disallow: /trackback/ Disallow: /rss/ Disallow: /comments/feed/ Disallow: /page/ Disallow: /date/ Disallow: /comments/ # Allow Everything Allow: /*
Intermediate & Advanced SEO | | workathomecareers0 -
MOZ crawl report says category pages blocked by meta robots but theyr'e not?
I've just run a SEOMOZ crawl report and it tells me that the category pages on my site such as http://www.top-10-dating-reviews.com/category/online-dating/ are blocked by meta robots and have the meta robots tag noindex,follow. This was the case a couple of days ago as I run wordpress and am using the SEO Category updater plugin. By default it appears it makes categories noindex, follow. Therefore I edited the plugin so that the default was index, follow as I want google to index the category pages so that I can build links to them. When I open the page in a browser and view source the tags show as index, follow which adds up. Why then is the SEOMOZ report telling me they are still noindex,follow? Presumably the crawl is in real time and should pick up the new follow tag or is it perhaps because its using data from an old crawl? As yet these pages aren't indexed by google. Any help is much appreciated! Thanks Sam.
Intermediate & Advanced SEO | | SamCUK0 -
Old pages still crawled by SE returning 404s. Better to put 301 or block with robots.txt ?
Hello guys, A client of ours has thousand of pages returning 404 visibile on googl webmaster tools. These are all old pages which don't exist anymore but Google keeps on detecting them. These pages belong to sections of the site which don't exist anymore. They are not linked externally and didn't provide much value even when they existed What do u suggest us to do: (a) do nothing (b) redirect all these URL/folders to the homepage through a 301 (c) block these pages through the robots.txt. Are we inappropriately using part of the crawling budget set by Search Engines by not doing anything ? thx
Intermediate & Advanced SEO | | H-FARM0