Disallow statement - is this tiny anomaly enough to render Disallow invalid?
-
Google site search (site:'hbn.hoovers.com') indicates 171,000 results for this subdomain. That is not a desired result - this site has 100% duplicate content. We don't want SEs spending any time here.
Robots.txt is set up mostly right to disallow all search engines from indexing this site. That asterisk at the end of the disallow statement looks pretty harmless - but could that be why the site has been indexed?
User-agent: * Disallow: /*
-
Interesting. I'd never heard that before.
We've never had GA or GWT on these mirror sites before, so it's hard to say what Google is doing these days.
But the goal is definitely to make them and their contents invisible to SEs. We'll get GWT on there and start removing URLs.
Thanks!
-
The additional asterisk shouldn't do you any harm, although standard practice seems to be just putting the "/".
Does it seem like Google is still crawling this subdomain when you look at webmasters crawl stats? While the disallow function in robots.txt will usually stop bots from crawling, it doesn't prevent them from indexing or keeping pages indexed that were before the disallow was put in place. If you want these pages removed from the index, you can request it through webmasters and also use meta robots noindex as opposed to the robots.txt file. Moz has a good article about it here: http://moz.com/blog/robot-access-indexation-restriction-techniques-avoiding-conflicts
If you're just worried about bots crawling the subdomain, it's possible they've already stopped crawling it, but continue to index it due to history or additional indicators suggesting they should index it.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Disallowing WP 'author' page archives
Hey Mozzers. I want to block my author archive pages, but not the primary page of each author. For example, I want to keep /author/jbentz/ but get rid of /author/jbentz/page/4/. Can I do that in robots by using a * where the author name would be populated. ' So, basically... my robots file would include something like this... Disallow: /author/*/page/ Will this work for my intended goal... or will this just disallow all of my author pages?
Technical SEO | | Netrepid0 -
All other things equal, do server rendered websites rank higher than JavaScript web apps that follow the AJAX Crawling Spec?
I instinctively feel like server rendered websites should rank higher since Google doesn't truly know that the content its getting from an AJAX site is what the user is seeing and Google isn't exactly sure of the page load time (and thus user experience). I can't find any evidence that would prove this, however. A website like Monocle.io uses pushstate, loads fast, has good page titles, etc., but it is a JavaScript single page application. Does it make any difference?
Technical SEO | | jeffwhelpley0 -
Disallow: /404/ - Best Practice?
Hello Moz Community, My developer has added this to my robots.txt file: Disallow: /404/ Is this considered good practice in the world of SEO? Would you do it with your clients? I feel he has great development knowledge but isn't too well versed in SEO. Thank you in advanced, Nico.
Technical SEO | | niconico1011 -
Robots.txt to disallow /index.php/ path
Hi SEOmoz, I have a problem with my Joomla site (yeah - me too!). I get a large amount of /index.php/ urls despite using a program to handle these issues. The URLs cause indexation errors with google (404). Now, I fixed this issue once before, but the problem persist. So I thought, instead of wasting more time, couldnt I just disallow all paths containing /index.php/ ?. I don't use that extension, but would it cause me any problems from an SEO perspective? How do I disallow all index.php's? Is it a simple: Disallow: /index.php/
Technical SEO | | Mikkehl0 -
Will invalid HTML code generated by WordPress affect SEO efforts?
Hi all, I'm new to SEOmoz and SEO in general really. I run a small but well regarded freelance website and graphic design business, and until very recently had an employee who handled the SEO side of things. I'm now looking to step into this role myself and hopefully learn the in's and out's of SEO. I've no doubt there will be much to learn, but the SEOmoz tools and it's community seem excellent and helpful. My question then is basically, if WordPress generated HTML code can have an effect on SEO, when it's reported as invalid by tools such as the W3C HTML validator? I'm used to hand coding the majority of my websites for clients, where creating valid HTML and CSS code is something I can do with relative ease. A new client however wants to use WordPress - for ease of updating the site content themselves. The client does however consider any potential SEO implications to be a very important factor in choosing a hand coded vs. WordPress based website. I am aware that WordPress itself is just a means of generating HTML code, and that to the search engines there is no difference between this and the hand coded websites I usually produce. However if WordPress is generating HTML that is being reported as invalid, would this make the search engines penalise the site? On a second note, will the search engines look negatively on a WordPress site where it is being used as a standard website, and the content may not be updated as frequently, as say, a blog? Thanks for your time, and I look forward to hearing your suggestions.
Technical SEO | | SavilleWolf0 -
Allow or Disallow First in Robots.txt
If I want to override a Disallow directive in robots.txt with an Allow command, do I have the Allow command before or after the Disallow command? example: Allow: /models/ford///page* Disallow: /models////page
Technical SEO | | irvingw0 -
How long does it take for traffic to bounce back from and accidental robots.txt disallow of root?
We accidentally uploaded a robots.txt disallow root for all agents last Tuesday and did not catch the error until yesterday.. so 6 days total of exposure. Organic traffic is down 20%. Google has since indexed the correct version of the robots.txt file. However, we're still seeing awful titles/descriptions in the SERPs and traffic is not coming back. GWT shows that not many pages were actually removed from the index but we're still seeing drastic rankings decreases. Anyone been through this? Any sort of timeline for a recovery? Much appreciated!
Technical SEO | | bheard0 -
Disallowing https URLs
It there a problem disallowing all https URLs to be indexed in order to avoid duplication? This is the article recommending this practice - http://blog.leonardchallis.com/seo/serve-a-different-robots-txt-for-https/ Thanks!
Technical SEO | | theLotter0