No index tag robots.txt
-
Hi Mozzers,
A client's website has a lot of internal directories defined as /node/*.
I already added the rule 'Disallow: /node/*' to the robots.txt file to prevents bots from crawling these pages.
However, the pages are already indexed and appear in the search results.
In an article of Deepcrawl, they say you can simply add the rule 'Noindex: /node/*' to the robots.txt file, but other sources claim the only way is to add a noindex directive in the meta robots tag of every page.
Can someone tell me which is the best way to prevent these pages from getting indexed? Small note: there are more than 100 pages.
Thanks!
Jens -
Hi Jens
I don't know Drupal but if it's like Wordpress it will add a noindex tag to the page.
Do it for one page then take a look at the code.
Go to the page: right click > View Source
Then go to the three dots top right in chrome and search noindex. It will look like this attached. (ignore the red line crossed out piece)
Best Regards Nigel
-
Hi Guys,
In Drupal between the advanced tags (meta tags), there is an option:
' Prevents search engines from indexing this page 'Do you happen to know whether these tags are seen as valid by Searchbots?
Thanks again guys!
-
For the sake of balance, probably worth mentioning that I'm with David in that I've seen a robots.txt noindex work. It has been relatively recently used by a large publisher when they had an article they had to take down but which Google was holding on to. That's irrelevant nuance in this situation but I think David deserves more credit than he got here.
In terms of this specific fix I agree with Nigel - remove the Disallow and add a noindex (prompt Google to crawl the pages, with a sitemap if they don't seem to be shifting). You can re-add the Disallow if you think it's necessary but once all of the appropriate pages have a noindex tag they should stay out of the index and if they are heavily linked to on the site disallowing them could result in a loss of link equity (it'll stop with the link to the disallowed pages). So if you think you can achieve this with just a noindex you might want to leave it at that.
-
Hi David
I'd rather listen to John Mueller - he has specifically said to not use it:
https://www.seroundtable.com/google-do-not-use-noindex-in-robots-txt-20873.html
I wouldn't be advising people to use it on that basis whether it has worked for you this time or not. It's not best practice.
That's all. (Sorry Jens!)
Regards
Nigel
-
Thanks a lot for your answers guys!
-
Hi Nigel,
I agreed that what you said is the best solution in this case but noindex can definitely be done in robots.txt.
I'm not sure of the questionable sites you've seen it mentioned on, but I'd consider Stone Temple and Deep Crawl to be reputable sources.
That said, I always like to test things for myself!
I tried robots.txt noindex on one of my own big sports news websites a little while ago because I didn't want to manually set thousands of old posts to noindex. The robots.txt noindex worked fine.
Cheers,
David
-
Hi Jens/David
You should not use a noindex in Robots.txt. You can put it on the page as a robots tag, but not in Robots.txt
I have never ever seen it used in the Robots.txt - I have seen it mentioned a few times on some questionable sites and the odd mention many years ago but it's bad practice in my opinion.
Read more about Robots.txt here: https://moz.com/learn/seo/robotstxt
If you follow what I have said, that is the correct solution.
Regards Nigel
-
Hi Nigel and Jens,
Just to clarify - noindex is valid in robots.txt for Google but it's not recognized by Bing.
Here's a case study by Stone Temple on using noindex in robots.txt: https://www.stonetemple.com/does-google-respect-robots-txt-noindex-and-should-you-use-it/
From their case study, it was found to be pretty effective, but not 100%. It would be a good solution for large websites, but if you're only looking at 100+ pages I would do as Nigel said above and implement the meta robots noindex tags.
Cheers,
David
-
Hi Jens
You can't add a noindex in the Robots.txt file.
Firstly you need to add a noindex tag to all of the pages in the /node/ directory.
Then remove the nofollow directive in the Robots.txtYou need to do this for Google to see the noindex tags!
If you have a noindex tag and a nofollow then the directory is blocked so Google can't see the tags!
Once all the pages have gone from search then add the nofollow back to the Robots.txt file so that Google doesn't waste crawl budget trying to index them.
This will solve your problem.
Regards
Nigel
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Clarification regarding robots.txt protocol
Hi,
Technical SEO | | nlogix
I have a website , and having 1000 above url and all the url already got indexed in Google . Now am going to stop all the available services in my website and removed all the landing pages from website. Now only home page available . So i need to remove all the indexed urls from Google . I have already used robots txt protocol for removing url. i guess it is not a good method for adding bulk amount of urls (nearly 1000) in robots.txt . So just wanted to know is there any other method for removing indexed urls.
Please advice.0 -
Canonical tag refers to itself (???)
Greetings Mozzers. I have seen a couple of pages that use canonical tags in a peculiar way, and I wanted to know if this way of using the tags was correct, harmless or dangerous: What I've seen is that on some pages like: www.example.com/page1 There's a canonical tag in the header that looks like this link href="http://ww.example.com/page1" rel="canonical" It looks as though the tag is "redirecting to itself", this seems useless (at least to me). Is there a case where this is actually a recommended practice? Will using the canonical tag in this way "hurt" the page's ranking potential? Cheers Jorge
Technical SEO | | Masoko-T0 -
"Extremely high number of URLs" warning for robots.txt blocked pages
I have a section of my site that is exclusively for tracking redirects for paid ads. All URLs under this path do a 302 redirect through our ad tracking system: http://www.mysite.com/trackingredirect/blue-widgets?ad_id=1234567 --302--> http://www.mysite.com/blue-widgets This path of the site is blocked by our robots.txt, and none of the pages show up for a site: search. User-agent: * Disallow: /trackingredirect However, I keep receiving messages in Google Webmaster Tools about an "extremely high number of URLs", and the URLs listed are in my redirect directory, which is ostensibly not indexed. If not by robots.txt, how can I keep Googlebot from wasting crawl time on these millions of /trackingredirect/ links?
Technical SEO | | EhrenReilly0 -
Do i have my robots.txt file set up properly
Hi, just doing some seo on my site and i am not sure if i have my robots file set correctly. i use joomla and my website is www.in2town.co.uk. here is my robots file, does this look correct to you User-agent: *
Technical SEO | | ClaireH-184886
Disallow: /administrator/
Disallow: /cache/
Disallow: /components/
Disallow: /includes/
Disallow: /installation/
Disallow: /language/
Disallow: /libraries/
Disallow: /media/
Disallow: /modules/
Disallow: /plugins/
Disallow: /templates/
Disallow: /tmp/
Disallow: /xmlrpc/ many thanks1 -
Index page
To the SEO experts, this may well seem a silly question, so I apologies in advance as I try not to ask questions that I probably know the answer for already, but clarity is my goal I have numerous sites ,as standard practice, through the .htaccess I will always set up non www to www, and redirect the index page to www.mysite.com. All straight forward, have never questioned this practice, always been advised its the ebst practice to avoid duplicate content. Now, today, I was looking at a CMS service for a customer for their website, the website is already built and its a static website, so the CMS integration was going to mean a full rewrite of the website. Speaking to a friend on another forum, he told me about a service called simple CMS, had a look, looks perfect for the customer ... Went to set it up on the clients site and here is the problem. For the CMS software to work, it MUST access the index page, because my index page is redirected to www.mysite.com , it wont work as it cant find the index page (obviously) I questioned this with the software company, they inform me that it must access the index page, I have explained that it wont be able to and why (cause I have my index page redirected to avoid duplicate content) To my astonishment, the person there told me that duplicate content is a huge no no with Google (that's not the astonishing part) but its not relevant to the index and non index page of a website. This goes against everything I thought I knew ... The person also reassured me that they have worked within the SEO area for 10 years. As I am a subscriber to SEO MOZ and no one here has anything to gain but offering advice, is this true ? Will it not be an issue for duplicate content to show both a index page and non index page ?, will search engines not view this as duplicate content ? Or is this SEO expert talking bull, which I suspect, but cannot be sure. Any advice would be greatly appreciated, it would make my life a lot easier for the customer to use this CMS software, but I would do it at the risk of tarnishing the work they and I have done on their ranking status Many thanks in advance John
Technical SEO | | Johnny4B0 -
Question about Robot.txt
I just started my own e-commerce website and I hosted it to one of the popular e-commerce platform Pinnacle Cart. It has a lot of functions like, page sorting, mobile website, etc. After adjusting the URL parameters in Google webmaster last 3 weeks ago, I still get the same duplicate errors on meta titles and descriptions based from Google Crawl and SEOMOZ crawl. I am not sure if I made a mistake of choosing pinnacle cart because it is not that flexible in terms of editing the core website pages. There is now way to adjust the canonical, to insert robot.txt on every pages etc. however it has a function to submit just one page of robot.txt. and edit the .htcaccess. The website pages is in PHP format. For example this URL: www.mycompany.com has a duplicate title and description with www.mycompany.com/site-map.html (there is no way of editing the title and description of my sitemap) Another error is www.mycompany.com has a duplicate title and description with http://www.mycompany.com/brands?url=brands Is it possible to exclude those website with "url=" and my "sitemap.html" in the robot.txt? or the URL parameters from Google is enough and it just takes a lot of time. Can somebody help me on the format of Robot.txt. Please? thanks
Technical SEO | | paumer800 -
Googleoff/on tags
Hi all, I'd like to restrict Google indexing a part of content on the page. Does tag really work for it as it described on https://developers.google.com/search-appliance/documentation/46/admin_crawl/Preparing#pagepart? Thanks, Jane
Technical SEO | | Jane_Barry0 -
Site not being Indexed that fast anymore, Is something wrong with this Robots.txt
My wordpress site's robots.txt used to be this: User-agent: * Disallow: Sitemap: http://www.domainame.com/sitemap.xml.gz I also have all in one SEO installed and other than posts, tags are also index,follow on my site. My new posts used to appear on google in seconds after publishing. I changed the robots.txt to following and now post indexing takes hours. Is there something wrong with this robots.txt? User-agent: * Disallow: /cgi-bin Disallow: /wp-admin Disallow: /wp-includes Disallow: /wp-content/plugins Disallow: /wp-content/cache Disallow: /wp-content/themes Disallow: /wp-login.php Disallow: /wp-login.php Disallow: /trackback Disallow: /feed Disallow: /comments Disallow: /author Disallow: /category Disallow: */trackback Disallow: */feed Disallow: */comments Disallow: /login/ Disallow: /wget/ Disallow: /httpd/ Disallow: /*.php$ Disallow: /? Disallow: /*.js$ Disallow: /*.inc$ Disallow: /*.css$ Disallow: /*.gz$ Disallow: /*.wmv$ Disallow: /*.cgi$ Disallow: /*.xhtml$ Disallow: /? Disallow: /*?Allow: /wp-content/uploads User-agent: TechnoratiBot/8.1 Disallow: ia_archiverUser-agent: ia_archiver Disallow: / disable duggmirror User-agent: duggmirror Disallow: / allow google image bot to search all imagesUser-agent: Googlebot-Image Disallow: /wp-includes/ Allow: /* # allow adsense bot on entire siteUser-agent: Mediapartners-Google* Disallow: Allow: /* Sitemap: http://www.domainname.com/sitemap.xml.gz
Technical SEO | | ideas1230