Can I rely on just robots.txt
-
We have a test version of a clients web site on a separate server before it goes onto the live server.
Some code from the test site has some how managed to get Google to index the test site which isn't great!
Would simply adding a robots text file to the root of test simply blocking all be good enough or will i have to put the meta tags for no index and no follow etc on all pages on the test site also?
-
You can do the inbound link check right here using SEOMoz's Open Site Explorer tool to check for links to the dev site, whether it's in a subdomain, subfolder or a separate site.
Good luck!
Paul
-
thats a great help cheers
wheres the best place to do an inbound link check?
-
You're actually up against a bit of a sticky wicket here, SS. You do need the no-index, no-follow meta tags on each page as Irving mentions.
HOWEVER! If you also add a robots.txt directive not to index the site, the search crawlers will not crawl your pages and therefore will never see the noindex metatag to know to remove the incorrectly-indexed pages from their index.
My recommendation is for a belt & suspenders approach.
- implement the meta no-index, no-follow tags throughout the dev site, but do NOT immediately implement the robots.txt exclusion. Wait a day or two until the pages get recrawled and the bots discover the noindex metatags
- Use the Remove URL tools in both Google and Bing Webmaster Tools to request removal of all the dev pages you are aware have been indexed.
- Then add the exclusion directive to the robots.txt file to keep the crawlers out from then on (leaving the no-index, no-follow tags in place).
- check back in the SERPS periodically to check that no other dev pages have been indexed. IF they have, do another manual removal request.
Does that make sense?
Paul
P.S. As a last measure, run an inbound links check on the dev pages that got indexed to find out which external pages are linking to the dev pages. Get those inbound links removed ASAP so the search engines aren't getting any signals to index the dev site. Last option would be to simply password-protect the directory the dev site is in. A little less convenient, but guaranteed to keep the crawlers out.
-
cheers, i thought as much
-
You cannot rely on robots.txt alone, you need to add the meta noindex tag to the pages as well to ensure that they will not get indexed.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Blocking pages from Moz and Alexa robots
Hello, We want to block all pages in this directory from Moz and Alexa robots - /slabinventory/search/ Here is an example page - https://www.msisurfaces.com/slabinventory/search/granite/giallo-fiesta/los-angeles-slabs/msi/ Let me know if this is a valid disallow for what I'm trying to. User-agent: ia_archiver
Technical SEO | | Pushm
Disallow: /slabinventory/search/* User-agent: rogerbot
Disallow: /slabinventory/search/* Thanks.0 -
Robots User-agent Query
Am I correct in saying that the allow/disallow is only applied to msnbot_mobile? mobile robots file User-agent: Googlebot-Mobile User-agent: YahooSeeker/M1A1-R2D2 User-agent: MSNBOT_Mobile Allow: / Disallow: /1 Disallow: /2/ Disallow: /3 Disallow: /4/
Technical SEO | | ThomasHarvey1 -
Robots.txt - "File does not appear to be valid"
Good afternoon Mozzers! I've got a weird problem with one of the sites I'm dealing with. For some reason, one of the developers changed the robots.txt file to disavow every site on the page - not a wise move! To rectify this, we uploaded the new robots.txt file to the domain's root as per Webmaster Tool's instructions. The live file is: User-agent: * (http://www.savistobathrooms.co.uk/robots.txt) I've submitted the new file in Webmaster Tools and it's pulling it through correctly in the editor. However, Webmaster Tools is not happy with it, for some reason. I've attached an image of the error. Does anyone have any ideas? I'm managing another site with the exact same robots.txt file and there are no issues. Cheers, Lewis FNcK2YQ
Technical SEO | | PeaSoupDigital0 -
Robots.txt
Google Webmaster Tools say our website's have low-quality pages, so we have created a robots.txt file and listed all URL’s that we want to remove from Google index. Is this enough for the solve problem?
Technical SEO | | iskq0 -
Robots.txt and joomla
Hello, I use joomla for my website and automatically all those files are blocked is that good or bad, so I remove anything and if so why ? User-agent: *
Technical SEO | | seoanalytics
Disallow: /administrator/
Disallow: /cache/
Disallow: /components/
Disallow: /images/
Disallow: /includes/
Disallow: /installation/
Disallow: /language/
Disallow: /libraries/
Disallow: /media/
Disallow: /modules/
Disallow: /plugins/
Disallow: /templates/
Disallow: /tmp/
Disallow: /xmlrpc/ I also added to my robots.txt files my email address ( is that useful, I am afraid google passes PR to the email address )
and a javascript: void (0) because I have tabs on my webpage ( is that useful )
as well as a .pdf ( is it also useful ) any comments ? does anything need to be changed or is it ok ? Thank you,0 -
Warnings for blocked by blocked by meta-robots/meta robots Nofollow...how to resolve?
Hello, I see hundreds of notices for blocked by meta-robots/meta robots nofollow and it appears it is linked to the comments on my site which I assume I would not want to be crawled. Is this the case and these notices are actually a positive thing? Please advise how to clear them up if these notices can be potentially harmful for my SEO. Thanks, Talia
Technical SEO | | M80Marketing0 -
Robots.txt Showing in SERP Results
Currently doing a technical audit for a website and when I search "Site:website.com -www" the only result is website.com/robots.txt I was wondering if anyone else has come across this before -- or what this may mean from a technical audit standpoint. Thank you!
Technical SEO | | vectormedia0 -
Invisible robots.txt?
So here's a weird one... Client comes to me for some simple changes, turns out there are some major issues with the site, one of which is that none of the correct content pages are showing up in Google, just ancillary (outdated) ones. Looks like an issue because even the main homepage isn't showing up with a "site:domain.com" So, I add to Webmaster Tools and, after an hour or so, I get the red bar of doom, "robots.txt is blocking important pages." I check it out in Webmasters and, sure enough, it's a "User agent: * Disallow /" ACK! But wait... there's no robots.txt to be found on the server. I can go to domain.com/robots.txt and see it but nothing via FTP. I upload a new one and, thankfully, that is now showing but I've never seen that before. Question is: can a robots.txt file be stored in a way that can't be seen? Thanks!
Technical SEO | | joshcanhelp0