Moz Q&A is closed.
After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.
Google indexing despite robots.txt block
-
Hi
This subdomain has about 4'000 URLs indexed in Google, although it's blocked via robots.txt: https://www.google.com/search?safe=off&q=site%3Awww1.swisscom.ch&oq=site%3Awww1.swisscom.ch
This has been the case for almost a year now, and it does not look like Google tends to respect the blocking in http://www1.swisscom.ch/robots.txt
Any clues why this is or what I could do to resolve it?
Thanks!
-
It sounds like Martijn solved your problem, but I still wanted to add that robots.txt exclusions keep search bots from reading pages that are disallowed, but it does not stop those pages from being returned in search results. When those pages do appear, a lot of times they'll have a page description along the lines of "A description of this page is not available due to this sites robots.txt".
If you want to ensure that pages are kept out of search engines results, you have to use the noindex meta tag on each page.
-
Yes, I think the crucial point is that addressing googlebot wouldn't resolve the specific problem I have here.
I would have tried adressing googlebot otherwise. But to be honest, I wouldn't have expected a much different result than specifying all user agents. Googlebot should be part of that exclusion in any case.
-
I thought that value was a bit outdated, turns out to be still accepted. Although it probably only address this issue for him in Google and I assume it will still remain one in other search engines.
Besides that the problem offered a way better solution in allowing Google not on the HTTPS site.
-
Specifically for Googlebot. I'm pretty surprised people would disagree - Stephan Spencer recommended this in a personal conversation with me.
-
Did you mean a noindex tags for robots or a specific one for googlebot? With the second one I probably get the downvotes.
-
People who are disagreeing with this, explain your reasoning.
-
A noindex tag specific to Googlebot would also be a good idea.
-
You're welcome, it was mostly due to noticing that the first snippet, the homepage, had no snippet and the rest of the pages did have one. That led me to looking at their URL structure. Good luck fixing it!
-
100 points for you Martijn, thanks! I'm pretty sure you've found the problem and I'll go about fixing it. Gotta get used to having https used more frequently now...
-
Hi Phillipp,
You almost got me with this one, but it's fairly simple. In your question you're pointing at the robots.txt of your HTTP page. But it's mostly your HTTP**S **pages that are indexed and if you look at that robots.txt file it's pretty clear why these pages are indexed: https://www1.swisscom.ch/robots.txt all the pages that are indexed match with one of your Allow statements are the complete Disallow. Hopefully that provides you with the insight on how to fix your issue.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Is there a limit to how many URLs you can put in a robots.txt file?
We have a site that has way too many urls caused by our crawlable faceted navigation. We are trying to purge 90% of our urls from the indexes. We put no index tags on the url combinations that we do no want indexed anymore, but it is taking google way too long to find the no index tags. Meanwhile we are getting hit with excessive url warnings and have been it by Panda. Would it help speed the process of purging urls if we added the urls to the robots.txt file? Could this cause any issues for us? Could it have the opposite effect and block the crawler from finding the urls, but not purge them from the index? The list could be in excess of 100MM urls.
Technical SEO | | kcb81780 -
Blocking Affiliate Links via robots.txt
Hi, I work with a client who has a large affiliate network pointing to their domain which is a large part of their inbound marketing strategy. All of these links point to a subdomain of affiliates.example.com, which then redirects the links through a 301 redirect to the relevant target page for the link. These links have been showing up in Webmaster Tools as top linking domains and also in the latest downloaded links reports. To follow guidelines and ensure that these links aren't counted by Google for either positive or negative impact on the site, we have added a block on the robots.txt of the affiliates.example.com subdomain, blocking search engines from crawling the full subddomain. The robots.txt file is the following code: User-agent: * Disallow: / We have authenticated the subdomain with Google Webmaster Tools and made certain that Google can reach and read the robots.txt file. We know they are being blocked from reading the affiliates subdomain. However, we added this affiliates subdomain block a few weeks ago to the robots.txt, but links are still showing up in the latest downloads report as first being discovered after we added the block. It's been a few weeks already, and we want to make sure that the block was implemented properly and that these links aren't being used to negatively impact the site. Any suggestions or clarification would be helpful - if the subdomain is being blocked for the search engines, why are the search engines following the links and reporting them in the www.example.com subdomain GWMT account as latest links. And if the block is implemented properly, will the total number of links pointing to our site as reported in the links to your site section be reduced, or does this not have an impact on that figure?From a development standpoint, it's a much easier fix for us to adjust the robots.txt file than to change the affiliate linking connection from a 301 to a 302, which is why we decided to go with this option.Any help you can offer will be greatly appreciated.Thanks,Mark
Technical SEO | | Mark_Ginsberg0 -
Unnecessary pages getting indexed in Google for my blog
I have a blog dapazze.com and I am suffering from a problem for a long time. I found out that Google have indexed hundreds of replytocom links and images attachment pages for my blog. I had to remove these pages manually using the URL removal tool. I had used "Disallow: ?replytocom" in my robots.txt, but Google disobeyed it. After that, I removed the parameter from my blog completely using the SEO by Yoast plugin. But now I see that Google has again started indexing these links even after they are not present in my blog (I use #comment). Google have also indexed many of my admin and plugin pages, whereas they are disallowed in my robots.txt file. Have a look at my robots.txt file here: http://dapazze.com/robots.txt Please help me out to solve this problem permanently?
Technical SEO | | rahulchowdhury0 -
De-indexed from Google
Hi Search Experts! We are just launching a new site for a client with a completely new URL. The client can not provide any access details for their existing site. Any ideas how can we get the existing site de-indexed from Google? Thanks guys!
Technical SEO | | rikmon0 -
Should I block robots from URLs containing query strings?
I'm about to block off all URLs that have a query string using robots.txt. They're mostly URLs with coremetrics tags and other referrer info. I figured that search engines don't need to see these as they're always better off with the original URL. Might there be any downside to this that I need to consider? Appreciate your help / experiences on this one. Thanks Jenni
Technical SEO | | ShearingsGroup0 -
How to block "print" pages from indexing
I have a fairly large FAQ section and every article has a "print" button. Unfortunately, this is creating a page for every article which is muddying up the index - especially on my own site using Google Custom Search. Can you recommend a way to block this from happening? Example Article: http://www.knottyboy.com/lore/idx.php/11/183/Maintenance-of-Mature-Locks-6-months-/article/How-do-I-get-sand-out-of-my-dreads.html Example "Print" page: http://www.knottyboy.com/lore/article.php?id=052&action=print
Technical SEO | | dreadmichael0 -
How to remove a sub domain from Google Index!
Hello, I have a website having many subdomains having same copy of content i think its harming my SEO for that site since abc and xyz sub domains do have same contents. Thus i require to know i have already deleted required subdomain DNS RECORDS now how to have those pages removed from Google index as well ? The DNS Records no more exists for those subdomains already.
Technical SEO | | anand20100 -
Does Google index XML files?
Does Google or other search engines include XML files in their index? More specifically, I am wondering how Google knows the difference between an xml filetype and an RSS feed.
Technical SEO | | nicole.healthline0