Moz Q&A is closed.
After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.
Block Domain in robots.txt
-
Hi.
We had some URLs that were indexed in Google from a www1-subdomain. We have now disabled the URLs (returning a 404 - for other reasons we cannot do a redirect from www1 to www) and blocked via robots.txt. But the amount of indexed pages keeps increasing (for 2 weeks now). Unfortunately, I cannot install Webmaster Tools for this subdomain to tell Google to back off...
Any ideas why this could be and whether it's normal?
I can send you more domain infos by personal message if you want to have a look at it.
-
Hi Philipp,
I have not heard of Google going rogue like this before, however I have seen it with other search engines (Baidu).
I would first verify that the robots.txt is configured correctly, and verify there is no links anywhere to the domain. The reason I mentioned this prior, was due to this official notification on Google: https://support.google.com/webmasters/answer/156449?rd=1
While Google won't crawl or index the content of pages blocked by robots.txt, we may still index the URLs if we find them on other pages on the web. As a result, the URL of the page and, potentially, other publicly available information such as anchor text in links to the site, or the title from the Open Directory Project (www.dmoz.org), can appear in Google search results.
My next thought would be, did Google start crawling the site before the robots.txt blocked them from doing so? This may have caused Google to start the indexing process which is not instantaneous, then you have the new urls appear after the robots.txt went into effect. The solution is add the meta tag noindex, or block put an explicit block on the server as I mention above.
If you are worried about duplicate content issues you maybe able to at least canonical the subdomain urls to the correct url.
Hope that helps and good luck
-
Hi Don
Thanks for your hint. It doesn't look like there are any links to the www1 subdomain. Also, since we've let the www1-Subdomain return 404's and blocked it with robots, the indexed pages increased from 39'300 to 45'100 so this is more than anybody would link to... Really strange why Google just ignores robots and keeps indexing...
-
Hi Phil,
Is it possible that google is find the links on another site (like somebody else has your links on their site)? Depending on your situation a good catch all block is to secure the www1 domain with (.htaccess/**.**htpasswd ) this would force anybody (even bots) to provide credentials to see or explore the site. Of course everybody who needs access to the site would have the credentials. So in theory you shouldn't see any more urls getting indexed.
Hope that helps,Don
-
Thanks for the resource Chris! The strange thing is that Google keeps indexing new URLs even though it is clearly blocked via robots.txt...
But I guess I'll just wait for these 90 days to pass then...
-
Phillip,
If you've deleted the URLs, there's not much else for you to do. You're experiencing the lag between when Google crawls and indexes pages new pages and when it finds and removes a 404 URL from it's index.
You should think 90 days as an approximate time frame for your page count in the index to start dropping. Here's more from google:
https://support.google.com/webmasters/answer/1663419
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Do I need a separate robots.txt file for my shop subdomain?
Hello Mozzers! Apologies if this question has been asked before, but I couldn't find an answer so here goes... Currently I have one robots.txt file hosted at https://www.mysitename.org.uk/robots.txt We host our shop on a separate subdomain https://shop.mysitename.org.uk Do I need a separate robots.txt file for my subdomain? (Some Google searches are telling me yes and some no and I've become awfully confused!
Technical SEO | | sjbridle0 -
URL Structure On Site - Currently it's domain/product-name NOT domain/category/product name is this bad?
I have a eCommerce site and the site structure is domain/product-name rather than domain/product-category/product-name Do you think this will have a negative impact SEO Wise? I have seen that some of my individual product pages do get better rankings than my categories.
Technical SEO | | the-gate-films0 -
Robots.txt on http vs. https
We recently changed our domain from http to https. When a user enters any URL on http, there is an global 301 redirect to the same page on https. I cannot find instructions about what to do with robots.txt. Now that https is the canonical version, should I block the http-Version with robots.txt? Strangely, I cannot find a single ressource about this...
Technical SEO | | zeepartner0 -
Robots.txt and Multiple Sitemaps
Hello, I have a hopefully simple question but I wanted to ask to get a "second opinion" on what to do in this situation. I am working on a clients robots.txt and we have multiple sitemaps. Using yoast I have my sitemap_index.xml and I also have a sitemap-image.xml I do put them in google and bing by hand but wanted to have it added into the robots.txt for insurance. So my question is, when having multiple sitemaps called out on a robots.txt file does it matter if one is before the other? From my reading it looks like you can have multiple sitemaps called out, but I wasn't sure the best practice when writing it up in the file. Example: User-agent: * Disallow: Disallow: /cgi-bin/ Disallow: /wp-admin/ Disallow: /wp-content/plugins/ Sitemap: http://sitename.com/sitemap_index.xml Sitemap: http://sitename.com/sitemap-image.xml Thanks a ton for the feedback, I really appreciate it! :) J
Technical SEO | | allstatetransmission0 -
I accidentally blocked Google with Robots.txt. What next?
Last week I uploaded my site and forgot to remove the robots.txt file with this text: User-agent: * Disallow: / I dropped from page 11 on my main keywords to past page 50. I caught it 2-3 days later and have now fixed it. I re-imported my site map with Webmaster Tools and I also did a Fetch as Google through Webmaster Tools. I tweeted out my URL to hopefully get Google to crawl it faster too. Webmaster Tools no longer says that the site is experiencing outages, but when I look at my blocked URLs it still says 249 are blocked. That's actually gone up since I made the fix. In the Google search results, it still no longer has my page title and the description still says "A description for this result is not available because of this site's robots.txt – learn more." How will this affect me long-term? When will I recover my rankings? Is there anything else I can do? Thanks for your input! www.decalsforthewall.com
Technical SEO | | Webmaster1230 -
Checkout on different domain
Is it a bad SEO move to have a your checkout process on a separate domain instead of the main domain for a ecommerce site. There is no real content on the checkout pages and they are completely new pages that are not indexed in the search engines. Do to the backend architecture it is impossibe for us to have them on the same domain. An example is this page: http://www.printingforless.com/2/Brochure-Printing.html One option we've discussed to not pass page rank on to the checkout domain by iFraming all of the links to the checkout domain. We could also move the checkout process to a subdomain instead of a new domain. Please ignore the concerns with visitors security and conversion rate. Thanks!
Technical SEO | | PrintingForLess.com0 -
Subdomain and Domain Rankings
I have read here that domain names with keywords might add a boost to your search rank For instance using a completely inane example monkey-fights.com might get a boost compared to mfl.com (monkey fighting league) when searching for "monkey fights" There seems to be a hot debate as to how much bonus the first domain might get over the second, but leaving that aside for the moment. Question 1. Would monkey-fights.mfl.com get the same kind of bonus as a root domain bonus? Question 2. If the answer to 1 above was yes would a 301 redirect from the suddomain URL to root domain URL retain that bonus I was just thinking on how hard it is to get root domains these days that are not either being squatted on etc. and if this might be a way to get the same bonus, or maybe subdomains are less bonus prone and so it would be a waste of time Thanks
Technical SEO | | bThere0 -
Using hyphenated sub-domains or non-hyphenated sub-domains? What is the question! I Any takers?
For our corporate business level domain, we are exploring using a hyphenated sub-domain foir a project. Something like www.go-figure.extreme.com I thought from a user perspective it seems cluttered. The domain length might also be an issue with the new Algorithm big G has launched in recent past. I know with past experience, hyphenated domains usually take longer to index, as they are used by spammers more frequently and can take longer to get out of the supplementary index. Our company site has over 90 million viewers / year, so our brand is well established and traffic isn't an issue. This is for a corporate level project and I didn't have the answer! Will this work? anyone have any experience testing this. Any thoughts will help! Thanks, Rob
Technical SEO | | RobMay0