Moz Q&A is closed.
After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.
Will disallowing URL's in the robots.txt file stop those URL's being indexed by Google
-
I found a lot of duplicate title tags showing in Google Webmaster Tools. When I visited the URL's that these duplicates belonged to, I found that they were just images from a gallery that we didn't particularly want Google to index. There is no benefit to the end user in these image pages being indexed in Google.
Our developer has told us that these urls are created by a module and are not "real" pages in the CMS.
They would like to add the following to our robots.txt file
Disallow: /catalog/product/gallery/
QUESTION: If the these pages are already indexed by Google, will this adjustment to the robots.txt file help to remove the pages from the index?
We don't want these pages to be found.
-
That's why I mentioned: "eventually". But thanks for the added information. Hopefully it's clear now for the original poster.
-
Looking at this video - https://www.youtube.com/watch?v=KBdEwpRQRD0&feature=youtu.be Matt Cutts advises to use the noindex tag on every individual page. However, this is very time consuming if you're dealing wit a large volume of pages.
The other option he recommends is to use the robots.txt file as well as the URL removal tool in GWMT, Although this is the second choice option, it does seem easier for us to implement than the noindex tag.
-
Hi,
Yes, if you put any url in the robots.txt it will not be shown in the search results after some time even if your pages were already indexed. Because when your disallow urls in the robots.txt , Google will stop crawling that page and eventually will stop indexing those pages.
-
Hi Nico
Great response thanks.
This is certainly something I'm taking into consideration and will question my developer about this.
-
Thanks Thomas.
I'm now finding out from my developer is we are able to noindex these pages with the meta robots.
If this is something that isn't possible, it's likely that we'll add to the robots.txt as you did.
Either way I think will be progress to different degrees.
-
I don' think Martijn's statement is quite correct as I have made different experiences in an accidental experiment. Crawling is not the same as indexing. Google will put pages it cannot crawl into the index ... and they will stay there unless removed somehow. They will probably only show up for specific searches, though
Completely agree, I have done the same for a website I am doing work with, ideally we would noindex with meta robots however that isn't possible. So instead we added to the robots.txt, the number of indexed pages have dropped, yet when you search exactly it just says the description can't be reached.
So I was happy with the results as they're now not ranking for the terms they were.
-
I don' think Martijn's statement is quite correct as I have made different experiences in an accidental experiment. Crawling is not the same as indexing. Google will put pages it cannot crawl into the index ... and they will stay there unless removed somehow. They will probably only show up for specific searches, though
In September 2015 I catapulted a website from ~3.000 to 130.000 indexed pages (roughly). 127.000 were essentially canonicalised duplicates (yes, it did make sense) but also blocked by robots.txt - but put into the index nonetheless. The problem was a dynamically generated parameter, always different, always blocked by robots.
The title was equal to the link text; the description became "A description for this result is not available because of this site's robots.txt – learn more." (If Google cannot crawl a URL Google will usually take titles from links pointing to that URL). No sign of disappearing. In fact, Google was happy to add more and more to its index ...
At the start of December 2015 I removed the robots.txt block - Google could now read the canonicals or noindex on the URLs ... the pages only began dropping out, slowly and in bunches of a few thousand in March 2016 - probably due to the very low relevancy and crawl budget assigned to them. Right now there are still about 24.000 pages in the index.
So my answer would be: No - disabling crawling in the robots.txt will NOT remove a page from the index. For that you need to noindex them (which sometimes also works if done in robots.txt, I've heard). Disallowing URLs in the robots.txt will very likely drop pages to the end of useful results, though, as Andy described. (I don't know if this has any influence on the general evaluation of the site as a whole; I'd guess not.)
Regards
Nico
-
Thanks Martijn. This is what I was assuming would happen. However, I got a confusing message from my developer which said the following,
"won't remove the URL's from the index but it will mean that they will only show up for very specific searches that customers are extremely unlikely to use. It will also increase Asgard's crawl budget as Google and Bing won't try to crawl these URLs. Would you be happy with this solution?"
I would tend to still agree with your statement though.
-
Yes they will be eventually. As you disallow Google to crawl the URLs it will probably start hiding the descriptions for some of these image pages soon as they can't crawl them anymore. Then at some point they'll stop looking at them at all.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
If my website do not have a robot.txt file, does it hurt my website ranking?
After a site audit, I find out that my website don't have a robot.txt. Does it hurt my website rankings? One more thing, when I type mywebsite.com/robot.txt, it automatically redirect to the homepage. Please help!
Intermediate & Advanced SEO | | binhlai0 -
Why is /home used in this company's home URL?
Just working with a company that has chosen a home URL with /home latched on - very strange indeed - has anybody else comes across this kind of homepage URL "decision" in the past? I can't see why on earth anybody would do this! Perhaps simply a logic-defying decision?
Intermediate & Advanced SEO | | McTaggart0 -
Should you allow an auto dealer's inventory to be indexed?
Due to the way most auto dealership website populate inventory pages, should you allow inventory to be indexed at all? The main benefit us more content. The problem is it creates duplicate, or near duplicate content. It also creates a ton of crawl errors since the turnover is so short and fast. I would love some help on this. Thanks!
Intermediate & Advanced SEO | | Gauge1230 -
Brackets vs Encoded URLs: The "Same" in Google's eyes, or dup content?
Hello, This is the first time I've asked a question here, but I would really appreciate the advice of the community - thank you, thank you! Scenario: Internal linking is pointing to two different versions of a URL, one with brackets [] and the other version with the brackets encoded as %5B%5D Version 1: http://www.site.com/test?hello**[]=all&howdy[]=all&ciao[]=all
Intermediate & Advanced SEO | | mirabile
Version 2: http://www.site.com/test?hello%5B%5D**=all&howdy**%5B%5D**=all&ciao**%5B%5D**=all Question: Will search engines view these as duplicate content? Technically there is a difference in characters, but it's only because one version encodes the brackets, and the other does not (See: http://www.w3schools.com/tags/ref_urlencode.asp) We are asking the developer to encode ALL URLs because this seems cleaner but they are telling us that Google will see zero difference. We aren't sure if this is true, since engines can get so _hung up on even one single difference in character. _ We don't want to unnecessarily fracture the internal link structure of the site, so again - any feedback is welcome, thank you. 🙂0 -
Pipe ("|") in my website's title is being replaced with ":" in Google results
Hi , One of the websites I'm promoting and working on is www.pau-brasil.co.il.
Intermediate & Advanced SEO | | Kadel
It's wordpress-based website and as you can see the html's Title is "PauBrasil | some hebrew slogan".
(Screenshot: http://i.imgur.com/2f80EEY.gif)
When I'm searching for "PauBrasil" (Which is the brand's name) , one of the results google shows is "PauBrasil: Some Hebrew Slogan" (Screenshot: http://i.imgur.com/eJxNHrO.gif ) Why does the pipe is being replaced with ":" ?
And not just that , as you can see there's a "blank space" missing between the the ":" to the slogan.
(note: the websites has been indexed by google crawler at least 4 times so I find it hard to believe it can be the reason) I've keep on looking and found out that there's another page in that website with the exact same title
but when I'm looking for it in google , it shows the title as it really is , with pipe. ("|").
(Screenshot: http://i.imgur.com/dtsbZV2.gif) Have you ever encountered something like that?
Can it be that the duplicated title cause that weird "replacement"? Thanks in advance,
Kadel0 -
Our login pages are being indexed by Google - How do you remove them?
Each of our login pages show up under different subdomains of our website. Currently these are accessible by Google which is a huge competitive advantage for our competitors looking for our client list. We've done a few things to try to rectify the problem: - No index/archive to each login page Robot.txt to all subdomains to block search engines gone into webmaster tools and added the subdomain of one of our bigger clients then requested to remove it from Google (This would be great to do for every subdomain but we have a LOT of clients and it would require tons of backend work to make this happen.) Other than the last option, is there something we can do that will remove subdomains from being viewed from search engines? We know the robots.txt are working since the message on search results say: "A description for this result is not available because of this site's robots.txt – learn more." But we'd like the whole link to disappear.. Any suggestions?
Intermediate & Advanced SEO | | desmond.liang1 -
Should I use both Google and Bing's Webmaster Tools at the same time?
Hi All, Up till now I've been registered only to Google WMT. Do you recommend using at the same time Bing's WMT? Thanks
Intermediate & Advanced SEO | | BeytzNet0 -
Culling 99% of a website's pages. Will this cause irreparable damage?
I have a large travel site that has over 140,000 pages. The problem I have is that the majority of pages are filled with dupe content. When Panda came in, our rankings were obliterated, so I am trying to isolate the unique content on the site and go forward with that. The problem is, the site has been going for over 10 years, with every man and his dog copying content from it. It seems that our travel guides have been largely left untouched and are the only unique content that I can find. We have 1000 travel guides in total. My first question is, would reducing 140,000 pages to just 1,000 ruin the site's authority in any way? The site does use internal linking within these pages, so culling them will remove thousands of internal links throughout the site. Also, am I right in saying that the link juice should now move to the more important pages with unique content, if redirects are set up correctly? And finally, how would you go about redirecting all theses pages? I will be culling a huge amount of hotel pages, would you consider redirecting all of these to the generic hotels page of the site? Thanks for your time, I know this is quite a long one, Nick
Intermediate & Advanced SEO | | Townpages0