Moz Q&A is closed.
After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.
Will disallowing URL's in the robots.txt file stop those URL's being indexed by Google
-
I found a lot of duplicate title tags showing in Google Webmaster Tools. When I visited the URL's that these duplicates belonged to, I found that they were just images from a gallery that we didn't particularly want Google to index. There is no benefit to the end user in these image pages being indexed in Google.
Our developer has told us that these urls are created by a module and are not "real" pages in the CMS.
They would like to add the following to our robots.txt file
Disallow: /catalog/product/gallery/
QUESTION: If the these pages are already indexed by Google, will this adjustment to the robots.txt file help to remove the pages from the index?
We don't want these pages to be found.
-
That's why I mentioned: "eventually". But thanks for the added information. Hopefully it's clear now for the original poster.
-
Looking at this video - https://www.youtube.com/watch?v=KBdEwpRQRD0&feature=youtu.be Matt Cutts advises to use the noindex tag on every individual page. However, this is very time consuming if you're dealing wit a large volume of pages.
The other option he recommends is to use the robots.txt file as well as the URL removal tool in GWMT, Although this is the second choice option, it does seem easier for us to implement than the noindex tag.
-
Hi,
Yes, if you put any url in the robots.txt it will not be shown in the search results after some time even if your pages were already indexed. Because when your disallow urls in the robots.txt , Google will stop crawling that page and eventually will stop indexing those pages.
-
Hi Nico
Great response thanks.
This is certainly something I'm taking into consideration and will question my developer about this.
-
Thanks Thomas.
I'm now finding out from my developer is we are able to noindex these pages with the meta robots.
If this is something that isn't possible, it's likely that we'll add to the robots.txt as you did.
Either way I think will be progress to different degrees.
-
I don' think Martijn's statement is quite correct as I have made different experiences in an accidental experiment. Crawling is not the same as indexing. Google will put pages it cannot crawl into the index ... and they will stay there unless removed somehow. They will probably only show up for specific searches, though
Completely agree, I have done the same for a website I am doing work with, ideally we would noindex with meta robots however that isn't possible. So instead we added to the robots.txt, the number of indexed pages have dropped, yet when you search exactly it just says the description can't be reached.
So I was happy with the results as they're now not ranking for the terms they were.
-
I don' think Martijn's statement is quite correct as I have made different experiences in an accidental experiment. Crawling is not the same as indexing. Google will put pages it cannot crawl into the index ... and they will stay there unless removed somehow. They will probably only show up for specific searches, though
In September 2015 I catapulted a website from ~3.000 to 130.000 indexed pages (roughly). 127.000 were essentially canonicalised duplicates (yes, it did make sense) but also blocked by robots.txt - but put into the index nonetheless. The problem was a dynamically generated parameter, always different, always blocked by robots.
The title was equal to the link text; the description became "A description for this result is not available because of this site's robots.txt – learn more." (If Google cannot crawl a URL Google will usually take titles from links pointing to that URL). No sign of disappearing. In fact, Google was happy to add more and more to its index ...
At the start of December 2015 I removed the robots.txt block - Google could now read the canonicals or noindex on the URLs ... the pages only began dropping out, slowly and in bunches of a few thousand in March 2016 - probably due to the very low relevancy and crawl budget assigned to them. Right now there are still about 24.000 pages in the index.
So my answer would be: No - disabling crawling in the robots.txt will NOT remove a page from the index. For that you need to noindex them (which sometimes also works if done in robots.txt, I've heard). Disallowing URLs in the robots.txt will very likely drop pages to the end of useful results, though, as Andy described. (I don't know if this has any influence on the general evaluation of the site as a whole; I'd guess not.)
Regards
Nico
-
Thanks Martijn. This is what I was assuming would happen. However, I got a confusing message from my developer which said the following,
"won't remove the URL's from the index but it will mean that they will only show up for very specific searches that customers are extremely unlikely to use. It will also increase Asgard's crawl budget as Google and Bing won't try to crawl these URLs. Would you be happy with this solution?"
I would tend to still agree with your statement though.
-
Yes they will be eventually. As you disallow Google to crawl the URLs it will probably start hiding the descriptions for some of these image pages soon as they can't crawl them anymore. Then at some point they'll stop looking at them at all.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
If my website do not have a robot.txt file, does it hurt my website ranking?
After a site audit, I find out that my website don't have a robot.txt. Does it hurt my website rankings? One more thing, when I type mywebsite.com/robot.txt, it automatically redirect to the homepage. Please help!
Intermediate & Advanced SEO | | binhlai0 -
Does content revealed by a 'show more' button get crawled by Google?
I have a div on my website with around 500 words of unique content in, automatically when the page is first visited the div has a fixed height of 100px, showing a couple of hundred words and fading out to white, with a show more button, which when clicked, increases the height to show the full content. My question is, does Google crawl the content in that div when it renders the page? Or disregard it? Its all in the source code. Or worse, do they consider this cloaking or hidden content? It is only there to make the site more useable for customers, so i don't want to get penalised for it. Cheers
Intermediate & Advanced SEO | | SEOhmygod0 -
Partial Match or RegEx in Search Console's URL Parameters Tool?
So I currently have approximately 1000 of these URLs indexed, when I only want roughly 100 of them. Let's say the URL is www.example.com/page.php?par1=ABC123=&par2=DEF456=&par3=GHI789= All the indexed URLs follow that same kinda format, but I only want to index the URLs that have a par1 of ABC (but that could be ABC123 or ABC456 or whatever). Using URL Parameters tool in Search Console, I can ask Googlebot to only crawl URLs with a specific value. But is there any way to get a partial match, using regex maybe? Am I wasting my time with Search Console, and should I just disallow any page.php without par1=ABC in robots.txt?
Intermediate & Advanced SEO | | Ria_0 -
Why Google isn't indexing my images?
Hello, on my fairly new website Worthminer.com I am noticing that Google is not indexing images from my sitemap. Already 560 images submitted and Google indexed only 3 of them. Altough there is more images indexed they are not indexing any new images, and I have no idea why. Posts, categories and other urls are indexing just fine, but images not. I am using Wordpress and for sitemaps Wordpress SEO by yoast. Am I missing something here? Why Google won't index my images? Thanks, I appreciate any help, David xv1GtwK.jpg
Intermediate & Advanced SEO | | Worthminer1 -
A few questions on Google's Structured Data Markup Helper...
I'm trying to go through my site and add microdata with the help of Google's Structured Data Markup Helper. I have a few questions that I have not been able to find an answer for. Here is the URL I am referring to: http://www.howlatthemoon.com/locations/location-chicago My company is a bar/club, with only 4 out of 13 locations serving food. Would you mark this up as a local business or a restaurant? It asks for "URL" above the ratings. Is this supposed to be the URL that ratings are on like Yelp or something? Or is it the URL for the page? Either way, neither of those URLs are on the page so I can't select them. If it is for Yelp should I link to it? How do I add reviews? Do they have to be on the page? If I make a group of days for Day of the Week for Opening hours, such as Mon-Thu, will that work out? I have events on this page. However, when I tried to do the markup for just the event it told me to use itemscope itemtype="http://schema.org/Event" on the body tag of the page. That is just a small part of the page, I'm not sure why I would put the event tag on the whole body? Any other tips would be much appreciated. Thanks!
Intermediate & Advanced SEO | | howlusa0 -
Do you add 404 page into robot file or just add no index tag?
Hi, got different opinion on this so i wanted to double check with your comment is. We've got /404.html page and I was wondering if you would add this page to robot text so it wouldn't be indexed or would you just add no index tag? What would be the best approach? Thanks!
Intermediate & Advanced SEO | | Rubix0 -
Do I need to use canonicals if I will be using 301's?
I just took a job about three months and one of the first things I wanted to do was restructure the site. The current structure is solution based but I am moving it toward a product focus. The problem I'm having is the CMS I'm using isn't the greatest (and yes I've brought this up to my CMS provider). It creates multiple URL's for the same page. For example, these two urls are the same page: (note: these aren't the actual urls, I just made them up for demonstration purposes) http://www.website.com/home/meet-us/team-leaders/boss-man/
Intermediate & Advanced SEO | | Omnipress
http://www.website.com/home/meet-us/team-leaders/boss-man/bossman.cmsx (I know this is terrible, and once our contract is up we'll be looking at a different provider) So clearly I need to set up canonical tags for the last two pages that look like this: http://www.omnipress.com/boss-man" /> With the new site restructure, do I need to put a canonical tag on the second page to tell the search engine that it's the same as the first, since I'll be changing the category it's in? For Example: http://www.website.com/home/meet-us/team-leaders/boss-man/ will become http://www.website.com/home/MEET-OUR-TEAM/team-leaders/boss-man My overall question is, do I need to spend the time to run through our entire site and do canonical tags AND 301 redirects to the new page, or can I just simply redirect both of them to the new page? I hope this makes sense. Your help is greatly appreciated!!0 -
Disallowed Pages Still Showing Up in Google Index. What do we do?
We recently disallowed a wide variety of pages for www.udemy.com which we do not want google indexing (e.g., /tags or /lectures). Basically we don't want to spread our link juice around to all these pages that are never going to rank. We want to keep it focused on our core pages which are for our courses. We've added them as disallows in robots.txt, but after 2-3 weeks google is still showing them in it's index. When we lookup "site: udemy.com", for example, Google currently shows ~650,000 pages indexed... when really it should only be showing ~5,000 pages indexed. As another example, if you search for "site:udemy.com/tag", google shows 129,000 results. We've definitely added "/tag" into our robots.txt properly, so this should not be happening... Google showed be showing 0 results. Any ideas re: how we get Google to pay attention and re-index our site properly?
Intermediate & Advanced SEO | | udemy0