Will disallowing URL's in the robots.txt file stop those URL's being indexed by Google
-
I found a lot of duplicate title tags showing in Google Webmaster Tools. When I visited the URL's that these duplicates belonged to, I found that they were just images from a gallery that we didn't particularly want Google to index. There is no benefit to the end user in these image pages being indexed in Google.
Our developer has told us that these urls are created by a module and are not "real" pages in the CMS.
They would like to add the following to our robots.txt file
Disallow: /catalog/product/gallery/
QUESTION: If the these pages are already indexed by Google, will this adjustment to the robots.txt file help to remove the pages from the index?
We don't want these pages to be found.
-
That's why I mentioned: "eventually". But thanks for the added information. Hopefully it's clear now for the original poster.
-
Looking at this video - https://www.youtube.com/watch?v=KBdEwpRQRD0&feature=youtu.be Matt Cutts advises to use the noindex tag on every individual page. However, this is very time consuming if you're dealing wit a large volume of pages.
The other option he recommends is to use the robots.txt file as well as the URL removal tool in GWMT, Although this is the second choice option, it does seem easier for us to implement than the noindex tag.
-
Hi,
Yes, if you put any url in the robots.txt it will not be shown in the search results after some time even if your pages were already indexed. Because when your disallow urls in the robots.txt , Google will stop crawling that page and eventually will stop indexing those pages.
-
Hi Nico
Great response thanks.
This is certainly something I'm taking into consideration and will question my developer about this.
-
Thanks Thomas.
I'm now finding out from my developer is we are able to noindex these pages with the meta robots.
If this is something that isn't possible, it's likely that we'll add to the robots.txt as you did.
Either way I think will be progress to different degrees.
-
I don' think Martijn's statement is quite correct as I have made different experiences in an accidental experiment. Crawling is not the same as indexing. Google will put pages it cannot crawl into the index ... and they will stay there unless removed somehow. They will probably only show up for specific searches, though
Completely agree, I have done the same for a website I am doing work with, ideally we would noindex with meta robots however that isn't possible. So instead we added to the robots.txt, the number of indexed pages have dropped, yet when you search exactly it just says the description can't be reached.
So I was happy with the results as they're now not ranking for the terms they were.
-
I don' think Martijn's statement is quite correct as I have made different experiences in an accidental experiment. Crawling is not the same as indexing. Google will put pages it cannot crawl into the index ... and they will stay there unless removed somehow. They will probably only show up for specific searches, though
In September 2015 I catapulted a website from ~3.000 to 130.000 indexed pages (roughly). 127.000 were essentially canonicalised duplicates (yes, it did make sense) but also blocked by robots.txt - but put into the index nonetheless. The problem was a dynamically generated parameter, always different, always blocked by robots.
The title was equal to the link text; the description became "A description for this result is not available because of this site's robots.txt – learn more." (If Google cannot crawl a URL Google will usually take titles from links pointing to that URL). No sign of disappearing. In fact, Google was happy to add more and more to its index ...
At the start of December 2015 I removed the robots.txt block - Google could now read the canonicals or noindex on the URLs ... the pages only began dropping out, slowly and in bunches of a few thousand in March 2016 - probably due to the very low relevancy and crawl budget assigned to them. Right now there are still about 24.000 pages in the index.
So my answer would be: No - disabling crawling in the robots.txt will NOT remove a page from the index. For that you need to noindex them (which sometimes also works if done in robots.txt, I've heard). Disallowing URLs in the robots.txt will very likely drop pages to the end of useful results, though, as Andy described. (I don't know if this has any influence on the general evaluation of the site as a whole; I'd guess not.)
Regards
Nico
-
Thanks Martijn. This is what I was assuming would happen. However, I got a confusing message from my developer which said the following,
"won't remove the URL's from the index but it will mean that they will only show up for very specific searches that customers are extremely unlikely to use. It will also increase Asgard's crawl budget as Google and Bing won't try to crawl these URLs. Would you be happy with this solution?"
I would tend to still agree with your statement though.
-
Yes they will be eventually. As you disallow Google to crawl the URLs it will probably start hiding the descriptions for some of these image pages soon as they can't crawl them anymore. Then at some point they'll stop looking at them at all.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Google slow to index pages
Hi We've recently had a product launch for one of our clients. Historically speaking Google has been quick to respond, i.e when the page for the product goes live it's indexed and performing for branded terms within 10 minutes (without 'Fetch and Render'). This time however, we found that it took Google over an hour to index the pages. we found initially that press coverage ranked until we were indexed. Nothing major had changed in terms of the page structure, content, internal linking etc; these were brand new pages, with new product content. Has anyone ever experienced Google having an 'off' day or being uncharacteristically slow with indexing? We do have a few ideas what could have caused this, but we were interested to see if anyone else had experienced this sort of change in Google's behaviour, either recently or previously? Thanks.
Intermediate & Advanced SEO | | punchseo0 -
How will changing my website's page content affect SEO?
Our company is looking to update the content on our existing web pages and I am curious what the best way to roll out these changes are in order to maintain good SEO rankings for certain pages. The infrastructure of the site will not be modified except for maybe adding a couple new pages, but existing domains will stay the same. If the domains are staying the same does it really matter if I just updated 1 page every week or so, versus updating them all at once? Just looking for some insight into how freshening up the content on the back end pages could potentially hurt SEO rankings initially. Thanks!
Intermediate & Advanced SEO | | Bankable1 -
How to stop URLs that include query strings from being indexed by Google
Hello Mozzers Would you use rel=canonical, robots.txt, or Google Webmaster Tools to stop the search engines indexing URLs that include query strings/parameters. Or perhaps a combination? I guess it would be a good idea to stop the search engines crawling these URLs because the content they display will tend to be duplicate content and of low value to users. I would be tempted to use a combination of canonicalization and robots.txt for every page I do not want crawled or indexed, yet perhaps Google Webmaster Tools is the best way to go / just as effective??? And I suppose some use meta robots tags too. Does Google take a position on being blocked from web pages. Thanks in advance, Luke
Intermediate & Advanced SEO | | McTaggart0 -
Does content revealed by a 'show more' button get crawled by Google?
I have a div on my website with around 500 words of unique content in, automatically when the page is first visited the div has a fixed height of 100px, showing a couple of hundred words and fading out to white, with a show more button, which when clicked, increases the height to show the full content. My question is, does Google crawl the content in that div when it renders the page? Or disregard it? Its all in the source code. Or worse, do they consider this cloaking or hidden content? It is only there to make the site more useable for customers, so i don't want to get penalised for it. Cheers
Intermediate & Advanced SEO | | SEOhmygod0 -
How can a recruitment company get 'credit' from Google when syndicating job posts?
I'm working on an SEO strategy for a recruitment agency. Like many recruitment agencies, they write tons of great unique content each month and as agencies do, they post the job descriptions to job websites as well as their own. These job websites won't generally allow any linking back to the agency website from the post. What can we do to make Google realise that the originator of the post is the recruitment agency and they deserve the 'credit' for the content? The recruitment agency has a low domain authority and so we've very much at the start of the process. It would be a damn shamn if they produced so much great unique content but couldn't get Google to recognise it. Google's advice says: "Syndicate carefully: If you syndicate your content on other sites, Google will always show the version we think is most appropriate for users in each given search, which may or may not be the version you'd prefer. However, it is helpful to ensure that each site on which your content is syndicated includes a link back to your original article. You can also ask those who use your syndicated material to use the noindex meta tag to prevent search engines from indexing their version of the content." - But none of that can happen. Those big job websites just won't do it. A previous post here didn't get a sufficient answer. I'm starting to think there isn't an answer, other than having more authority than the websites we're syndicating to. Which isn't going to happen any time soon! Any thoughts?
Intermediate & Advanced SEO | | Mark_Reynolds0 -
Is there any SEO advantage to sharing links on twitter using google's url shortener goo.gl/
Hi is there any advantage to using <cite class="vurls">goo.gl/</cite> to shorten a URL for Twitter instead of other ones? I had a thought that <cite class="vurls">goo.gl/</cite> might allow google to track click throughs and hence judge popularity.
Intermediate & Advanced SEO | | S_Curtis0 -
Robots.txt file - How to block thosands of pages when you don't have a folder path
Hello.
Intermediate & Advanced SEO | | Unity
Just wondering if anyone has come across this and can tell me if it worked or not. Goal:
To block review pages Challenge:
The URLs aren't constructed using folders, they look like this:
www.website.com/default.aspx?z=review&PG1234
www.website.com/default.aspx?z=review&PG1235
www.website.com/default.aspx?z=review&PG1236 So the first part of the URL is the same (i.e. /default.aspx?z=review) and the unique part comes immediately after - so not as a folder. Looking at Google recommendations they show examples for ways to block 'folder directories' and 'individual pages' only. Question:
If I add the following to the Robots.txt file will it block all review pages? User-agent: *
Disallow: /default.aspx?z=review Much thanks,
Davinia0 -
Why is google automatically showing my competitor's result even when customer types in our brand name on the query?
This is little weird. We run a website specific to mobile phones called as 91mobiles.com. The site has gained lot of user interest and trust in the last 2 years [pre-dominantly indian users]. Lot of our users type mobile phone model and 91mobiles as the query. Example "Sony Xperia P 91mobiles". Google is showing gsmarena.com results on top and then shows our own results below! What's annoying is the fact that google also bolds the term gsmarena [denoting that it's a synonym]. Any idea why this is happening? We are very sure that we are not doing anything wrong..We have worked really hard for the last 2 years to reach where we are..and it's kind of hard to see gsmarena siphoning away our traffic for no reason at all [even when customer types in 91mobiles a part of the query to quality it]...Can some experts here demystify this? Thanks.
Intermediate & Advanced SEO | | Gaadi0