Moz Q&A is closed.
After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.
Is there a limit to how many URLs you can put in a robots.txt file?
-
We have a site that has way too many urls caused by our crawlable faceted navigation. We are trying to purge 90% of our urls from the indexes. We put no index tags on the url combinations that we do no want indexed anymore, but it is taking google way too long to find the no index tags. Meanwhile we are getting hit with excessive url warnings and have been it by Panda.
Would it help speed the process of purging urls if we added the urls to the robots.txt file? Could this cause any issues for us? Could it have the opposite effect and block the crawler from finding the urls, but not purge them from the index? The list could be in excess of 100MM urls.
-
Hi Kristen,
I did this recently and it worked. The important part is that you need to block the pages in robots.txt or add a noindex tag to the pages to stop them from being indexed again.
I hope this helps.
-
Hi all, Google Webmaster Tools has a great tool for this. If you go into WMT and select "Google index", then "remove URLs". You can use regex to remove a large batch of URLs then block them in robots.txt to make sure they stay out of the index.
I hope this helps.
-
Great thanks for the input. Per Kristen's post I am worried that it could just block the URLs altogether and they will never get purged from the index.
-
Yes, we have done that and are seeing traction on those urls, but we can't get rid of these old urls as fast as we would like.
Thanks for your input
-
Thanks Kristen, thats what I was afraid I would do. Other than Fetch is there a way to send Google these URLs in mass? There are over 100 million URLs so Fetch is not scalable. They are picking them up slowly, but at current pace it will take a few months and I would like to find a way to make it purge faster.
-
You could add them to the robots.txt but it you have to remember that Google will only read the first 500kb (source) - as far as I understand with the number of url's you want to block you'll pass this limit.
As Google bot is able to understand basic regex expressions it's probably better to use regex (you will probably be able to block all these url's with a few lines of code.
More info here & on Moz: https://moz.com/blog/interactive-guide-to-robots-txtDirk
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Robots.txt Syntax for Dynamic URLs
I want to Disallow certain dynamic pages in robots.txt and am unsure of the proper syntax. The pages I want to disallow all include the string ?Page= Which is the proper syntax?
Technical SEO | | btreloar
Disallow: ?Page=
Disallow: ?Page=*
Disallow: ?Page=
Or something else?0 -
Category URL Pagination where URLs don't change between pages
Hello, I am working on an e-commerce site where there are categories with multiple pages. In order to avoid pagination issues I was thinking of using rel=next and rel=prev and cannonical tags. I noticed a site where the URL doesn't change between pages, so whether you're on page 1,2, or 3 of the same category, the URL doesn't change. Would this be a cleaner way of dealing with pagination?
Technical SEO | | whiteonlySEO0 -
Is there a limit to Internal Redirect?
I know Google says there is no limit to it but I have seen on many websites that too many 301 redirects can be a problem and might negatively affect your rankings in SERPs. I wanted to know especially from people who worked on large ecommerce site. How do they manage internal redirect from one URL to other and how many according to you are too many. I mean if you get a website that contain 300 plus 301 redirections within the website, how will you deal with that? Please let me know if the question is not clear.
Technical SEO | | MoosaHemani0 -
Url folder structure
I work for a travel site and we have pages for properties in destinations and am trying to decide how best to organize the URLs basically we have our main domain, resort pages and we'll also have articles about each resort so the URL structure will actually get longer:
Technical SEO | | Vacatia_SEO
A. domain.com/main-keyword/state/city-region/resort-name
_ domain.com/family-condo-for-rent/orlando-florida/liki-tiki-village_ _ domain.com/main-keyword-in-state-city/resort-name-feature _
_ domain.com/family-condo-for-rent/orlando-florida/liki-tiki-village/kid-friend-pool_ B. Another way to structure would be to remove the location and keyword folders and combine. Note that some of the resort names are long and spaces are being replaced dynamically with dashes.
ex. domain.com/main-keyword-in-state-city/resort-name
_ domain.com/family-condo-for-rent-in-orlando-florida/liki-tiki-village_ _ domain.com/main-keyword-in-state-city/resort-name-feature_
_ domain.com/family-condo-for-rent-in-orlando-florida/liki-tiki-village-kid-friend-pool_ Question: is that too many folders or should i combine or break up? What would you do with this? Trying to avoid too many dashes.0 -
Google indexing despite robots.txt block
Hi This subdomain has about 4'000 URLs indexed in Google, although it's blocked via robots.txt: https://www.google.com/search?safe=off&q=site%3Awww1.swisscom.ch&oq=site%3Awww1.swisscom.ch This has been the case for almost a year now, and it does not look like Google tends to respect the blocking in http://www1.swisscom.ch/robots.txt Any clues why this is or what I could do to resolve it? Thanks!
Technical SEO | | zeepartner0 -
Can too many pages hurt crawling and ranking?
Hi, I work for local yellow pages in Belgium, over the last months we introduced a succesfull technique to boost SEO traffic: we have created over 150k of new pages, all targeting specific keywords and all containing unique content, a site architecture to enable google to find these pages through crawling, xml sitemaps, .... All signs (traffic, indexation of xml sitemaps, rankings, ...) are positive. So far so good. We are able to quickly build more unique pages, and I wonder how google will react to this type of "large scale operation": can it hurt crawling and ranking if google notices big volumes of content (unique content)? Please advice
Technical SEO | | TruvoDirectories0 -
Can I Disallow Faceted Nav URLs - Robots.txt
I have been disallowing /*? So I know that works without affecting crawling. I am wondering if I can disallow the faceted nav urls. So disallow: /category.html/? /category2.html/? /category3.html/*? To prevent the price faceted url from being cached: /category.html?price=1%2C1000
Technical SEO | | tylerfraser
and
/category.html?price=1%2C1000&product_material=88 Thanks!0 -
Trailing Slashes In Url use Canonical Url or 301 Redirect?
I was thinking of using 301 redirects for trailing slahes to no trailing slashes for my urls. EG: www.url.com/page1/ 301 redirect to www.url.com/page1 Already got a redirect for non-www to www already. Just wondering in my case would it be best to continue using htacces for the trailing slash redirect or just go with Canonical URLs?
Technical SEO | | upick-1623910