Trying to reduce pages crawled to within 10K limit via robots.txt

AspenFasteners

Our site has far too many pages for our 10K page PRO account which are not SEO worthy. In fact, only about 2000 pages qualify for SEO value. Limitations of the store software only permit me to use robots.txt to sculpt the rogerbot site crawl. However, I am having trouble getting this to work. Our biggest problem is the 35K individual product pages and the related shopping cart links (at least another 35K); these aren't needed as they duplicate the SEO-worthy content in the product category pages.

The signature of a product page is that it is contained within a folder ending in -p. So I made the following addition to robots.txt:

User-agent: rogerbot
Disallow: /-p/

However, the latest crawl results show the 10K limit is still being exceeded. I went to Crawl Diagnostics and clicked on Export Latest Crawl to CSV. To my dismay I saw the report was overflowing with product page links:

e.g. www.aspenfasteners.com/3-Star-tm-Bulbing-Type-Blind-Rivets-Anodized-p/rv006-316x039354-coan.htm

The value for the column "Search Engine blocked by robots.txt" = FALSE; does this mean blocked for all search engines? Then it's correct. If it means "blocked for rogerbot? Then it shouldn't even be in the report, as the report seems to only contain 10K pages.

Any thoughts or hints on trying to attain my goal would REALLY be appreciated, I've been trying for weeks now. Honestly - virtual beers for everyone!

Carlo

andresgmontero

Wow! thank you, many of the robots.txt testers still show them as disallow, good to know! thank you!

AspenFasteners

Hi Andres!

Sorry, I thought I answered this earlier. If I understand correctly wildcards ARE allowed, according to this reply to my question on the topic: http://www.seomoz.org/q/does-rogerbot-read-url-wildcards-in-robots-txt

Hope THIS reply sticks this time!

andresgmontero

Hi, as far as I know wildcard characters (like "*") are not allowed there, the line must be an allow, disallow, comment or a blank line statement, so before you get angry at Roger for not listening to you, go to Google Webmaster Tools > Crawler Access and test the robots.txt file. Hope it works.

Welcome to the Q&A Forum

Browse the forum for helpful insights and fresh discussions about all things SEO.

Trying to reduce pages crawled to within 10K limit via robots.txt

Got a burning SEO question?

Browse Questions

Explore more categories

Related Questions

Over 40+ pages have been removed from the indexed and this page has been selected as the google preferred canonical.

I have two robots.txt pages for www and non-www version. Will that be a problem?

Crawl issues

Page for page 301 redirects from old server to new server

Can you 301 redirect a page to an already existing/old page ?

Renaming of pages

Robots.txt file getting a 500 error - is this a problem?

Page that has no link is being crawled