Google: How to See URLs Blocked by Robots?
-
Google Webmaster Tools says we have 17K out of 34K URLs that are blocked by our Robots.txt file.
How can I see the URLs that are being blocked?
Here's our Robots.txt file.
User-agent: * Disallow: /swish.cgi Disallow: /demo Disallow: /reviews/review.php/new/ Disallow: /cgi-audiobooksonline/sb/order.cgi Disallow: /cgi-audiobooksonline/sb/productsearch.cgi Disallow: /cgi-audiobooksonline/sb/billing.cgi Disallow: /cgi-audiobooksonline/sb/inv.cgi Disallow: /cgi-audiobooksonline/sb/new_options.cgi Disallow: /cgi-audiobooksonline/sb/registration.cgi Disallow: /cgi-audiobooksonline/sb/tellfriend.cgi Disallow: /*?gdftrk
-
It seems you might be asking two different questions here, Larry.
You ask which URLs are blocked by your robots file. You then answered your own question by listing the entries in your robots file which are the actual URLs that it is blocking.
If in fact what you want to know is which pages exist on your website but are not currently indexed, that's a much bigger question and requires a lot more work to answer.
There is no way Webmaster Tools can give you that answer, because if it was aware of the URL it would already be indexing it.
HOWEVER! It is possible to do it if you are willing to do some of the work on your own to collect and manipulate data using several tools. Essentially, you have to do it in three steps:
- create a list of all the URLs that Google says are indexed. (This info comes from Google's SERPs.)
- then create a separate list of all of the URLs that actually exist on your website. (This must come from a 3rd-party tool you run against your site yourself.)
- From there, you will use Excel to subtract the indexed URLs from the known URLs, leaving a list of non-indexed URLS, which is what you asked for.
I actually laid out this process step-by-step in response to an earlier question, so you can read the process there http://www.seomoz.org/q/how-to-determine-which-pages-are-not-indexed
Is that what you were looking for?
Paul
-
Okay, well the robots.txt will only be excluding robots from the folders and URLs specified and as I say, there's no way to download a list of all the URLs that Google is not indexing from webmaster tools.
If you have exact URLs in mind which you think might be getting excluded, you can test individual URLs in Google Webmaster Tools in:
Health > Blocked URLs > URLs Specify the URLs and user-agents to test against.
Beyond this, if you want to know if there are URLs that shouldn't be excluded in the folders you have specified, I would run a crawl of your website using SEOMoz' crawl test or Screaming Frog. Then sort the URLs alphabetically and make sure that all of the URLs in the folders you have excluded via robots.txt are ones that you want to exclude.
-
I want to make sure that Google is indexing all of our pages we want them to. I.E. That all of the NOT indexed URLs are valid.
-
Hi Larry
Why do you want to find those URLs out for my understanding? Are you concerned that the robots.txt is blocking URLs it shouldn't be?
As for downloading a list of URLs which aren't indexed from Google Webmaster Tools, which is what I think you would really like, this isn't possible at the moment.
-
Liz; Perhaps my post was unclear or I am misunderstanding your answer.
I want to find out the specific URLs that Google says it isn't indexing because of our Robots.txt file.
-
If you want to see if Google has indexed individual pages which are supposed to be excluded, you can check the URLs in your robots.txt using the site: command.
E.g. type the following into Google:
site:http://www.audiobooksonline.com/swish.cgi
site:http://www.audiobooksonline.com/reviews/review.php/new/
...continue for all the URLs in your robots.txtJust from searching on the last example above (site:http://www.audiobooksonline.com/reviews/review.php/new/) I can see that you have results indexed. This is probably because you added the robots.txt after it was already indexed.
To get rid of these results you need to take the culprit line out of the robots.txt, add the robots meta tag set to noindex to all pages you want removed, submit a URL removal request via webmaster tools, check it has been nonidexed then you can add the line back into the robots.txt.
This is the tag:
I hope that makes sense and is useful!
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Blocking Dynamic Search Result Pages From Google
Hi Mozzerds, I have a quick question that probably won't have just one solution. Most of the pages that Moz crawled for duplicate content we're dynamic search result pages on my site. Could this be a simple fix of just blocking these pages from Google altogether? Or would Moz just crawl these pages as critical crawl errors instead of content errors? Ultimately, I contemplated whether or not I wanted to rank for these pages but I don't think it's worth it considering I have multiple product pages that rank well. I think in my case, the best is probably to leave out these search pages since they have more of a negative impact on my site resulting in more content errors than I would like. So would blocking these pages from the Search Engines and Moz be a good idea? Maybe a second opinion would help: what do you think I should do? Is there another way to go about this and would blocking these pages do anything to reduce the number of content errors on my site? I appreciate any feedback! Thanks! Andrew
Intermediate & Advanced SEO | | drewstorys0 -
Which URL is better for SEO?
We have a URL structure question: Because we have websites in multiple countries and in multiple languages, we need to add additional elements to our URL structure. Of the two following options, what would be better for SEO? Option 1: www.abccompany.com/abc-ca-en/home.htm Option 2: www.abccompany.com/home.abc.ca.en.htm
Intermediate & Advanced SEO | | northwoods-2603420 -
How can I get a list of every url of a site in Google's index?
I work on a site that has almost 20,000 urls in its site map. Google WMT claims 28,000 indexed and a search on Google shows 33,000. I'd like to find what the difference is. Is there a way to get an excel sheet with every url Google has indexed for a site? Thanks... Mike
Intermediate & Advanced SEO | | 945010 -
HTML for URL markup
Hi, We are changing our URLs to be more SEO friendly. Is there any negative impact or pitfall of using <base> HTML-tag? Our developers are considering it as a possible solution for relative URLs inside HTML-markup in the Friendly URL context.
Intermediate & Advanced SEO | | theLotter0 -
Blocking poor quality content areas with robots.txt
I found an interesting discussion on seoroundtable where Barry Schwartz and others were discussing using robots.txt to block low quality content areas affected by Panda. http://www.seroundtable.com/google-farmer-advice-13090.html The article is a bit dated. I was wondering what current opinions are on this. We have some dynamically generated content pages which we tried to improve after panda. Resources have been limited and alas, they are still there. Until we can officially remove them I thought it may be a good idea to just block the entire directory. I would also remove them from my sitemaps and resubmit. There are links coming in but I could redirect the important ones (was going to do that anyway). Thoughts?
Intermediate & Advanced SEO | | Eric_edvisors0 -
How to See Image Metadata?
We sell 1000s of audiobooks and get our cover images and descriptions from the publisher’s sites. When I download a cover image such as this one (http://www.audiobooksonline.com/media/Alex-Cross-Run-James-Patterson.jpg)
Intermediate & Advanced SEO | | lbohen
I always rename and re-size it before installing at our Web store. Would this process result in any publisher’s metadata in the image we use at our Web store and/or anything else Google would not like?
Is there an online utility that would allow me to see metadata in our images?0 -
Why am I seeing this in ahrefs?
I'm working on diagnosing the reason for a traffic drop for a site. When I look at the referring domains report in ahrefs I see a huge drop in the number of referring domains that happens exactly on the day of the traffic drop. However, when I look at the new/lost backlinks report there is no coinciding loss in links. How is this possible?
Intermediate & Advanced SEO | | MarieHaynes0