Google: How to See URLs Blocked by Robots?
-
Google Webmaster Tools says we have 17K out of 34K URLs that are blocked by our Robots.txt file.
How can I see the URLs that are being blocked?
Here's our Robots.txt file.
User-agent: * Disallow: /swish.cgi Disallow: /demo Disallow: /reviews/review.php/new/ Disallow: /cgi-audiobooksonline/sb/order.cgi Disallow: /cgi-audiobooksonline/sb/productsearch.cgi Disallow: /cgi-audiobooksonline/sb/billing.cgi Disallow: /cgi-audiobooksonline/sb/inv.cgi Disallow: /cgi-audiobooksonline/sb/new_options.cgi Disallow: /cgi-audiobooksonline/sb/registration.cgi Disallow: /cgi-audiobooksonline/sb/tellfriend.cgi Disallow: /*?gdftrk
-
It seems you might be asking two different questions here, Larry.
You ask which URLs are blocked by your robots file. You then answered your own question by listing the entries in your robots file which are the actual URLs that it is blocking.
If in fact what you want to know is which pages exist on your website but are not currently indexed, that's a much bigger question and requires a lot more work to answer.
There is no way Webmaster Tools can give you that answer, because if it was aware of the URL it would already be indexing it.
HOWEVER! It is possible to do it if you are willing to do some of the work on your own to collect and manipulate data using several tools. Essentially, you have to do it in three steps:
- create a list of all the URLs that Google says are indexed. (This info comes from Google's SERPs.)
- then create a separate list of all of the URLs that actually exist on your website. (This must come from a 3rd-party tool you run against your site yourself.)
- From there, you will use Excel to subtract the indexed URLs from the known URLs, leaving a list of non-indexed URLS, which is what you asked for.
I actually laid out this process step-by-step in response to an earlier question, so you can read the process there http://www.seomoz.org/q/how-to-determine-which-pages-are-not-indexed
Is that what you were looking for?
Paul
-
Okay, well the robots.txt will only be excluding robots from the folders and URLs specified and as I say, there's no way to download a list of all the URLs that Google is not indexing from webmaster tools.
If you have exact URLs in mind which you think might be getting excluded, you can test individual URLs in Google Webmaster Tools in:
Health > Blocked URLs > URLs Specify the URLs and user-agents to test against.
Beyond this, if you want to know if there are URLs that shouldn't be excluded in the folders you have specified, I would run a crawl of your website using SEOMoz' crawl test or Screaming Frog. Then sort the URLs alphabetically and make sure that all of the URLs in the folders you have excluded via robots.txt are ones that you want to exclude.
-
I want to make sure that Google is indexing all of our pages we want them to. I.E. That all of the NOT indexed URLs are valid.
-
Hi Larry
Why do you want to find those URLs out for my understanding? Are you concerned that the robots.txt is blocking URLs it shouldn't be?
As for downloading a list of URLs which aren't indexed from Google Webmaster Tools, which is what I think you would really like, this isn't possible at the moment.
-
Liz; Perhaps my post was unclear or I am misunderstanding your answer.
I want to find out the specific URLs that Google says it isn't indexing because of our Robots.txt file.
-
If you want to see if Google has indexed individual pages which are supposed to be excluded, you can check the URLs in your robots.txt using the site: command.
E.g. type the following into Google:
site:http://www.audiobooksonline.com/swish.cgi
site:http://www.audiobooksonline.com/reviews/review.php/new/
...continue for all the URLs in your robots.txtJust from searching on the last example above (site:http://www.audiobooksonline.com/reviews/review.php/new/) I can see that you have results indexed. This is probably because you added the robots.txt after it was already indexed.
To get rid of these results you need to take the culprit line out of the robots.txt, add the robots meta tag set to noindex to all pages you want removed, submit a URL removal request via webmaster tools, check it has been nonidexed then you can add the line back into the robots.txt.
This is the tag:
I hope that makes sense and is useful!
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
My url disappeared from Google but Search Console shows indexed. This url has been indexed for more than a year. Please help!
Super weird problem that I can't solve for last 5 hours. One of my urls: https://www.dcacar.com/lax-car-service.html Has been indexed for more than a year and also has an AMP version, few hours ago I realized that it had disappeared from serps. We were ranking on page 1 for several key terms. When I perform a search "site:dcacar.com " the url is no where to be found on all 5 pages. But when I check my Google Console it shows as indexed I requested to index again but nothing changed. All other 50 or so urls are not effected at all, this is the only url that has gone missing can someone solve this mystery for me please. Thanks a lot in advance.
Intermediate & Advanced SEO | | Davit19850 -
Trailing Slashes on URLs
Hi we currently have a site on Wordpress which has two version of each URL trailing slash on URLs and one without it. Example: www.domain.com/page (preferred version - based on link data) www.domain.com/page**/** The non-slash version of the URL has most of the external links pointing to them, so we are going to pick that as the preferred version. However, currently, each version of every URL has rel canonical tag pointing to the non-preferred version. E.g. www.domain.com/page the rel canonical tag is: www.domain.com/page/ What would be the best way to clean up this setup? Cheers.
Intermediate & Advanced SEO | | cathywix0 -
Search Results Pages Blocked in Robots.txt?
Hi I am reviewing our robots.txt file. I wondered if search results pages should be blocked from crawling? We currently have this in the file /searchterm* Is it a good thing for SEO?
Intermediate & Advanced SEO | | BeckyKey0 -
Silly Question still - Because I am paying high to google adwords is it possible google can't rank me high in organic?
Hello All, My ecommerce site gone in penalty more than 3 years before and within 3 months I got message from google penalty removed. Since then till date my organic ranking is very worst. In this 3 years I improved my site onpage very great. If I compare my site with all other competitors who are ranking in top 10 then my onpage that includes all schema, reviews, sitemap, header tags, meta's etc, social media, site structure, most imp speed, google page speed insight score, pingdom, w3c errors, alexa rank, global rank, UI, offers, design, content, code to text raito, engagement rate, page views, time on site etc all my sites always good compare to competitors. They also have few backlinks I do have few backlinks only. I am doing very high google adwords and my conversion rate is very very good. But do you think because I am paying since last 3 year high to google because of that google have some setting or strategy that those who perform well in adwords so not to bring up in organic? Is it possible I can talk with google on this? If yes then what will be the medium of conversation? Pls give some valuable inputs I am performing very much in paid so user end site is very very well. Thanks!
Intermediate & Advanced SEO | | pragnesh96390 -
Ranking on google but not Bing?
Any reason why I could be ranking for Google but not Bing?
Intermediate & Advanced SEO | | edward-may0 -
Why is this url redirecting to our site?
I was doing an audit on our site and searching for duplicate content using some different terms from each of our pages. I came across the following result: www.sswug.org/url/32639 redirects to our website. Is that normal? There are hundreds of these url's in google all with the exact same description. I thought it was odd. Any ideas and what is the consequence of this?
Intermediate & Advanced SEO | | Sika220 -
Is there a way to contact Google besides the google product forum?
Our traffic from google has dropped more than 35% and continues to fall. We have been on this forum and google's webmaster forum trying to get help. We received great advice, have waited months, but instead of our traffic improving, it has worsened. We are being penalized by google for many keywords such as trophies, trophies and awards and countless others - we were on page one previously. We filed two reconsideration requests and were told both times that there were no manual penalties. Some of our pages continue to rank well, so it is not across the board (but all of our listings went down a bit). We have made countless changes (please see below). Our busy season was from March to May and we got clobbered. Google, as most people know, is a monopoly when it comes to traffic, so we are getting killed. At first we thought it was Penquin, but it looks like we started getting killed late last year. Lots of unusual things happened - we had a large spike in traffic for two days, then lost our branded keywords, then our main keywords. Our branded keywords came back pretty quickly, but nothing else did. We have received wonderful advice and made most of the changes. We are a very reputable company and have a feeling we are being penalized for something other than spamming. For example, we have a mobile site we added late last year and a wholesale system was added around the same time. Since the date does not coincide with Penquin, we think there is some major technical driver, but have no idea what to do at this point. The webmasters have all been helpful, but nothing is working. We are trying to find out what one does in a situation as we are trying to avoid closing our business. Thank you! Changes Made: 1. We had many crawl errors so we reduced them significantly 2. We had introduced a mobile website in January which we
Intermediate & Advanced SEO | | trophycentraltrophiesandawards
thought may have been the cause (splitting traffic, duplicate content, etc.),
so we had our mobile provider add the site to their robots.txt file. 3. We were told by a webmaster that their were too many
links from our search provider, so we have them put the search pages in a
robots.txt file. 4. We were told that we had too much duplicate content. This was / is true, as we have hundred of legitate products that are similar:
example trophies and certificates that are virtually the same but are
for different sports or have different colors and sizes. Still, we added more content and added no index tags to many products. We compared our % of dups to competitors and it is far less. 5. At the recommendation of another webmaster, we changed
many pages that might have been splitting traffic. 6. Another webmaster told us that too many people were
linking into our site with the same text, namely Trophy Central and that it
might have appeared we were trying to game the system somehow. We have never bought links and don't even have a webmaster although over the last 10 years have worked with programmers and seo companies (but we don't think any have done anything unusual). 7. At the suggestion of another webmaster, we have tried to
improve our link profile. For example,
we found Yahoo was not linking to our url. 8. We were told to setup a 404 page, so we did 9. We were told to ensure that all of the similar domains
were pointing to www.trophycentral.com/ so we setup redirects 10. We were told that a site that we have linking to us from too many places so we reduced it to 1. Our key pages have A rankings from SEOMOZ for the selected keywords. We have made countless other changes recommended by experts
but have seen no improvements (actually got worse). I am the
president of the company and have made most of the above recent changes myself. Our website is trophycentral.com0 -
Export list of urls in google's index?
Is there a way to export an exact list of urls found in Google's index?
Intermediate & Advanced SEO | | nicole.healthline0