How can I get a list of every url of a site in Google's index?
-
I work on a site that has almost 20,000 urls in its site map. Google WMT claims 28,000 indexed and a search on Google shows 33,000. I'd like to find what the difference is.
Is there a way to get an excel sheet with every url Google has indexed for a site?
Thanks... Mike
-
If this is still an issue you're facing, have you checked the sitemap settings to see which page types are getting included? For example, a site with a few thousand tags that are not entered in the sitemap but not yet set to noindex could easily produce extra pages like this.
The next step is parameterization. Anything going on there with search URLs or product URLs? eg ?refid=1235134&q=search+term or ?prod=152134&variant=blue
If you really want to scrape through Google, get a list of your sitemap and scrape queries like "inurl:domain.com/a", "inurl:domain.com/b", "inurl:domain.com/c". etc. This should allow you to dive deeper into the site map to see what Google really has indexed. For URL subfolders with tons of URLs like domain.com/product/a, you'll want to do the same thing at a subfolder level instead of root URLs.
-
You can do that with a tool like Scrapebox or Outwit. Go slow, or else you'll need to use proxies to get Google to respond fast enough. As another commenter mentioned, it's probably against TOS.
-
You could probably write a macro to do this, although just because you could doesn't mean you should. I don't think it is advisable because you do not want to violate any terms of use for anyone. That is never a good thing.
-
Yes, WMT API doesn't have it. The site site:xxxx.com search is where are got one of the two too high numbers. Thanks... Mike
-
Hi Marijn,
Thanks for the suggestions. 2.5 years of G/A organic landing pages is 10,000 urls.... 1/2 as many as the site map and 1/3rd as many as Google says indexed. On scraping google, do you know of a tool for that?
Thanks... Mike
-
Might be something you can get from the WMT API.
Also, to really see how many pages are indexed, do a site:xxxx.com search, go to the last page, include omitted results, go to the last page again, and add up how many you have. That's probably the most accurate number.
-
Hi Mike,
There a couple of solutions, neither of them provide you with 100% of data. The best would be to export a list of landing pages from Google Analytics or your favorite web analytics tool segmented by organic search/ Google. This would provide you with a list of pages that received traffic via search and so are indexed. If you cross reference them with your sitemaps that might already help you out a bit. Besides that you could crawl and scrape the URLS for a site:xxx.com search.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
List of SEO "to do's" to increase organic rankings
We are looking for a complete list of all white hat SEO "to do's" that an SEO firm should do in order to help increase Google/Bing/Yahoo organic rankings. We would like to use this list to be sure that the SEO company/individual we choose uses all these white hat items as part of an overall SEO strategy to increase organic rankings. Can anyone please point me in the right direction as to where we can obtain this complete list? If this is not the best approach, please let me know what is, as I am not an SEO person. Thank you kindly in advance
Intermediate & Advanced SEO | | RetractableAwnings.com0 -
How can I remove my old sites URL from showing up in Google?
Hi everyone. We have had a new site up for over a year now. When I search site:sqlsentry.net the old url still shows up and while those pages are redirected to .com I'd like to get the .net URL's out of google forever. What is the best way I can go about that?
Intermediate & Advanced SEO | | Sika220 -
Https & http urls in Google Index
Hi everyone, this question is a two parter: I am now working for a large website - over 500k monthly organic traffic. The site currently has both http and https urls in Google's index. The website has not formally converted to https. The https began with an error and has evolved unchecked over time. Both versions of the site (http & https) are registered in webmaster tools so I can clearly track and see that as time passes http indexation is decreasing and https has been increasing. The ratio is at about 3:1 in favor of https at this time. Traffic over the last year has slowly dipped, however, over the last two months there has been a steady decline in overall visits registered through analytics. No single page appears to be the culprit, this decline is occurring across most pages of the website, pages which traditionally draw heavy traffic - including the home page. Considering that Google is giving priority to https pages, could it be possible that the split is having a negative impact on traffic as rankings sway? Additionally, mobile activity for the site has steadily increased both from a traffic and a conversion standpoint. However that traffic has also dipped significantly over the last two months. Looking at Google's mobile usability error's page I see a significant number of errors (over 1k). I know Google has been testing and changing mobile ranking factors, is it safe to posit that this could be having an impact on mobile traffic? The traffic declines are 9-10% MOM. Thank you. ~Geo
Intermediate & Advanced SEO | | Geosem0 -
Google's form for "Small sites that should rank better" | Any experiences or results?
Back in August of 2013 Google created a form that allowed people to submit small websites that "should be ranking better in Google". There is more info about it in this article http://www.seroundtable.com/google-small-site-survey-17295.html Has anybody used it? Any experiences or results you can share? *private message if you do not want to share publicly...
Intermediate & Advanced SEO | | GregB1230 -
Huge Google index on E-commerce site
Hi Guys, Refering back to my original post I would first like to thank you guys for all the advice. We implemented canonical url's all over the site and noindexed some url's with robots.txt and the site already went from 100.000+ url's indexed to 87.000 urls indexed in GWT. My question: Is there way to speed this up?
Intermediate & Advanced SEO | | ssiebn7
I do know about the way to remove url's from index (with noindex of robots.txt condition) but this is a very intensive way to do so. I was hoping you guys maybe have a solution for this.. 🙂0 -
Google Site Extended Listing Not Indexed
I am trying to get the new Site map to be picked up by Google for the extended listing as its pulling from the old links and returning 404 errors. How can I get the site listing indexed quickly and have the extended listing get updated to point to the right places. This is the site - http://epaperflip.com/Default.aspx This is the search with the extended listing and some 404's - Broad Match search for "epaperflip"
Intermediate & Advanced SEO | | Intergen0 -
Removing Dynamic "noindex" URL's from Index
6 months ago my clients site was overhauled and the user generated searches had an index tag on them. I switched that to noindex but didn't get it fast enough to avoid being 100's of pages indexed in Google. It's been months since switching to the noindex tag and the pages are still indexed. What would you recommend? Google crawls my site daily - but never the pages that I want removed from the index. I am trying to avoid submitting hundreds of these dynamic URL's to the removal tool in webmaster tools. Suggestions?
Intermediate & Advanced SEO | | BeTheBoss0 -
Does Google crawl the pages which are generated via the site's search box queries?
For example, if I search for an 'x' item in a site's search box and if the site displays a list of results based on the query, would that page be crawled? I am asking this question because this would be a URL that is non existent on the site and hence am confused as to whether Google bots would be able to find it.
Intermediate & Advanced SEO | | pulseseo0