How can I get a list of every url of a site in Google's index?
-
I work on a site that has almost 20,000 urls in its site map. Google WMT claims 28,000 indexed and a search on Google shows 33,000. I'd like to find what the difference is.
Is there a way to get an excel sheet with every url Google has indexed for a site?
Thanks... Mike
-
If this is still an issue you're facing, have you checked the sitemap settings to see which page types are getting included? For example, a site with a few thousand tags that are not entered in the sitemap but not yet set to noindex could easily produce extra pages like this.
The next step is parameterization. Anything going on there with search URLs or product URLs? eg ?refid=1235134&q=search+term or ?prod=152134&variant=blue
If you really want to scrape through Google, get a list of your sitemap and scrape queries like "inurl:domain.com/a", "inurl:domain.com/b", "inurl:domain.com/c". etc. This should allow you to dive deeper into the site map to see what Google really has indexed. For URL subfolders with tons of URLs like domain.com/product/a, you'll want to do the same thing at a subfolder level instead of root URLs.
-
You can do that with a tool like Scrapebox or Outwit. Go slow, or else you'll need to use proxies to get Google to respond fast enough. As another commenter mentioned, it's probably against TOS.
-
You could probably write a macro to do this, although just because you could doesn't mean you should. I don't think it is advisable because you do not want to violate any terms of use for anyone. That is never a good thing.
-
Yes, WMT API doesn't have it. The site site:xxxx.com search is where are got one of the two too high numbers. Thanks... Mike
-
Hi Marijn,
Thanks for the suggestions. 2.5 years of G/A organic landing pages is 10,000 urls.... 1/2 as many as the site map and 1/3rd as many as Google says indexed. On scraping google, do you know of a tool for that?
Thanks... Mike
-
Might be something you can get from the WMT API.
Also, to really see how many pages are indexed, do a site:xxxx.com search, go to the last page, include omitted results, go to the last page again, and add up how many you have. That's probably the most accurate number.
-
Hi Mike,
There a couple of solutions, neither of them provide you with 100% of data. The best would be to export a list of landing pages from Google Analytics or your favorite web analytics tool segmented by organic search/ Google. This would provide you with a list of pages that received traffic via search and so are indexed. If you cross reference them with your sitemaps that might already help you out a bit. Besides that you could crawl and scrape the URLS for a site:xxx.com search.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
A client rebranded a few years ago and doesn't want to be associated with it's old brand name. He wishes not to appear when the old brand is searched in Google, is there something we can do?
The problem is there was redirection between the old branded site and the new one, and now when you type in the name of the old brand, the new one comes up. I have desperately tried to convince this client there is nothing we can do about it, dozens of news articles crop up with the two brands together as this was a hot topic a few years ago, but just in case I missed something I thought I'd ask the community of experts here on Moz. An example for this would be Tyco Healthcare that became covidien in 2007. When you type tyco healthcare, covidien crops up here and there. Any ideas? Thanks!
Intermediate & Advanced SEO | | Netsociety0 -
Moving a lot of pdfs to main site. Worth trying to get them indexed?
On my main site we link to pdfs that are located on another one of our domains. The only thing that is on this other domain is the pdfs. It was setup really poorly so I am going to redesign everything and probably move it. Is it worthwhile trying to add these pdfs to our sitemap and to try and get them indexed? They are all connected to a current item, but the content is original.
Intermediate & Advanced SEO | | EcommerceSite0 -
Can a move to a new domain (with 301's) shake off a google algorithm penalty
we have done everything under the sun using the holy grail of google guidelines to get our site back onto page 1 for our domain. we have recovered (penguin and panda) algorithm filters for keywords that were page 1 going to page 7 and now page 2. its been 2 years and we cant hit page 1 again. this is our final phase we cna think of.. do you thin kit will work if we move to a new domain. and how much traffic/rankings can we expect to lose in the short-term?
Intermediate & Advanced SEO | | Direct_Ram0 -
Value in creating an 'All listings' sitemap?
Hello, I work for the Theater discovery website, theatermania.com. Users can browse current shows on a city-by-city basis, such as New York: http://www.theatermania.com/new-york-city-theater/shows/ My question is, is there any SEO benefit in us creating a single page that lists all shows (both current and non-current) across the US? My boss mentioned that this could help our long tail results, but I'm not so sure.
Intermediate & Advanced SEO | | TheaterMania0 -
Google de-indexed a page on my site
I have a site which is around 9 months old. For most search terms we rank fine (including top 3 rankings for competitive terms). Recently one of our pages has been fluctuating wildly in the rankings and has now disappeared altogether from the rankings for over 1 week. As a test I added a similar page to one of my other sites and it ranks fine. I've checked webmaster tools and there is nothing of note there. I'm not really sure what to do at this stage. Any advice would me much appreciated!
Intermediate & Advanced SEO | | deelo5550 -
Why is this site not indexed by Google?
Hi all and thanks for your help in advance. I've been asked to take a look at a site, http://www.yourdairygold.ie as it currently does not appear for its brand name, Your Dairygold on Google Ireland even though it's been live for a few months now. I've checked all the usual issues such as robots.txt (doesn't have one) and the robots meta tag (doesn't have them). The even stranger thing is that the site does rank on Yahoo! and Bing. Google Webmaster Tools shows that Googlebot is crawling around 150 pages a day but the total number of pages indexed is zero. It does appear if you carry out a site: search on Google however. The site is very poorly optimised in terms of title tags, unnecessary redirects etc which I'm working on now but I wondered if you guys had any further insights. Thanks again for your help.
Intermediate & Advanced SEO | | iProspect-Ireland0 -
We're indexed in Google News, any tips or suggestions for getting traffic from news?
We have a news sitemap, and follow all best practices as outlined by Google for news. We are covering breaking stories at the same time as other publications, but have only made it to the front page of Google News once in the last few weeks. Does anyone have any tips, recommended reading, etc for how to get to the front page of Google News? Thanks!
Intermediate & Advanced SEO | | nicole.healthline0 -
Do sites with a small number of content pages get penalized by Google?
If my site has just five content pages, instead of 25 or 50, then will it get penalized by Google for a given moderately competitive keyword?
Intermediate & Advanced SEO | | RightDirection0