How can I get a list of every url of a site in Google's index?
-
I work on a site that has almost 20,000 urls in its site map. Google WMT claims 28,000 indexed and a search on Google shows 33,000. I'd like to find what the difference is.
Is there a way to get an excel sheet with every url Google has indexed for a site?
Thanks... Mike
-
If this is still an issue you're facing, have you checked the sitemap settings to see which page types are getting included? For example, a site with a few thousand tags that are not entered in the sitemap but not yet set to noindex could easily produce extra pages like this.
The next step is parameterization. Anything going on there with search URLs or product URLs? eg ?refid=1235134&q=search+term or ?prod=152134&variant=blue
If you really want to scrape through Google, get a list of your sitemap and scrape queries like "inurl:domain.com/a", "inurl:domain.com/b", "inurl:domain.com/c". etc. This should allow you to dive deeper into the site map to see what Google really has indexed. For URL subfolders with tons of URLs like domain.com/product/a, you'll want to do the same thing at a subfolder level instead of root URLs.
-
You can do that with a tool like Scrapebox or Outwit. Go slow, or else you'll need to use proxies to get Google to respond fast enough. As another commenter mentioned, it's probably against TOS.
-
You could probably write a macro to do this, although just because you could doesn't mean you should. I don't think it is advisable because you do not want to violate any terms of use for anyone. That is never a good thing.
-
Yes, WMT API doesn't have it. The site site:xxxx.com search is where are got one of the two too high numbers. Thanks... Mike
-
Hi Marijn,
Thanks for the suggestions. 2.5 years of G/A organic landing pages is 10,000 urls.... 1/2 as many as the site map and 1/3rd as many as Google says indexed. On scraping google, do you know of a tool for that?
Thanks... Mike
-
Might be something you can get from the WMT API.
Also, to really see how many pages are indexed, do a site:xxxx.com search, go to the last page, include omitted results, go to the last page again, and add up how many you have. That's probably the most accurate number.
-
Hi Mike,
There a couple of solutions, neither of them provide you with 100% of data. The best would be to export a list of landing pages from Google Analytics or your favorite web analytics tool segmented by organic search/ Google. This would provide you with a list of pages that received traffic via search and so are indexed. If you cross reference them with your sitemaps that might already help you out a bit. Besides that you could crawl and scrape the URLS for a site:xxx.com search.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Start a new site to get out of Google penalties?
Hey Moz, I have several questions in regards to whether I should a start a new second site to save my online presence after a series of Google penalties. The main questions being: Is this the best way to spend my time/resources? If I’m forced to jump my company over to the new site can Google see that and transfer the penalty? I plan on all new content (no link redirect, no dup content) so do I need to kill the original site? Are there any Pro’s/cons I am missing? Summary of my situation: Looking at analytics it appears I was hit with both Penguin 2.0 and 2.1, each cutting my traffic in half, despite a link remediation campaign in the summer of 2013. There was a manual penalty also imposed on the site in the fall of 2013, which was released in early 2014. With Penguin 3.0’s release at the end of 2014, the site saw a slight uptick in organic traffic, improving from essentially nothing to next to nothing. Most of the site’s issues revolved around cheap $5 links from India in the 2006-09 time frame. This link building was abandoned, and replaced with nothing but “letting them happen naturally” from 2010 through the 2013 penalties. Since 2013 we have done a small amount of quality articles on a monthly basis to promote the site, social media, and continuous link remediation. In addition the whole site has been redesigned, optimized for speed/mobile, secured, and completely rewritten. Given all of this, the site has really only recovered to page 2 and 3 of the SERPs for our key words. Even after a highly circulated piece appeared on an Authority site (97 DA) a few months ago there was zero movement. It appears we have an anvil tied around our leg until Penguin 4.0. With all of the above, and no sign of when the next penguin will be released, I ask, is it time to start investing in a new site? With no movement in 2.5 years, it’s impossible to know where my current site stands, so I don’t know what else I can do to improve it. I am considering slowly building a new site that is a high quality informational site. My thought process is it will take a year for a new site to gain any traction with Google. If by that time my main site has not recovered, I can jump to that new site, add a commercial component, and use it as a life boat for my company. If I have recovered, then I have a future asset. Thanks in advance!
Intermediate & Advanced SEO | | TheDude0 -
Why isn't the canonical tag on my client's Magento site working?
The reason for this mights be obvious to the right observer, but somehow I'm not able to spot the reason why. The situation:
Intermediate & Advanced SEO | | Inevo
I'm doing an SEO-audit for a client. When I'm checking if the rel=canonical tag is in place correctly, it seems like it: view-source:http://quickplay.no/fotball-mal.html?limit=15) (line nr 15) Anyone seing something wrong with this canonical? When I perform a site:http://quickplay.no/ search, I find that there's many url's indexed that ought to have been picked up by the canonical-tag: (see picture) ..this for example view-source:http://quickplay.no/fotball-mal.html?limit=15 I really can't see why this page is getting indexed, when the canonical-tag is in place. Anybody who can? Sincerely 🙂 GMdWg0K0 -
Our web site lost ranking on google a couple of years ago. We have done lots of work on it but still can not improve our search ranking. Can anyone give us some advise
A couple of years ago the ranking on our site dropped over night. I believe someone working here at the time purchased links about that time. We have been doing lots of work on the site since then to improve it. We can not get our rankings back up on google searches. Can anyone give us some advise about what to do or where to go for some help that we can trust.
Intermediate & Advanced SEO | | CostumeD0 -
Keyword research when the site's subject is low volume
Hey guys, what do you do when you planning a new website and doing keyword research for a site when the avg. search volumes are relatively low. We set up run contact centres for UK charities including voice, webchat, sms, email and response fulfillment etc. It seems that people aren't really searching that often for this 'sexy subject'. Average volumes for searches with some intent/qualifier range from between 10-100 monthly searches. What sort of strategies would you adopt in this scenario? Do you optimise for what you can and then make a large focus on other digital marketing tactics such as content marketing, social media, email marketing etc. Thanks for your time guys Leo
Intermediate & Advanced SEO | | Leo_Woodhead0 -
After Receiving a "Googlebot can't access your site" would this stop your site from being crawled?
Hi Everyone,
Intermediate & Advanced SEO | | AMA-DataSet
A few weeks ago now I received a "Googlebot can't access your site..... connection failure rate is 7.8%" message from the webmaster tools, I have since fixed the majority of these issues but iv noticed that all page except the main home page now have a page rank of N/A while the home page has a page rank of 5 still. Has this connectivity issues reduced the page ranks to N/A? or is it something else I'm missing? Thanks in advance.0 -
How to make an AJAX site crawlable when PushState and #! can't be used?
Dear Mozzers, Does anyone know a solution to make an AJAX site crawlable if: 1. You can't make use of #! (with HTML snapshots) due to tracking in Analytics 2. PushState can't be implemented Could it be a solution to create two versions of each page (one without #!, so campaigns can be tracked in Analytics & one with #! which will be presented to Google)? Or is there another magical solution that works as well? Any input or advice is highly appreciated! Kind regards, Peter
Intermediate & Advanced SEO | | ConversionMob0 -
Can Someone Provide an Example of a Site that Indexes Search Results Successfully?
So, I know indexing search results is a big no-no, but I recently started working with a site that sees 50% of its traffic from search result pages. The user engagement on these pages is very high, and these pages rank well too. Unfortunately, they've been hit by Panda. They already moved the section of the site with search results to a subdomain, and saw temporary success. There must be a way to preserve their traffic from these search result pages and get out from under Panda.
Intermediate & Advanced SEO | | nicole.healthline0 -
How can I change my website's content on specific pages without affecting ranking for specific keywords?
My client's website (www.nursevillage.com) content has not been touched for 4 years and we are currently ranking #1 for "per diem nursing". They do not want to make any changes to the site in fear that it might decrease our rankings. We want to try to use utilize that keyword ranking on specific pages (www.nursevillage.com/nv/content/careeroptions/perdiem.jsp ) ranking for "per diem nursing" and try redirecting traffic or placing some banners and links on that page to specific pages or other sites related to "per diem nursing" jobs so we can get nurses to apply to our new nursing jobs. Any advice on why "per diem nursing" is ranking so high for us and what we can change on the site without messing up our ranking would be greatly appreciated. Thanks
Intermediate & Advanced SEO | | ryanperea1000