Moz Q&A is closed.
After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.
How can I get a list of every url of a site in Google's index?
-
I work on a site that has almost 20,000 urls in its site map. Google WMT claims 28,000 indexed and a search on Google shows 33,000. I'd like to find what the difference is.
Is there a way to get an excel sheet with every url Google has indexed for a site?
Thanks... Mike
-
If this is still an issue you're facing, have you checked the sitemap settings to see which page types are getting included? For example, a site with a few thousand tags that are not entered in the sitemap but not yet set to noindex could easily produce extra pages like this.
The next step is parameterization. Anything going on there with search URLs or product URLs? eg ?refid=1235134&q=search+term or ?prod=152134&variant=blue
If you really want to scrape through Google, get a list of your sitemap and scrape queries like "inurl:domain.com/a", "inurl:domain.com/b", "inurl:domain.com/c". etc. This should allow you to dive deeper into the site map to see what Google really has indexed. For URL subfolders with tons of URLs like domain.com/product/a, you'll want to do the same thing at a subfolder level instead of root URLs.
-
You can do that with a tool like Scrapebox or Outwit. Go slow, or else you'll need to use proxies to get Google to respond fast enough. As another commenter mentioned, it's probably against TOS.
-
You could probably write a macro to do this, although just because you could doesn't mean you should. I don't think it is advisable because you do not want to violate any terms of use for anyone. That is never a good thing.
-
Yes, WMT API doesn't have it. The site site:xxxx.com search is where are got one of the two too high numbers. Thanks... Mike
-
Hi Marijn,
Thanks for the suggestions. 2.5 years of G/A organic landing pages is 10,000 urls.... 1/2 as many as the site map and 1/3rd as many as Google says indexed. On scraping google, do you know of a tool for that?
Thanks... Mike
-
Might be something you can get from the WMT API.
Also, to really see how many pages are indexed, do a site:xxxx.com search, go to the last page, include omitted results, go to the last page again, and add up how many you have. That's probably the most accurate number.
-
Hi Mike,
There a couple of solutions, neither of them provide you with 100% of data. The best would be to export a list of landing pages from Google Analytics or your favorite web analytics tool segmented by organic search/ Google. This would provide you with a list of pages that received traffic via search and so are indexed. If you cross reference them with your sitemaps that might already help you out a bit. Besides that you could crawl and scrape the URLS for a site:xxx.com search.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Image Audit: Getting a list of *ALL* Images on a Site?
Hello! We are doing an image optimization audit, and are therefore trying to find a way to get a list of all images on a site. Screaming Frog seems like a great place to start (as per this helpful article: https://moz.com/ugc/how-to-perform-an-image-optimization-audit), but unfortunately, it doesn't include images in CSS. 😞 Does the community have any ideas for how we try to otherwise get list of images? Thanks in advance for any tips/advice.
Intermediate & Advanced SEO | | mirabile0 -
Change Google's version of Canonical link
Hi My website has millions of URLs and some of the URLs have duplicate versions. We did not set canonical all these years. Now we wanted to implement it and fix all the technical SEO issues. I wanted to consolidate and redirect all the variations of a URL to the highest pageview version and use that as the canonical because all of these variations have the same content. While doing this, I found in Google search console that Google has already selected another variation of URL as canonical and not the highest pageview version. My questions: I have millions of URLs for which I have to do 301 and set canonical. How can I find all the canonical URLs that Google has autoselected? Search Console has a daily quota of 100 or something. Is it possible to override Google's version of Canonical? Meaning, if I set a variation as Canonical and it is different than what Google has already selected, will it change overtime in Search Console? Should I just do a 301 to highest pageview variation of the URL and not set canonicals at all? This way the canonical that Google auto selected might get redirected to the highest pageview variation of the URL. Any advice or help would be greatly appreciated.
Intermediate & Advanced SEO | | SDCMarketing0 -
What's the best possible URL structure for a local search engine?
Hi Mozzers, I'm working at AskMe.com which is a local search engine in India i.e if you're standing somewhere & looking for the pizza joints nearby, we pick your current location and share the list of pizza outlets nearby along with ratings, reviews etc. about these outlets. Right now, our URL structure looks like www.askme.com/delhi/pizza-outlets for the city specific category pages (here, "Delhi" is the city name and "Pizza Outlets" is the category) and www.askme.com/delhi/pizza-outlets/in/saket for a category page in a particular area (here "Saket") in a city. The URL looks a little different if you're searching for something which is not a category (or not mapped to a category, in which case we 301 redirect you to the category page), it looks like www.askme.com/delhi/search/pizza-huts/in/saket if you're searching for pizza huts in Saket, Delhi as "pizza huts" is neither a category nor its mapped to any category. We're also dealing in ads & deals along with our very own e-commerce brand AskMeBazaar.com to make the better user experience and one stop shop for our customers. Now, we're working on URL restructure project and my question to you all SEO rockstars is, what can be the best possible URL structure we can have? Assume, we have kick-ass developers who can manage any given URL structure at backend.
Intermediate & Advanced SEO | | _nitman0 -
URL Injection Hack - What to do with spammy URLs that keep appearing in Google's index?
A website was hacked (URL injection) but the malicious code has been cleaned up and removed from all pages. However, whenever we run a site:domain.com in Google, we keep finding more spammy URLs from the hack. They all lead to a 404 error page since the hack was cleaned up in the code. We have been using the Google WMT Remove URLs tool to have these spammy URLs removed from Google's index but new URLs keep appearing every day. We looked at the cache dates on these URLs and they are vary in dates but none are recent and most are from a month ago when the initial hack occurred. My question is...should we continue to check the index every day and keep submitting these URLs to be removed manually? Or since they all lead to a 404 page will Google eventually remove these spammy URLs from the index automatically? Thanks in advance Moz community for your feedback.
Intermediate & Advanced SEO | | peteboyd0 -
404's - Do they impact search ranking/how do we get rid of them?
Hi, We recently ran the Moz website crawl report and saw a number of 404 pages from our site come back. These were returned as "high priority" issues to fix. My question is, how do 404's impact search ranking? From what Google support tells me, 404's are "normal" and not a big deal to fix, but if they are "high priority" shouldn't we be doing something to remove them? Also, if I do want to remove the pages, how would I go about doing so? Is it enough to go into Webmaster tools and list it as a link no to crawl anymore or do we need to do work from the website development side as well? Here are a couple of examples that came back..these are articles that were previously posted but we decided to close out: http://loyalty360.org/loyalty-management/september-2011/let-me-guessyour-loyalty-program-isnt-working http://loyalty360.org/resources/article/mark-johnson-speaks-at-motivation-show Thanks!
Intermediate & Advanced SEO | | carlystemmer0 -
Does Google Read URL's if they include a # tag? Re: SEO Value of Clean Url's
An ECWID rep stated in regards to an inquiry about how the ECWID url's are not customizable, that "an important thing is that it doesn't matter what these URLs look like, because search engines don't read anything after that # in URLs. " Example http://www.runningboards4less.com/general-motors#!/Classic-Pro-Series-Extruded-2/p/28043025/category=6593891 Basically all of this: #!/Classic-Pro-Series-Extruded-2/p/28043025/category=6593891 That is a snippet out of a conversation where ECWID said that dirty urls don't matter beyond a hashtag... Is that true? I haven't found any rule that Google or other search engines (Google is really the most important) don't index, read, or place value on the part of the url after a # tag.
Intermediate & Advanced SEO | | Atlanta-SMO0 -
How is Google crawling and indexing this directory listing?
We have three Directory Listing pages that are being indexed by Google: http://www.ccisolutions.com/StoreFront/jsp/ http://www.ccisolutions.com/StoreFront/jsp/html/ http://www.ccisolutions.com/StoreFront/jsp/pdf/ How and why is Googlebot crawling and indexing these pages? Nothing else links to them (although the /jsp.html/ and /jsp/pdf/ both link back to /jsp/). They aren't disallowed in our robots.txt file and I understand that this could be why. If we add them to our robots.txt file and disallow, will this prevent Googlebot from crawling and indexing those Directory Listing pages without prohibiting them from crawling and indexing the content that resides there which is used to populate pages on our site? Having these pages indexed in Google is causing a myriad of issues, not the least of which is duplicate content. For example, this file <tt>CCI-SALES-STAFF.HTML</tt> (which appears on this Directory Listing referenced above - http://www.ccisolutions.com/StoreFront/jsp/html/) clicks through to this Web page: http://www.ccisolutions.com/StoreFront/jsp/html/CCI-SALES-STAFF.HTML This page is indexed in Google and we don't want it to be. But so is the actual page where we intended the content contained in that file to display: http://www.ccisolutions.com/StoreFront/category/meet-our-sales-staff As you can see, this results in duplicate content problems. Is there a way to disallow Googlebot from crawling that Directory Listing page, and, provided that we have this URL in our sitemap: http://www.ccisolutions.com/StoreFront/category/meet-our-sales-staff, solve the duplicate content issue as a result? For example: Disallow: /StoreFront/jsp/ Disallow: /StoreFront/jsp/html/ Disallow: /StoreFront/jsp/pdf/ Can we do this without risking blocking Googlebot from content we do want crawled and indexed? Many thanks in advance for any and all help on this one!
Intermediate & Advanced SEO | | danatanseo0 -
There's a website I'm working with that has a .php extension. All the pages do. What's the best practice to remove the .php extension across all pages?
Client wishes to drop the .php extension on all their pages (they've got around 2k pages). I assured them that wasn't necessary. However, in the event that I do end up doing this what's the best practices way (and easiest way) to do this? This is also a WordPress site. Thanks.
Intermediate & Advanced SEO | | digisavvy0