What Sources to use to compile an as comprehensive list of pages indexed in Google?
-
As part of a Panda recovery initiative we are trying to get an as comprehensive list of currently URLs indexed by Google as possible.
Using the site:domain.com operator Google displays that approximately 21k pages are indexed. Scraping the results however ends after the listing of 240 links.
Are there any other sources we could be using to make the list more comprehensive? To be clear, we are not looking for external crawlers like the SEOmoz crawl tool but sources that would be confidently allow us to determine a list of URLs currently hold in the Google index.
Thank you /Thomas
-
We don't usually take private info in public questions, but if you want to, Private Message me the domain (via my profile). I'm really curious about (1) and I'd love to take a peek.
-
Thanks Pete,
As always very much appreciate your input.
1/ We aren't using any parameters and when using the filter=0 we are getting the same results. For my just done test I was only able to pull 350 pages out of 18.5k pages using the web interface. If anyone has any other thoughts on this please let me now.
2/ That is a great idea. Most of our pages live in the root directory to keep the URL slugs short so unfortunately this one will not help us.
3/ Another good idea. I understand this approach is helpful to see your coverage of wanted pages in the Google index but won't be able to help you determine superfluous pages currently in the Google index unless I misunderstood you?
4/ We are using ScreamingFrog and I agree its a fantastic tool. The index size with ScreamingFrog is showing not more than 300 pages which is our final goal.
Overall we are seeing continuous yet small drops to the index size using our approach of returning 410 response codes for unwanted pages and dedicated sitemaps to speed up delisting. See http://www.seomoz.org/q/panda-recovery-what-is-the-best-way-to-shrink-your-index-and-make-google-aware
We are just trying to get a more complete list of whats currently in the index to speed up delisting.
Thank you for your reference to the Panda post I remember reading it before and will give it another go right now.
One final question, in your experience dealing with Panda penalties, have you seen scenarios where it seems the delisting/penalizing of a site has only happened for a particular CCTLD of google or just the homepage? See http://www.seomoz.org/q/panda-penguin-penalty-not-global-but-only-firea-for-specific-google-cctlds It is what we are currently experiencing and trying to see if other people have observed something similar.
Best /Thomas
-
If you're willing to piece together multiple sources, I can definitely give you some starting points:
(1) First, dropping from 21K pages indexed in Google to 240 definitely seems odd. Are you hitting omitted results? You may have to shut off filtering in the URL (&filter=0).
(2) You can also divide the site up logically and run "site:" on sub-folders, parameters, etc. Say, for example:
site:example.com/blog
site:example.com/shop
site:example.com/uk
As long as there's some logical structure, you can use it to break the index request down into smaller chunks. Don't forget to use inurl: for URL parameters (filters, pagination, etc.).
(3) This takes a while, but split up your XML sitemaps into logical clusters - say, one for major pages, one for top-level topics/categories, one for sub-categories, one for products. That way, you'll get a cleaner could of what kind of pages are indexed, and you'll know where your gaps are.
(4) Run a desktop crawler on the site, like Xenu or Screaming Frog (Xenu is free, but PC only and harder to use. Screaming Frog has a yearly fee, but it's an excellent tool). This won't necessarily tell you what Google has indexed, but it will help you see how your site is being crawled and where problems are occurring.
I wrote a mega-post a while back on all the different kinds of duplicate content. Sometimes, just seeing examples can help you catch a problem you might be having. It's at:
http://www.seomoz.org/blog/duplicate-content-in-a-post-panda-world
-
Does anyone have any insight on this? If the answer is simply there is no better approach than look at the limited data available through the Google UI this would be helpful as well.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Website dropped out from Google index
Howdy, fellow mozzers. I got approached by my friend - their website is https://www.hauteheadquarters.com She is saying that they dropped from google index over night - and, as you can see if you google their name, website url or even site: , most of the pages are not indexed. Home page is nowhere to be found - that's for sure. I know that they were indexed before. Google webmaster tools don't have any manual actions (at least yet). No sudden changes in content or backlink profile. robots.txt has some weird rule - disallow everything for EtaoSpider. I don't know if google would listen to that - robots checker in GWT says it's all good. Any ideas why that happen? Any ideas what I should check? P.S. Just noticed in GWT there was a huge drop in indexed pages within first week of August. Still no idea why though. P.P.S. Just noticed that there is noindex x-robots-tag in headers... Anyone knows where this can be set?
Intermediate & Advanced SEO | | DmitriiK0 -
Google is indexing wrong page for search terms not on that page
I’m having a problem … the wrong page is indexing with Google, for search phrases “not on that page”. Explained … On a website I developed, I have four products. For example sake, we’ll say these four products are: Sneakers (search phrase: sneakers) Boots (search phrase: boots) Sandals (search phrase: sandals) High heels (search phrase: high heels) Error: What is going “wrong” is … When the search phrase “high heels” is indexed by Google, my “Sneakers” page is being indexed instead (and ranking very well, like #2). The page that SHOULD be indexing, is the “High heels” page (not the sneakers page – this is the wrong search phrase, and it’s not even on that product page – not in URL, not in H1 tags, not in title, not in page text – nowhere, except for in the top navigation link). Clue #1 … this same error is ALSO happening for my other search phrases, in exactly the same manner. i.e. … the search phrase “sandals” is ALSO resulting in my “Sneakers” page being indexed, by Google. Clue #2 … this error is NOT happening with Bing (the proper pages are correctly indexing with the proper search phrases, in Bing). Note 1: MOZ has given all my product pages an “A” ranking, for optimization. Note 2: This is a WordPress website. Note 3: I had recently migrated (3 months ago) most of this new website’s page content (but not the “Sneakers” page – this page is new) from an old, existing website (not mine), which had been indexing OK for these search phrases. Note 4: 301 redirects were used, for all of the OLD website pages, to the new website. I have tried everything I can think of to fix this, over a period of more than 30 days. Nothing has worked. I think the “clues” (it indexes properly in Bing) are useful, but I need help. Thoughts?
Intermediate & Advanced SEO | | MG_Lomb_SEO0 -
Redirected Old Pages Still Indexed
Hello, we migrated a domain onto a new Wordpress site over a year ago. We redirected (with plugin: simple 301 redirects) all the old urls (.asp) to the corresponding new wordpress urls (non-.asp). The old pages are still indexed by Google, even though when you click on them you are redirected to the new page. Can someone tell me reasons they would still be indexed? Do you think it is hurting my rankings?
Intermediate & Advanced SEO | | phogan0 -
Website Does not index in any page?
I created a website www.astrologersktantrik.com 4 days ago and fetch it with google but still my website does not index on google as the keywords I use is with low competition but still my website does not appear on any keywords?
Intermediate & Advanced SEO | | ramansaab0 -
Google indexed wrong pages of my website.
When I google site:www.ayurjeewan.com, after 8 pages, google shows Slider and shop pages. Which I don't want to be indexed. How can I get rid of these pages?
Intermediate & Advanced SEO | | bondhoward0 -
Weird Page switch for a keyword in Google Rankings
Over this past weekend Google switched the page which usually showed in search results for keyword benchmarking. It went from from http://www.apqc.org/benchmarking to http://www.apqc.org/benchmarking-portal/osb. Also on Google the Rankings for the keyword 'benchmarking' sank from 15 to 47 for http://www.apqc.org/benchmarking Just looking for some theories or ideas or anyone that has had this happen to them.
Intermediate & Advanced SEO | | inhouseninja0 -
Fixing A Page Google Omits In Search
Hi, I have two pages ranking for the same keyword phrase. Unfortunately, the wrong page is ranking higher, and the other page, only ranks when you include the omitted results. When you have a page that only shows when its omitted, is that because the content is too similar in google's eyes? Could there be any other possible reason? The content really shouldn't be flagged as duplicate, but if this is the only reason, I can change it around some more. I'm just trying to figure out the root cause before I start messing with anything. Here are the two links, if that's necessary. http://www.kempruge.com/personal-injury/ http://www.kempruge.com/location/tampa/tampa-personal-injury-legal-attorneys/ Best, Ruben
Intermediate & Advanced SEO | | KempRugeLawGroup0 -
How Long Does it Take for Rel Canonical to De-Index / Re-Index a Page?
Hi Mozzers, We have 2 e-commerce websites, Website A and Website B, sharing thousands of pages with duplicate product descriptions. Currently only the product pages on Website B are indexing, and we want Website A indexed instead. We added the rel canonical tag on each of Website B's product pages with a link towards the matching product on Page A. How long until Website B gets de-indexed and Website A gets indexed instead? Did we add the rel canonical tag correctly? Thanks!
Intermediate & Advanced SEO | | Travis-W0