What Sources to use to compile an as comprehensive list of pages indexed in Google?
-
As part of a Panda recovery initiative we are trying to get an as comprehensive list of currently URLs indexed by Google as possible.
Using the site:domain.com operator Google displays that approximately 21k pages are indexed. Scraping the results however ends after the listing of 240 links.
Are there any other sources we could be using to make the list more comprehensive? To be clear, we are not looking for external crawlers like the SEOmoz crawl tool but sources that would be confidently allow us to determine a list of URLs currently hold in the Google index.
Thank you /Thomas
-
We don't usually take private info in public questions, but if you want to, Private Message me the domain (via my profile). I'm really curious about (1) and I'd love to take a peek.
-
Thanks Pete,
As always very much appreciate your input.
1/ We aren't using any parameters and when using the filter=0 we are getting the same results. For my just done test I was only able to pull 350 pages out of 18.5k pages using the web interface. If anyone has any other thoughts on this please let me now.
2/ That is a great idea. Most of our pages live in the root directory to keep the URL slugs short so unfortunately this one will not help us.
3/ Another good idea. I understand this approach is helpful to see your coverage of wanted pages in the Google index but won't be able to help you determine superfluous pages currently in the Google index unless I misunderstood you?
4/ We are using ScreamingFrog and I agree its a fantastic tool. The index size with ScreamingFrog is showing not more than 300 pages which is our final goal.
Overall we are seeing continuous yet small drops to the index size using our approach of returning 410 response codes for unwanted pages and dedicated sitemaps to speed up delisting. See http://www.seomoz.org/q/panda-recovery-what-is-the-best-way-to-shrink-your-index-and-make-google-aware
We are just trying to get a more complete list of whats currently in the index to speed up delisting.
Thank you for your reference to the Panda post I remember reading it before and will give it another go right now.
One final question, in your experience dealing with Panda penalties, have you seen scenarios where it seems the delisting/penalizing of a site has only happened for a particular CCTLD of google or just the homepage? See http://www.seomoz.org/q/panda-penguin-penalty-not-global-but-only-firea-for-specific-google-cctlds It is what we are currently experiencing and trying to see if other people have observed something similar.
Best /Thomas
-
If you're willing to piece together multiple sources, I can definitely give you some starting points:
(1) First, dropping from 21K pages indexed in Google to 240 definitely seems odd. Are you hitting omitted results? You may have to shut off filtering in the URL (&filter=0).
(2) You can also divide the site up logically and run "site:" on sub-folders, parameters, etc. Say, for example:
site:example.com/blog
site:example.com/shop
site:example.com/uk
As long as there's some logical structure, you can use it to break the index request down into smaller chunks. Don't forget to use inurl: for URL parameters (filters, pagination, etc.).
(3) This takes a while, but split up your XML sitemaps into logical clusters - say, one for major pages, one for top-level topics/categories, one for sub-categories, one for products. That way, you'll get a cleaner could of what kind of pages are indexed, and you'll know where your gaps are.
(4) Run a desktop crawler on the site, like Xenu or Screaming Frog (Xenu is free, but PC only and harder to use. Screaming Frog has a yearly fee, but it's an excellent tool). This won't necessarily tell you what Google has indexed, but it will help you see how your site is being crawled and where problems are occurring.
I wrote a mega-post a while back on all the different kinds of duplicate content. Sometimes, just seeing examples can help you catch a problem you might be having. It's at:
http://www.seomoz.org/blog/duplicate-content-in-a-post-panda-world
-
Does anyone have any insight on this? If the answer is simply there is no better approach than look at the limited data available through the Google UI this would be helpful as well.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Do uncrawled but indexed pages affect seo?
It's a well known fact that too much thin content can hurt your SEO, but what about when you disallow google to crawl some places and it indexes some of them anyways (No title, no description, just the link) I am building a shopify store and it's imposible to change the robots.txt using shopify, and they disallow for example, the cart. Disallow: /cart But all my pages are linking there, so google has the uncrawled cart in it's index, along with many other uncrawled urls, can this hurt my SEO or trying to remove that from their index is just a waste of time? -I can't change anything from the robots.txt -I could try to nofollow those internal links What do you think?
Intermediate & Advanced SEO | | cuarto7150 -
Why isn't Google indexing this site?
Hello, Moz Community My client's site hasn't been indexed by Google, although it was launched a couple of months ago. I've ran down the check points in this article https://moz.com/ugc/8-reasons-why-your-site-might-not-get-indexed without finding a reason why. Any sharp SEO-eyes out there who can spot this quickly? The url is: http://www.oldermann.no/ Thank you
Intermediate & Advanced SEO | | Inevo
INEVO, digital agency0 -
Why Google isn't indexing my images?
Hello, on my fairly new website Worthminer.com I am noticing that Google is not indexing images from my sitemap. Already 560 images submitted and Google indexed only 3 of them. Altough there is more images indexed they are not indexing any new images, and I have no idea why. Posts, categories and other urls are indexing just fine, but images not. I am using Wordpress and for sitemaps Wordpress SEO by yoast. Am I missing something here? Why Google won't index my images? Thanks, I appreciate any help, David xv1GtwK.jpg
Intermediate & Advanced SEO | | Worthminer1 -
Drop in indexed pages!
Hi everybody! I've been working on http://thewilddeckcompany.co.uk/ for a little while now. Until recently, everything was great - good rankings for the key terms of 'bird hides' and 'pond dipping platforms'. However, rankings have tanked over the past few days. I can't point my finger at it yet, but a site:thewilddeckcompany.co.uk search shows only three pages have been indexed. There's only 10 on the site, and it was fine beforehand. Any advice would be much appreciated,
Intermediate & Advanced SEO | | Blink-SEO0 -
What may cause a page not to be indexed (be de-indexed)?
Hi All, I have a main category page, a landing page, that does not appear in the SERPS at all (even if I serach for a whole sentence from it). This page once ranked high. What may cause such a punishment for a specific page? Thanks
Intermediate & Advanced SEO | | BeytzNet0 -
How to find all indexed pages in Google?
Hi, We have an ecommerce site with around 4000 real pages. But our index count is at 47,000 pages in Google Webmaster Tools. How can I get a list of all pages indexed of our domain? trying to locate the duplicate content. Doing a "site:www.mydomain.com" only returns up to 676 results... Any ideas? Thanks, Ben
Intermediate & Advanced SEO | | bjs20100 -
Wordpress blog in a subdirectory not being indexed by Google
HI MozzersIn my websites sitemap.xml, pages are listed, such as /blog/ and /blog/textile-fact-or-fiction-egyptian-cotton-explained/These pages are visible when you visit them in a browser and when you use the Google Webmaster tool - Fetch as Google to view them (see attachment), however they aren't being indexed in Google, not even the root directory for the blog (/blog/) is being indexed, and when we query:site: www.hilden.co.uk/blog/ It returns 0 results in Google.Also note that:The Wordpress installation is located at /blog/ which is a subdirectory of the main root directory which is managed by Magento. I'm wondering if this causing the problem.Any help on this would be greatly appreciated!AnthonyToTOHuj.png?1
Intermediate & Advanced SEO | | Tone_Agency0 -
Help! Why did Google remove my images from their index?
I've been scratching my head over this one for a while now and I can't seem to figure it out. I own a website that is user-generated content. Users submit images to my sites of graphic resources (for designers) that they have created to share with our community. I've been noticing over the past few months that I'm getting completely dominated in Google Images. I used to get a ton of traffic from Google Images, but now I can't find my images anywhere. After diving into Analytics I found this: http://cl.ly/140L2d14040Q1R0W161e and realized sometime about a year ago my image traffic took a dive. We've gone back through all the change logs and can't find where we made any changes to the site structure that could have caused this. We are stumped. Does anyone know of any historical Google updates that could have caused this last year around the end of April 2010? Any help or insight would be greatly appreciated!
Intermediate & Advanced SEO | | shawn810