What Sources to use to compile an as comprehensive list of pages indexed in Google?
-
As part of a Panda recovery initiative we are trying to get an as comprehensive list of currently URLs indexed by Google as possible.
Using the site:domain.com operator Google displays that approximately 21k pages are indexed. Scraping the results however ends after the listing of 240 links.
Are there any other sources we could be using to make the list more comprehensive? To be clear, we are not looking for external crawlers like the SEOmoz crawl tool but sources that would be confidently allow us to determine a list of URLs currently hold in the Google index.
Thank you /Thomas
-
We don't usually take private info in public questions, but if you want to, Private Message me the domain (via my profile). I'm really curious about (1) and I'd love to take a peek.
-
Thanks Pete,
As always very much appreciate your input.
1/ We aren't using any parameters and when using the filter=0 we are getting the same results. For my just done test I was only able to pull 350 pages out of 18.5k pages using the web interface. If anyone has any other thoughts on this please let me now.
2/ That is a great idea. Most of our pages live in the root directory to keep the URL slugs short so unfortunately this one will not help us.
3/ Another good idea. I understand this approach is helpful to see your coverage of wanted pages in the Google index but won't be able to help you determine superfluous pages currently in the Google index unless I misunderstood you?
4/ We are using ScreamingFrog and I agree its a fantastic tool. The index size with ScreamingFrog is showing not more than 300 pages which is our final goal.
Overall we are seeing continuous yet small drops to the index size using our approach of returning 410 response codes for unwanted pages and dedicated sitemaps to speed up delisting. See http://www.seomoz.org/q/panda-recovery-what-is-the-best-way-to-shrink-your-index-and-make-google-aware
We are just trying to get a more complete list of whats currently in the index to speed up delisting.
Thank you for your reference to the Panda post I remember reading it before and will give it another go right now.
One final question, in your experience dealing with Panda penalties, have you seen scenarios where it seems the delisting/penalizing of a site has only happened for a particular CCTLD of google or just the homepage? See http://www.seomoz.org/q/panda-penguin-penalty-not-global-but-only-firea-for-specific-google-cctlds It is what we are currently experiencing and trying to see if other people have observed something similar.
Best /Thomas
-
If you're willing to piece together multiple sources, I can definitely give you some starting points:
(1) First, dropping from 21K pages indexed in Google to 240 definitely seems odd. Are you hitting omitted results? You may have to shut off filtering in the URL (&filter=0).
(2) You can also divide the site up logically and run "site:" on sub-folders, parameters, etc. Say, for example:
site:example.com/blog
site:example.com/shop
site:example.com/uk
As long as there's some logical structure, you can use it to break the index request down into smaller chunks. Don't forget to use inurl: for URL parameters (filters, pagination, etc.).
(3) This takes a while, but split up your XML sitemaps into logical clusters - say, one for major pages, one for top-level topics/categories, one for sub-categories, one for products. That way, you'll get a cleaner could of what kind of pages are indexed, and you'll know where your gaps are.
(4) Run a desktop crawler on the site, like Xenu or Screaming Frog (Xenu is free, but PC only and harder to use. Screaming Frog has a yearly fee, but it's an excellent tool). This won't necessarily tell you what Google has indexed, but it will help you see how your site is being crawled and where problems are occurring.
I wrote a mega-post a while back on all the different kinds of duplicate content. Sometimes, just seeing examples can help you catch a problem you might be having. It's at:
http://www.seomoz.org/blog/duplicate-content-in-a-post-panda-world
-
Does anyone have any insight on this? If the answer is simply there is no better approach than look at the limited data available through the Google UI this would be helpful as well.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
If there any SEO downside in using Google+ brand page for news curation?
We are thinking about using our Google+ brand page to curate relevant news from different sources and organize them in Collections. We are confident that we can generate backlinks, followers, and engagement with this strategy. My fear is to suffer some penalty due to the fact that will not be sharing our own content. We will be redirecting the clicks to the website of the owner of the content; using Start a Fire tracking links (https://startafire.com/). Since I am not aware of any Google+ brand page that executed this curated news strategy with success, I decided to post this question. Our goal is to get high ranks for our Google+ brand page for searches to our brand name and for the name of the Collections. BTW, our curated news posts will be automated.
Intermediate & Advanced SEO | | grinseo0 -
How do I know what pages of my site is not inedexed by google ?
Hi I my Google webmaster tools under Crawl->sitemaps it shows 1117 pages submitted but 619 has been indexed. Is there any way I can fined which pages are not indexed and why? it has been like this for a while. I also have a manual action (partial) message. "Unnatural links to your site--impacts links" and under affects says "Some incoming links" is that the reason Google does not index some of my pages? Thank you Sina
Intermediate & Advanced SEO | | SinaKashani0 -
Using Canonical on home page
Our home page has the canonical tag pointing to itself (something from wordpress i understand). Is there any positive or negative affect that anyone is aware of from having pages canonical'ed to themselves?
Intermediate & Advanced SEO | | halloranc0 -
Should I use rel=canonical on similar product pages.
I'm thinking of using rel=canonical for similar products on my site. Say I'm selling pens and they are al very similar. I.e. a big pen in blue, a pack of 5 blue bic pens, a pack of 10, 50, 100 etc. should I rel=canonical them all to the best seller as its almost impossible to make the pages unique. (I realise the best I realise these should be attributes and not products but I'm sure you get my point) It seems sensible to have one master canonical page for bic pens on a site that has a great description video content and good images plus linked articles etc rather than loads of duplicate looking pages. love to hear thoughts from the Moz community.
Intermediate & Advanced SEO | | mark_baird0 -
Does Google index more than three levels down if the XML sitemap is submitted via Google webmaster Tools?
We are building a very big ecommerce site. The site has 1000 products and has many categories/levels. The site is still in construccion so you cannot see it online. My objective is to get Google to rank the products (level 5) Here is an example level 1 - Homepage - http://vulcano.moldear.com.ar/ Level 2 - http://vulcano.moldear.com.ar/piscinas/ Level 3 - http://vulcano.moldear.com.ar/piscinas/electrobombas-para-piscinas/ Level 4 - http://vulcano.moldear.com.ar/piscinas/electrobombas-para-piscinas/autocebantes.html/ Level 5 - Product is on this level - http://vulcano.moldear.com.ar/piscinas/electrobombas-para-piscinas/autocebantes/autocebante-recomendada-para-filtros-vc-10.html Thanks
Intermediate & Advanced SEO | | Carla_Dawson0 -
Adding Orphaned Pages to the Google Index
Hey folks, How do you think Google will treat adding 300K orphaned pages to a 4.5 million page site. The URLs would resolve but there would be no on site navigation to those pages, Google would only know about them through sitemap.xmls. These pages are super low competition. The plot thickens, what we are really after is to get 150k real pages back on the site, these pages do have crawlable paths on the site but in order to do that (for technical reasons) we need to push these other 300k orphaned pages live (it's an all or nothing deal) a) Do you think Google will have a problem with this or just decide to not index some or most these pages since they are orphaned. b) If these pages will just fall out of the index or not get included, and have no chance of ever accumulating PR anyway since they are not linked to, would it make sense to just noindex them? c) Should we not submit sitemap.xml files at all, and take our 150k and just ignore these 300k and hope Google ignores them as well since they are orhpaned? d) If Google is OK with this maybe we should submit the sitemap.xmls and keep an eye on the pages, maybe they will rank and bring us a bit of traffic, but we don't want to do that if it could be an issue with Google. Thanks for your opinions and if you have any hard evidence either way especially thanks for that info. 😉
Intermediate & Advanced SEO | | irvingw0 -
Why are so many pages indexed?
We recently launched a new website and it doesn't consist of that many pages. When you do a "site:" search on Google, it shows 1,950 results. Obviously we don't want this to be happening. I have a feeling it's effecting our rankings. Is this just a straight up robots.txt problem? We addressed that a while ago and the number of results aren't going down. It's very possible that we still have it implemented incorrectly. What are we doing wrong and how do we start getting pages "un-indexed"?
Intermediate & Advanced SEO | | MichaelWeisbaum0 -
When using ALT tags - are spaces, hyphens or underscores preferred by Google when using multiple words?
when plugging ALT tags into images, does Google prefer spaces, hyphens, or underscores? I know with filenames, hyphens or underscores are preferred and spaces are replaced with %20. Thoughts? Thanks!
Intermediate & Advanced SEO | | BrooklynCruiser3