Moz Q&A is closed.
After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.
How can I make a list of all URLs indexed by Google?
-
I started working for this eCommerce site 2 months ago, and my SEO site audit revealed a massive spider trap.
The site should have been 3500-ish pages, but Google has over 30K pages in its index. I'm trying to find a effective way of making a list of all URLs indexed by Google.
Anyone?
(I basically want to build a sitemap with all the indexed spider trap URLs, then set up 301 on those, then ping Google with the "defective" sitemap so they can see what the site really looks like and remove those URLs, shrinking the site back to around 3500 pages)
-
If you can get a developer to create a list of all the pages Google has crawled within a date range then you can use this python script to check if the page is indexed or not.
http://searchengineland.com/check-urls-indexed-google-using-python-259773
The script uses the info: search feature to check the urls.
You will have to install Python, Tor and Polipo for this to work. It is quite technical so if you aren't a technical person you may need help.
Depending on how many URL's you have and how long you decide to wait before checking each URL, it can take a few hours.
-
Thanks for your input guys! I've almost landed on the following approach:
- Use this http://www.chrisains.com/seo-tools/extract-urls-from-web-serps/ to collect a number (3-600) of URLs based on the various problem URL-footprints.
- Make XML "problem sitemaps" based on above URLs
- Implement 301s
- Ping the search engines with the XML "problem sitemaps", so that these may discover changes and see what the site really looks like (ideally reducing the # of indexed pages by about 85%)
- Track SE traffic as well as index for each URL footprint once a week for 6-8 weeks and follow progress
- If progress is not satisfactory, then go the URL Profiler route.
Any thoughts before I go ahead?
-
URL profiler will do this, as well as the other recommend scraper sites.
-
URL Profiler might be worth checking out:
It does require that you use a proxy, since Google does not like you scraping their search results.
-
Im sorry to confirm you that google does not want to everyine know that they have in their index. We as SEOs complain about that.
Its hard to belive that you couldnt get all your pages with a scraper. (because it just searches and gets the SERPS)
-
I tried thiss and a few others http://www.chrisains.com/seo-tools/extract-urls-from-web-serps/. This gave me about 500-1000 URLs at a time, but included a lot of cut and paste back and forth.
I imagine there must be a much easier way of doing this...
-
Well, There are some scrapers that might do that job.
To do it the right way you will need proxies and a scraper.
My recommendation is Gscraper or Scrapebox and a list of (at list) 10 proxies.Then, just make a scrape whit the "site:mydomain.com" and see what you get.
(before buying proxies or any scraper, check if you get something like you want with the free stuff) -
I used Screaming to discover the spider trap (and more), but as far as I know, I cannot use Screaming to import all URLs that Google actually has in its index (or can I?).
A list of URLs actually in Googles index is what I'm after
-
Hi Sverre,
Have you tried Screaming Frog SEO Spider? Here a link to it: https://www.screamingfrog.co.uk/seo-spider/
It's really helpfull to crawl all the pages you have as accesible for spiders. You might need the premium version to crawl over 500 pages.
Also, have you checked for the common duplicate pages issues? Here a Moz tutorial: https://moz.com/learn/seo/duplicate-content
Hope it helps.
GR.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Google does not want to index my page
I have a site that is hundreds of page indexed on Google. But there is a page that I put in the footer section that Google seems does not like and are not indexing that page. I've tried submitting it to their index through google webmaster and it will appear on Google index but then after a few days it's gone again. Before that page had canonical meta to another page, but it is removed now.
Intermediate & Advanced SEO | | odihost0 -
Wrong URLs indexed, Failing To Rank Anywhere
I’m struggling with a client website that's massively failing to rank. It was published in Nov/Dec last year - not optimised or ranking for anything, it's about 20 pages. I came onboard recently, and 5-6 weeks ago we added new content, did the on-page and finally changed from the non-www to the www version in htaccess and WP settings (while setting www as preferred in Search Console). We then did a press release and since then, have acquired about 4 partial match contextual links on good websites (before this, it had virtually none, save for social profiles etc.) I should note that just before we added the (about 50%) new content and optimised, my developer accidentally published the dev site of the old version of the site and it got indexed. He immediately added it correctly to robots.txt, and I assumed it would therefore drop out of the index fairly quickly and we need not be concerned. Now it's about 6 weeks later, and we’re still not ranking anywhere for our chosen keywords. The keywords are around “egg freezing,” so only moderate competition. We’re not even ranking for our brand name, which is 4 words long and pretty unique. We were ranking in the top 30 for this until yesterday, but it was the press release page on the old (non-www) URL! I was convinced we must have a duplicate content issue after realising the dev site was still indexed, so last week, we went into Search Console to remove all of the dev URLs manually from the index. The next day, they were all removed, and we suddenly began ranking (~83) for “freezing your eggs,” one of our keywords! This seemed unlikely to be a coincidence, but once again, the positive sign was dampened by the fact it was non-www page that was ranking, which made me wonder why the non-www pages were still even indexed. When I do site:oursite.com, for example, both non-www and www URLs are still showing up…. Can someone with more experience than me tell me whether I need to give up on this site, or what I could do to find out if I do? I feel like I may be wasting the client’s money here by building links to a site that could be under a very weird penalty 😕
Intermediate & Advanced SEO | | Ullamalm0 -
Can an "Event" in Structured Data For Google Be A Webinar?
I have a client who is has structured data for live business webinars. Google's documentation seems to talk more about music and tickets than this kind of thing. At the same time, we get an error in search console for "Name" and location, which they list as "webinar." Should I removed this failed structured data attempt or is there a way to fix it? Thanks!
Intermediate & Advanced SEO | | 945010 -
How can I stop spam Google Organic traffic?
Hey Moz, I'm a rather experienced SEO who just encountered a problem I have never faced. I am hoping to get some advice or be pointed in the right direction. I just started work for a new client. Really great client and website. Nicer than most design/content. They will need some rel canonical work but that is not the issue here. The traffic looked great at first glance 131k visits in April. Google Analytics Acquisition Overview showed 94% of the traffic as organic. When I dug deeper and looked at the organic source I saw that Google was 99.9% of it. Normal enough. Then I looked at the time on site and my jaw dropped. 118,454 Organic New Users for Google only stayed on the site for 3 seconds. There is no way that the traffic is real. It does not match what Google Webmaster tools, Moz, and Ahrefs are telling me. How do I stop a service that is sending fake organic Google traffic?
Intermediate & Advanced SEO | | placementLabs0 -
How can I get Bing to index my subdomain correctly?
Hi guys, My website exists on a subdomain (i.e. https://website.subdomain.com) and is being indexed correctly on all search engines except Bing and Duck Duck Go, which list 'https://www.website.subdomain.com'. Unfortunately my subdomain isn't configured for www (the domain is out of my control), so searchers are seeing a server error when clicking on my homepage in the SERPs. I have verified the site successfully in Bing Webmaster Tools, but it still shows up incorrectly. Does anyone have any advice on how I could fix this issue? Thank you!
Intermediate & Advanced SEO | | cos20300 -
Best way to remove full demo (staging server) website from Google index
I've recently taken over an in-house role at a property auction company, they have a main site on the top-level domain (TLD) and 400+ agency sub domains! company.com agency1.company.com agency2.company.com... I recently found that the web development team have a demo domain per site, which is found on a subdomain of the original domain - mirroring the site. The problem is that they have all been found and indexed by Google: demo.company.com demo.agency1.company.com demo.agency2.company.com... Obviously this is a problem as it is duplicate content and so on, so my question is... what is the best way to remove the demo domain / sub domains from Google's index? We are taking action to add a noindex tag into the header (of all pages) on the individual domains but this isn't going to get it removed any time soon! Or is it? I was also going to add a robots.txt file into the root of each domain, just as a precaution! Within this file I had intended to disallow all. The final course of action (which I'm holding off in the hope someone comes up with a better solution) is to add each demo domain / sub domain into Google Webmaster and remove the URLs individually. Or would it be better to go down the canonical route?
Intermediate & Advanced SEO | | iam-sold0 -
Is there a way to get a list of Total Indexed pages from Google Webmaster Tools?
I'm doing a detailed analysis of how Google sees and indexes our website and we have found that there are 240,256 pages in the index which is way too many. It's an e-commerce site that needs some tidying up. I'm working with an SEO specialist to set up URL parameters and put information in to the robots.txt file so the excess pages aren't indexed (we shouldn't have any more than around 3,00 - 4,000 pages) but we're struggling to find a way to get a list of these 240,256 pages as it would be helpful information in deciding what to put in the robots.txt file and which URL's we should ask Google to remove. Is there a way to get a list of the URL's indexed? We can't find it in the Google Webmaster Tools.
Intermediate & Advanced SEO | | sparrowdog0 -
Can too many NoFollow links damage your Google rankings?
I've been trying to recover from a Google algorithm change since Sep 2012, so far without success. I'm now wondering if the nofollow on external links in my blog posts are actually doing me damage. http://www.smartdatinguk.com/blog/ Does anyone have any experience of this?
Intermediate & Advanced SEO | | benners0