Why do I have so many extra indexed pages?
-
Stats-
Webmaster Tools Indexed Pages- 96,995
Site: Search- 97,800 Pages
Sitemap Submitted- 18,832
Sitemap Indexed- 9,746
I went through the search results through page 28 and every item it showed was correct. How do I figure out where these extra 80,000 items are coming from? I tried crawling the site with screaming frog awhile back but it locked because of so many urls. The site is a Magento site so there are a million urls, but I checked and all of the canonicals are setup properly. Where should I start looking?
-
It ended up being my search results. I was able to use the site operator to break it down.
-
To ensure Screaming Frog can handle the crawl you could chunk up the site and crawl it in parts, e.g. by each subdirectory. This can be done within the 'configuration' menu under 'include'. There's loads of tutorials online.
You can also use exclude to ensure it doesn't crawl unnecessary pages, images or scripts for example on wordpress I often block wp-content
Definitely sounds like a problem with query parameters being indexed though and its often good to ensure these are addressed in the search console.
-
1. Your first one is interesting. I actually haven't been in there before. There are 96 rows and everyone of them is set to let Googlebot Decide. Do you think I should change that up?
2. Not sure on how many images we have but it is a lot. Not we do not have an image sitemap.
I tried Screaming Frog and it couldn't handle it. After about 1.5 million urls it kept locking up. I just setup a free trial for Deep Crawl. It can only do 10,000 but I will see if it has anything worthwhile.
-
- Have you checked out the parameters settings in Google Search Console to find out how many pages Google has found for your site with the same parameters? That might give some insights on that side.
- How many images do you have across the site? Do you have image sitemaps for these kind of pages.
What I would advise + what you've already been trying is to get a full crawl by either using ScreamingFrog or Deepcrawl. This will provide you with better insights into how many pages a search engine can really find.
-
I wouldn't say it is doing fine. Before I started they launched a new site and messed up the 301 redirects. Traffic hasn't recovered yet.
For Robots I am using the Inchoo robots.txt-http://inchoo.net/ecommerce/ultimate-magento-robots-txt-file-examples/ maybe it is a parameters issue, but I can't figure out how to see all my indexed pages.
I tried doing a search for both inurl:= site:www.site.com and inurl:? site:www.site.com and nothing showed up unless I am missing something.
I can't figure out how to check if some of the canonicalized urls are indexed. The pages are all identical though.
We have less then 100 out of stock items.
-
As long as your organic traffic is doing fine I shouldn't be too concerned. That being said:
- Is your robots.txt or search console disallowing crawler access to parameters like '?count=' or '?color='?
- Is your robots.txt disallowing crawler access to urls that have a 'noindex' but were indexed before they got noindex?
- You can also take a couple of parameters from your site and test if any url's have been indexed, by using the 'inurl:parameter site:www.site.com' query.
- Are some of the canonicalized urls indexed anyway? This may indicate that page content is different enough for Google to index both versions.
- If there's a ton of articles that go in and out of stock and use dynamic ID's, Google may keep these in their index. Do out of stock articles return a 404 or are they kept alive?
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Page must be internally linked to get indexed?
If a there is page like website.com/page; I think this page will be indexed by Google even we don't link it internally from anywhere. Is this true? Will it makes any difference in-terms of "indexability" if we list this page on sitemap? I know page's visibility will increase when link from multiple internal pages. I wonder will there be any noticeable difference while this page is listed in sitemap.
Intermediate & Advanced SEO | | vtmoz0 -
How do we decide which pages to index/de-index? Help for a 250k page site
At Siftery (siftery.com) we have about 250k pages, most of them reflected in our sitemap. Though after submitting a sitemap we started seeing an increase in the number of pages Google indexed, in the past few weeks progress has slowed to a crawl at about 80k pages, and in fact has been coming down very marginally. Due to the nature of the site, a lot of the pages on the site likely look very similar to search engines. We've also broken down our sitemap into an index, so we know that most of the indexation problems are coming from a particular type of page (company profiles). Given these facts below, what do you recommend we do? Should we de-index all of the pages that are not being picked up by the Google index (and are therefore likely seen as low quality)? There seems to be a school of thought that de-indexing "thin" pages improves the ranking potential of the indexed pages. We have plans for enriching and differentiating the pages that are being picked up as thin (Moz itself picks them up as 'duplicate' pages even though they're not. Thanks for sharing your thoughts and experiences!
Intermediate & Advanced SEO | | ggiaco-siftery0 -
On 1 of our sites we have our Company name in the H1 on our other site we have the page title in our H1 - does anyone have any advise about the best information to have in the H1, H2 and Page Tile
We have 2 sites that have been set up slightly differently. On 1 site we have the Company name in the H1 and the product name in the page title and H2. On the other site we have the Product name in the H1 and no H2. Does anyone have any advise about the best information to have in the H1 and H2
Intermediate & Advanced SEO | | CostumeD0 -
Is it a problem to use a 301 redirect to a 404 error page, instead of serving directly a 404 page?
We are building URLs dynamically with apache rewrite.
Intermediate & Advanced SEO | | lcourse
When we detect that an URL is matching some valid patterns, we serve a script which then may detect that the combination of parameters in the URL does not exist. If this happens we produce a 301 redirect to another URL which serves a 404 error page, So my doubt is the following: Do I have to worry about not serving directly an 404, but redirecting (301) to a 404 page? Will this lead to the erroneous original URL staying longer in the google index than if I would serve directly a 404? Some context. It is a site with about 200.000 web pages and we have currently 90.000 404 errors reported in webmaster tools (even though only 600 detected last month).0 -
We are switching our CMS local pages from a subdomain approach to a subfolder approach. What's the best way to handle this? Should we redirect every local subdomain page to its new subfolder page?
We are looking to create a new subfolder approach within our website versus our current subdomain approach. How should we go about handling this politely as to not lose everything we've worked on up to this point using the subdomain approach? Do we need to redirect every subdomain URL to the new subfolder page? Our current local pages subdomain set up: stores.websitename.com How we plan on adding our new local subfolder set-up: websitename.com/stores/state/city/storelocation Any and all help is appreciated.
Intermediate & Advanced SEO | | SEO.CIC0 -
Index or not index Categories
We are using Yoast Seo plugin. On the main menu we have only categories which has consist of posts and one page. We have category with villas, category with villa hotels etc. Initially we set to index and include in the sitemap posts and excluded categories, but I guess it was not correct. Would be a better way to index and include categories in the sitemap and exclude the posts in order to avoid the duplicate? It somehow does not make sense for me, If the posts are excluded and the categories included, will not then be the categories empty for google? I guess I will get crazy of this. Somebody has perhaps more experiences with this?
Intermediate & Advanced SEO | | Rebeca10 -
Getting Pages Requiring Login Indexed
Somehow certain newspapers' webpages show up in the index but require login. My client has a whole section of the site that requires a login (registration is free), and we'd love to get that content indexed. The developer offered to remove the login requirement for specific user agents (eg Googlebot, et al.). I am afraid this might get us penalized. Any insight?
Intermediate & Advanced SEO | | TheEspresseo0 -
Indexed Pages in Google, How do I find Out?
Is there a way to get a list of pages that google has indexed? Is there some software that can do this? I do not have access to webmaster tools, so hoping there is another way to do this. Would be great if I could also see if the indexed page is a 404 or other Thanks for your help, sorry if its basic question 😞
Intermediate & Advanced SEO | | JohnPeters0