Moz Q&A is closed.
After more than 13 years, and tens of thousands of questions, Moz Q&A closed on 12th December 2024. Whilst we’re not completely removing the content - many posts will still be possible to view - we have locked both new posts and new replies. More details here.
Crawled page count in Search console
-
Hi Guys,
I'm working on a project (premium-hookahs.nl) where I stumble upon a situation I can’t address. Attached is a screenshot of the crawled pages in Search Console.
History:
Doing to technical difficulties this webshop didn’t always no index filterpages resulting in thousands of duplicated pages. In reality this webshops has less than 1000 individual pages. At this point we took the following steps to result this:
- Noindex filterpages.
- Exclude those filterspages in Search Console and robots.txt.
- Canonical the filterpages to the relevant categoriepages.
This however didn’t result in Google crawling less pages. Although the implementation wasn’t always sound (technical problems during updates) I’m sure this setup has been the same for the last two weeks. Personally I expected a drop of crawled pages but they are still sky high. Can’t imagine Google visits this site 40 times a day.
To complicate the situation:
We’re running an experiment to gain positions on around 250 long term searches. A few filters will be indexed (size, color, number of hoses and flavors) and three of them can be combined. This results in around 250 extra pages. Meta titles, descriptions, h1 and texts are unique as well.
Questions:
- - Excluding in robots.txt should result in Google not crawling those pages right?
- - Is this number of crawled pages normal for a website with around 1000 unique pages?
- - What am I missing?
-
Ben,
I doubt that crawlers are going to access the robots.txt file for each request, but they still have to validate any url they find against the list of the blocked ones.
Glad to help,
Don
-
Hi Don,
Thanks for the clear explanation. I always though disallow in robots.txt would give a sort of map to Google (at the start of a site crawl) with the pages on the site that shouldn’t be crawled. So he therefore didn’t have to “check the locked cars”.
If I understand you correctly, google checks the robots.txt with every single page load?
That could definitely explain high number of crawled pages per day.
Thanks a lot!
-
Hi Bob,
About the nofollow vs blocked. In the end I suppose you have the same results, but in practice it works a little differently. When you nofollow a link it tells the crawler as soon as it encounters the link not to request or follow that link path. When you block it via robots the crawler still attempts to access the url only to find it not accessible.
Imagine if I said go to the parking lot and collect all the loose change in all the unlocked cars. Now imagine how much easier that task would be if all the locked cars had a sign in the window that said "Locked", you could easily ignore the locked cars and go directly to the unlocked ones. Without the sign you would have to physically go check each car to see if it will open.
About link juice, if you have a link, juice will be passed regardless of the type of link. (You used to be able to use nofollow to preserve link juice but no longer). This is bit unfortunate for sites that use search filters because they are such a valuable tool for the users.
Don
-
Hi Don,
You're right about the sitemap, noted it on the to do list!
Your point about nofollow is intersting. Isn't excluding in robots.txt giving the same result?
Before we went on with the robots.txt we didn't implant nofollow because we didn't want any linkjuice to pass away. Since we got robots.txt I assume this doesn’t matter anymore since Google won’t crawl those pages anyway.
Best regards,
Bob
-
Hi Bob,
You can "suggest" a crawl rate to Google by logging into your webmasters tools on Google and adjusting it there.
As for indexing pages.. I looked at your robots and site. It really looks like you need to employ some No Follow on some of your internal linking, specifically on the product page filters, that alone could reduce the total number of URLS that the crawlers even attempts to look at.
Additionally your sitemap http://premium-hookahs.nl/sitemap.xml shows a change frequency of daily, and probably should be broken out between Pages / Images so you end up using two sitemaps one for images and one for pages. You may also want to review what is in there. Using ScreamingFrog (free) the sitemap I made (link) only shows about 100 urls.
Hope it helps,
Don
-
Hi Don,
Just wanted to add a quick note: your input made go through the indexation state of the website again which was worse than I through it was. I will take some steps to get this resolved, thanks!
Would love to hear your input about the number of crawled pages.
Best regards,
Bob
-
Hello Don,
Thanks for your advice. What would your advice be if the main goal would be the reduction of crawled pages per day? I think we got the right pages in the index and the old duplicates are mostly deindexed. At this point I’m mostly worried about Google spending it’s crawlbudget on the right pages. Somehow it still crawls 40.000 pages per day while we only got around 1000 pages that should be crawled. Looking at the current setup (with almost everything excluded though robots.txt) I can’t think of pages it does crawl to reach the 40k. And 40 times a day sounds like way to many crawled pages for a normal webshop.
Hope to hear from you!
-
Hello Bob,
Here is some food for thought. If you disallow a page in Robots.txt, google for example will not crawl that page. That does not however mean they will remove it from the index if it had previously been crawled. It simply treats it as inaccessible and moves on. It will take some time, months before Google finally says, we have no fresh crawls of page x, its time to remove it from the index.
On the other hand if you specifically allow Google to crawl those pages and show a no-index tag on it, Google now has a new directive it can act upon immediately.
So my evaluation of the situation would be to do 1 of 2 things.
1. Remove the disallow from robots and allow Google to crawl the pages again. However, this time use no-index, no-follow tags.
2. Remove the disallow from robots and allow Google to crawl the pages again, but use canonical tags to the main "filter" page to prevent further indexing the specific filter pages.
Which option is best depends on the amount of urls being indexed, a few thousand canonical would be my choice. A few hundred thousand, then no index would make more sense.
Whichever option, you will have to insure Google re-crawls, and then allow them time to re-index appropriately. Not a quick fix, but a fix none the less.
My thoughts and I hope it makes sense,
Don
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Does anyone know how to fix this structured data error on search console? Invalid value in field "itemtype"
I'm getting the same structured data error on search console form most of my websites, Invalid value in field "itemtype" I take off all the structured data but still having this problem, according to Search console is a syntax problem but I can't find what is causing this. Any guess, suggestion or solution for this?
Intermediate & Advanced SEO | | Alexanders0 -
URL structure - Page Path vs No Page Path
We are currently re building our URL structure for eccomerce websites. We have seen a lot of site removing the page path on product pages e.g. https://www.theiconic.co.nz/liberty-beach-blossom-shirt-680193.html versus what would normally be https://www.theiconic.co.nz/womens-clothing-tops/liberty-beach-blossom-shirt-680193.html Should we be removing the site page path for a product page to keep the url shorter or should we keep it? I can see that we would loose the hierarchy juice to a product page but not sure what is the right thing to do.
Intermediate & Advanced SEO | | Ashcastle0 -
Google Adsbot crawling order confirmation pages?
Hi, We have had roughly 1000+ requests per 24 hours from Google-adsbot to our confirmation pages. This generates an error as the confirmation page cannot be viewed after closing or by anyone who didn't complete the order. How is google-adsbot finding pages to crawl that are not linked to anywhere on the site, in the sitemap or linked to anywhere else? Is there any harm in a google crawler receiving a higher percentage of errors - even though the pages are not supposed to be requested. Is there anything we can do to prevent the errors for the benefit of our network team and what are the possible risks of any measures we can take? This bot seems to be for evaluating the quality of landing pages used in for Adwords so why is it trying to access confirmation pages when they have not been set for any of our adverts? We included "Disallow: /confirmation" in the robots.txt but it has continued to request these pages, generating a 403 page and an error in the log files so it seems Adsbot doesn't follow robots.txt. Thanks in advance for any help, Sam
Intermediate & Advanced SEO | | seoeuroflorist0 -
Multiple pages optimised for the same keywords but pages are functionally different and visually different
Hi MOZ community! We're wondering what the implications would be on organic ranking by having 2 pages, which have quite different functionality were optimised for the same keywords. So, for example, one of the pages in question is
Intermediate & Advanced SEO | | TrueluxGroup
https://www.whichledlight.com/categories/led-spotlights
and the other page is
https://www.whichledlight.com/t/led-spotlights both of these pages are basically geared towards the keyword led spotlights the first link essentially shows the options for led spotlights, the different kind of fittings available, and the second link is a product search / results page for all products that are spotlights. We're wondering what the implications of this could be, as we are currently looking to improve the ranking for the site particularly for this keyword. Is this even safe to do? Especially since we're at the bottom of the hill of climbing the ranking ladder of this keyword. Give us a shout if you want any more detail on this to answer more easily 🙂0 -
Google indexing only 1 page out of 2 similar pages made for different cities
We have created two category pages, in which we are showing products which could be delivered in separate cities. Both pages are related to cake delivery in that city. But out of these two category pages only 1 got indexed in google and other has not. Its been around 1 month but still only Bangalore category page got indexed. We have submitted sitemap and google is not giving any crawl error. We have also submitted for indexing from "Fetch as google" option in webmasters. www.winni.in/c/4/cakes (Indexed - Bangalore page - http://www.winni.in/sitemap/sitemap_blr_cakes.xml) 2. http://www.winni.in/hyderabad/cakes/c/4 (Not indexed - Hyderabad page - http://www.winni.in/sitemap/sitemap_hyd_cakes.xml) I tried searching for "hyderabad site:www.winni.in" in google but there also http://www.winni.in/hyderabad/cakes/c/4 this link is not coming, instead of this only www.winni.in/c/4/cakes is coming. Can anyone please let me know what could be the possible issue with this?
Intermediate & Advanced SEO | | abhihan0 -
How Do You Remove Video Thumbnails From Google Search Result Pages?
This is going to be a long question, but, in a nutshell, I am asking if anyone knows how to remove video thumbnails from Google's search result pages? We have had video thumbnails show up next to many of our organic listings in Google's search result pages for several months. To be clear, these are organic listings for our site, not results from performing a video search. When you click on the thumbnail or our listing title, you go to the same page on our site - a list of products or the product page. Although it was initially believed that these thumbnails drew the eye to our listings and that we would receive more traffic, we are actually seeing severe year over year declines in traffic to our category pages with thumbnails vs. category pages without thumbnails (where average rank remained relatively constant). We believe this decline is due to several things: An old date stamp that makes our listing look outdated (despite the fact that we can prove Google has spidered and updated their cache of these pages as recent as 2 days ago). We have no idea where Google is getting this datestamp from. An unrelated thumbnail to the page title, etc. - sometimes a picture of a man's face when the category is for women's handbags A difference in intent - user intends to shop or browse, not watch a video. They skip our listing because it looks like a video even though both the thumbnail and our listing click through to a category page of products. So we want to remove these video thumbnails from Google's search results without removing our pages from the index. Does anyone know how to do this? We believed that this connection between category page and video was happening in our video sitemap. We have removed all reference to video and category pages in the sitemap. After making this change and resubmitting the sitemap in Webmaster Tools, we have not seen any changes in the search results (it's been over 2 weeks). I've been reading and it appears many believe that Google can identify video embedded in pages. That makes sense. We can certainly remove videos from our category pages to truly remove the connection between category page URL and video thumbnail. However, I don't believe this is enough because in some cases you can find video thumbnails next to listings where the page has not had a video thumbnail in months (example: search for "leather handbags" and find www.ebags.com/category/handbags/m/leather - that video does not exist on that page and has not for months. Similarly, do a search for "handbags" and find www.ebags.com/department/handbags. That video has not been on that page since 2010. Any ideas?
Intermediate & Advanced SEO | | SharieBags0 -
Best practice for removing indexed internal search pages from Google?
Hi Mozzers I know that it’s best practice to block Google from indexing internal search pages, but what’s best practice when “the damage is done”? I have a project where a substantial part of our visitors and income lands on an internal search page, because Google has indexed them (about 3 %). I would like to block Google from indexing the search pages via the meta noindex,follow tag because: Google Guidelines: “Use robots.txt to prevent crawling of search results pages or other auto-generated pages that don't add much value for users coming from search engines.” http://support.google.com/webmasters/bin/answer.py?hl=en&answer=35769 Bad user experience The search pages are (probably) stealing rankings from our real landing pages Webmaster Notification: “Googlebot found an extremely high number of URLs on your site” with links to our internal search results I want to use the meta tag to keep the link juice flowing. Do you recommend using the robots.txt instead? If yes, why? Should we just go dark on the internal search pages, or how shall we proceed with blocking them? I’m looking forward to your answer! Edit: Google have currently indexed several million of our internal search pages.
Intermediate & Advanced SEO | | HrThomsen0 -
Should the sitemap include just menu pages or all pages site wide?
I have a Drupal site that utilizes Solr, with 10 menu pages and about 4,000 pages of content. Redoing a few things and we'll need to revamp the sitemap. Typically I'd jam all pages into a single sitemap and that's it, but post-Panda, should I do anything different?
Intermediate & Advanced SEO | | EricPacifico0