Working out exactly how Google is crawling my site if I have loooots of pages
-
I am trying to work out exactly how Google is crawling my site including entry points and its path from there. The site has millions of pages and hundreds of thousands indexed. I have simple log files with a time stamp and URL that google bot was on. Unfortunately there are hundreds of thousands of entries even for one day and as it is a massive site I am finding it hard to work out the spiders paths. Is there any way using the log files and excel or other tools to work this out simply? Also I was expecting the bot to almost instantaneously go through each level eg. main page--> category page ---> subcategory page (expecting same time stamp) but this does not appear to be the case. Does the bot follow a path right through to the deepest level it can/allowed to for that crawl and then returns to the higher level category pages at a later time? Any help would be appreciated
Cheers
-
Can you explain to me how you did your site map for this please?
-
I've run into the same issue for a site with 40 k + pages - far from your overall page # but still .. maybe it's the same flow overall.
The site I was working on had a structure of about 5 level deep. Some of the areas within the last level were out of reach and they didn't get indexed. More then that even a few areas on level 2 were not present in the google index and the google boot didn't visit those either.
I've created a large xml site map and a dynamic html sitemap with all the pages from the site and submit it via webmaster tool (the xml sitemap that is) but that didn't solve the issue and the same areas were out of the index and didn't got hit. Anyway the huge html sitemap was impossible to follow from a user point of view so I didn't keep that online for long but I am sure it can't work that way either.
What i did that finally solved the issue was to spot the exact areas that were left out, identify the "head" of those pages - that means several pages that acted as gateway for the entire module and I've build a few outside links that pointed to those pages directly and a few that were pointed to main internal pages of those modules that were left out.
Those pages gain authority fast and only in a few days we've spotted the google boot staying over night
All pages are now indexed and even ranking well.
If you can spot some entry pages that can conduct the spider to the rest you can try this approach - it should work for you too.
As far as links I've started with social network links, a few posts with links within the site blog (so that means internal links) and only a couple of outside links - articles with content links for those pages. Overall I think we are talking about 20-25 social network links (twitter, facebook, digg, stumble and delic), about 10 blog posts published in a 2-3 days span and about 10 articles in outside sources.
Since you have a much larger # as far as pages you probably will need more gateways and that means more links - but overall it's not a very time consuming session and it can solve your issue... hopefully
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Does Google understand misspellings in terms of what keywords I should optimize a page for
Hey there! This is sort of an oddball question. We do a lot of hospital websites. One client that we have spells "Orthopedics" as "Orthopaedics" which is another spelling. When I did initial keyword research the volume for Orthopedics as I expected is much higher. However when I do a test search for "Orthopaedics" it looks like I'm getting the same results and Google is highlighting in the content "orthopaedics" even though my search query was "orthopedics". What I'm wondering - is it the same thing to optimize for "orthopaedics" or is it a recommendation I should make to the client to change to "orthopedics" Thanks!
Intermediate & Advanced SEO | | CentreTEK0 -
301 from old site to new one , Should I point to home page or sub category page ?
Hey Seo Experts, I have a small website ranking for few terms like cabinets sale, buy etc . However what i have now decided is to launch a New website with more different products like living room furniture, wardrobes etc . Out of all these categories on new website Cabinets is one of the SubCategory . Now I do not want to have 2 websites . So wanted to 301 from small cabinets website to newly created website. Some of the doubts I have at the moment is ? 1 Should I REDIRECT 301 to sub category (i,e cabinets) which is purely related to Cabinets or Do a Redirect to HOME PAGE . As I also need more Authority to home page as well , as this is relatively new website ? 2 Second question related to this. If you have multiple sub domains does it divide the total authority & TF.Or it is just Ok to have multiple Sub domains if needed ? Any advice appreciated !! Thanks .
Intermediate & Advanced SEO | | aus00070 -
SEO - is it site or page
Hi When we're talking about SEO does the search engine only look at the whole site in general or do they look at the individual page when we're talking about SERP? So if you have a keyword "my search term" Does the search engine look at the site first or the page with the term on then rank you or is it the page then the site.
Intermediate & Advanced SEO | | Cocoonfxmedia0 -
Google Is Indexing The Wrong Page For My Keyword
For a long time (almost 3 mounth) google indexing the wrong page for my main keyword.
Intermediate & Advanced SEO | | Tiedemann_Anselm
The problem is that each time google indexed another page each time for a period of 4-7 days, Sometimes i see the home page, sometimes a category page and sometimes a product page.
It seems though Google has not yet decided what his favorite / better page for this keyword. This is the pages google index: (In most cases you can find the site on the second or third page) Main Page: http://bit.ly/19fOqDh Category Page: http://bit.ly/1ebpiRn Another Category: http://bit.ly/K3MZl4 Product Page: http://bit.ly/1c73B1s All links I get to the website are natural links, therefore in most cases the anchor we got is the website name. In addition I have many links I get from bloggers that asked to do a review on one of my products, I'm very careful about that and so I'm always checking the blogger and their website only if it is something good, I allowed it. also i never ask for a link back (must of the time i receive without asking), and as I said, most of their links are anchor with my website name. Here some example of links that i received from bloggers: http://bit.ly/1hF0pQb http://bit.ly/1a8ogT1 http://bit.ly/1bqqRr8 http://bit.ly/1c5QeC7 http://bit.ly/1gXgzXJ Please Can I get a recommendation what should you do?
Should I try to change the anchor of the link?
Do I need to not allow bloggers to make a review on my products? I'd love to hear what you recommend,
Thanks for the help0 -
Unnatural Links From My Site Penalty - Where, exactly?
So I was just surprised by officially being one of the very few to be hit with the manual penalty from Google "unnatural links from your site." We run a clean ship or try to. Of all the possible penalties, this is the one most unlikely by far to occur. Well, it explains some issues we've had that have been impossible to overcome. We don't have a link exchange. Our entire directory has been deindexed from Google for almost 2 years because of Panda/Penguin - just to be 100% sure this didn't happen. We removed even links that went even to my own personal websites - which were a literal handful. We have 3 partners - who have nofollow links and are listed on a single page. So I'm wondering... does anyone have any reason to understand why we'd have this penalty and it would linger for such a long period of time? If you want to see strange things, try to look up our page rank on virtually any page, especially in the /gui de/ directory. Now the bizarre results of many months make sense. Hopefully one of my fellow SEOs with a fresh pair of eyes can take a look at this one. http://legal.nu/kc68
Intermediate & Advanced SEO | | seoagnostic0 -
Huge Google index on E-commerce site
Hi Guys, Refering back to my original post I would first like to thank you guys for all the advice. We implemented canonical url's all over the site and noindexed some url's with robots.txt and the site already went from 100.000+ url's indexed to 87.000 urls indexed in GWT. My question: Is there way to speed this up?
Intermediate & Advanced SEO | | ssiebn7
I do know about the way to remove url's from index (with noindex of robots.txt condition) but this is a very intensive way to do so. I was hoping you guys maybe have a solution for this.. 🙂0 -
Google is not Indicating any Links to my site
We built a new store on another ccTLD and linked to it from some of our other domains in a few locations. I am noticing that with the Google operator command "links:" we are seeing nothing linking to our site anywhere. Some things to clarify: These are not no-follow links These pages linking to our new domain are indexed The pages being linked to on our new domain are indexed This is not a flash site or heavy in JavaScript The links existed the day the site was launched so when the new pages were crawled they existed. "Site:" command in Google shows me that my new site is indexed. What could potentially be causing this? I am trying to get these newer ccTLD's to begin ranking and I understand that I need to get links going to these pages since they are fairly new (2.5 months) so I can outrank the .com in the SE's in those locales. (Like Google.co.uk)
Intermediate & Advanced SEO | | DRSearchEngOpt0 -
Our site is recieving traffic for both .com/page and .com/page/ with the trailing slash.
Our site is recieving traffic for both .com/page and .com/page/ with the trailing slash. Should we rewrite to just the trailing slash or without because of duplicates. The other question is, if we do a rewrite, google has indexed some pages with the slash and some without - i am assuming we will lose rank for one of them once we do the rewrite, correct?
Intermediate & Advanced SEO | | Profero0