How Does Google's "index" find the location of pages in the "page directory" to return?
-
This is my understanding of how Google's search works, and I am unsure about one thing in specific:
- Google continuously crawls websites and stores each page it finds (let's call it "page directory")
- Google's "page directory" is a cache so it isn't the "live" version of the page
- Google has separate storage called "the index" which contains all the keywords searched. These keywords in "the index" point to the pages in the "page directory" that contain the same keywords.
- When someone searches a keyword, that keyword is accessed in the "index" and returns all relevant pages in the "page directory"
- These returned pages are given ranks based on the algorithm
The one part I'm unsure of is how Google's "index" knows the location of relevant pages in the "page directory". The keyword entries in the "index" point to the "page directory" somehow. I'm thinking each page has a url in the "page directory", and the entries in the "index" contain these urls. Since Google's "page directory" is a cache, would the urls be the same as the live website (and would the keywords in the "index" point to these urls)?
For example if webpage is found at wwww.website.com/page1, would the "page directory" store this page under that url in Google's cache?
The reason I want to discuss this is to know the effects of changing a pages url by understanding how the search process works better.
-
Yeah that makes sense. I also have a lot of experience with databases and the back ends of websites so I know your language.
I'm wondering how Google correlates the url with the page entries then. Maybe each page entry would have a url field so Google knows the location of the live version to constantly update that entry in the "page directory" database?
-
That is a question that no one here can answer. We cant speak for how Google does things internally.
but.... as a web / database programmer for 14+ years let me tell you how its "generally" done
Usually when you have to link to separate sets of data together (ie. database or tables) there is usually a unique_id created to link them which usually is never changed. So when a new record is created that record will live with that ID for its life, also known as a (unique identifier which tends to be an auto-incremented number that is dynamically generated and can not be repeated).
Since records tend to be linked this way, any other fields that exist in the record (firstName, lastName, Url, blah blah) then can be changed without the original ID being disturbed.
So to answer your question from my experience I would assume Google links from a unique identifier of some sort and not the URL directly.
Hope I didn't lose you, its my favorite subject...but no one here speaks that language to much
-
That makes sense, thanks for getting back to me so fast!
Perhaps you can help answer my next question. I have a client who used to host his domain at "www.oldurl.com", and has migrated his website to "www.newurl.com". He wants to use his old domain "www.oldurl.com", so he setup forwarding/masking so that when someone tries to access "www.oldurl.com" they are forwarded to "www.newurl.com" but the url shown to the user is "www.oldurl.com".
My client want his old url "www.oldurl.com" to be ranked in Google, but from what I understand his new url will be ranked. I know masking is really bad for SEO, and I want to educate my client as to why on the technical side. I have read Google see's all the content as duplicate with masking. Do you know the details as to why?
-
Hey Cesar,
Thanks for the links! Really useful info there.
Unfortunately they I couldn't find the answer I was looking for so I'll be more specific in what I'm asking.
From what I understand Google uses two database systems. One contains keywords and the other contains cached pages. How does a keyword entry point to a page entry? Does it use a unique id number, or does it use the url that page is using in the "live" vesion on the web?
-
Just because you create a new page and delete the old one, Google won't know immediately about it. So if Google crawls the new page before it's had a chance to crawl the old one, then it will indeed consider the new page to be duplicate content. Then when it tries to crawl the old page, it will discover that it no longer exists. However, as long as links to the old page exist, it will continue to try to crawl that page. Eventually it may de-index the old page if it keeps returning an error.
Bottom line, if you are moving content to a new URL, be sure to include a 301 redirect on the old page so that Google (and other search engines) know that the piece of content has moved. You can also do this with canonical tags, but 301s are more effective.
-
Thanks for the response and links Takeshi. Maybe I can rephrase the question to be more clear. Let's say a piece of content (or page) is at the url "www.oldurl.com/page". During a migration this same piece of content now at the url "www.newurl.com/page". The "www.oldurl.com" doesn't exist anymore so there isn't duplicate content in the live web.
Would Google create a new entry in it's "page directory" (what is the industry standard name for this directory?) and give it the url "www.newurl.com/page"?
If it does create a new entry, would Google keep the old entry "www.oldurl.com/page" although the old url doesn't exist in the "live" web anymore?
-
Wow you just asked questions that would require about 10,000,000,000 answers
Lets start here
- Video from the man himself Mr. Matt Cutts - Matt Cutts (Works for Google)
- Great Web 2.0 Page create from Google themself - (Google Them self)
- Older but still relevant description about how "backlinks" affect PR - (Google Them self)
-
This a pretty confusing question, and the terminology you use is different from industry standard. Check out these links for a quick overview of how Google works:
- http://www.google.com/insidesearch/howsearchworks/thestory/
- http://www.googleguide.com/google_works.html
If you are just worried about changing a page's url, just be sure to put in a 301 redirect from the old page to the new page. That way, even if Google has an older version of the page indexed, it will automatically redirect the user to the new page as well as help Google discover the new location of the page.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Home Pages of Several Websites are disappearing / reappearing in Google Index
Hi, I periodically use the Google site command to confirm that our client's websites are fully indexed. Over the past few months I have noticed a very strange phenomenon which is happening for a small subset of our client's websites... basically the home page keeps disappearing and reappearing in the Google index every few days. This is isolated to a few of our client's websites and I have also noticed that it is happening for some of our client's competitor's websites (over which we have absolutely no control). In the past I have been led to believe that the absence of the home page in the index could imply a penalty of some sort. This does not seem to be the case since these sites continue to rank the same in various Google searches regardless of whether or not the home page is listed in the index. Below are some examples of sites of our clients where the home page is currently not indexed - although they may be indexed by the time you read this and try it yourself. Note that most of our clients are in Canada. My questions are: 1. has anyone else experienced/noticed this? 2. any thoughts on whether this could imply some sort of penalty? or could it just be a bug in Google? 3. does Google offer a way to report stuff like this? Note that we have been building websites for over 10 years so we have long been aware of issues like www vs. non-www, canonicalization, and meta content="noindex" (been there done that in 2005). I could be wrong but I do not believe that the site would keep disappearing and reappearing if something like this was the issue. Please feel free to scrutinize the home pages to see if I have overlooked something obvious - I AM getting old. site:dietrichlaw.ca - this site has continually ranked in the top 3 for [kitchener personal injury lawyers] for many years. site:burntucker.com - since we took over this site last year it has moved up to page 1 for [ottawa personal injury lawyers] site:bolandhowe.com - #1 for [aurora personal injury lawyers] site:imranlaw.ca - continually ranked in the top 3 for [mississauga immigration lawyers]. site:canadaenergy.ca - ranks #3 for [ontario hydro plans] Thanks in advance! Jim Donovan, President www.wethinksolutions.com
Technical SEO | | wethink0 -
What's Moz's Strategy behind their blog main categories?
I've only just noticed that the Moz' blog categories have been moved within a pull down menu. See it underneath : 'Explore Posts by Category' on any blog page. This means that the whole list of categories under that pull-down is not crawlable by bots, and therefore no link-juice flows down onto those category pages. I imagine that the main drive behind that move is to sculpt page rank so that the business/money pages or areas of the website get greater link equity as opposed to just wasting it all throwing it down to the many categories ? it'd be good to hear about more from Rand or anyone in his team as to how they came onto engineering this and why. One of the things I wonder is: with the sheer amount of content that Moz produces, is it possible to contemplate an effective technical architecture such as that? I know they do a great job at interlinking content from one post onto another, so effectively one can argue that that kind of supersedes the need for hierarchical page rank distribution via categories... but I wonder : "is it working better this way vs having crawlable blog category links on the blog section? have they performed tests" some insights or further info on this from Moz would be very welcome. thanks in advance
Technical SEO | | carralon
David0 -
Are image pages considered 'thin' content pages?
I am currently doing a site audit. The total number of pages on the website are around 400... 187 of them are image pages and coming up as 'zero' word count in Screaming Frog report. I needed to know if they will be considered 'thin' content by search engines? Should I include them as an issue? An answer would be most appreciated.
Technical SEO | | MTalhaImtiaz0 -
Help! Pages not being indexed
Hi Mozzers, I need your help.
Technical SEO | | bshanahan
Our website (www.barnettcapitaladvisors.com) stopped being indexed in search engines following a round of major changes to URLs and content. There were a number of dead links for a few days before 301 redirects were properly put in place. And now, only 3 pages show up in bing when I do the search "site:barnettcapitaladvisors.com". A bunch of pages show up in Google for that search, but they're not any of the pages we want to show up. Our home page and most important services pages are nowhere in search results. What's going on here?
Our sitemap is at http://www.barnettcapitaladvisors.com/sites/default/files/users/AndrewCarrillo/sitemap/sitemap.xml
Robots.txt is at: http://www.barnettcapitaladvisors.com/robots.txt Thanks!0 -
Lots of Pages Dropped Out of Google's Index?
Until yesterday, my website had about 1200 pages indexed in Google. I did lots of changes: removed low quality content, rewrote passable content to make it better, wrote high quality content, got lots of likes and shares on social networks, etc. Now this morning I see that out of 1252 pages submitted, only 691 are indexed. Is that a temporary situation related to the recent updates? Anyone seeing this? What should I interpret about this?
Technical SEO | | sbrault740 -
132 pages reported as having Duplicate Page Content but I'm not sure where to go to fix the problems?
I am seeing “Duplicate Page Content” coming up in our
Technical SEO | | danatanseo
reports on SEOMOZ.org Here’s an example: http://www.ccisolutions.com/StoreFront/product/williams-sound-ppa-r35-e http://www.ccisolutions.com/StoreFront/product/aphex-230-master-voice-channel-processor http://www.ccisolutions.com/StoreFront/product/AT-AE4100.prod These three pages are for completely unrelated products.
They are returning “200” status codes, but are being identified as having
duplicate page content. It appears these are all going to the home page, but it’s
an odd version of the home page because there’s no title. I would understand if these pages 301-redirected to the home page if they were obsolete products, but it's not a 301-redirect. The referring page is
listed as: http://www.ccisolutions.com/StoreFront/category/cd-duplicators None of the 3 links in question appear anywhere on that page. It's puzzling. We have 132 of these. Can anyone help me figure out
why this is happening and how best to fix it? Thanks!0 -
Moz Crawl Reporting Duplicate content on "template" styled pages
We have a lot of detail pages on our site that reference specific scholarships. Each page has a different Title and Description. They also have unique information all regarding the same data points. The pages are displayed in a similar structure to the user so the data is easy to read. My problem is a lot of these pages are being reported as duplicate content when they certainly are not. Most of them are reported as duplicates when they have the same sponsor. They may have the same contact information listed. These two are being reported as duplicate of each other. They share some data but they are definitely different scholarships. http://www.collegexpress.com/scholarships/adelaide-mcclelland-garden-club-scholarship/9254/ http://www.collegexpress.com/scholarships/mary-wannamaker-witt-and-lee-hampton-witt-memorial-scholarship/10785/ Would it help to add a Canonical for each page to themselves? Any other suggestions would be great. Thanks
Technical SEO | | GeorgeLaRochelle0 -
Additional product information: the product's sales page or a blog post?
I want to go in-depth about different customizations for custom caps, which is one of the products we offer. I just don't know whether it would be better--from an SEO perspective--to expand the caps sales page we already have or to write a blog post to give the site another valuable indexed page. From a user standpoint, I don't think it's as important, because if I do it the blog way, I can't just put a link on the page saying, Want more customizations? Visit our blog post. Any opinions?
Technical SEO | | UnderRugSwept1