How Does Google's "index" find the location of pages in the "page directory" to return?
-
This is my understanding of how Google's search works, and I am unsure about one thing in specific:
- Google continuously crawls websites and stores each page it finds (let's call it "page directory")
- Google's "page directory" is a cache so it isn't the "live" version of the page
- Google has separate storage called "the index" which contains all the keywords searched. These keywords in "the index" point to the pages in the "page directory" that contain the same keywords.
- When someone searches a keyword, that keyword is accessed in the "index" and returns all relevant pages in the "page directory"
- These returned pages are given ranks based on the algorithm
The one part I'm unsure of is how Google's "index" knows the location of relevant pages in the "page directory". The keyword entries in the "index" point to the "page directory" somehow. I'm thinking each page has a url in the "page directory", and the entries in the "index" contain these urls. Since Google's "page directory" is a cache, would the urls be the same as the live website (and would the keywords in the "index" point to these urls)?
For example if webpage is found at wwww.website.com/page1, would the "page directory" store this page under that url in Google's cache?
The reason I want to discuss this is to know the effects of changing a pages url by understanding how the search process works better.
-
Yeah that makes sense. I also have a lot of experience with databases and the back ends of websites so I know your language.
I'm wondering how Google correlates the url with the page entries then. Maybe each page entry would have a url field so Google knows the location of the live version to constantly update that entry in the "page directory" database?
-
That is a question that no one here can answer. We cant speak for how Google does things internally.
but.... as a web / database programmer for 14+ years let me tell you how its "generally" done
Usually when you have to link to separate sets of data together (ie. database or tables) there is usually a unique_id created to link them which usually is never changed. So when a new record is created that record will live with that ID for its life, also known as a (unique identifier which tends to be an auto-incremented number that is dynamically generated and can not be repeated).
Since records tend to be linked this way, any other fields that exist in the record (firstName, lastName, Url, blah blah) then can be changed without the original ID being disturbed.
So to answer your question from my experience I would assume Google links from a unique identifier of some sort and not the URL directly.
Hope I didn't lose you, its my favorite subject...but no one here speaks that language to much
-
That makes sense, thanks for getting back to me so fast!
Perhaps you can help answer my next question. I have a client who used to host his domain at "www.oldurl.com", and has migrated his website to "www.newurl.com". He wants to use his old domain "www.oldurl.com", so he setup forwarding/masking so that when someone tries to access "www.oldurl.com" they are forwarded to "www.newurl.com" but the url shown to the user is "www.oldurl.com".
My client want his old url "www.oldurl.com" to be ranked in Google, but from what I understand his new url will be ranked. I know masking is really bad for SEO, and I want to educate my client as to why on the technical side. I have read Google see's all the content as duplicate with masking. Do you know the details as to why?
-
Hey Cesar,
Thanks for the links! Really useful info there.
Unfortunately they I couldn't find the answer I was looking for so I'll be more specific in what I'm asking.
From what I understand Google uses two database systems. One contains keywords and the other contains cached pages. How does a keyword entry point to a page entry? Does it use a unique id number, or does it use the url that page is using in the "live" vesion on the web?
-
Just because you create a new page and delete the old one, Google won't know immediately about it. So if Google crawls the new page before it's had a chance to crawl the old one, then it will indeed consider the new page to be duplicate content. Then when it tries to crawl the old page, it will discover that it no longer exists. However, as long as links to the old page exist, it will continue to try to crawl that page. Eventually it may de-index the old page if it keeps returning an error.
Bottom line, if you are moving content to a new URL, be sure to include a 301 redirect on the old page so that Google (and other search engines) know that the piece of content has moved. You can also do this with canonical tags, but 301s are more effective.
-
Thanks for the response and links Takeshi. Maybe I can rephrase the question to be more clear. Let's say a piece of content (or page) is at the url "www.oldurl.com/page". During a migration this same piece of content now at the url "www.newurl.com/page". The "www.oldurl.com" doesn't exist anymore so there isn't duplicate content in the live web.
Would Google create a new entry in it's "page directory" (what is the industry standard name for this directory?) and give it the url "www.newurl.com/page"?
If it does create a new entry, would Google keep the old entry "www.oldurl.com/page" although the old url doesn't exist in the "live" web anymore?
-
Wow you just asked questions that would require about 10,000,000,000 answers
Lets start here
- Video from the man himself Mr. Matt Cutts - Matt Cutts (Works for Google)
- Great Web 2.0 Page create from Google themself - (Google Them self)
- Older but still relevant description about how "backlinks" affect PR - (Google Them self)
-
This a pretty confusing question, and the terminology you use is different from industry standard. Check out these links for a quick overview of how Google works:
- http://www.google.com/insidesearch/howsearchworks/thestory/
- http://www.googleguide.com/google_works.html
If you are just worried about changing a page's url, just be sure to put in a 301 redirect from the old page to the new page. That way, even if Google has an older version of the page indexed, it will automatically redirect the user to the new page as well as help Google discover the new location of the page.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
My website is currently failing Google's mobile friendly test. What are my options?
What can I tell my developer so I pass this test? What will they need to develop A web mockup? Is there an easy code to implement?
Technical SEO | | pmull0 -
How to know how much pages are indexed on Google?
I have a big site, there are a way to know what page are not indexed? I know that you can use site: but with a big site is a mess to check page by page. This is a tool or a system to check a entire site and automatically find non-indexed pages?
Technical SEO | | markovald0 -
Sitemap issue? 404's & 500's are regenerating?
I am using the WordPress SEO plugin by Yoast to generate a sitemap on http://www.atozqualityfencing.com. Last month, I had an associate create redirects for over 200 404 errors. She did this via the .htaccess file. Today, there are the same amount of 404s along with a number of 503 errors. This new Wordpress website was constructed on a subdirectory and made live by simply entering some code into the .htaccess file in order to direct browsers to the content we wanted live. In other words, the content actually resides in a subdirectory titled "newsite" but is shown live on the main url. Can you tell me why we are having these 404 & 503 errors? I have no idea where to begin looking.
Technical SEO | | JanetJ0 -
Google not indexing my website
Hi guys, We have this website http://www.m-health-expo.nl/ but it is not indexed by google. In webmaster tools google says that it can not fetch the site due to the robots.txt but i do not see any faults in it. http://www.m-health-expo.nl/robots.txt Do you see something strange, it really bothers me.
Technical SEO | | RuudHeijnen0 -
No existing pages in Google index
I have a real estate portal. I have a few categories - for example: flats, houses etc. Url of category looks like that: mydomain.com/flats/?page=1 Each category has about 30-40 pages - BUT in Google index I found url like: mydomain.com/flats/?page=1350 Can you explain it? This url contains just headline etc - but no content! (it´s just generated page by PHP) How is it possible, that Google can find and index these pages? (on the web, there are no backlinks on these pages) thanks
Technical SEO | | visibilitysk0 -
Is a Rel="cacnonical" page bad for a google xml sitemap
Back in March 2011 this conversation happened. Rand: You don't want rel=canonicals. Duane: Only end state URL. That's the only thing I want in a sitemap.xml. We have a very tight threshold on how clean your sitemap needs to be. When people are learning about how to build sitemaps, it's really critical that they understand that this isn't something that you do once and forget about. This is an ongoing maintenance item, and it has a big impact on how Bing views your website. What we want is end state URLs and we want hyper-clean. We want only a couple of percentage points of error. Is this the same with Google?
Technical SEO | | DoRM0 -
Cantags within links affect Google's perception of them?
Hi, All! This might be really obvious, but I have little coding experience, so when in doubt - ask... One of our client site's has navigation that looks (in part) like this: <a <span="">href</a><a <span="">="http://www.mysite.com/section1"></a> <a <span="">src="images/arrow6.gif" width="13" height="7" alt="Section 1">Section 1</a><a <span=""></a> WC3 told us the tags invalidate, and while I ignored most of their comments because I didn't think it would impact on what search engines saw, because thesetags are right in the links, it raised a question. Anyone know if this is for sure a problem/not a problem? Thanks in advance! Aviva B
Technical SEO | | debi_zyx0 -
We changed the URL structure 10 weeks ago and Google hasn't indexed it yet...
We recently modified the whole URL structure on our website, which resulted in huge amount of 404 pages changing them to nice human readable urls. We did this in the middle of March - about 10 weeks ago... We used to have around 5000 404 pages in the beginning, but this number is decreasing slowly. (We have around 3000 now). On some parts of the website we have also set up a 301 redirect from the old URLs to the new ones, to avoid showing a 404 page thus making the “indexing transmission”, but it doesn’t seem to have made any difference. We've lost a significant amount of traffic, because of the URL changes, as Google removed the old URLs, but hasn’t indexed our new URLs yet. Is there anything else we can do to get our website indexed with the new URL structure quicker? It might also be useful to know that we are a page rank 4 and have over 30,000 unique users a month so I am sure Google often comes to the site quite often and pages we have made since then that only have the new url structure are indexed within hours sometimes they appear in search the next day!
Technical SEO | | jack860