How Does Google's "index" find the location of pages in the "page directory" to return?
-
This is my understanding of how Google's search works, and I am unsure about one thing in specific:
- Google continuously crawls websites and stores each page it finds (let's call it "page directory")
- Google's "page directory" is a cache so it isn't the "live" version of the page
- Google has separate storage called "the index" which contains all the keywords searched. These keywords in "the index" point to the pages in the "page directory" that contain the same keywords.
- When someone searches a keyword, that keyword is accessed in the "index" and returns all relevant pages in the "page directory"
- These returned pages are given ranks based on the algorithm
The one part I'm unsure of is how Google's "index" knows the location of relevant pages in the "page directory". The keyword entries in the "index" point to the "page directory" somehow. I'm thinking each page has a url in the "page directory", and the entries in the "index" contain these urls. Since Google's "page directory" is a cache, would the urls be the same as the live website (and would the keywords in the "index" point to these urls)?
For example if webpage is found at wwww.website.com/page1, would the "page directory" store this page under that url in Google's cache?
The reason I want to discuss this is to know the effects of changing a pages url by understanding how the search process works better.
-
Yeah that makes sense. I also have a lot of experience with databases and the back ends of websites so I know your language.
I'm wondering how Google correlates the url with the page entries then. Maybe each page entry would have a url field so Google knows the location of the live version to constantly update that entry in the "page directory" database?
-
That is a question that no one here can answer. We cant speak for how Google does things internally.
but.... as a web / database programmer for 14+ years let me tell you how its "generally" done
Usually when you have to link to separate sets of data together (ie. database or tables) there is usually a unique_id created to link them which usually is never changed. So when a new record is created that record will live with that ID for its life, also known as a (unique identifier which tends to be an auto-incremented number that is dynamically generated and can not be repeated).
Since records tend to be linked this way, any other fields that exist in the record (firstName, lastName, Url, blah blah) then can be changed without the original ID being disturbed.
So to answer your question from my experience I would assume Google links from a unique identifier of some sort and not the URL directly.
Hope I didn't lose you, its my favorite subject...but no one here speaks that language to much
-
That makes sense, thanks for getting back to me so fast!
Perhaps you can help answer my next question. I have a client who used to host his domain at "www.oldurl.com", and has migrated his website to "www.newurl.com". He wants to use his old domain "www.oldurl.com", so he setup forwarding/masking so that when someone tries to access "www.oldurl.com" they are forwarded to "www.newurl.com" but the url shown to the user is "www.oldurl.com".
My client want his old url "www.oldurl.com" to be ranked in Google, but from what I understand his new url will be ranked. I know masking is really bad for SEO, and I want to educate my client as to why on the technical side. I have read Google see's all the content as duplicate with masking. Do you know the details as to why?
-
Hey Cesar,
Thanks for the links! Really useful info there.
Unfortunately they I couldn't find the answer I was looking for so I'll be more specific in what I'm asking.
From what I understand Google uses two database systems. One contains keywords and the other contains cached pages. How does a keyword entry point to a page entry? Does it use a unique id number, or does it use the url that page is using in the "live" vesion on the web?
-
Just because you create a new page and delete the old one, Google won't know immediately about it. So if Google crawls the new page before it's had a chance to crawl the old one, then it will indeed consider the new page to be duplicate content. Then when it tries to crawl the old page, it will discover that it no longer exists. However, as long as links to the old page exist, it will continue to try to crawl that page. Eventually it may de-index the old page if it keeps returning an error.
Bottom line, if you are moving content to a new URL, be sure to include a 301 redirect on the old page so that Google (and other search engines) know that the piece of content has moved. You can also do this with canonical tags, but 301s are more effective.
-
Thanks for the response and links Takeshi. Maybe I can rephrase the question to be more clear. Let's say a piece of content (or page) is at the url "www.oldurl.com/page". During a migration this same piece of content now at the url "www.newurl.com/page". The "www.oldurl.com" doesn't exist anymore so there isn't duplicate content in the live web.
Would Google create a new entry in it's "page directory" (what is the industry standard name for this directory?) and give it the url "www.newurl.com/page"?
If it does create a new entry, would Google keep the old entry "www.oldurl.com/page" although the old url doesn't exist in the "live" web anymore?
-
Wow you just asked questions that would require about 10,000,000,000 answers
Lets start here
- Video from the man himself Mr. Matt Cutts - Matt Cutts (Works for Google)
- Great Web 2.0 Page create from Google themself - (Google Them self)
- Older but still relevant description about how "backlinks" affect PR - (Google Them self)
-
This a pretty confusing question, and the terminology you use is different from industry standard. Check out these links for a quick overview of how Google works:
- http://www.google.com/insidesearch/howsearchworks/thestory/
- http://www.googleguide.com/google_works.html
If you are just worried about changing a page's url, just be sure to put in a 301 redirect from the old page to the new page. That way, even if Google has an older version of the page indexed, it will automatically redirect the user to the new page as well as help Google discover the new location of the page.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Google Search Console 'Change of Address' Just 301s on source domain?
Hi all. New here, so please be gentle. 🙂 I've developed a new site, where my client also wanted to rebrand from .co.nz to .nz On the source (co.nz) domain, I've setup a load of 301 redirects to the relevant new page on the new domain (the URL structure is changing as well).
Technical SEO | | WebGuyNZ
E.G. On the old domain: https://www.mysite.co.nz/myonlinestore/t-shirt.html
In the HTACCESS on the old/source domain, I've setup 301's (using RewriteRule).
So that when **https://www.mysite.co.nz/**myonlinestore/t-shirt.html is accessed, it does a 301 to;
https://mysite.nz/shop/clothes/t-shirt All these 301's are working fine. I've checked in dev tools and a 301 is being returned. My question is, is having the 301's just on the source domain only enough, in regards to starting a 'Change of Address' in Google's Search Console? Their wording indicates it's enough but I'm concerned, maybe I also need redirects on the target domain as well? I.E. Does the Search Console Change of Address process work this way?
It looks at the source domain URL (that's already in Google's index), sees the 301 then updates the index (and hopefully pass the link juice) to the new URL. Also, I've setup both source and target Search Console properties as Domain Properties. Does that mean I no longer need to specify that the source and target properties are HTTP or HTTPS? I couldn't see that option when I created the properties. Thanks!0 -
Site's meta description is not being shown in Google Search results. Instead our privacy policy is getting indexed.
We re-launched our new site and put in the re-directs. Our site is https://www.fico.com/en. When I search for "fico" in Google. I see the privacy policy getting indexed as meta descriptions instead of our actual meta description. I have edited the meta description, requested Google to re-index our site. Not sure what to do next? Thanks for your advise.
Technical SEO | | gosheen0 -
Google dropping pages from SERPs even though indexed and cached. (Shift over to https suspected.)
Anybody know why pages that have previously been indexed - and that are still present in Google's cache - are now not appearing in Google SERPs? All the usual suspects - noindex, robots, duplication filter, 301s - have been ruled out. We shifted our site over from http to https last week and it appears to have started then, although we have also been playing around with our navigation structure a bit too. Here are a few examples... Example 1: Live URL: https://www.normanrecords.com/records/149002-memory-drawings-there-is-no-perfect-place Cached copy: http://webcache.googleusercontent.com/search?q=cache:https://www.normanrecords.com/records/149002-memory-drawings-there-is-no-perfect-place SERP (1): https://www.google.co.uk/search?q=memory+drawings+there+is+no+perfect+place SERP (2): https://www.google.co.uk/search?q=memory+drawings+there+is+no+perfect+place+site%3Awww.normanrecords.com Example 2: SERP: https://www.google.co.uk/search?q=deaf+center+recount+site%3Awww.normanrecords.com Live URL: https://www.normanrecords.com/records/149001-deaf-center-recount- Cached copy: http://webcache.googleusercontent.com/search?q=cache:https://www.normanrecords.com/records/149001-deaf-center-recount- These are pages that have been linked to from our homepage (Moz PA of 68) prominently for days, are present and correct in our sitemap (https://www.normanrecords.com/catalogue_sitemap.xml), have unique content, have decent on-page optimisation, etc. etc. We moved over to https on 11 Aug. There were some initial wobbles (e.g. 301s from normanrecords.com to www.normanrecords.com got caught up in a nasty loop due to the conflicting 301 from http to https) but these were quickly sorted (i.e. spotted and resolved within minutes). There have been some other changes made to the structure of the site (e.g. a reduction in the navigation options) but nothing I know of that would cause pages to drop like this. For the first example (Memory Drawings) we were ranking on the first page right up until this morning and have been receiving Google traffic for it ever since it was added to the site on 4 Aug. Any help very much appreciated! At the very end of my tether / understanding here... Cheers, Nathon
Technical SEO | | nathonraine0 -
"nofollow pages" or "duplicate content"?
We have a huge site with lots of geographical-pages in this structure: domain.com/country/resort/hotel domain.com/country/resort/hotel/facts domain.com/country/resort/hotel/images domain.com/country/resort/hotel/excursions domain.com/country/resort/hotel/maps domain.com/country/resort/hotel/car-rental Problem is that the text on ie. /excursions is often exactly the same on .../alcudia/hotel-sea-club/excursion and .../alcudia/hotel-beach-club/excursion The two hotels offer the same excursions, and the intro text on the pages are the exact same throughout the entire site. This is also a problem on the /images and /car-rental pages. I think in most cases the only difference on these pages is the Title, description and H1. These pages do not attract a lot of visits through search-engines. But to avoid them being flagged as duplicate content (we have more than 4000 of these pages - /excursions, /maps, /car-rental, /images), do i add a nofollow-tag to these, do i block them in robots.txt or should i just leave them and live with them being flagged as duplicate content? Im waiting for our web-team to add a function to insert a geographical-name in the text, so i could add ie #HOTELNAME# in the text and thereby avoiding the duplicate text. Right now we have intros like: When you visit the hotel ... instead of: When you visit Alcudia Sea Club But untill the web-team has fixed these GEO-tags, what should i do? What would you do and why?
Technical SEO | | alsvik0 -
I'm redesigning a website which will have a new URL format. What's the best way to redirect all the old URLs to the new ones? Is there an automated, fast way to do this?
For example, the new URL will be: https://oregonoptimalhealth.com/about_us.html while the old one's were like this: http://www.oregonoptimalhealth.com/home/ooh/smartlist_1/services.html I have redirect almost 100 old pages to the correct new page. What's the best and easiest way to do this?
Technical SEO | | PolarisMarketing0 -
Website's stability and it's affect on SEO
What is the best way to combat previous website stability issues? We had page load time and site stability problems over the course of several months. As a result our keyword rankings plummeted. Now that the issues have been resolved, what's the best/quickest way to regain our rankings on specific keywords? Thanks, Eric
Technical SEO | | MediaCause0 -
Can I format my H1 to be smaller than H2's and H3's on the same page?
I would like to create a web design with 12px H1 and for sub headings on the page to be more like 24px. Will search engines see this and dislike it? The reason for doing it is that I want to put a generic page title in the banner, and more poetic headings above the main body. Example: Small H1: Wholesale coffee, online coffee shop and London roastery Large h2: Respect the bean... Thanks
Technical SEO | | Crumpled_Dog
Scott0 -
Can JavaScrip affect Google's index/ranking?
We have changed our website template about a month ago and since then we experienced a huge drop in rankings, especially with our home page. We kept the same url structure on entire website, pretty much the same content and the same on-page seo. We kind of knew we will have a rank drop but not that huge. We used to rank with the homepage on the top of the second page, and now we lost about 20-25 positions. What we changed is that we made a new homepage structure, more user-friendly and with much more organized information, we also have a slider presenting our main services. 80% of our content on the homepage is included inside the slideshow and 3 tabs, but all these elements are JavaScript. The content is unique and is seo optimized but when I am disabling the JavaScript, it becomes completely unavailable. Could this be the reason for the huge rank drop? I used the Webmaster Tolls' Fetch as Googlebot tool and it looks like Google reads perfectly what's inside the JavaScrip slideshow so I did not worried until now when I found this on SEOMoz: "Try to avoid ... using javascript ... since the search engines will ... not indexed them ... " One more weird thing is that although we have no duplicate content and the entire website has been cached, for a few pages (including the homepage), the picture snipet is from the old website. All main urls are the same, we removed some old ones that we don't need anymore, so we kept all the inbound links. The 301 redirects are properly set. But still, we have a huge rank drop. Also, (not sure if this important or not), the robots.txt file is disallowing some folders like: images, modules, templates... (Joomla components). We still have some html errors and warnings but way less than we had with the old website. Any advice would be much appreciated, thank you!
Technical SEO | | echo10