XML Sitemap Questions For Big Site
-
Hey Guys,
I have a few question about XML Sitemaps.
-
For a social site that is going to have presonal accounts created, what is the best way to get them indexed? When it comes to profiles I found out that twitter (https://twitter.com/i/directory/profiles) and facebook (https://www.facebook.com/find-friends?ref=pf) have directory pages, but Google plus has xml index pages (http://www.gstatic.com/s2/sitemaps/profiles-sitemap.xml).
-
If we go the XML route, how would we automatically add new profiles to the sitemap? Or is the only option to keep updating your xml profiles using a third party software (sitemapwriter)?
-
If a user chooses to not have their profile indexed (by default it will be index-able), how do we go about deindexing that profile? Is their an automatic way of doing this?
-
Lastly, has anyone dappled with google sitemap generator (https://code.google.com/p/googlesitemapgenerator/) if so do you recommend it?
Thank you!
-
-
Thanks for the input guys!
I believe Twitter and Facebook don't run sitemaps for their profiles, what they have is a directory for all their profiles (twitter: https://twitter.com/i/directory/profiles Facebook: https://www.facebook.com/find-friends?ref=pf) and use that to get their profiles crawled, however I feel the best approach is through xml sitemaps and Google plus actually does this with their profiles (http://www.gstatic.com/s2/sitemaps/profiles-sitemap.xml) and quite frankly I would rather follow Google then FB or Twitter... I'm just now wondering how the hell they upkeep that monster! Does it create a new sitemap everything one hits 50k? When do they update their sitemap? daily, weekly, or monthly and how?
One other question I have is if their is any penalties to getting a lot of pages crawled at once? Meaning one day we have 10 pages and the next we have 10,000 pages or 50,000 pages...
Thanks again guys!
-
I guess the way I was explaining it was for scalabilty on a large site. You have to think a site like fb or twitter with hundreds of millions of users still has the limitation of only having 50k records in a site map. So if they are running site maps, they have hundreds.
-
I'm not a web developer, so this might may be wrong, but I feel like it might be easier to just add every user to the xml sitemap and then add a noindex robots meta tag ons users pages that don't want to their profiles to be indexed.
-
If it were me and someone were asking me to design a system like that, I would design it in a few parts.
First I would create an application that handled the sitemap minus profiles, just for your tos, sign up pages, terms, and what ever pages like that.
Then I would design a system that handled the actual profiles. It would be pretty complex and resource intensive as the site grew. But the main idea flows like this
Start generation, grab the user record with id 1 in the database, check to see if indexable (move to next if not), see what pages are connected, write to xml file, loop back and start with record #2.
There are a few concessions you have to make, you need to keep up with the number of records in a file before you start another file. You can only have 50k records in one file.
The way I would handle the process in total for a large site would be this, sync the required tables via a weekly or daily cron to another instance (server). Call the php script (because that is what I use) that creates the first sitemap for the normal site wide pages. At the end of that site map, put a location for the user profile sitemap, then at the end of the scrip, execute the user profile site map generating script. At the end of each site map, put the location of the next site map file, because as you grow it might take 2-10000 site map files.
One thing that I would ensure to do is get a list of crawler ip addresses and in your .htaccess have an allow / deny rule. That way you can make the site maps only visible to the search engines.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
XML sitemap generator only crawling 20% of my site
Hi guys, I am trying to submit the most recent XML sitemap but the sitemap generator tools are only crawling about 20% of my site. The site carries around 150 pages and only 37 show up on tools like xml-sitemaps.com. My goal is to get all the important URLs we care about into the XML sitemap. How should I go about this? Thanks
Intermediate & Advanced SEO | | TyEl0 -
Site Migration - Pagination
Hi, We are migrating our website and an issue we are facing is how to handle paginated content in our categories. Our new website will have the same structure but with different urls. Should we 301 redirect all the paginated content (if crawled by Google) to the url of the main category? To put this into an example: Old urls: www.example.com/technology/tvs (main category of TVs & also page 1) ** www.example.com/technology/tvs?v=0&page=2 ** ( page 2 of TVs) New urls: **www.example.com/soundvision/tvs **(main category of TVs & also page 1) **www.example.com/soundvision/tvs?page=2 **(page 2 of tvs) Should we redirect all of the old TV urls (also the paginated) to www.example.com/soundvision/tvs ? The is no rel next, prev tag in our site and no canonicals. Also there is a view all products page in each category, BUT it doesn't contain all the products(max. is 100 per page - yes the view all page is also paginated). The same view all products page (paginated) will exist in the new website also. I checked google search console, and Google has decided to treat as canonical page the first page www.example.com/technology/tvs . Also, all the organic traffic of our categories goes to these pages (main category page - 1st page). I would appreciate any thoughts on this.
Intermediate & Advanced SEO | | HellasSITES0 -
301 or Canonical - Ecommerce Site Question
We are making a change to our Navigation and this includes having to change the URL structure of a few pages of our site. Due to issues with the CMS (that are out of my control) we are unable to keep the current URL structure of two of our highest ranking pages. Our site is an E-commerce Site The Structure is changing from..... www.domain.com/page/highrankingpage <----OLD PAGE RANKED WELL to www.domain.com/category/highrankingpage <----NEW PAGE Generally I would have 301 'd this page but I found out that our Tech team added a Canonical to this page instead....(showing the high ranking page to the Search Engines) and on our site the visitors are able to browse the website getting the new page. BOTH PAGES ARE BASICALLY IDENTICAL (Same Content) http://searchenginewatch.com/sew/how-to/2288690/how-and-when-to-use-301-redirects-vs-canonical# Thoughts?
Intermediate & Advanced SEO | | CMcMullen0 -
SEO Site Analysis
I am looking for a company doing a SEO analysis on our website www.interelectronix.com and write a optimization proposal incl. a budgetary quote for performing those optimizations.
Intermediate & Advanced SEO | | interelectronix0 -
Development site is live (and has indexed) alongside live site - what's the best course of action?
Hello Mozzers, I am undertaking a site audit and have just noticed that the developer has left the development site up and it has indexed. They 301d from pages on old site to equivalent pages on new site but seem to have allowed the development site to index, and they haven't switched off the development site. So would the best option be to redirect the development site pages to the homepage of the new site (there is no PR on dev site and there are no links incoming to dev site, so nothing much to lose...)? Or should I request equivalent to equivalent page redirection? Alternatively I can simply ask for the dev site to be switched off and the URLs removed via WMT, I guess... Thanks in advance for your help! 🙂
Intermediate & Advanced SEO | | McTaggart1 -
Sitemaps
I am working with a site that has sitemaps broken down very specifically. By page type: article, page etc and also broken down by Category. Unfortunately, this is not done hierarchically. Category and page type are separate maps, they are not nested. My question here is: Is is detrimental to have two separate sitemaps that point to the same pages? Should we eliminate one of these taxonomies, or maybe just try to make them hierarchical? IE item type -> category -> pagetitle Is there an issue with having a sitemap index that points to a nested sitemap index? (I dont think so, but might as well be sure. Thanks Moz Community! Can't delete my question, but turns out that isn't how they are structured. Food for thought anyway I suppose.
Intermediate & Advanced SEO | | MarloSchneider0 -
Link Building Question
Hi, I have a 2 month old blog with me, i have submitted a few press releases for the start, later in these 2 months, i got about 40 guest posts, which i've written and submitted at myblogguest site. My niche is in health. Currently my serps are at 16th page which is not a good position. I want to do more link building, but at myblogguest, no one are interested in my niche and don't want to publish content related to my niche, so it becoming hard for me to find guest blogs related to my niche. But i want to get more links in order for my blog to rank well. Is it ok if i write guest posts in other niches as well like technology and put a link in author's resource box? Does it become useful? please help as i find no other sources for my link building task, i tried researching for guest blogs in google also, but i don't find any related to my niche. Seems like, i cannot go further with my link building. Please help me. Thanks
Intermediate & Advanced SEO | | Vegitt
Dheer0 -
Network Of Sites...
Hi Guys, Just wondering if anyone can help me out... We have recently been hit by the Google penguin update and I'm currently working though all the bad / spammy backlinks that previous SEO companies have built for us. I have come across 1 particular domain www.justgoodcars.com they seem to have a lot of different domain names: <colgroup><col width="390"></colgroup>
Intermediate & Advanced SEO | | ScottBaxterWW
| http://www.justpulsarcars.com/nissan-pulsar-warranties/1/United_Kingdom/all.html |
| http://www.justpumacars.com/ford-puma-warranties/1/United_Kingdom/all.html |
| http://www.justpuntocars.com/dutch-site/fiat-punto-warranties/1/United_Kingdom/all.html?selectcountry1=United_Kingdom |
| http://www.justpuntocars.com/fiat-punto-warranties/1/United_Kingdom/all.html?selectcountry1=United_Kingdom | Now all of theses domains names have exactly the same IP Address?? Above is just a few I would say there are 100s of them. Do you think this could have an affect on us? Thanks, Scott0