XML Sitemap Questions For Big Site
-
Hey Guys,
I have a few question about XML Sitemaps.
-
For a social site that is going to have presonal accounts created, what is the best way to get them indexed? When it comes to profiles I found out that twitter (https://twitter.com/i/directory/profiles) and facebook (https://www.facebook.com/find-friends?ref=pf) have directory pages, but Google plus has xml index pages (http://www.gstatic.com/s2/sitemaps/profiles-sitemap.xml).
-
If we go the XML route, how would we automatically add new profiles to the sitemap? Or is the only option to keep updating your xml profiles using a third party software (sitemapwriter)?
-
If a user chooses to not have their profile indexed (by default it will be index-able), how do we go about deindexing that profile? Is their an automatic way of doing this?
-
Lastly, has anyone dappled with google sitemap generator (https://code.google.com/p/googlesitemapgenerator/) if so do you recommend it?
Thank you!
-
-
Thanks for the input guys!
I believe Twitter and Facebook don't run sitemaps for their profiles, what they have is a directory for all their profiles (twitter: https://twitter.com/i/directory/profiles Facebook: https://www.facebook.com/find-friends?ref=pf) and use that to get their profiles crawled, however I feel the best approach is through xml sitemaps and Google plus actually does this with their profiles (http://www.gstatic.com/s2/sitemaps/profiles-sitemap.xml) and quite frankly I would rather follow Google then FB or Twitter... I'm just now wondering how the hell they upkeep that monster! Does it create a new sitemap everything one hits 50k? When do they update their sitemap? daily, weekly, or monthly and how?
One other question I have is if their is any penalties to getting a lot of pages crawled at once? Meaning one day we have 10 pages and the next we have 10,000 pages or 50,000 pages...
Thanks again guys!
-
I guess the way I was explaining it was for scalabilty on a large site. You have to think a site like fb or twitter with hundreds of millions of users still has the limitation of only having 50k records in a site map. So if they are running site maps, they have hundreds.
-
I'm not a web developer, so this might may be wrong, but I feel like it might be easier to just add every user to the xml sitemap and then add a noindex robots meta tag ons users pages that don't want to their profiles to be indexed.
-
If it were me and someone were asking me to design a system like that, I would design it in a few parts.
First I would create an application that handled the sitemap minus profiles, just for your tos, sign up pages, terms, and what ever pages like that.
Then I would design a system that handled the actual profiles. It would be pretty complex and resource intensive as the site grew. But the main idea flows like this
Start generation, grab the user record with id 1 in the database, check to see if indexable (move to next if not), see what pages are connected, write to xml file, loop back and start with record #2.
There are a few concessions you have to make, you need to keep up with the number of records in a file before you start another file. You can only have 50k records in one file.
The way I would handle the process in total for a large site would be this, sync the required tables via a weekly or daily cron to another instance (server). Call the php script (because that is what I use) that creates the first sitemap for the normal site wide pages. At the end of that site map, put a location for the user profile sitemap, then at the end of the scrip, execute the user profile site map generating script. At the end of each site map, put the location of the next site map file, because as you grow it might take 2-10000 site map files.
One thing that I would ensure to do is get a list of crawler ip addresses and in your .htaccess have an allow / deny rule. That way you can make the site maps only visible to the search engines.
Got a burning SEO question?
Subscribe to Moz Pro to gain full access to Q&A, answer questions, and ask your own.
Browse Questions
Explore more categories
-
Moz Tools
Chat with the community about the Moz tools.
-
SEO Tactics
Discuss the SEO process with fellow marketers
-
Community
Discuss industry events, jobs, and news!
-
Digital Marketing
Chat about tactics outside of SEO
-
Research & Trends
Dive into research and trends in the search industry.
-
Support
Connect on product support and feature requests.
Related Questions
-
Redirecting an Entire Site to a Page on Another Site?
So I have a site that I want to shut down http://vowrenewalsmaui.com and redirect to a dedicated Vow Renewals page I am making on this site here: https://simplemauiwedding.net. My main question is: I don't want to lose all the authority of the pages and if I just redirect the site using my domain registrar's 301 redirect it will only redirect the main URL not all of the supporting pages, to my knowledge. How do I not lose all the authority of the supporting pages and still shut down the site and close down my site builder? I know if I leave the site up I can redirect all of the individual pages to corresponding pages on the other site, but I want to be done with it. Just trying to figure out if there is a better way than I know of. The domain is hosted through GoDaddy.
Intermediate & Advanced SEO | | photoseo10 -
How is my 301 redirected site stealing rankings from the main site?
Hello, I have a site, drhobelt.com, that 301 redirects to the main site, drhonow.com. Not only is drhobelt.com still indexed, but it recently stole rankings from drhonow.com for "decompression belt" related terms. What could be causing this? How do I reclaim the rankings for drhonow.com? Thanks for reading!!
Intermediate & Advanced SEO | | DA20130 -
XML sitemaps questions
Hi All, My developer has asked me some questions that I do not know the answer to. We have both searched for an answer but can't find one.... So, I was hoping that the clever folk on Moz can help!!! Here is couple questions that would be nice to clarify on. What is the actual address/name of file for news xml. Can xml site maps be generated on request? Consider following scenario: spider requests http://mypage.com/sitemap.xml which permanently redirects to extensionless MVC 4 page http://mypage.com/sitemapxml/ . This page generates xml. Thank you, Amelia
Intermediate & Advanced SEO | | CommT0 -
Mobile Sitemaps
We are planning on creating a mobile site using a different URL. Our current sitemap creator won't dynamically detect mobile pages using the rel="alternate" tag but can can create a Project for that domain in Sitemap Creator and use the "mobile" option when you export it. The Sitemap Creator will then insert the mobile:mobilecontent tag for all the URLs in that sitemap. </mobile:mobile> Is this okay or will it cause problems?
Intermediate & Advanced SEO | | theLotter0 -
Google penalized site--307/302 redirect to new site-- Via intermediate link—New Site Ranking Gone..?
Hi, I have a site that google had placed a manual link penalty on, let’s call this our
Intermediate & Advanced SEO | | Robdob2013
company site. We tried and tried to get the penalty removed, and finally gave up and purchased another name. It was our understanding that we could safely use either a 302 or 307 temporary redirect in order to redirect people from our old domain to our new one.. We put this into place several months and everything seemed to be going along well. Several days ago I noticed that our root domain name had dropped for our selected keyword from position 9 to position 65. Upon looking into our GWT under “Links to Your site” , I have found many, many, many links which were pointed to our old google penalized domain name to our new root domain name each of this links had a sub heading “Via this intermediate link -> Our Old Domain Google Penalized Domain Name” In light of all of this going on, I have removed the 307/302 redirect, have brought the
old penalized site back which now consists of a basic “we’ve moved page” which is linked to our new site using a rel=’nofollow’ I am hoping that -1- Our new domain has probably not received a manual penalty and is most likely now
received some sort of algorithmic penalty, and that as these “intermediate links” will soon disappear because I’m no longer doing the 302/307 from the old sight to the new. Do you think this is the case now or that I now have a new manual penalty place on the new
domain name.. I would very much appreciate any comments and/or suggestions as to what I should or can do to get this fixed. I need to still keep the old domain name as this address has already been printed on business cards many, many years ago.. Also on a side note some of the sub pages of the new root domain are still ranking very
well, it’s only the root domain that is now racking awfully.. Thanks,0 -
Optimal site structure for travel site
Hi there, I am seo-managing a travel website where we are going to make a new site structure next year. We have about 4000 pages on the site at the moment. The structure is only 2-levels at the moment: Level 1: Homepage Level 2: All other pages (4000 individual pages - (all with different urls)) We are adding another 2-3 levels, but we have a challenge: We have potentially 2 roads to the same product (e.g. "phuket diving product") domain.com/thailand/activities/diving/phuket-diving-product.asp domain.com/activities/diving/thailand/phuket-diving-product.asp I would very much appreciate your view on the problem: How do I solve this dilemma/challenge from a SEO standpoint? I want to avoid DC if possible, I also only want one landing page - for many reasons. And usability is of course also very important. Best regards, Chris
Intermediate & Advanced SEO | | sembseo0 -
Is my site being penalized?
I launched http://rumma.ge in February of this year. Because I'm using a domain hack (the Georgian domain), I'd really like to rank for just the word "rummage". After launching, I was steady at around page 4/5 on searches for "rummage". However since then I've tumbled out of the first 100. In fact I can't even find the site in the first 20 pages on Google for that search. Even a search for my exact homepage title text doesn't bring up the site, despite the fact that the site is still in the index. I'm wondering if one of the following could be the root cause: We have a ccTLD (.ge)--not sure about the impacts of this, but seems like it might not be the root cause because we were ranking for "rummage" when we first launched. Tried running an Adwords campaign but the site was flagged as a "bridge page" (working on getting this addressed). I'm wondering if this could have carryover impacts into natural search rankings? We've tried doing some press and built up a decent number of backlinks over the past couple of months, many of which had "rummage" in the anchor text. This was all organic, but happened over the span of a month which may be too fast? Am I being penalized? Beyond checking indexing of the site, is there a way to tell if I've been flagged for some bad behavior? Any help or thoughts would be greatly appreciated. I'm really confused by this since I feel like I've been doing things right and my rankings have been travelling downward. Thanks!! Matt
Intermediate & Advanced SEO | | minouye0 -
Site dancing
Hi guys I have a site which is dancing. I mean one day is on position 20 , if I put more backlinks is falling, after rising again,, I dont know what is going on. The site is 2 years old, pr 2, authority 35. Why this is happening? Usually when he appears again is ranking higher, but today he disappear totally from rankings. Maybe return tomorrow? But anyway why is dancing? Thanks
Intermediate & Advanced SEO | | nyanainc0