Major Reliability Upgrade / Scheduled Downtime
I’m pleased to report that we have completed the roll-out of upgrades to our master database server and front-end servers that will allow fully distributed operation. The primary effect of this is that in the event of a critical failure of our master database, your hosted sites will continue to operate.
Previously, the master database was one of three single points of failure in the hosting of any particular site. (The two remaining single points of failure are the NAS headend and the server hosting a site’s MySQL process, if applicable.) The master database was, however, the only single point of failure able to wipe out everybody at once so it was the natural place to start. We’ll be continuing to work on eliminating the other two. (We have a plan for member MySQL, but real no-SPOF file servers are particularly hard to come by.)
This database upgrade carries a number of other benefits as well. First, it has removed a lot of load from that master database, allowing us to continue to scale up the service without sacrificing performance. Second, it will enable us to deploy network front ends in other cities, enabling even better network performance and reliability. Third, it will enable us to offer a couple of long-awaited features that will be discussed in more detail once they are fully tested and the supporting UI has been developed.
However, there’s only one way to make sure that this upgrade will function as promised if the master database server goes down, and that’s taking the master database server down.
For that reason, we are scheduling a series of downtime windows to allow us to do exactly that. For the coming week, Monday through Friday, we are scheduling maintenance windows between 10pm and 11pm Arizona time (5:00am to 6:00am UTC). At some point during each window, we may (not “we will”) shut down our master database for five minutes to make sure everything works as expected, and that it comes back up properly afterward. Due to this upgrade, these five minute shutdowns should affect only access to ssh, FTP, and our member interface. However, since not everything always goes according to plan, it’s possible that brief site disruptions may occur for the first couple of nights. Even so, we will limit each night to a maximum of one such downtime.
Thanks for your patience as we continue to upgrade our network to make it faster, better, and more reliable!
1 Comment
RSS feed for comments on this post.
Sorry, the comment form is closed at this time.
Entries and comments feeds.
Valid XHTML and CSS.
Powered by WordPress. Hosted by NearlyFreeSpeech.NET.
This is very welcome news. However, I have a couple of questions:
First, previous SPoF problems have been attributed to a “master file server”. Is that no longer an issue?
Second, the wording implies that you will still have a master DB server, but that this will no longer be a SPoF. Is that correct?
Each site has a file server, the headend of which is a SPOF, but there is no longer a “master file server” per se. The master database is technically still a SPOF, just not a service-impacting one. -jdw
Comment by Daran — April 6, 2008 #