Planned downtime for Monday, September 10

On Monday, September 10, 2007 at around noon Arizona time (3pm Eastern, 7pm UTC), we will temporarily shut down our entire network to complete a migration of our equipment to a new datacenter. We anticipate that it will take four to eight hours to complete the move. All our services will be offline during that time.

Q. Why are you moving your datacenter?

We have been collocated with Limelight Networks since our inception in 2002. While Limelight is an excellent network, and their newest datacenters are works of art, we have run into a number of issues that have conspired to push us in a different direction:

1) Limelight recently completed a successful IPO based on their highly successful and incredibly awesome CDN service, and as such they are now almost exclusively focused on that service. In fact, they no longer offer collocation as a product for companies like us. We do not want to find ourself in a position where we cannot expand because the facilities are no longer available.

2) At Limelight, they run the datacenter and they are our only Internet carrier. While that’s very convenient for us, it’s putting all our eggs in one basket, which is not good for redundancy. We need (and have obtained) a carrier-neutral facility that will enable us to connect directly to as many carriers as we like, improving our overall reliability and also empowering us to do things like cherry-pick carriers that will improve overseas connectivity.

3) With multiple carriers, we will be able to request our own IP assignments directly from ARIN, which will remove our exposure to certain types of problems caused by routing issues on other networks.

4) We will be behind dedicated hot-failover routers and firewalls under our control, which will improve our reliability and enable us to react much more quickly to DDOS attacks, eliminating an entire class of such attacks through proactive filtering, and mitigating the rest much more quickly because we will be able to do it ourselves without depending on any third party.

5) We will be retiring some legacy equipment that is constraining our network performance but is too critical and too deeply wired in to address any other way.

6) We will be deploying an infrastructure capable of scaling to at least 10Gbps.

As you can see, this move represents us stepping up and taking control over a lot of areas where we have previously been dependent on others.

Q. Why are you moving your datacenter now?

In July and August, we experienced an unacceptable number of reliability issues. While the causes of these issues were almost exclusively not “our fault” and factors beyond our control (DDOS attacks, routing problems at relatively distant carriers, etc.), at some point it becomes our fault for not taking control of those factors, which is what this move is doing for us.

Q. Why will the downtime be so long?

While we can’t say for certain how long it will take, here are some of the reasons we estimate it will take so long:

1) It’s a lot of equipment.

2) We are actually consolidating two datacenters into one.

3) Virtually all of our production infrastructure equipment (routers & switches) will have to be completely reconfigured on the fly.

4) We have to make two trips to ensure that we don’t have every copy of our members’ data in the same truck at the same time.

5) It’s much better to set a realistic, conservative timeline than try to rush things and wind up making a critical mistake.

Q. Why didn’t you find a way to do this without downtime?

Anticipating the need to expand, we spent several months trying to plan a way to do a move without affecting service. In fact, we have already moved roughly a third of our equipment with relatively little impact. (Example: our services are supposed to all be N+2 redundant, but right now all the +2’s are at the new building, leaving us not a lot of margin for error.)

In the end, a complete zero-downtime move proved impossible. At some point, your site’s data is on a disk array, and that array can only be in one place at a time. The same for your MySQL process. We tried MySQL replication between facilities on our own master database, and we were disappointed with the results (it blew up at least once a day). Similarly, the amount of data associated with many sites is just too much and changes too fast to keep it synchronized between two locations. We have experimented with all sorts of VPN-type solutions, hoping to facilitate “sneaking” equipment from one location to another, and they just aren’t reliable enough to depend on for the length of time it would take to migrate that much data.

Still, such a move is theoretically possible with enough planning and preparation. However, given the issues we’ve seen in the past few months, and our current vulnerabilities to those same sorts of issues reoccuring, it becomes a risk assessment question: will the unknown, unplanned downtime caused by external forces during the time it takes us to devise a zero-downtime move be worse for our members than a single, known, scheduled downtime? We just can’t take the chance that the answer is yes.

Q. What are you doing to minimize the effects of the downtime?

We have been migrating services and equipment to the new facility as much as we can without causing disruption, and will continue to do so right up until the last minute. Everything we can move before is something we don’t have to move during.

Also, rather than leave your web site completely unresponsive, we are going to have a “maintenance page” served for all requests indicating to people that your site is not gone or shut down, merely being serviced for a few hours, and asking them to check back later.

Q. Will you be changing your IP addresses during the move?

Yes. This move will entail a complete IP renumbering on our part. Actually, after it is done, we will have to renumber again a month or so later to satisfy some obscure bureaucratic requirments.

We will be leaving a handful of equipment at the old location to preserve the old IP addresses and catch people with hardcoded third-party DNS or domain registrations. Once the second renumber is complete, we will start contacting people using the old addresses to let them know what changes they need to make.

For those people who have their domain registrations and DNS with us, all your IP information will be updated automatically both times with no action required on your part.

Q. How can I stay up-to-date on the downtime?

As soon as the downtime arrives, we will begin posting updates on our offsite status page with our trademark transparency. We will keep it updated throughout the move, particularly if anything unforeseen happens.

Q. How are you making sure everyone is aware of this?

We know not everyone follows our blog, and that the third-party blog-to-email service we were using recently imploded. For these reasons, and because of the scope of the move, we will be taking the unusual step of sending out a mass-email to all of our members making them aware of the planned downtime.

Q. Are you sorry you have to do this?

Yes, we are really, really sorry we have to do this, and we hope you will agree that the long-term benefits overwhelm the short-term pain of this downtime.


RSS feed for comments on this post.

  1. Sounds like a blast, godspeed!

    Comment by Anthony — September 7, 2007 #

  2. It’s unfortunate but I understand the necessity. Thanks for the full disclosure.

    God speed my little disk arrays.

    Comment by Guy — September 7, 2007 #

  3. Oh one more thing- I advise *not* stopping for beers on the way 😉

    Well, that’s one item I can cross off the to-do list! -jdw

    Comment by Guy — September 7, 2007 #

  4. A wise solution. A sensible downtime for a known, static implementation plan. I like it.

    I applaud your bravery in actually putting your trust logic and common sense; I like how you guys operate!

    Comment by Determination — September 7, 2007 #

  5. Thanks for the good explanations. By the way, I’m currently unable to connect to over https… regular http works fine.

    Comment by Ken Dreyer — September 7, 2007 #

  6. If anyone’s confused by the status page link not working, it should be this: offsite status page (i.e. HTTP, not HTTPS).

    D’oh. It’s fixed. Thanks for pointing it out! We’ll see about getting a secure certificate for the status site. -jdw

    Comment by dsymonds — September 7, 2007 #

  7. I appreciate your efforts. A small downtime for more reliability is a small price to pay.

    Your great service has been appreciated.

    Now, a small request. Could we get some pictures of the moving process (including the beers on the way) posted on NFS?

    For many of us non-network wise folks, this would be an interesting insight to the NFS saga, and we just might learn something.

    Good luck and Godspeed

    Bob Arkow

    Honestly, if we had another pair of hands during the move, I think we’d have something better for them to do than hold a camera. Great thought though! -jdw

    Comment by Robert Arkow — September 7, 2007 #

  8. I’m sorry to hear that the sneaky way of doing things didn’t work out, but I’m far more excited to hear that you’re finally going to have everything in one (better) place, with lots of room for expansion and improvement!

    I wish you the best of luck. Oh, and do have a beer *after* everything’s done. (Have one on me — just deduct it from my main account. /me wonders if you’ll have to add a new “Beer for Sysadmin” category to the accounting system… 😉

    A good night’s sleep without any pagers going off will suit me just fine. -jdw

    Comment by Thomas Tuttle — September 7, 2007 #

  9. Oh, and once everything *is* settled in and working, can you post some pictures of the new datacenter so we can all go “oooh” and “aaah” and “hey… waitaminute — all your servers really *aren’t* the same color!”?

    I wish I could claim that the “no pictures” rule is due to the obsessive security at the new location (which is really, really obsessive — these people list mantraps on their flyers as a feature). But in fact, I think it’s because it’s the ugliest datacenter on Earth. It’s connected, it’s cold, and the power stays on, but it’s some powerful ugly in there. -jdw

    Comment by Thomas Tuttle — September 7, 2007 #

  10. Being the boss means never having to say you’re sorry. No apologies necessary, dudes.

    Comment by KC — September 7, 2007 #

  11. hi jeff & co

    you mention that downtime will begin at noon.

    i am in south africa (UTC+2), and you guys are in arizona (UTC-7).

    so does this mean i can anounce downtime to my clients from 9pm until roughly 5am?

    thanks for the great service (i’m beginning to sound like a stuck record). good luck with the move.

    It’s late and I’ve lost the ability to do basic math, but yes, that sounds about right. -jdw

    Comment by kosta kontos — September 7, 2007 #

  12. Thanks for the information. So, just to be sure I understand, will DNS be down during this entire window?

    DNS will be the first service restored, because lots of other stuff won’t work without it. In fact, we’re going to do our best to keep some DNS servers online the whole time so visitors can find the “site under maintenance” pages. -jdw

    Comment by Bryan K. Walton — September 7, 2007 #

  13. Thanks for letting us know. Good luck! I hope it all goes well.

    Comment by Stephie — September 7, 2007 #

  14. Good Luck,

    I recently completed a data center move, we co-located our corporate servers. I still get tired when I remember the 36 hour weekend I put in. 🙁

    Thanks for being proactive, I will continue to recommend your service at every oportunity.

    Comment by pnutjam — September 7, 2007 #

  15. Countdown

    Comment by pnutjam — September 7, 2007 #

  16. What steps should I take? my sites are using CNAME ?

    I’m not sure I understand the question. Assuming you’re referring to renumbering, if we are hosting your DNS and/or domain registration, we will take care of everything. If not, you may have to take some steps manually, and at that time, we will contact you and explain what those steps are. -jdw

    Comment by Bruce — September 7, 2007 #

  17. If you will keep DNS up, does that mean that DNS entries regarding mail servers will remain active, and that my non-NFS hosted e-mail will continue to work?

    That is the goal, but we cannot guarantee it yet. -jdw

    Comment by curt — September 7, 2007 #

  18. Good luck with the move.

    Comment by Andrei — September 7, 2007 #

  19. Best wishes! The service has been great, and I appreciate your business model. Thanks for taking us to the next level!
    I recommend to everyone. -jk

    Comment by Justin Keogh — September 7, 2007 #

  20. I am surprised you claim +2 equipment is already at the new location. It might be, but if it were operational you would have been able to make that active during the time the rest of the equipment moves. That would have meant no downtime. Since we are going to have downtime, it means there is some flaw in the argument.

    That is not a correct assessment of the circumstances. +2 merely means we can have two servers fail and keep going, not that we can run our entire service from 2 servers. It likewise does not magically enable your data to be in more than one place at a time. -jdw

    Comment by Hardeep Singh — September 7, 2007 #

  21. […] […]

    Pingback by Bokashi Blog » My blog host announces planned downtime on 10 September 2007 — September 7, 2007 #

  22. Realistically, I’m expecting this move to take sixteen hours, because of a hard cold law of physics known as “Murphy’s Law”. 🙂

    I’ll echo the sentiment expressed in this thread that this post is very well thought out and constructed. I also entirely agree with the decision to K.I.S.S. the matter off. Why ask for trouble by making it too complicated?

    Good luck with what’s gonna be a huge, sweaty pain in the butt.

    Comment by Bumpy Light — September 7, 2007 #

  23. Awww, man. And just when I registered. *sigh* Well, at least it’ll be an extra day of coding for me.

    Hope everything goes well. And thanks for being transparent about the issues. Looks like I’ll be enjoying my stay here. 🙂

    Comment by Therese — September 7, 2007 #

  24. jdw & team, thank you for your fore-thought, honesty and full-on clarity. I wish we had some folks like you (or even just one!) at my place of work. Informative and transparent notices such as this affirm my decision to switch to NFS even though it is not hosted in my native country was a good one. I pray the transfer is uneventful and that you get a well deserved rest afterwards!

    pnutjam, thanks for the countdown timer.

    Comment by maphew — September 7, 2007 #

  25. Good luck with the move!

    Comment by Rick Umali — September 7, 2007 #

  26. Sorry to be a n00b, but I have only the vaguest of ideas what the answer to my question is so I’m just going to as it. I use the NFS DNS services to direct emails to my domain name to an email service provider. I haven’t seen any mention of email services, so can you confirm whether any of my email MX records will continue to function and/or whether I should expect any e-mail downtime as a result of this move?

    Thanks for giving us the deets and best of luck with the move.

    If you use third-party email hosting with our DNS, your email will be affected, but only for as long as our DNS is down. We are working hard to make that no time at all, and I will post an update once we know for sure, hopefully later today. -jdw

    Comment by Karen — September 7, 2007 #

  27. Just as an update… I can now confirm that we are planning to keep DNS available during the planned downtime, though there may be a few minutes of disruption as responsibilities migrate between machines.

    Once the downtime begins, however, you will not be able to update your DNS because our interface will be down, so please make any needed changes with plenty of time to spare. -jdw

    Comment by jdw — September 9, 2007 #

  28. Thanks for the heads up good luck 🙂

    Comment by Neale — September 9, 2007 #

  29. […] Oder im O-Ton On Monday, September 10, 2007 at around noon Arizona time (3pm Eastern, 7pm UTC), we will temporarily shut down our entire network to complete a migration of our equipment to a new datacenter. We anticipate that it will take four to eight hours to complete the move. All our services will be offline during that time. […]

    Pingback by – Gedanken eines Jurastudenten » Blog Archive » In eigener Sache: Planned Downtime — September 9, 2007 #

  30. Good luck guys. Awesome communication – as usual.

    Comment by Vinaya HS — September 10, 2007 #

  31. I thought the date had a connection to the anniversary of the terrorist attack on September 11.

    I never felt you lacked in your features and services department. The 1 day planned downtime surely does not affect me much and I only wish you well with a complete and successful migration.

    Best regards. And awesome beer.

    Comment by Loloy D Anonymous — September 10, 2007 #

  32. I have email forwarding active (not a different MX record). Will my emails still be forwarded without delay, or is the forwarding system going to be down for a while too?

    I want to say, thanks for the transparency, and that I wish you the least of unpredictables. Don’t forget to breathe deeply and often, and drink plenty of water! ;-}

    Yes, email forwarding will be impaired for a while, but it will be one of the first services restored after the equipment is physically moved. Emails will queue during this time and be delivered to you once it’s back up. -jdw

    Comment by Ricardo Salta — September 10, 2007 #

  33. I know it’s harder and more expensive
    but aren’t Sunday evenings
    less disruptive ?

    We considered all possible times, but loading docks are only available during a subset of regular business hours, and Phoenix is not somewhere to leave stacks of servers on the street at night because your pallet jack can’t make it over the lip of the “no deliveries” entrance. -jdw

    Comment by louis pollock — September 10, 2007 #

  34. […] A lengthier explanation can be found here (worth a read). […]

    Pingback by Planned Downtime For Tuesday, September 11, 2007 at GEO 12.97°N 77.56°E — September 11, 2007 #

  35. […] There will be some downtime for the websites on the domain tomorrow, starting from 7 PM GMT until about 3 AM GMT (3 PM Eastern to 11 PM Eastern, and these numbers are correct if my in-head quickie addition works as it should), because NFSN is moving to a new data center to improve reliability of their service. This is great news, and I wish them the best in the migration—those things can be hellish. […]

    Pingback by Trausch’s Little Home » Blog Archive » Downtime — September 21, 2007 #

Sorry, the comment form is closed at this time.

Entries Feed and comments Feed feeds. Valid XHTML and CSS.
Powered by WordPress. Hosted by NearlyFreeSpeech.NET.