Post-mortem report of Saturday’s file server failure

On Saturday, March 29 at about 4pm US Eastern time, we rebooted one of our file servers that hosts content for member sites. It experienced a critical hardware failure and did not come back online. It took about 28 hours to get things back into service. We’re going to talk briefly about why that happened, and what we’ll be doing differently in the future.

ZFS in one paragraph

This issue has a lot to do with ZFS, so I’ll talk very briefly about what that is and how we use it. ZFS is an advanced filesystem, originally developed by Sun Microsystems back before they got devoured by Oracle. When you upload files to our service, ZFS is what keeps track of them. It performs very, very well on hardware attainable without an IPO, and we’ve been using it for many years because we need stuff that performs very, very well to keep up with you guys. It also has features that we and you are fond of, like snapshots, so if something of yours gets accidentally deleted, we can (almost always) get it back for you. The downside to ZFS is that is not cluster-able. That means that no matter what we do, there will always be at least one single point of failure somewhere in the system. If we do any maintenance, or if it fails, an outage will result.

What happened

Prior to Saturday’s issue, that file server (f5) had caused problems twice in the past two weeks that caused slow performance. We’ve seen a very similar problem with ZFS-based file servers in the past; when they accumulate a lot of uptime they start to slow down until rebooted. Because it involves downtime, member file servers don’t get rebooted very often; not unless they are having a problem. This one was having a problem we believed would be resolved by rebooting, so we rebooted it. However, at that point, it suffered a hardware failure. Although there’s no direct evidence of a connection, it’s hard to believe that’s a coincidence.

We did have two backup servers available to address this situation, one of which was intended for that purpose. It is based on new technology that we will discuss in more detail later, but what we discovered when we attempted to restore to it is that it misreports its available space. It said it had three and half times more space than we needed, but it really only had a few hundred gigabytes; nowhere near enough. (Fortunately we now understand why it reports what it does and how to determine what’s really available.) The second option had the space, but was always intended only to be a standby copy to guard against data loss, not as production storage. We determined pretty quickly that it could not sustain the activity necessary.

As a result, we were forced to focus on either fixing the existing server or obtaining a suitable replacement. Unfortunately, Saturday evening is not a good time to be looking for high-performance server components. We do have a service for that, and they eventually came through for us, but it did take until Sunday afternoon to obtain and install the replacement parts. Once that was resolved, we were able to get it back online relatively quickly and get everyone back in service.

What will happen next

As mentioned above, the big problem with ZFS is that it cannot be configured with no single point of failure. This basically makes it the core around which the rest of the service updates. We’ve done everything possible always to get as close as we can; the server that failed has multiple controllers, mirrored drives, redundant power supplies. Pretty much everything but the motherboard was redundant or had a backup. And, of course, the motherboard is the component that failed.

That’s not a small problem. Nor is it a new one. Single points of failure are bad, and we’ve been struggling for a long time to get rid of this one. We’ve tried a lot of different things, some of them pretty exotic. But what we have found for the past several years is that there’s really not a magic bullet. The list of storage options that genuinely have no single point of failure is pretty short. (There are several more that claim to, but don’t live up to it when we test it.) We have consistently found that the alternatives are some combination of:

– terrible performance (doesn’t matter how reliably it doesn’t get the job done)
– lack of POSIX compatibility (great for static files, but forget running WordPress on it)
– track record of instability or data loss (We’re not trusting your files to SuperMagiCluster v0.0.0.1-alpha. Or btrfs.)
– long rebuild (down)times after crash or failure
– (for commercial hardware solutions) a price tag so high that it is simply incompatible with our business model

The end result is that for the past few years, we have backed ourselves into something of a ZFS-addicted corner. However, what makes Saturday’s failure particularly frustrating is that we actually solved this problem. We’ve been rolling that solution out over the past couple of months. What’s left to be moved at this point is member site content and member MySQL data. The hardware to do that is already on order; it may even arrive this week. Once it does, there will be a week or two of setup and testing, and then we will start moving content. That will involve a brief downtime for each site and MySQL process while it’s moved, and may require a few sites with hardcoded paths to make some updates. We’ll post more about that when we are ready to proceed.

The new fileserver setup has no single points of failure, is scalable, serviceable, and expandable without downtime, preserves our ability to make snapshots, and performs like we need it to. And (crucially) although it is still cripplingly expensive, we could afford it. This is an area where we’ve been working very hard for a very long time, and it simply wasn’t possible to get all the requirements in one solution until recently.

To be perfectly clear, this doesn’t mean our service will never have any more problems. No one can promise that. File server problems were already incredibly rare, but since our service design makes them so catastrophic for so many people (at many hosts, such failures are a lot more common, but don’t affect nearly as many sites at once), we have to do as much as we can to make them incredibly rarer.

There are also plenty of other things besides file servers that can go wrong at a web host, and we continue to work on improving our service in all those areas. We’ll have more to say on that subject as the year progresses, but really, there’s no such thing as “good enough” for us, so that work will never end.

For now, we’re very sorry this happened. As we said during the downtime, there is nothing we hate more than letting you guys down, and we did that here. It’s no more acceptable to us than to anyone else for something like this to happen. What we can tell you is that before this happened we were executing a plan that, if it had been completed, would have prevented this. Completing that plan as quickly as possible is our next step.

Thanks for your time and your support. Problems like this are physically sickening, and seeing that so many of our members were so supportive really helped carry us through.

29 Comments

RSS feed for comments on this post.

  1. jdw, I love info packed feedback like this. Sorry all that happened, but still you all are doing a great job, and the infrastructure is getting better. Keep up the good work, and we’re looking forward to the migration! Yours, rkh, Happy NFSN Customer.

    Comment by rkh — April 3, 2014 #

  2. You guys are absolutely amazing. I thought for sure when I read your Twitter feed (and my site on f5 went down) that this would not be resolved for a while.. but you were up and running within very close to 24 hours. Kudos and thanks for the great service! Akhan.

    Comment by Akhan Almagambetov — April 3, 2014 #

  3. “It said it had three and half times more space than we needed, but it really only had a few hundred gigabytes; nowhere near enough. (Fortunately we now understand why it reports what it does and how to determine what’s really available.)”

    Care to elaborate on this?

    Comment by Viljo Viitanen — April 3, 2014 #

  4. Available space is always a bit weird when you start clustering multiple disks across multiple servers, and taking into account replicas, snapshots, and slack space, etc, etc. It seems like the easy-to-obtain calculation is just a bit naive and doesn’t go out and query everything and do all the necessary math. Part of it is also a result of our initial deployment being relatively small. As we scale up, it seems like it will get closer over time.

    -jdw

    Comment by jdw — April 3, 2014 #

  5. Which solution did you end up choosing for your fileserver needs?

    Because we are in an incredibly competitive industry with companies that spend more on advertising than we make in a year, we tend not to discuss how we do things in very much detail, particularly when it comes to solving complex problems. Sorry, I know that’s a frustrating non-answer. -jdw

    Comment by Mads Sülau Jørgensen — April 3, 2014 #

  6. At least you guys are upfront and honest about the situation… I’d rather have that than some shiny expensive thing that can still break anyway. Hope your transition to the new system goes well.

    Comment by Peter B — April 3, 2014 #

  7. When I worked in a computer repair shop, I would sometimes go to a client location. I noticed that on machines which stay on all of the time, they will sometimes have something go bad but does not cause a failure until it’s powered down, and fails to boot. It’s fairly common.

    That’s rare with high-quality rack servers, but it does happen. In this case, though, it was just a warm reboot. It died without being powered down or even hard reset. So that’s a little more mysterious. But hardware failures often are. -jdw

    Comment by Klaus Donnert — April 3, 2014 #

  8. I’m with commenter “rkh” — I really appreciate the detailed and candid transparency both during the outage and in the post-mortem. (And in the blog generally, and in the FAQ. Your straightforwardness and absence of BS and whitewashing has always been really refreshing.) I’m even happier to be a NFSN customer now than I was before this incident.

    Comment by Tom McNeely — April 3, 2014 #

  9. I work in a similar industry and understand some of what you guys deal with. Thank you for a clear, reasonable explanation of a complex issue.

    Downtime stinks, but nothing is worse than the blame shifting we get from many other vendors. You guys rock for dealing with this like you did. Thank you for your hard work, expertise and for having a good backup of our data! I honestly don’t know how you guys provide such an amazing service for the price you charge.

    I’ve been a happy member since 2007, even happier after reading this post. Thank you!

    Comment by Geoffrey Phillips — April 3, 2014 #

  10. For several years y’all have given reliable service at an amazing price. The rare times problems arise, you’ve been right out front with updates on the situation. A dead server on a weekend, I’d not wish that on my worst enemy. Let me just add a “me too”. Well done, and thank you.

    Comment by Mark Pritchard — April 3, 2014 #

  11. This is why I’m a NFSN customer. Great service, detailed postmortems. I’m super-curious about this new storage solution, but I understand that you might not want to share information about that until either it’s live and proven to work, or never, since it could be considered a tactical advantage.

    Comment by Nick May — April 3, 2014 #

  12. I can feel for your heroic 24-hr to 36-hr recovery marathons – I’ve been through a few myself. Thanks as always for your hard work and commitment to providing the best service for the most reasonable price. You deserve all the loyalty displayed in this thread.

    Comment by Gordo — April 5, 2014 #

  13. Having, many years ago, seen more than my fair share of post-mortems like this from certain other competitors, NFSN’s uptime is still better than ANY other shared host I’ve used.

    Stuff happens. Thanks for the details. 🙂

    Comment by G M — April 6, 2014 #

  14. thank you for being so transparent. I have been a loyal customer for years and intend to continue to do so.

    Comment by Matt — April 6, 2014 #

  15. I work as field tech for a company that services all major UPS systems. Redundancy is pretty much the only way to mitigate a single point of failure. I belive I read that that is not or was not possible with your config.

    MTBF on your new hardware- is it an improvement over the prior (down) hardware? Also, can it run in parallel with an identical unit?

    thanks

    From the post: “The new fileserver setup has no single points of failure, is scalable, serviceable, and expandable without downtime, preserves our ability to make snapshots, and performs like we need it to.” -jdw

    Comment by parallelogram — April 6, 2014 #

  16. Thank you for taking the time to let us know all the nitty-gritty details and actually explaining it to your customers like it is. I’ll continue to stick with NFS as much as I possibly can just because of this. It’s extremely rare that a company opens up about its faults, elaborates on the situation, and addresses the plan of correction. Very commendable! Thanks again!

    Comment by Kyle — April 8, 2014 #

  17. Just out of curiosity, were you using OpenSolaris or OpenIndiana, or were you using something else with ZFS added? I’ve got a file server running EON (nv_130 running in memory) and I’ve never had a slow down problem. Then again I don’t push my file server as hard as you guys would so probably a silly assumption… just curious is all.

    We’ve tried ZFS on virtually everything it’s supported at one point or another, although not EON. The biggest slowdown issues we’ve seen were with some now-older releases of FreeBSD and large numbers of snapshots. I am pretty sure that’s resolved with the latest releases. -jdw

    Comment by Aaron Mason — April 9, 2014 #

  18. Thanks for the detailed post mortem. I know you’re more frustrated than I am when things go down and far better than I would be at getting them back up, which is good enough for me.

    Keep up the good work!

    Comment by Landon Winkler — April 10, 2014 #

  19. Now that I think about it, I’ve had systems fail during a soft reboot in the past. Nothing enterprise though, mostly laptops. It does happen, though the odds against it happening are rather astronomical.

    Comment by Aaron Mason — April 11, 2014 #

  20. Your statement that it was occurring on old versions of FreeBSD is comforting.

    I’m planning an upgrade to my file server involving a Supermicro Avoton-based motherboard, 3TB drives for z2, 1TB drives for hardware RAID10 on a Dell PERC5/i card and FreeNAS (because EON doesn’t support the PERC, needed for iSCSI exports to two Sun V20Z servers in a Hyper-V cluster) and what you said, knowing how extensively you guys use FreeBSD, was concerning.

    Comment by Aaron Mason — April 11, 2014 #

  21. Thanks for feedback, this outage happened at a really bad timing for my projects here, had to switch server last minute before a live streaming event! Was a bit of hell for me. But 15 minute solution made me up again. Compared to your marathon it’s much less hassle.

    You guys evaluated GlusterFS for clustering file servers?

    Yes. -jdw

    Comment by Olivier — April 11, 2014 #

  22. There is only one thing rarer than a no-fail server…

    Honesty.

    🙂

    Thanks lots.

    Comment by Phil — April 12, 2014 #

  23. I don’t drink the flavor aid that suggests that failures can somehow be avoided entirely. As I’m sure jdw is fully aware, failures can happen and it’s often the difference between being dead and being inconvenienced. One builds to mitigate the effects of failure, not avoid it entirely, the latter often resulting in a horrendously complicated solution that fails spectacularly in a fashion you didn’t see coming.

    Comment by Aaron Mason — April 12, 2014 #

  24. Add my voice to the “Well done!” crowd. You guys did an amazing job getting the problem fixed…and an amazing job reporting on it. NFS is truly one of a kind.

    Comment by lostnbronx — April 21, 2014 #

  25. Thank you so much for all of your service. I also really appreciate how open and honest and straightforward you are when things go wrong. I have had nothing but good experiences with NFS, and I regularly recommend it to my students when they are asking for good web hosts. Your integrity is why I keep using this service.

    Comment by Ingbert — April 30, 2014 #

  26. You guys are terrific. I will never leave NFS while you are around.

    Comment by Mark Preston — May 7, 2014 #

  27. I love that when something /does/ go wrong at NSF, we get a page’s worth of explanation from you guys, rather than the “it looks fine from our end, try it again in… uh… three, maybe three and a half hours” explanation I’ve had from other companies.

    Comment by Jon — May 10, 2014 #

  28. Like many other posters, I’d like to thank you for the explanation. Other companies would ignore the problem, deny it, blame somebody else, and MAYBE eventually say “sorry” and leave it at that. Your blog during downtime, and this explanation afterwards, are very refreshing.

    You’re good to me, and I’m good to you. I’ve asked for your help twice (once as a prospective customer and once while setting up DNS) and you’ve been polite, very quick, and absurdly helpful. Complaining about this downtime would be like complaining that my mother didn’t bring me breakfast in bed when she had the flu!

    Keep up the good work, and thanks for the years of service!

    Comment by Chris — May 13, 2014 #

  29. It’s clear from your account that you were, are, and will continue to do everything in your power to bring us the very best of the very most affordable possible web hosting. Thanks for all your hard work, dedication, and especially your savvy and independent-mindedness. I most appreciate your pay-per-use option for static sites, and am exceedingly grateful you offer it.

    Comment by philodygmn — May 15, 2014 #

Sorry, the comment form is closed at this time.

Entries Feed and comments Feed feeds. Valid XHTML and CSS.
Powered by WordPress. Hosted by NearlyFreeSpeech.NET.

NFSN