On Saturday, March 29 at about 4pm US Eastern time, we rebooted one of our file servers that hosts content for member sites. It experienced a critical hardware failure and did not come back online. It took about 28 hours to get things back into service. We’re going to talk briefly about why that happened, and what we’ll be doing differently in the future.
ZFS in one paragraph
This issue has a lot to do with ZFS, so I’ll talk very briefly about what that is and how we use it. ZFS is an advanced filesystem, originally developed by Sun Microsystems back before they got devoured by Oracle. When you upload files to our service, ZFS is what keeps track of them. It performs very, very well on hardware attainable without an IPO, and we’ve been using it for many years because we need stuff that performs very, very well to keep up with you guys. It also has features that we and you are fond of, like snapshots, so if something of yours gets accidentally deleted, we can (almost always) get it back for you. The downside to ZFS is that is not cluster-able. That means that no matter what we do, there will always be at least one single point of failure somewhere in the system. If we do any maintenance, or if it fails, an outage will result.
Prior to Saturday’s issue, that file server (f5) had caused problems twice in the past two weeks that caused slow performance. We’ve seen a very similar problem with ZFS-based file servers in the past; when they accumulate a lot of uptime they start to slow down until rebooted. Because it involves downtime, member file servers don’t get rebooted very often; not unless they are having a problem. This one was having a problem we believed would be resolved by rebooting, so we rebooted it. However, at that point, it suffered a hardware failure. Although there’s no direct evidence of a connection, it’s hard to believe that’s a coincidence.
We did have two backup servers available to address this situation, one of which was intended for that purpose. It is based on new technology that we will discuss in more detail later, but what we discovered when we attempted to restore to it is that it misreports its available space. It said it had three and half times more space than we needed, but it really only had a few hundred gigabytes; nowhere near enough. (Fortunately we now understand why it reports what it does and how to determine what’s really available.) The second option had the space, but was always intended only to be a standby copy to guard against data loss, not as production storage. We determined pretty quickly that it could not sustain the activity necessary.
As a result, we were forced to focus on either fixing the existing server or obtaining a suitable replacement. Unfortunately, Saturday evening is not a good time to be looking for high-performance server components. We do have a service for that, and they eventually came through for us, but it did take until Sunday afternoon to obtain and install the replacement parts. Once that was resolved, we were able to get it back online relatively quickly and get everyone back in service.
What will happen next
As mentioned above, the big problem with ZFS is that it cannot be configured with no single point of failure. This basically makes it the core around which the rest of the service updates. We’ve done everything possible always to get as close as we can; the server that failed has multiple controllers, mirrored drives, redundant power supplies. Pretty much everything but the motherboard was redundant or had a backup. And, of course, the motherboard is the component that failed.
That’s not a small problem. Nor is it a new one. Single points of failure are bad, and we’ve been struggling for a long time to get rid of this one. We’ve tried a lot of different things, some of them pretty exotic. But what we have found for the past several years is that there’s really not a magic bullet. The list of storage options that genuinely have no single point of failure is pretty short. (There are several more that claim to, but don’t live up to it when we test it.) We have consistently found that the alternatives are some combination of:
– terrible performance (doesn’t matter how reliably it doesn’t get the job done)
– lack of POSIX compatibility (great for static files, but forget running WordPress on it)
– track record of instability or data loss (We’re not trusting your files to SuperMagiCluster v0.0.0.1-alpha. Or btrfs.)
– long rebuild (down)times after crash or failure
– (for commercial hardware solutions) a price tag so high that it is simply incompatible with our business model
The end result is that for the past few years, we have backed ourselves into something of a ZFS-addicted corner. However, what makes Saturday’s failure particularly frustrating is that we actually solved this problem. We’ve been rolling that solution out over the past couple of months. What’s left to be moved at this point is member site content and member MySQL data. The hardware to do that is already on order; it may even arrive this week. Once it does, there will be a week or two of setup and testing, and then we will start moving content. That will involve a brief downtime for each site and MySQL process while it’s moved, and may require a few sites with hardcoded paths to make some updates. We’ll post more about that when we are ready to proceed.
The new fileserver setup has no single points of failure, is scalable, serviceable, and expandable without downtime, preserves our ability to make snapshots, and performs like we need it to. And (crucially) although it is still cripplingly expensive, we could afford it. This is an area where we’ve been working very hard for a very long time, and it simply wasn’t possible to get all the requirements in one solution until recently.
To be perfectly clear, this doesn’t mean our service will never have any more problems. No one can promise that. File server problems were already incredibly rare, but since our service design makes them so catastrophic for so many people (at many hosts, such failures are a lot more common, but don’t affect nearly as many sites at once), we have to do as much as we can to make them incredibly rarer.
There are also plenty of other things besides file servers that can go wrong at a web host, and we continue to work on improving our service in all those areas. We’ll have more to say on that subject as the year progresses, but really, there’s no such thing as “good enough” for us, so that work will never end.
For now, we’re very sorry this happened. As we said during the downtime, there is nothing we hate more than letting you guys down, and we did that here. It’s no more acceptable to us than to anyone else for something like this to happen. What we can tell you is that before this happened we were executing a plan that, if it had been completed, would have prevented this. Completing that plan as quickly as possible is our next step.
Thanks for your time and your support. Problems like this are physically sickening, and seeing that so many of our members were so supportive really helped carry us through.
Sorry, the comment form is closed at this time.