Facility move post-mortem analysis

It has been a little over two weeks since our facility move. I believe at some point we promised some post-mortem analysis of what went right and what went wrong, and here it is.

The outcome of the move is a success. All of our core hardware is in one place, it is all working, and we are in a better position to grow and expand than we have ever been.

Our network monitoring has been vastly improved, with more improvements on the way, which will allow us to be a lot more proactive about keeping your sites healthy in the future, rather than waiting for you to complain when there’s a problem.

Our network infrastructure has been greatly improved. We have incredibly more bandwidth available on our internal network than ever before, so much that we are still tuning the most effective way to utilize it all to serve your sites as quickly as possible. Our cluster technology has always had the ability to failover your site from one web server to another if there’s a problem, but we now have similar (but faster) hot failover technology for our routers, firewalls, and edge proxies.

Our network defenses have also received a sharp boost. We have more and better firewall capability, better inbound packet scrubbing, better backup access to our equipment during attacks, and the ability to selectively blackhole attackers based on complex criteria before they ever reach NearlyFreeSpeech.NET. Not only will this pay off big time in the event of future DDOS attacks, but it will manifest as better protection against everyday nonsense like large spam runs.

You’ll be hearing more about a concrete example of the effect of some of these improvements in a near-future blog post.

While the outcome is successful, the implementation of the move was… not a great success. While nothing specific went horribly wrong, a number of little factors conspired to make the downtime last longer than expected.

First, we estimated the time required to disconnect, transport, and install our equipment very accurately. However, we (I) drastically underestimated the amount of time required to cable the equipment at the equipment at the new location. We use a partial mesh cluster topology, so it is not a simple matter of plugging each server into the closest switch; servers have, on average, 4-6 connections each, and some of those connections required quite a bit of cable routing. As a result, cabling took several hours more than expected and became the single most time-consuming task.

Second, our “placeholder pages” that were intended to be up during the move did work sometimes, but they were dependent on equipment at the new location that wound up needing to be reconfigured once the equipment from the old location arrived. We did not spot the unresolved dependency that caused this during the planning stages.

Third, a couple of our new servers, which had been very thoroughly tested prior to the move, completely flipped out on us as soon as they were brought up in the live configuration. They were quickly stabilized but their timing absolutely sucked.

After the move was completed and things started working again, we encountered a number of additional problems, most notably errors from our network edge proxies: Connection Refused, Host not responding, Host is Down, and Unknown Site. Some of these were related to the new configuration, but several were related to previously unexercised (or undetected) bugs in our clustering software that have now been fixed.

Unfortunately, it is not really useful to say “We have learned (something) from this experience about what to do next time.” That is because, having attempted what we did, only a complete idiot would do it again.

Should we ever need to move facilities in the future, no matter how long it takes or how much it costs, we will just build out the new facility in its entirety, move all the services between the two live facilities, and then burn down the old one for the insurance money.*

We have also renumbered into new IP addresses. This is temporary, and we will renumber again once our network brings some additional carriers online, with the primary holdup being bureaucracy. (The only reason the Internet hasn’t run out of IP addresses already is that the regional Internet registries make the application process so very tedious and time-consuming that it discourages all but the most determined applicants, like us.)

At this point, although we’re still in the period where most people automatically assume most problems are move-related even though most aren’t, we do still have a few remaining issues we are looking into:

  • We have scattered reports that DNS performance may not be satisfactory. Although there’s nothing specific to back that up, we are developing a way to instrument our DNS performance to measure speed and reliability, rather than blindly making changes we think might help and hoping for the best. In the mean time, if you have any problems like this please contact us, especially if you can document them.
  • Although it was not directly move-related, some people have encountered a unintended change in the default behavior of our network when Apache issues an internal redirect, causing their sites to fall back on the example.nfshost.com alias if, for example, someone enters a directory name without the trailing slash. This change is actually the result of a bug fix and represents the correct behavior, but we understand that people find it annoying. There is an .htaccess workaround for enforcing canonical site URLs in our Member Wiki, but we are working on allowing you to specify the canonical name for your site from our member interface, which will then be used in all such redirects.
  • The block of IP addresses that we are temporarily using is on a blacklist maintained by a secret cabal of British P2P users that apparently like to trade illegal files and don’t want to get caught. They originally had the range blacklisted “a lot” because of former users of the same IP addresses, but switched it to “just a little bit” (rather than removing us) because they were apparently worried that NearlyFreeSpeech.NET was really just a front for the RIAA, but subsequently blacklisted us “a lot” again for “not showing any gratitude” about being incorrectly blacklisted only a little bit. So we’re not just spies, we’re ungrateful spies. Seriously, I couldn’t make this stuff up if I tried. Fortunately, this has no effect at all unless:
    1. you are a NearlyFreeSpeech.NET member,
    2. you run a P2P blocking application called “Peer Guardian,”
    3. you edit your site with FTP.

    If all those are the case, when you access your site with FTP you will receive a warning that you can bypass by clicking on it. We have specific instructions available if you need them, but I think the handful of affected people have already contacted us.

    This has no effect on web access to your site, and will most likely not be an issue at all after we renumber our IP’s again. If you have any concerns about it, please feel free to contact us.

  • The blog-to-email service we were using appears to have imploded. We’ve selected a replacement, and we’ll be setting that up shortly, so email notification of NearlyFreeSpeech.NET news & announcements will be possible soon.
  • One thing I noticed during the downtime was how liberating it was to have an open issue and the offsite status page that let us give out real-time updates. I want to find some way for us to do that on an ongoing basis. Something less “significant” than a blog post; almost like an internal members-only Twitter that we could update multiple times a day so you can find out what’s going on right now and what we’ve been working on lately to make your service better. Provided we give you the tools to find what you want, we really can’t give you too much information.

Beyond those things, everything seems to be back to normal. We’ve still got a mountain of cleanup work, but give us a couple of weeks and we should be able to turn our focus back to the sort of new feature development that really differentiates NearlyFreeSpeech.NET from the mob of cookie cutter web hosts.

I want to take this opportunity to thank our members for all of their support, patience, and understanding during and after our move. We are so grateful for all of you. With that said, I don’t want to trivialize the downtime we experienced during the move (or the issues that led up to it). We are very sorry to everyone about the downtime caused by our move, and for the issues of the past couple of months that led up to such drastic action.

Naturally we did receive a few fairly snarky messages about the move and downtime, but that’s human nature, and frankly we got fewer than we expected. I’d like to offer a special apology to those few people for putting them in a position where they felt like that was their best recourse.

To close out the subject of the move, I would love to be able to state firmly, right here and now, “We will never do that again!” but the reality is that never is a really long time.

What I can say is that as a result of this move, we’ve gone from having almost no control over our network to having almost complete control and we’re chiseling away at the “almost” even now. We’re also now able to start building our own network, which will mean the ability to spread out to multiple locations. At that point, moving no longer affects everything we own. It just becomes a turn up in one location followed by a turn down in another. All the while providing the high quality of service you expect from NearlyFreeSpeech.NET.

Thanks again!

*Note: That was a joke. Kids! When it comes to arson, environmentally damaging disposal of old computer equipment, and insurance fraud: just say “No!”

(Just a reminder: our blog is not a venue for member support and we can’t discuss issues specific to your service with you in public due to our Privacy Policy. If you’re having a problem with your service, whether you think it’s move-related or not, please submit a secure support request so we can help you out.)

17 Comments

RSS feed for comments on this post.

  1. Thanks for keeping us up to date constantly during the move. It gave me something to do while my site was down. 🙂

    Comment by Eric — September 27, 2007 #

  2. You write great essays. I feel that things are in capable hands after reading posts like the above.

    Comment by KC — September 27, 2007 #

  3. Thanks everyone keep up the good work, your efforts are truly appreciated by many of your members.

    Comment by Teknorat — September 27, 2007 #

  4. Thanks for the post-mortem! It means a lot to me.

    Comment by Tim McCormack — September 27, 2007 #

  5. You guys rock. I did have some downtime, but I was expecting it. I trusted you to bring it back up, and guess what, you did! Thanks for keeping us informed.

    Comment by Karen — September 27, 2007 #

  6. I do not depend on my website for any sort of income or support; that aside, I feel that the way you handled this move was classy and appropriate. You kept us updated the entire time, and I appreciate that. Best of luck in your new facility.

    Comment by Josh — September 28, 2007 #

  7. I rarely pop in here except to check on how much money I’m spending, so the first I knew of the move was an email I received a few minutes AFTER it had all kicked off, and my site had gone off air. I was a little peeved at that, because it gave me no opportunity to contact my members and inform them about what was going on (at the time I relied on using the site to keep people informed on issues. Since that time I’ve put alternative email-based solutions in place)

    I have to say that I would have appreciated that email being sent a week before the downtime, rather than a few minutes after it

    Having said that, these things happen, and since the move the service seems to have been rock solid. I’d also like to say how much I appreciate all the time and effort that so obviously went into fixing all those problems that arose because of the move

    P.S. loved that “and then burn down the old one for the insurance money” comment…

    I’m sorry the downtime message took so long to reach you. We sent it as far in advance as we could, giving the logistical issues associated with scheduling in a carrier-neutral facility; the messages left our system about five days before the move. I can’t guess why it didn’t reach you sooner, but you’re not alone… we received a bounce back from one member today. That’s over three weeks delay, if you’re keeping score. -jdw

    Comment by hieronymous — September 28, 2007 #

  8. Thanks for being honest about all the problems. I think many companies would try and make out that everything went smoothly and to plan. You have openly and honestly admitted the problems and why they happend and that means a lot to me and I imagine others, too.

    Comment by Paul — September 28, 2007 #

  9. Thanks for the reply. In that case I take back any grumbles I had on that score

    (Although when I said “since the move the service seems to have been rock solid” I meant: “since the move the service seems to have been rock solid… right up to the point when I wrote that and pressed Submit Comment”)

    Ain’t it always the way?

    Indeed. I knew I was courting disaster by claiming all the fallout was over! -jdw

    Comment by hieronymous — September 28, 2007 #

  10. Congratulations on the move. I’m sure you were all working 100% during the switch so I was not really concerned with the slightly-longer-than-expected downtime.

    Most of my sites were in the minority that had a few lingering issues after the first “We’re back up” message. I left it for a few hours because I knew you’d all be backlogged with support requests and such. The suggestion of disabling and re-enabling my sites didn’t work, so after about 8 hours I submitted a support request. I got an email back within 2 hours and my websites were back up soon after!

    You manage to turn around issues that would create bad publicity for other hosts and convert them into an almost positive experience. There’s nothing better than to notice that my sites are down and to see a huge string of updates at status.nearlyfreespeech that exceeds the first fold.

    Good work. Excellent support, etc etc as usual.

    Comment by Guest — September 28, 2007 #

  11. hey

    I think for what you’re charging, the service is pretty great. the explanations are terrific too.

    That said, I have my teaching info online for my students, and your downtime happened right before my class. So that screwed up about twenty people.

    I love the way you run things, the only thing I don;t like is that there is alot of downtime. As a consumer, that means that I can;t really rely on you for hosting sites that I need to know are going to be “up”.

    Everything about the way you guys do business, except for the frequent downtime, really rocks. I know this because I have been looking for other hosts and they just don’t offer the same level of open-ness and flexibility. But they do have one thing that is pretty core: uptime and reliability.

    If you can solve the frequent downtime issues ( DOS, moves, whatever) then you will be the best host on the net!

    Just my two cents. I’m still a fan.

    One thing I think you may be overlooking is that the primary purpose of the move was to put control over reliability issues in our hands, since no one cares as much about the reliability of our service as we do. -jdw

    Comment by tc — October 3, 2007 #

  12. Being a technology addict, I was actually eagerly waiting for your next update on the status page! You guys rock.

    I five-star recommend your hosting services to everyone who asks my advise for one.

    Comment by Vinaya HS — October 5, 2007 #

  13. @jdw:

    thanks for your response, it makes sense!
    Like I said, still a fan.
    D

    Comment by tc — October 6, 2007 #

  14. Your account of the move sounds just like the team I work with! Just when everything is supposed to go smoothly, as in less than 4 hour downtime, some issue like more extensive cabling, new instabilities in equipment, power fluctuations, temperature changes, and probably even the moon phase plays havoc with the plan, resulting in a 36+ hour downtime experience, and a “good job” at the end.

    Yours is a very good service, by the way.

    Comment by dk — October 8, 2007 #

  15. Now, if you would just offer dedicated IP’s…

    …then we would be violating our promise not to harm the Internet by wasting IP addresses to make a quick buck. -jdw

    Comment by William Gray — October 21, 2007 #

  16. Around your recent move of physical hosting, you were planning two rounds of IP address changes for sites you’re hosting. It looks to me like both sets of changes have now occurred. Is that correct?

    No worry if not.

    Thanx.


    Nick

    [b]No. The process of IP allocation involves bureaucracy. The wheels are still turning, just so slowly that you can’t actually see it with the naked eye. -jdw[/b]

    Comment by Nick — November 3, 2007 #

  17. It is really great that you take the time to inform customers about the infrastructure that the websites are hosted on, this gives great peace of mind.

    Thank you.

    Comment by boxie — November 8, 2007 #

Sorry, the comment form is closed at this time.

Entries Feed and comments Feed feeds. Valid XHTML and CSS.
Powered by WordPress. Hosted by NearlyFreeSpeech.NET.

NFSN