Network Status – NearlyFreeSpeech.NET Blog https://blog.nearlyfreespeech.net A blog from the staff at NearlyFreeSpeech.NET. Tue, 22 Aug 2023 12:05:19 +0000 en-US hourly 1 Bigger, better, faster, more https://blog.nearlyfreespeech.net/2023/08/22/bigger-better-faster-more/ https://blog.nearlyfreespeech.net/2023/08/22/bigger-better-faster-more/#comments Tue, 22 Aug 2023 04:12:01 +0000 https://blog.nearlyfreespeech.net/?p=773 I debated whether to write a humorous intro, but I’ve ultimately decided it’s more important to get succinct information out to everyone, so here’s the TLDR:
Over the next few weeks, we will migrate NearlyFreeSpeech.NET to all-new equipment and greatly upgraded network infrastructure.

  • We’re replacing our Intel Xeon servers with brand-new AMD Epyc servers.
  • All our existing file storage will be migrated from SATA SSDs to NVMe PCIe 4.0 SSDs.
  • Most of our content will be served from New York City rather than Phoenix after the upgrade.
  • Various things may be intermittently weird or slow for the next couple of weeks as we shift them around, but we’re working hard to minimize and avoid disruptions to hosted services.

NearlyFreeSpeech goes Team Red

There’s no question that Intel has been good to us. Xeons are great processors. But these days, AMD Epyc… wow. The processors aren’t cheap, but the compute performance and I/O bandwidth are outstanding. 128 PCIe 4.0 lanes? Per CPU? Except for the speedup, this change should be transparent to most people. By and large, we’ve tried to protect people from building things too specific to exact CPU models by masking certain features, but there is probably some random instruction set supported on the old machines that isn’t present on the new ones. So if you’ve done something super-weird, you may have to recompile.

I don’t want to make any specific promises about performance. After all the speculative branch execution fixes, the security layers needed for our system to protect you properly, and other overhead, these things never quite reach their maximum potential. But, so far, they’re so fast!

Here’s the catch. Some ancient site plans bill based on storage space but not CPU usage. These plans have been gone for about ten years. They were an incredibly bad deal for people who wanted to store lots of data, but it cost basically nothing if your site was tiny and used lots of CPU. That wasn’t sustainable for us. We grandfathered those sites at the time because we’ve always paid a flat rate for a fixed amount of electricity whether we use it or not, and those sites have been running on the same hardware ever since (Intel Xeon X5680s!). Neither of those things will be true going forward, so it’s the end of the road for those plans. We plan to temporarily allocate a bare minimum amount of hardware to those sites for a few months and then let affected people know that they’ll be migrated to current plans around the end of the year.

If you want to check this now:

  1. Go to the Site Information panel for your site.
  2. Find the “Billing Information” box.
  3. If there’s been a red-text message “($10.24/GB/Month – Legacy Billing!)” on the “Storage Class” line for the last ten years, you’re affected.

To change it, find the “Config Information” box and edit the Server Type. Pick the closest option. (If in doubt, “Apache 2.4, PHP, CGI.”)

Quoth the raven, “NVMe more!”

It’s something of a sore point that our file storage performance has always been a bit lackluster. That’s largely because of the tremendous overhead in ensuring your data is incredibly safe. Switching from SATA SSDs to NVMe will give a healthy boost in that area. The drives are much faster, and the electrical path between a site and its data will be shorter and faster. And it’ll give all those Epyc PCIe lanes something to do.

But there’s a little more to the story. To get adequate resiliency, sacrificing some performance is a necessary evil. It just flat-out takes longer to write to multiple SSDs in multiple physical servers and wait for confirmation than to YOLO your data into the write cache of a device plugged into the motherboard and hope for the best. We accept that. And we’ve always accepted that our less-than-stellar filesystem performance was the compromise we had to make to get the level of resiliency we wanted. However, we’ve always suspected we were giving up too much. It’s taken years, but we’ve finally confirmed that some weird firmware issues have created intermittent slowness above and beyond the necessary overhead.

So we expect our filesystem performance to be dramatically better after the upgrade. Don’t get me wrong; it won’t be a miracle. The fastest SAN in the world is still slower than the NVMe M.2 SSD on the average gaming PC (or cheap VPS). But one keeps multiple copies of your data live at all times and does streaming backups, and one doesn’t. And it should be a hell of a lot better than it has been.

Related to this, we’ve made some structural changes to site storage that will make moving them easier and faster. That has some other benefits we care a lot about that you probably don’t, like making storage accounting super fast. It should also make some other neat things possible. But we need to explore that a little more before we announce anything.

New York, New York!

Things have changed quite a bit since we started. As much as I love Phoenix, it’s not the Internet hub it was when I lived there in the 1990s. While some benefits remain, I no longer believe it’s the best place for our service. We see dumb stuff we can’t control, like Internet backbones routing traffic for the US east coast and Europe from Phoenix through Los Angeles because it’s cheaper. New York, on the other hand, is functionally the center of the Internet. (More specifically, the old Western Union building at 60 Hudson Street in Manhattan.)

It will surprise no one that Manhattan real estate is not exactly in our budget, but we got close. And, more importantly, we are parked directly on top of the fiber serving that building. It’d cost about ten times more to shave 0.1 milliseconds of our ping times.

This change will make life demonstrably better for most people visiting hosted sites; they’re in the eastern US and Europe. But we’re not leaving the west hanging out to dry. We can finally do what I always wanted: deploy our own CDN. After we’re finished, traffic for customer sites will be able to hit local servers in Phoenix, New York, and Boston. Those servers will transparently call back to the core for interactive stuff but can serve static content directly, much like our front-end servers do today. That’s already tested and working. You might be using it right now.

The new design is completely flexible. It doesn’t matter where your site is located; traffic enters our network at the closest point to the requestor, and then our system does the right thing to handle it with maximum efficiency.

It’s now technically possible for us to run your site’s PHP in New York, store your files in Boston, and have your MySQL database in Phoenix. But “could” doesn’t always mean “should.” We’re still constrained by the speed of light; a two-thousand-mile round trip on every database query would suck pretty hard. (But I’ve done it myself with the staging version of the member site. It works!) So everything’s going to New York for now.

Keeping it weird

This change means we have to move all your data across the country. Sometime in the next few weeks, each site and MySQL process will be briefly placed in maintenance and migrated across our network from Phoenix to New York. For most sites, this should take less than a minute. We’ll start with static sites because they don’t have any external dependencies. Then we’ll move each member’s stuff all at once so we don’t put your MySQL processes and site software into a long-distance relationship for more than a few minutes. Once we have a specific schedule, we’ll attempt to make some information and, hopefully, some control available via the member UI to help you further minimize disruption. But our goal is that most people won’t even notice.
There may be some other weirdness during this period, like slowness on the ssh server, and you may actually have to start paying attention to what ssh hostname to use. All that will be sorted out by the time we’re done.

Some longtime members may recall the 2007 move where it took us over a day to move our service a few miles across town. At the time, we wrote, “Should we ever need to move facilities in the future, no matter how long it takes or how much it costs, we will just build out the new facility in its entirety, move all the services between the two live facilities, and then burn down the old one for the insurance money.” Oh my god, it took a long time and cost so much money, but that’s exactly what’s happening. (Sans burning down the facility! We love our Phoenix facility and hope to continue to have equipment there as long as Arizona remains capable of sustaining human life.)

Final thoughts

These changes represent an enormous investment. Thus, much like everyone else these past couple of years, we will have to pass along a huge price increase.

No, just kidding.

Our prices will stay exactly the same, at least for now. (Except for domain registration, where constant pricing fuckery at the registries and registrar remain the status quo. Sadly, there’s nothing we can do about that. Yet.) In fact, they might go down. We bill based on how much CPU time you use, and it’s likely to take less time to do the same amount of work on the new hardware.

The last few years have been pretty weird. COVID aside, NearlyFreeSpeech.NET has been keeping pretty quiet. There’s a reason for that. I’m proud of what NearlyFreeSpeech.NET is. But there’s a gap between what is and what I think should be. There always has been. And that gap is probably bigger than you think.

So I spent some time… OK, nearly three years… more or less preserving the status quo while I did a very deep dive to learn some things I felt I needed to know. And then, I spent a year paying off tech debt, like getting our UI code cleaned up and onto PHP 8 and setting up this move. So four years went by awfully fast with little visible change, all in pursuit of a long-term plan. And in a few weeks, we’ll be finished. With the foundation.

“It’s a bold strategy, Cotton. Let’s see if it pays off for ’em!”

]]>
https://blog.nearlyfreespeech.net/2023/08/22/bigger-better-faster-more/feed/ 19
Hey! What happened to 2023Q2? https://blog.nearlyfreespeech.net/2023/07/28/hey-what-happened-to-2023q2/ https://blog.nearlyfreespeech.net/2023/07/28/hey-what-happened-to-2023q2/#comments Fri, 28 Jul 2023 21:01:25 +0000 https://blog.nearlyfreespeech.net/?p=767 You may have noticed that production sites with normal updates are being upgraded from 2022Q4 to 2023Q1, and non-production sites are being upgraded from 2023Q1 to 2023Q3. So what happened to 2023Q2?

Wrangling the amount of pre-built software we do is a constant challenge. Something is always changing. And changes frequently break stuff. Several things changed around the same time earlier this year, especially some stuff related to Python, the FreeBSD ports-building process, and other more niche languages that our members care about, like Haskell and Octave. Some of those had nasty interactions. We also have some other changes in the works that have impacted this. (It will be an Epyc change. More details coming soon!)

To make a long story short, we spent so long on the 2023Q2 quarterly software build that it was July, and we still had problems. We finally have a clean build that passes all of our hundreds of internal tests. But we also have the 2023Q3 quarterly build running just as smoothly. Since 2023Q2 won’t get any security updates through the FreeBSD ports team, having our non-production members test it doesn’t seem useful. And we’re sure not going to roll it out to production sites untested.

And so, we are skipping it. The default realm for production sites will be the (now very thoroughly tested) 2023Q1 realm. And the default realm for non-production sites will be the shiny new 2023Q3 realm. As always, we’ll backport security fixes as needed from 2023Q3 to 2023Q1.

No more PHP 7!

For those sites being upgraded from 2022Q4 to 2023Q1, it’s worth reiterating that PHP 7.4 was deprecated in 2021, and security support ended in November 2022. If your site still runs on PHP 7 eight months later, you’re in for a bad time. The PHP developers are ardent adherents of “move fast & break things,” and backward compatibility is the thing they break the most. Back in February, we posted information about this, including some advice for updating, in our forums.

]]>
https://blog.nearlyfreespeech.net/2023/07/28/hey-what-happened-to-2023q2/feed/ 4
Maintenance for Christmas https://blog.nearlyfreespeech.net/2019/12/24/maintenance-for-christmas/ https://blog.nearlyfreespeech.net/2019/12/24/maintenance-for-christmas/#comments Tue, 24 Dec 2019 18:11:25 +0000 https://blog.nearlyfreespeech.net/?p=737 Christmas Eve and Christmas Day are the lowest-usage days of the year (both in terms of member activity and in terms of visits to member sites), so we are going to roll out some core system upgrades over the next 36 hours. These updates relate mostly to file servers.

Despite having no single point of failure from the hardware perspective, each site’s content is still backed by a single system image (necessary for coherency), so these updates may cause some temporary disruptions to affected sites. We will do our best to minimize that.

We do also plan to upgrade our core database servers. These are fully redundant, so we do not anticipate disruption, but the possibility does exist. We hope this upgrade will resolve an issue that mainly manifests as intermittent errors in our member interface early in the morning (UTC) on Sundays.

]]>
https://blog.nearlyfreespeech.net/2019/12/24/maintenance-for-christmas/feed/ 1
Upcoming updates, upgrades, and maintenance https://blog.nearlyfreespeech.net/2015/03/03/upcoming-updates-upgrades-and-maintenance/ https://blog.nearlyfreespeech.net/2015/03/03/upcoming-updates-upgrades-and-maintenance/#comments Tue, 03 Mar 2015 23:45:24 +0000 https://blog.nearlyfreespeech.net/?p=545 We have accumulated some housekeeping tasks that we’ll be taking care of over the next couple of months. They’re all necessary things to make sure our service keeps running at its best, and though we work hard to prevent these types of things from impacting services, occasionally they do intrude. As a result, we want to let everyone know what we’re up to and what the effects will be.

Retiring file server f2

We still have quite a few sites using the file server designated as “f2.” This is the oldest file server still in service, and although it has been a great performer for many years, it is reaching the end of its useful life. It is also one of two remaining file servers (and the only one that holds member site files) that has a single point of failure. Our newer file servers use different technology; they are faster (100% SSD), have no single points of failure, allow hardware maintenance while they are running, and allow us to make major changes (like adding capacity or rebalancing files) behind the scenes without you having to change the configuration of your site.

So, we are quite anxious to get rid of f2. We’ve been offering voluntary upgrades for some time now, but it’s time to move things along. We’ve set an upgrade date and time for every site on f2 in April. If you have a site on this file server, you can see your upgrade time in our member interface and, if it doesn’t suit you, upgrade at any earlier time or postpone it closer to the end of April.

Please note, the file server f2 is distinct from and has no relation to site storage nodes that contain the text fs2. If your site’s file storage tag contains fs2, you are not affected by this.

Migrating a site does entail placing it into maintenance mode briefly, for a period proportional to the size of the site. Beyond that it usually has no ill effects. Some sites do have complications, especially if they have hardcoded paths in their .htaccess files. After our system migrates your site, it will attempt to scan the site for affected files and send you an email listing them if it finds any. This isn’t 100% foolproof, but we previously did it for a lot more sites under considerably greater pressure with the f5 server, and problems were relatively few and far between.

Discontinuing PHP Flex

As part of our continued (slow) migration away from Apache 2.2, we will be discontinuing PHP Flex. PHP Flex refers to running PHP as a CGI script, which is a terrible way to do things. In the bad old days, it was useful in some cases for compatibility with PHP applications that didn’t work with safe_mode, if you didn’t mind the horrible performance. But, even in the bad old days, it mostly ended up being used not because it was necessary, but because it was easier than dealing with safe_mode.

These days, PHP safe_mode is long gone, so there’s no real reason to have PHP Flex anymore. Our new PHP types are highly compatible with (and much faster than) PHP Flex, and most people have already happily upgraded. However, there are still some stragglers out there and, as time goes by, they are starting to have problems. Those problems often completely go away simply by switching to a currently-supported version of PHP. Thus, we feel it’s time to phase out PHP Flex. In the month of April, we will auto-migrate PHP Flex sites (which mostly run PHP 5.3 and in some cases 5.2) to PHP 5.5.

MySQL software upgrades

We are currently working on both long-term and short-term upgrades for MySQL. In the short term, we need to perform a series of OS and MySQL server updates on existing MySQL processes to keep them up-to-date and secure. This will require either one or two brief downtimes for each MySQL node, typically about 5-10 minutes. We will be performing these updates throughout the month of March, and we will announce them on our network status feed (viewable on our site and Twitter).

In the long term, MariaDB 5.3 is getting a bit long in the tooth, so we are working to jump straight to MariaDB 10 and all its great new functionality, as well as offering better scalability and configuration flexibility. This is likely to be somewhat more resource intensive, and hence more expensive, so it will be optional for people who are perfectly happy with the way things are. (If you like your MySQL plan, you can keep it!) More on this as it gets closer to release.

Physical maintenance

We also need to do some maintenance on the power feeds to one of our server shelves. Ordinarily that isn’t an issue that affects our members, but in this case it’s being converted between 120V and 208V. Hypothetically that can be done while the equipment is running, but doing so entails a nonzero risk of death by electrocution and after careful consideration we’ve decided that none of the current field techs are expendable at this time. Also, it could burn down the datacenter. So, we’re going to go ahead and do it by the book, which means shutting it off.

That’s a few dozen CPU cores and hundreds of gigs of RAM we need to take offline for a little while. In a real disaster, our infrastructure could survive, but there would be a period of degraded service while things balance out on the remaining hardware. That period would be significantly longer and affect significantly more people than the actual maintenance. So, we feel our best course of action is just to shut it off for the few minutes it will take to rewire the power feeds. The service impact should be low, but will probably not be zero.

We want to complete the MySQL maintenance listed above first, so we are likely to do this toward the end of March. We will post updates on our network status feed with more precise timing as we get closer.

Realm upgrade reminder

We have finally finished rolling sites off of the dreaded “legacy” realms (freebsd6, freebsd72, and 2011Q4). Every site is now on a “color” realm. This means that, for people who have selected late realm upgrades for their site in our UI and who are currently running on the red realm, they will receive an automatic upgrade to violet in April, after quarterly realm rotation has occurred. Compatibility between the two is excellent and we anticipate very few problems.

That’s all for now. All in all, the upgrades and maintenance shouldn’t affect too many people, but we regret and apologize in advance for any problems they do cause. These steps are part of a process designed to eliminate some very old stuff that causes stuff like this to be intrusive. In other words, the goal is to do this maintenance is in large part so that the next time we do it, you’ll be even less likely to notice.

Thanks for reading!

]]>
https://blog.nearlyfreespeech.net/2015/03/03/upcoming-updates-upgrades-and-maintenance/feed/ 6
Automatic file server upgrades https://blog.nearlyfreespeech.net/2014/08/01/automatic-file-server-upgrades/ https://blog.nearlyfreespeech.net/2014/08/01/automatic-file-server-upgrades/#comments Fri, 01 Aug 2014 06:28:21 +0000 https://blog.nearlyfreespeech.net/?p=432 As most of our members are aware, one of our older file servers, f5, has been causing intermittent problems. The time has come to move the sites still using it to newer, faster, more reliable equipment. The ability to do that manually has been available in our UI for about a week now, and it has not surprisingly been pretty popular. But after that server caused additional downtime this past week, we’re moving to the next phase: moving sites automatically.

We’ve been testing the replacement file servers for some time now, with hundreds of test sites and various use cases, and they have done very well. Naturally, we’re still paranoid that something will go wrong, but in addition to the testing we have an aggressive backup and replication schedule. So it’s time to move ahead.

Beginning August 4th and continuing through the end of the month, we will start automatically migrating affected sites. If you have any, they are marked with an asterisk on the Sites tab in our UI, with more details on the Site Info Panel for each affected site. The Site Info Panel will also let you adjust the scheduled upgraded, allowing you to migrate a site at any time or (to an extent) postpone an upgrade that is scheduled at a bad time for you.

Most sites don’t need to make any changes as a result of this migration. Based on our testing and the sites that have voluntarily migrated thus far, less than 1% of sites need anything modified to continue working after the upgrade. These changes are related to hardcoding absolute paths that won’t be valid after the migration. I.e. anything starting with /f5/sitename/. These fall into two broad categories.

First, .htaccess files. If you’re using HTTP basic authentication or something similar in your .htaccess file that uses absolute pathnames, those will have to be changed after the migration. You’ll be able to get the new path to use from your site info panel after the migration.

Second, if you’re still using PHP 5.3 Fast and you have hardcoded paths in your PHP code, those will also need to be updated. Using hardcoded paths in this situation was never recommended; it’s always preferable to use a preset variable like $_SERVER[‘DOCUMENT_ROOT’] or $_SERVER[‘NFSN_SITE_ROOT’] if at all possible. PHP 5.3 has also been obsolete for a long time. So if you find yourself in that situation, this is a great time to upgrade that from our UI as well. You’ll still have to change the paths, but this will be the last time. All the currently-supported versions of PHP (5.4 and later) use /home-based paths, just like CGI and ssh, and those never change.

To help you find out if your site needs to be modified, we’ve developed a scan which is run during the migration. When the migration is finished, it will email you to let you know it’s done and whether or not it found any potential problems. It may not catch every possible issue, but it does a very very good job.

Once f5 is no longer in use, it’ll be tempting to give it the full Office Space treatment due to the problems it has caused, but the truth is that it served us incredibly well for a long time, so giving it a salute as it is ejected into space in a decaying orbit into the sun would better fit the totality of its service. (Although that’s admittedly not in the budget, so recycling is a more likely outcome unless the console prints “Will I dream?” as we shut it down for the last time, in which case we probably won’t have the heart.)

Although only a tiny fraction of our members will have even minor problems with this upgrade, each and every one of our members and each and every one of their sites is important to us. If you do run into any snags related to migrated sites (or, really, anything else), please feel free to post on our forum and we’ll do what we can to help you out. (But please don’t post about them here; blog comments are a terrible venue for providing support, second only to Twitter in sheer awfulness and unnecessary difficulty.)

]]>
https://blog.nearlyfreespeech.net/2014/08/01/automatic-file-server-upgrades/feed/ 10
Post-mortem report of Saturday’s file server failure https://blog.nearlyfreespeech.net/2014/04/03/post-mortem-report-of-saturdays-file-server-failure/ https://blog.nearlyfreespeech.net/2014/04/03/post-mortem-report-of-saturdays-file-server-failure/#comments Thu, 03 Apr 2014 02:16:39 +0000 https://blog.nearlyfreespeech.net/?p=421 On Saturday, March 29 at about 4pm US Eastern time, we rebooted one of our file servers that hosts content for member sites. It experienced a critical hardware failure and did not come back online. It took about 28 hours to get things back into service. We’re going to talk briefly about why that happened, and what we’ll be doing differently in the future.

ZFS in one paragraph

This issue has a lot to do with ZFS, so I’ll talk very briefly about what that is and how we use it. ZFS is an advanced filesystem, originally developed by Sun Microsystems back before they got devoured by Oracle. When you upload files to our service, ZFS is what keeps track of them. It performs very, very well on hardware attainable without an IPO, and we’ve been using it for many years because we need stuff that performs very, very well to keep up with you guys. It also has features that we and you are fond of, like snapshots, so if something of yours gets accidentally deleted, we can (almost always) get it back for you. The downside to ZFS is that is not cluster-able. That means that no matter what we do, there will always be at least one single point of failure somewhere in the system. If we do any maintenance, or if it fails, an outage will result.

What happened

Prior to Saturday’s issue, that file server (f5) had caused problems twice in the past two weeks that caused slow performance. We’ve seen a very similar problem with ZFS-based file servers in the past; when they accumulate a lot of uptime they start to slow down until rebooted. Because it involves downtime, member file servers don’t get rebooted very often; not unless they are having a problem. This one was having a problem we believed would be resolved by rebooting, so we rebooted it. However, at that point, it suffered a hardware failure. Although there’s no direct evidence of a connection, it’s hard to believe that’s a coincidence.

We did have two backup servers available to address this situation, one of which was intended for that purpose. It is based on new technology that we will discuss in more detail later, but what we discovered when we attempted to restore to it is that it misreports its available space. It said it had three and half times more space than we needed, but it really only had a few hundred gigabytes; nowhere near enough. (Fortunately we now understand why it reports what it does and how to determine what’s really available.) The second option had the space, but was always intended only to be a standby copy to guard against data loss, not as production storage. We determined pretty quickly that it could not sustain the activity necessary.

As a result, we were forced to focus on either fixing the existing server or obtaining a suitable replacement. Unfortunately, Saturday evening is not a good time to be looking for high-performance server components. We do have a service for that, and they eventually came through for us, but it did take until Sunday afternoon to obtain and install the replacement parts. Once that was resolved, we were able to get it back online relatively quickly and get everyone back in service.

What will happen next

As mentioned above, the big problem with ZFS is that it cannot be configured with no single point of failure. This basically makes it the core around which the rest of the service updates. We’ve done everything possible always to get as close as we can; the server that failed has multiple controllers, mirrored drives, redundant power supplies. Pretty much everything but the motherboard was redundant or had a backup. And, of course, the motherboard is the component that failed.

That’s not a small problem. Nor is it a new one. Single points of failure are bad, and we’ve been struggling for a long time to get rid of this one. We’ve tried a lot of different things, some of them pretty exotic. But what we have found for the past several years is that there’s really not a magic bullet. The list of storage options that genuinely have no single point of failure is pretty short. (There are several more that claim to, but don’t live up to it when we test it.) We have consistently found that the alternatives are some combination of:

– terrible performance (doesn’t matter how reliably it doesn’t get the job done)
– lack of POSIX compatibility (great for static files, but forget running WordPress on it)
– track record of instability or data loss (We’re not trusting your files to SuperMagiCluster v0.0.0.1-alpha. Or btrfs.)
– long rebuild (down)times after crash or failure
– (for commercial hardware solutions) a price tag so high that it is simply incompatible with our business model

The end result is that for the past few years, we have backed ourselves into something of a ZFS-addicted corner. However, what makes Saturday’s failure particularly frustrating is that we actually solved this problem. We’ve been rolling that solution out over the past couple of months. What’s left to be moved at this point is member site content and member MySQL data. The hardware to do that is already on order; it may even arrive this week. Once it does, there will be a week or two of setup and testing, and then we will start moving content. That will involve a brief downtime for each site and MySQL process while it’s moved, and may require a few sites with hardcoded paths to make some updates. We’ll post more about that when we are ready to proceed.

The new fileserver setup has no single points of failure, is scalable, serviceable, and expandable without downtime, preserves our ability to make snapshots, and performs like we need it to. And (crucially) although it is still cripplingly expensive, we could afford it. This is an area where we’ve been working very hard for a very long time, and it simply wasn’t possible to get all the requirements in one solution until recently.

To be perfectly clear, this doesn’t mean our service will never have any more problems. No one can promise that. File server problems were already incredibly rare, but since our service design makes them so catastrophic for so many people (at many hosts, such failures are a lot more common, but don’t affect nearly as many sites at once), we have to do as much as we can to make them incredibly rarer.

There are also plenty of other things besides file servers that can go wrong at a web host, and we continue to work on improving our service in all those areas. We’ll have more to say on that subject as the year progresses, but really, there’s no such thing as “good enough” for us, so that work will never end.

For now, we’re very sorry this happened. As we said during the downtime, there is nothing we hate more than letting you guys down, and we did that here. It’s no more acceptable to us than to anyone else for something like this to happen. What we can tell you is that before this happened we were executing a plan that, if it had been completed, would have prevented this. Completing that plan as quickly as possible is our next step.

Thanks for your time and your support. Problems like this are physically sickening, and seeing that so many of our members were so supportive really helped carry us through.

]]>
https://blog.nearlyfreespeech.net/2014/04/03/post-mortem-report-of-saturdays-file-server-failure/feed/ 29
IPv6, SSL, scheduled tasks, storage discounts & bulk bandwidth https://blog.nearlyfreespeech.net/2012/12/26/ipv6-ssl-scheduled-tasks-storage-discounts-bulk-bandwidth/ https://blog.nearlyfreespeech.net/2012/12/26/ipv6-ssl-scheduled-tasks-storage-discounts-bulk-bandwidth/#comments Wed, 26 Dec 2012 05:23:59 +0000 http://blog.nearlyfreespeech.net/?p=284 We have quite a few feature announcements to bring you this holiday season. We’ve added support for several features, some of which have been requested for years like IPv6, SSL, and scheduled tasks (AKA cron jobs). We’ve also introduced new billing options that make our service pricing fairer and more scalable; these options will help a broad variety of our members save money.

IPv6 support for hosted sites

We’ve added IPv6 support for hosted sites. Just select the “Manage IPv6” action from the site info panel and enable it. Each site that enables IPv6 is currently assigned a unique IPv6 address. There is no extra cost for IPv6.

IPv6 isn’t fully deployed on the Internet yet, and consumer ISPs are the worst laggards, so IPv6 isn’t enabled by default and you may want to consider whether it’s right for you. More info is available in our UI.

SSL support

We always said that we would focus on SSL support as soon as IPv6 was done. And that’s just what we did. SSL support is now available in two forms.

First, you can obtain (or generate) an SSL certificate for any alias on your site, upload the certificate, the key, and the chain file (if applicable) to your site, and then request SSL be enabled for that alias through our UI. This is a great option if you want to secure traffic to your site for all visitors.

Second, generic support for securing the (shortname).nfshost.com alias of each site is available without the need for your own certificate. Use of this option is also requested through our UI, but it’s our domain name, so its use is subject to our pre-approval. (For example, we won’t be approving any sites with names like “securebanklogin.”) This option is good if you just want to administer your site through a web UI securely.

Our SSL implementation depends on the SNI feature of the TLS standard, which is now available in all modern browsers, so we are comfortable deploying it. SSL currently does not have any extra cost associated while it retains experimental status. It may get a nominal fee in the future to cover the added CPU cost of encryption once we get a better idea about how tightly we can pack certificates without causing problems.

Scheduled tasks

We’ve added the long-requested ability to run scheduled tasks on specific sites at regular intervals. Great for processing log files or refreshing content, scheduled tasks can be set to run hourly, daily, weekly, or monthly. There’s currently no extra charge for this feature, but we’ll keep an eye on the resources it uses.

Storage discounts and resource-based billing

Most people (including us) will acknowledge that for a long time, the greatest flaw of our service is that it bills a large amount of the cost based on how much disk space a site uses. This charge then pays for all the CPU and RAM it takes to host sites. This works well enough in terms of covering our costs, but forces sites that are very large to subsidize sites that are very small but resource-intensive. That’s not fair, so we’ve made two changes.

First, we’ve cut the storage charge for static sites, which by definition use few resources. Our published rate is $0.01 per megabyte per month, but with this change, static sites are now charged at $0.01 per 5 megabytes per month. That’s an automatic across-the-board 80% cut for all static sites.

Second, we’ve introduced a new option for dynamic sites called “stochastic billing.” If selected, this option cuts the storage costs for dynamic sites by 90% to $0.01 per 10 megabytes per month. In its place, it divides sites into groups and once per minute, selects a web request at random and bills the associated site for the resource usage attributed to that group. The likelihood of a given request being selected is proportional to the resources it uses, so over time the random sample converges to a very accurate representation of which sites are using which resources, and everyone who participates is billed fairly for the share of resources they used with a very high degree of accuracy.

We’ve set the pricing for stochastic billing in such a way that if everybody switched tomorrow, our bottom line wouldn’t change at all, so this is not a price increase. Most people will actually pay less. Sites that use above-average resources — the ones subsidized under the current plan by sites that use tons of disk space — will naturally cost more if they switch over. But we don’t plan to force anyone to switch. Instead, we intend to preserve both options and allocate to each the hardware resources it is paying for. Over the long term, we expect resource-heavy sites on the old plan will find fewer and fewer disk-heavy sites willing to subsidize them, which may lead them to resource shortages way down the road if they choose not to migrate and pay their own way. But we don’t anticipate any dramatic changes in the short term.

More information about this is available in our FAQ and our forum, and the option is available by changing your site’s server type in our member UI.

In the coming days, we’ll be adding a $0.01/10MiB/month + stochastic billing option for static sites as well. That’ll be better than the $0.01/5MiB/month plan for most but not all static sites, and we understand some people won’t want anything to do with a billing scheme with a random element, so it will be optional.

Bulk bandwidth option

One of the things we do that’s a little unusual is that we demand very high quality bandwidth for our member sites; there are a number of lower-priced providers commonly used by web hosts for connectivity that we don’t consider good enough. A consequence of this is that the bandwidth costs we pay are relatively expensive compared to some of our competitors of a similar size, and of course we pass that along. We feel it’s well worth it.

At the same time, we have wound up connecting to cheaper providers from time to time. This is not to serve member sites, but rather because cheaper providers — combined with clever routing and network management — can be a good way for us to soak off huge surges of incoming traffic associated with DDOS attacks without affecting the rest of our network. However, even though the bandwidth offered by these providers is relatively cheap, DDOS attacks consume a lot of it, and the more of it we have, the more resilient we are, so the overall bill is not trivial. And at the same time, DDOS attacks generate only inbound traffic, so we’re only using the inbound half of those connections (and then only when we’re being attacked).

So, we’ve got a bunch of unused outbound capacity. We’ve made the decision internally that the price/quality tradeoff is not worth using those connections for our regular traffic, but we respect that not everyone’s site is the same and that in light of the pretty significant cost difference, some people might prefer to make that decision for themselves.

As a result, we’re offering a new class of “bulk” bandwidth that will use our excess outbound capacity on a per-site basis. Instead of being priced per byte transferred like our regular offering, bulk bandwidth is priced per megabit per second (Mbps) per month. You select the amount you want and then pay $5.00/Mbps/mo. (But like most of the rest of our services, it is charged one penny at a time and can be added or removed at any time.) Your actual usage is unmetered. It’s also burst-able, meaning it groups sites together and if another site in the group isn’t using its share at any given moment, your site can borrow it at no extra charge.

Bulk bandwidth is typically best suited to sites that steadily use a lot of bandwidth, for example to distribute large files to the general public. Our regular bandwidth plan will still generally provide higher per-connection speeds, better routing and resiliency, and probably slightly better latency.

To determine if bulk bandwidth is right for a site, first figure out if it’s currently spending less than $5.00/mo on bandwidth under our standard plan. If it is, bulk bandwidth is a bad deal: pay more, get less. But if a site’s bandwidth costs more than $5.00/mo, the answer is maybe. Next, you would look at the nature of the site. If the priority is to deliver the most overall bandwidth per dollar, then bulk bandwidth might be a good choice. If the priority is to provide the fastest individual downloads, or if the site has significant interactive elements — particularly stuff like AJAX — it’s better to stick with the standard plan.

In short, the bulk bandwidth option is the freight truck to our standard plan’s sportscar. Both can move a lot of data very quickly, but in very different ways.

Final thoughts

Whew. Densest. Blog post. Ever.

If you follow our Twitter feed, you already know about these updates. But judging by the follower numbers, most people don’t, so we thought we’d mention it.

We think these changes are huge. They address a lot of the pain points that many of our members have been feeling for a long time, both in terms of features and cost. And they represent a mountain of work, especially these past few weeks to carry them over the finish line in time for Christmas.

Going forward, the biggest question will probably be about PHP 5.4. That’s the big one we weren’t able to make happen in time for this announcement. It remains available in Flex mode if you select the 2011Q4 realm, but 5.4 removes safe_mode support and hence there won’t be a “PHP 5.4 Fast.” Instead, “PHP 5.4 Full” is coming, which combines a lot of the best features of Flex (consistent paths, ability to execute external programs) with performance comparable to existing Fast sites. That’s our top feature development priority, and we’re keeping a close eye on the March 2013 timeframe that the PHP developers have announced for phasing out non-critical updates to PHP 5.3, but we can’t offer an ETA at this time. We also have some internal maintenance to do to keep things running smoothly and fix bugs.

Thanks everyone! We never lose sight that our incredible members make our service not just possible but everything it is. (And I allow myself a bit of a smug grin, secure in the knowledge that we have the hands-down smartest member base of any web host, which is the only reason we have the courage to do something as exotic as stochastic billing.)

(Updated 2012-12-26 to reflect that *.nfshost.com no longer uses a self-signed certificate for SSL.)

]]>
https://blog.nearlyfreespeech.net/2012/12/26/ipv6-ssl-scheduled-tasks-storage-discounts-bulk-bandwidth/feed/ 27
Security flaw with login corrected https://blog.nearlyfreespeech.net/2011/08/02/security-flaw-with-login-corrected/ https://blog.nearlyfreespeech.net/2011/08/02/security-flaw-with-login-corrected/#comments Tue, 02 Aug 2011 17:18:07 +0000 http://blog.nearlyfreespeech.net/?p=223 One of our members informed us a couple of days ago that due to a strange combination of actions and circumstances, he hit a flaw in our login system that enabled him to access the membership of another member with a similar name.

Of course we promptly investigated; the problem has been permanently fixed.

After that, we turned our attention to finding out if that particular flaw had been exploited in any other cases. It does have a very distinctive pattern, part of which is failing to log in as the first person, successfully logging in as the second person, and then “reappearing” as the first person. (That’s sufficient “signature” to detect it in our records, but there’s actually more internally required for it to happen — related to cookies and PHP session handling.) We’ve been over the logs back to the point where the problem was introduced and we’re happy to report that we were not able to find any previous similar incidents. So, if you needed any reassurance that most people are basically good, the first person to find this problem reported it to us within minutes.

Obviously the person who did this is aware of it, and we have already notified the person affected. So if you haven’t already heard from us about this, it doesn’t affect you and you don’t need to take any steps. We are posting this anyway simply because it’s security related. Security is our top priority; it’s the foundation upon which the rest of our service has to be built. So, as transparent and forthright as we try to be when we have service problems and downtime, I feel we need to be twice as forthcoming when we have problems like these, however small.

I also feel it’s appropriate to personally apologize to all of our members because this was a security problem and it was caused by a coding error introduced by me. This is an area where only perfection is acceptable; falling short even a little bit is not. I’m sorry, and I will work hard to keep it from happening again.

(Ironically, we are already developing a new certificate-based backend that is so secure, the goal is to open-source our entire UI when it is complete.)

]]>
https://blog.nearlyfreespeech.net/2011/08/02/security-flaw-with-login-corrected/feed/ 10
Scheduled maintenance November 22 and December 15 https://blog.nearlyfreespeech.net/2010/11/16/scheduled-maintenance-november-22-and-december-15/ https://blog.nearlyfreespeech.net/2010/11/16/scheduled-maintenance-november-22-and-december-15/#comments Tue, 16 Nov 2010 18:46:42 +0000 http://blog.nearlyfreespeech.net/?p=199 We are scheduling two maintenance windows in the next month to move some equipment:

Date: November 22nd, 2010
Window: 9am to noon UTC (4am to 7am US Eastern, 1am to 4am US Pacific)
Affecting: MySQL nodes m2, m3, and m21

Date: December 15th, 2010
Window: 8am to 1pm UTC (3am to 8am US Eastern, midnight to 5am US Pacific)
Affecting: File servers f2 and f5

Each server should be offline for about one hour, not the whole window. This will cause some downtime. While the MySQL nodes are offline, those MySQL processes hosted on them will be unreachable. While the file servers are offline, sites hosted by those file servers will show an official maintenance page.

Due to the nature of our network, we can generally (and frequently do) move stuff around without disrupting services. Unfortunately, we regret that’s not true in this case. These servers will actually be physically moving to a bigger space in a new facility with more (and more reliable) power so that will we have room to continue to grow for the next several years. We’ve listened to past comments on these types of issues, and we’ve endeavored to give you as much advance notice as possible, especially on the file server moves, so that you’ll have time to make any plans or announcements you feel are appropriate.

Thanks very much, and we apologize for any inconvenience as we (as always) continue to work tirelessly to make our service better for our members.

Update: The first batch of maintenance is long since completed, and we’re on schedule for the second, but we noticed that this was in the wrong blog category. We’ve now moved it.

]]>
https://blog.nearlyfreespeech.net/2010/11/16/scheduled-maintenance-november-22-and-december-15/feed/ 1
Removing deprecated IP block https://blog.nearlyfreespeech.net/2010/11/16/removing-deprecated-ip-block/ https://blog.nearlyfreespeech.net/2010/11/16/removing-deprecated-ip-block/#comments Tue, 16 Nov 2010 18:44:59 +0000 http://blog.nearlyfreespeech.net/?p=206 Many years ago, we were assigned the IP address block 64.238.220.0/23 by one of our upstream network providers. We officially deprecated the use of that block way back in 2008, and we will be returning it on December 1st, 2010, so it will not work after that point.

The only possible way this could affect you is:

1) You use third-party DNS to point at a site hosted here.
2) You hardcoded A records in the 64.38.220.0/23 range (against our advice) over two years ago.
3) You haven’t checked on / updated those settings in the past two years.

In other words, literally only a handful of people will be affected by this change. Nonetheless, we wanted to have at least some public warning before moving forward. The few people affected know a lot about DNS and went to the trouble to make a custom setup at another provider; they’ll know exactly what this post means and what to do about it. So, if you have to ask “does this affect me?” the answer is almost certainly no.

]]>
https://blog.nearlyfreespeech.net/2010/11/16/removing-deprecated-ip-block/feed/ 2