MariaDB Cluster Incident

Last Saturday we faced our biggest operations incident with Flownative Beach so far: A MariaDB Galera Cluster failed and only came up again in a degraded state.

This caused a loss of data from the last three days for 5 websites (as far as we know) and a downtime of about 60 minutes for 62 Beach instances. Among the affected sites were neos.io and flownative.com.

tldr; We will switch to a new database setup and will do so today during the night. If your project is going to change database content between 22.10.2019 22:00 and 23.10.2019 02:00 please get in touch with us, so we can work out a solution.

Some Background about our Setup

For about a year we are running MariaDB within our Kubernetes clusters. In order to make MariaDB more resilient against failures of the physical machines they are running on, we developed a setup using MariaDB Galera Cluster. That means: instead of using a single server, all data is replicated to a second one. A load-balancing-mechanism distributes the workload (your SQL requests) to those two servers (more concretely: we are running two-node Galera clusters with master-master replication and an additional arbitrator node, distributed to different Kubernetes nodes by using anti-affinity rules).

Running a Galera cluster is not trivial, and doing so in a Kubernetes environment is even more challenging. Therefore we invested about two months of work alone in developing custom MariaDB Galera Docker images and monitoring scripts.

What Happened

The root cause is relatively easy to describe. And it is quite embarrassing: "out of disk space".

Each MariaDB node has its own so-called "persisted volume". Basically it's a virtual SSD attached to the database server's container. When that volume is full, MariaDB can't process any further write-operations and eventually crashes.

Generally, there were a couple of things we did in order to prevent these kind of crashes. First of all, we were monitoring (albeit manually) the amount of free disk space and were resizing volumes where needed. Secondly, we were giving master nodes a different amount of disk spaces, so they were not running full at the same time. And last, we had development going on to automate these tasks and let volumes grow automatically.

These measures failed in this particular case, though: The disk usage grew much, much faster than anticipated and due to a copy-and-paste error, this cluster ended up with both master nodes having the same volume size.

When one of the nodes failed, the other node tried to jump in. The first node was restarted automatically and tried to recover – and because the first node then also ran out of space, it crashed, too. That's what you call a full cluster crash. During this time, the second (more healthy) node tried to recover from data provided by the first node, which was incomplete due to the crash. And thus, in the end, we lost data of about three days. Fortunately, most of the projects using this database server did not edit any content during that period of time.

How we Will Fix it

We talked with a bunch of folks running enterprisey projects and, well, they faced similar challenges in one way or the other. Bottomline for us is that Kubernetes may be not (yet) suited well for running database servers (although Oracle, for example, released a solution for MySQL, based on the Operator Pattern). It certainly is doable, but considering that we are talking about a hosting environment starting at 19€ per month, it's just out of scope with the resources we have. And, if that may be an indicator, even the big cloud companies, like Google, AWS and Digital Ocean still don't offer MySQL clusters with master-master replication as a service.

So, what we are doing right now is rolling back to a more traditional setup: We are migrating all databases to Google Cloud SQL instances, which run outside of Kubernetes. They do have auto-growing volumes – however, CPU and memory don't scale automatically. These servers also need to be restarted (and then cause a little downtime) when maintenance tasks are due. But some seconds of downtime a month certainly hold up against the prospective of losing data with a more progressive setup.

The Migration

During the migration we will create a fresh dump of your databases and import it into the new Cloud SQL server. Once that is done, we'll re-deploy your instances using updated database credentials.

We will switch to the new database setup today during the night. Any content changes from the time after the export started and the switch to the new server is done would be lost. So, if you know that your project is going to change database content between 22.10.2019 22:00 and 23.10.2019 01:00 please get in touch with us, so we can work out a solution.

We know that this announcement comes on short notice, but we really need to get content off the old cluster, so we don't have to sit next to it and hold hands, you know …

Conclusion

This was not fun, neither for us, nor for the customers who were affected by this outage. We are very sorry for this (also because we had to write some content again, ourselves).

I hope that we could give you some insight into what happened and how we are dealing with the problem at hand. If you have any questions or need help communicating the matter to your customers, please do let us know.

And then, enjoy the Beach – weather will be fine again end of this week!

Your Flownative Team