The problems of stateful data in Kubernetes

(written by lawrence krubner, however indented passages are often quotes). You can contact lawrence at:, or follow me on Twitter.


StatefulSet for deploying stateful Pods

StatefulSet is the abstraction that was supposed to solve all these issues. They give each Pod in the set a unique identifier, and the StatefulSet documentation says the following about them and intended use:

StatefulSets are valuable for applications that require one or more of the following.

* Stable, unique network identifiers.

* Stable, persistent storage.

* Ordered, graceful deployment and scaling.

* Ordered, graceful deletion and termination.

* Ordered, automated rolling updates.

Now we’re talking! Stable network and storage! Even ordering in how they are manipulated, so smart databases should not fall over when an upgrade is rolled out. That should be all you need, right?

Well, no.

The problem is that StatefulSet does not understand anything about what is going on inside the stateful Pods. It is an abstraction layer, and by definition, abstractions are bad at dealing with details.

Experience running stateful Pods
Sadly, many have experienced that StatefulSet does not fix all problems. It is a useful building block, for sure. But the fact is that many database systems (primary example of stateful components) cannot reliably work if they can be terminated at will. Especially without prior warning.

Consider, for example, a rolling upgrade of the nodes in the Kubernetes cluster. This needs to be done from time to time for security reasons. Say that we have a replicated master-slave database, such as PostgreSQL running in a StatefulSet. The node hosting the database master upgrades and reboots. The database master was busy processing some transactions. Those are likely lost. Some may have been replicated correctly to the slaves, some may not have been.

The loss of the master will trigger a re-election of a new master among the slaves. Note that the only way to offer the stability cited above is to re-schedule the old master Pod to the same node. Because of that, the old master joins a cluster that now has moved on to elect some other Pod as the new master. When the Kubernetes hosting node comes up, Kubernetes notices this, and immediately starts to roll out upgrades to the next one. Perhaps it hits the new master next. What is the state of the database cluster now? Which transactions have been committed? How happy was the database cluster about being disturbed this way?

This is not a made-up example, by the way. Of course it is not. If your heart started racing and you got a bit sweaty from reading it, you know how true it is.

Post external references

  1. 1