Common Safety Pitfalls found by Jepsen

Learning about Jepsen and doing a survey of published analyses.

When Jepsen appeared in the industry several years ago our team at PingCAP couldn't have been happier. Finally, an approachable, understandable reliability testing framework along with an enjoyable and rigorous library of analyses of popular databases.

Jepsen is an umbrella term for the software library, collection of articles, and service of the same name. Together, the project pushes distributed systems to be more accurate in their claims and more thorough in their testing.

It's been indispensable for battle-testing TiDB and TiKV, our distributed databases, as we deploy them in some of the largest technology companies in the world.

This article is, in many ways, a love letter to one of our favorite projects. We'll talk about why it's important, how it works, common issues found, how we use Jepsen, and finally we included a really short summary of all the problems Jepsen has found.

In the rest of this article, and in any Jepsen analyses you might find, you're going to find a lot of database jargon. Jepsen has our back! Located here you can see definitions and explanations of many of these terms.

We'll refer to serializability and linearizability often below, so you may benefit from quickly (re)familiarizing yourself with them.

Why is Jepsen so important to our industry?

Testing a distributed system like TiDB isn't easy! Because these systems depend on multiple nodes coordinating there is a lot of room for failures. Networks and hardware aren't reliable, and when multiple nodes are involved, the probability of failure is much higher. Even still, deployments (and even nodes!) are highly variable in scale, OS, hardware, and workloads. It's impossible to test all of them.

Is all hope lost? No! In Why Is Random Testing Effective for Partition Tolerance Bugs? Majumdar and Niksic formalized many failure scenarios and discussed how even a small, random subset of possibilities can be sufficient to achieve a specific coverage goal.

It's fairly rare to see projects and articles tackling the same kind of content as Jepsen. Vendors aren't always honest (by mistake or not) about what they do and don't guarantee. Projects like Jepsen, produced independently and rigorously, that test these claims are reinvigorating when compared to often banal marketing copy.

The Jepsen library helps automate and reproduce test cases, allowing the testers and readers to reproduce results for themselves.

How does Jepsen work?

Many projects use the Jepsen library to test their systems before a Jepsen analysis, or never undergo an analysis while still using the Jepsen framework.

This is great. This is the goal:

Jepsen is an effort to improve the safety of distributed databases, queues, consensus systems, etc.

-- The Jepsen website

When a project wishes to improve themselves, a Jepsen analysis places the Jepsen framework in the hands of an expert operator who improves and expands the test suite. Often this involves adding aggressive failure scenarios or entirely new complex tests.

Jepsen (the framework)

The tool itself, also called Jepsen, is a Clojure framework used to set up distributed systems tests, run operations against them, and verify the operation history is sane and possible given the claims of the project. Then, it can generate some nice graphs out of that data.

It's kind of like a really flexible failure-injecting fuzzer.

In many cases, Jepsen tests for linearizability by using Knossos to prove that a given history (from a test) is non-linearizable. Jepsen also frequently tests for timing-related issues using libfaketime to create clock skew.

Jepsen itself doesn't rigorously enforce any specific tests, it's up to the tester to determine which claims, under which conditions, they wish to attempt to disprove. Once these claims are written, the framework can attempt to find examples in invalid states.

While the specifics of tests are up to the author, there is a common suite of tests for testing certain claims, such as how required for snapshot isolation or sequential consistency.

Common tests include:

In addition to these tests, Jepsen uses Nemeses, an error injection tool which introduces failures to a system. A Nemesis can create partitions, fail processes, and cause other types of actions.

Here's an example of some nemeses that do partitioning:

Jepsen (the analysis)

When a project requests a Jepsen analysis from Jepsen, they're actually asking Jepsen to verify the claims the system makes, rather than check the system against a rigorous, formal model or set of requirements. E.g., if the project claims to withstand single node failures, then Jepsen can test that. In many cases, the project (or a possible customer) is funding some or all of the work.

All of this is quite important. This means Jepsen analyses aren't just someone "dunking" on a project. Moreover, Jepsen is validating the claims the project makes in its own documentation.

The Cockpit -- Franz Harvin Aceituna; @franzharvin on Unsplash
The Cockpit -- Franz Harvin Aceituna; @franzharvin on Unsplash

Commonly found problems

While not all analyses map directly to specific suites, tests, or claims, there were some problems that just seem to keep popping up!

Responding too early

When you write data to disk, it isn't immediately persistent on the disk. Instead, the data will be held in a buffer until it is "flushed" to disk. In the event of a failure, the data in the buffer may be lost, corrupted, or partially written. Therefore, in systems which promise no data loss, writes must be flushed before they can be acknowledged.

There were cases like in ElasticSearch 1.1.0 where writes were not fsynced by default. In these cases, for either performance reasons or mistakenly, a system responded prior to fsync. This created write loss during node failures.

A good way to test this yourself is to kill nodes at the start of a network write, then validate if the data was saved in the node on recovery.

Membership Changes are hard

Over time, a distributed system is likely to grow, shrink, or otherwise change its membership to accommodate maintenance and load. These phases of transition often involve periods of hand-off, where failures may expose previously undiscovered bugs.

In the RethinkDB configuration analysis, Jepsen built a nemesis that would interact with RethinkDB's reconfiguration API to reconfigure the cluster. Then, in conjunction with another nemesis that creates partitions, Jepsen was able to create very aggressive membership change tests.

In several cases, testing for more aggressive error modes during these times of transition lead to the discovery of subtle bugs. In RethinkDB, a split-brain (multiple competing sub-clusters) could happen during transition, requiring emergency repairs by a human operator.

Split brains from a bridge

Split brains were a regular problem in many Jepsen analyses. A split-brain occurs when a cluster somehow promotes multiple leaders in a cluster that should only have a single leader.

While many projects test trivial partition scenarios (such as {A,B,C,D,E} becoming {A,B,C},{D, E}), there were cases, such as Elasticsearch testing a "bridge", where nodes on each side of a partition were only partially disconnected, yielding split brains.

An example of this is {A,B,C,D,E} becoming {A,B,C},{C,D,E}, where C is a bridge between the two partitions.

A - C - E
 \ / \ /
  B   D

Intuitively, we can see from the diagram that C should probably be the elected leader. It's totally connected. But this isn't always what happens.

In Zendisco (2014), Elasticsearch's membership system, there was no invariant preventing two leaders from existing, so it was possible for two nodes to believe themselves to be in a majority, and become elected leaders.

Poor behavior on failed connections

Some projects suffered from connection behaviors that are not ideal for operators. TiKV issue #4500 is an interesting example of this. TiKV 2.1.7 would retry a failed PD connection for up to 60 seconds on startup. However, if it lost connection to PD while online, it would retry until recovery.

In this case, if PD failed, then TiKV failed before PD recovered. This could lead to a state where TiKV nodes would never recover.

Here's an example of a problematic case involving 1 TiKV node, and 1 PD node:

kill(pd);
restart(tikv);
sleep(61_000);
recover(pd);
// TiKV never recovers connection.

Dgraph behaved similarly with interactions between it's Alpha and Zero nodes.

Inappropriate retries

Both CockroachDB and TiDB recently were shown to have issues where they could improperly retry a transaction using a newer timestamp. In CockroachDB, this could result in duplicate writes. In TiDB, this could lead to read skew or lost updates, as indicated by a drifting total in the bank test.

In TiDB's case, these could occur under the default configuration for two reasons:

Misleading or incorrect advertising

Projects like Aerospike suffered from making obviously false claims such as "100% Uptime" on their marketing pages. These sorts of claims ultimately only are a detriment to the project's credibility. These sorts of claims show either deliberate lying, or lack of knowledge about fundamental principles.

Jepsen testing showed in these cases that these claims were too good to be true. These kinds of problems remind us all to properly evaluate systems for our needs, instead of trusting in marketing copy.

Incorrect claims by developers

Often, Jepsen analyses test claims or "sanctioned use cases" by the developers. For example, RabbitMQ developers claimed to support distributed semaphores but was not able to provide the common guarantees associated with these structures. In the case of NuoDB, a developer made bold claims about the database's capabilities in relation to the CAP theorem.

While these claims aren't sitting on the glossy marketing pages of the product, it's common for developers to reference official blogs and mailing lists when learning the capabilities and limits of a system. Developers of a project are typically considered trustworthy sources of knowledge about their product, so claims they make are often taken as true by users.

Interesting notes

During my review of Jepsen history, some interesting notes came up that may be of your interest. Here are a few of them that may hopefully encourage you to read the analyses for yourself.

Serializability vs strict serializability: casual reverse

In the CockroachDB analysis, Jepsen talks about the differences between strict serializability and serializability. Serializability, but not “strict serializability”, means it was possible for reads to fail to observe the most recent transactions.

Strict serializability means that operations should appear to have some order consistency with the real-time ordering. It implies both serializability and linearizability.

Error message compatibility

In ElasticSearch - 2014-06-15 - 1.1.0 it was noted that different JVMs serialize classes differently (e.g., InetAddress), meaning it was possible clients and servers would be unable to correctly communicate error messages from a different node.

This is issue #3145:

[2013-06-05 09:35:20,480][WARN ][action.index         	] [es-1] Failed to perform index on replica [rules][4]
org.elasticsearch.transport.RemoteTransportException: Failed to deserialize exception response from stream
Caused by: org.elasticsearch.transport.TransportSerializationException: Failed to deserialize exception response from stream

Using tools such as Protobuf or Cap'n Proto can help prevent these kinds of issues, as they have a well defined backward compatibility semantic.

How we use Jepsen with TiDB

The team at PingCAP uses Jepsen to test TiDB (which uses TiKV) frequently. This includes tests developed for the official Jepsen Analysis for TiDB.

Each new release of TiDB and TiKV passes a check ensuring that there are no claims we have made which the Jepsen framework disproves.

The Appendix -- Lysander Yuen; @lysanderyuen on Unsplash
The Appendix -- Lysander Yuen; @lysanderyuen on Unsplash

Appendix I

This is a summary of each report's problems discovered.

Note: Since Jepsen has expanded its feature set over time, these results are somewhat skewed, since some problems detected in new reports may have not been detectable in earlier reports.

This article was originally published on PingCAP's blog, it has been edited to fit the aesthetic and tone of this blog. Thanks to Kyle (@aphyr) for review.