So, this is a post about consistency in databases. And it comes as a result of a deep dive down a rabbit hole, with hundreds of pages of academic papers printed and countless Chrome tabs eating memory on my MacBook.
A few months ago, I set out to write a post explaining some quirks of local secondary indexes in DynamoDB. As I planned that out, I realized I needed to explain some things on eventual consistency in DynamoDB. As part of that, I realized the notion of consistency is pretty darn confusing and contains a bunch of overlapping concepts.
(Those other posts are coming, but they are not available today.)
In this post, I’ll explain my key takeaways after reviewing previous work in this space. Hopefully it will save you some time getting up to speed.
This post has two main sections. First, I’ll discuss the various definitions of the word “consistency” that are used in the distributed databases space. Then, I’ll discuss some of my issues with conversations about consistency.
If this is your first time reading one of my posts or hearing of me, some quick background: I work a lot with Amazon DynamoDB and wrote a book on it. I have pretty deep experience as a user of databases, and a good understanding of the underlying architectures and tradeoffs of various databases. However, I’m not a distributed systems expert, so think of that as from a layperson.
While first reading about consistency, I went around in circles. The definition of consistency seemed to change based on the article I was reading.
It took me a while to understand that the word “consistency” is used to describe multiple, different concepts in the database and distributed systems world.
There are three types of consistency you might read about:
Let’s examine each of these in order.
The first type of consistency is from the ‘C’ in the CAP Theorem.
In brief, the CAP Theorem states that a distributed system can only provide two of the following three features:
Further, the CAP theorem is about how your system behaves when a network partition happens — do you sacrifice availability in order to maintain consistency, or do you sacrifice consistency to maintain availability?
The key point here: The CAP Theorem is fundamentally about replication, specifically network failures during replication.
You might want to replicate the same piece of data to multiple nodes. This could be to increase resiliency of your system to failure or to reduce latency to users by locating the data closer to them. And if you have the same piece of data represented on multiple nodes, you need some sort of network connection to communicate data updates between nodes.
In the image above, we have three nodes working together as a system. A client sends a request to set the value of ‘x’ to ‘1’. Node A replicates this operation to Nodes B and C.
However, as Werner Vogels reminds us, “everything fails all the time”. And this includes your network.
In the image above, Node C is unable to communicate with Nodes A and B due to a network partition. Accordingly, it’s unable to receive updates to values that originate from Node A.
The CAP Theorem is about making a choice on how to behave when this partition happens — if isolated Node C receives a request from a client, how does your system allow it to respond?
If you choose an AP system, you allow Node C to respond with the data it has locally. Node C is available in that it can return a non-error response even though it may not have the latest version of the data.
Conversely, if you choose a CP system, you do not allow Node C to respond because its data may be stale. You are prioritizing a consistent view of the data, and thus will not allow Node C to respond, which reduces your availability (in the CAP sense).
If you choose a CA system, you’re cheating! Remember that we’re talking about what happens assuming a network partition. You can’t assume that away. :)
When you understand the precise definitions used by the CAP Theorem, you understand that it must be true. You can also see that the CAP Theorem will never be disproven and that attempts by a database vendor’s marketing department to claim otherwise should make you raise an eyebrow.
That said, you should also understand the limitations of the CAP Theorem:
It only discusses system operation during a failure. Failures happen, and it’s good to know how your database will operate when one happens. But your database is likely running normally most of the time, and there are interesting tradeoffs to consider then as well. One of these tradeoffs, between latency and consistency, is discussed below.
It addresses a singular type of failure. Not only is the CAP Theorem discussing operation during a failure, but it’s also only focused on a particular type of failure — a network partition between nodes. There are lots of other potential failures, from out-of-memory errors on a single node to a fire in a datacenter. These failures have real impact on your system’s availability (and not just the CAP definition of availability discussed in the next point), and you should understand those failure modes as well.
It uses a unique notion of ‘availability’. The CAP notion of availability requires a non-error response from every non-failing node. This means that if a client happens to reach our isolated Node C, it needs to return a successful response to be available.
Often, we think of ‘availability’ as ‘the percentage of successful responses over a time period’ (e.g. how many “nines” of availability does the system have?). A CP system could still have very high availability by forcing all clients to use the majority, non-partitioned nodes. For more on availability, check out Availability and availability by Marc Brooker.
Despite these limitations, the CAP Theorem is still useful in helping you think about the tradeoffs you have to make when replicating data in a distributed system.
A second type of consistency is from the concept of ACID transactions.
ACID refers to a set of guarantees about the behavior of a database when combining multiple operations in a single, logical operation (often called a “transaction”).
Whereas the CAP Theorem is a distributed systems concept concerning replication, ACID is purely a database concept. It can apply to single-node databases or multi-node databases. The four elements of ACID transactions are:
Consistency as used here is pretty different than consistency in the CAP sense. Consistency in ACID means that a transaction can’t result in an invalid state.
On a simple level, this means that the values in each column will adhere to their data types and that required values will be provided, though transactions don’t really add much complication here.
A more likely source of consistency issues with transactions is around something like referential integrity. If you have a foreign key set up on a table, consistency ensures that the record it points to will still exist when the transaction completes.
This property is absolutely useful, but I don’t think it’s what many people really think of when they talk about consistency tradeoffs in databases. There’s not really a tradeoff here like there are in other areas of consistency — it’s more just promising that the database will do what it says it’s going to do.
That said, the ‘I’ for ‘isolation’ in ACID is an interesting concept and more in line with consistency-adjacent topics. We’ll discuss that in the consistency models section below.
The third place you’ll see the term “consistency” when thinking about databases is with the concept of database consistency models.
I first heard of consistency models when reading Jepsen analyses. If you’re not familiar with Jepsen, it’s a blog, test suite, and consultancy service from Kyle Kingsbury. Kyle stress-tests a bunch of distributed systems to figure out their behavior under certain failure modes. They’re extremely well done and interesting. They’re also super deep in the weeds of distributed systems, so don’t be ashamed if you have to read an analysis multiple times and still feel like you only understood about 15% of it.
The Jepsen site has a nice breakdown of various consistency models, but even that is tough to parse if you don’t understand the terminology well.
As I understand it, a database’s consistency model has two main elements:
Linearizability is similar to the consistency in the CAP Theorem that we discussed earlier but even stronger. Whereas the CAP Theorem applies only during network partitions, linearizability describes how your system acts at all times. It basically states that once your system accepts and completes a write operation for a piece of data, all subsequent operations on that data should reflect that write.
Note that linearizability necessarily implies a CP system. There are less strict consistency models around linearizability that allow for better availability (in the CAP sense).
Note further that linearizability (and its weaker versions) doesn’t really seem to contemplate eventual consistency, especially in a load-balanced system like DynamoDB that doesn’t allow sticky connections to a specific node. Eventual consistency is discussed in more depth below.
While linearizability is about a single piece of data, serializability is about multiple pieces of data. More specifically, serializability is about how to treat concurrent transactions on the same underlying pieces of data.
The “safest” way to handle this is to line up transactions in the order they were arrived and execute them serially, making sure that one finishes before the next one starts. In reality, this is quite slow, so we often relax this by executing multiple transactions concurrently. However, there are different levels of safety around this concurrent execution, as we’ll discuss below.
Consistency models are super interesting, and the Jepsen breakdown is enlightening. If I had to quibble, it’s that I still don’t quite understand the interplay between the two poles of consistency models. Can I choose a lower level of linearizability (e.g. sequential) along with the highest level of serializability? Or does the existence of any level lower than linearizable mean that I’m out of the serializability game altogether?
If you understand this, hit me up! Or better yet, write up a better explanation than I ever could :). If you do, let me know so I can link it here.
Now that we’ve got the basic terminology figured out, let’s get into a bit of (friendly) criticism.
As I learned more about consistency, I found that most of the discussions I had previously had about consistency were pretty anemic. In thinking about consistency in your application, below are some of the ideas I would use to flesh out your thought process.
Note that, again, this is from a distributed systems and databases layperson. I’m not breaking new ground here, just sharing what I wish I knew before!
The CAP Theorem is useful, and the CAP Theorem is correct. No matter what you think, you cannot choose to build a CA system.
And yet, the CAP Theorem is incomplete. For example, I might hear something like the following in a discussion:
“We can’t choose $DATABASE because it’s an AP system, and we don’t want eventual consistency.”
Eventual consistency is an interesting topic, but it affects you more often than just the network partition scenario that CAP contemplates. While you should know how your system performs in the presence of a network partition, it is also true that your system will spend most of its time without a network partition. In those times, there are still important tradeoffs to consider in how your database operates.
Daniel Abadi created the PACELC Theorem to expand on the CAP Theorem. With PACELC, you distinguish between two operating modes: one during times of a network partition, and one during normal operation.
The first three letters of PACELC are the same three letters from CAP, and the analysis is the same there. Think: “in the event of a network Partition, do I choose Availability or Consistency?”
The next three letters are about how to operate without a network partition. “ELC” means ”Else, do I optimize for Latency or Consistency?”
Let’s think through this tradeoff. Remember that consistency is about replication of data across multiple machines while still showing the most up-to-date version of that data from any of the machines. However, this consistency has a cost. If I want a strong version of consistency for my distributed system, I need to ensure the data is copied to all nodes before acknowledging a write to a client. Depending on the location of my nodes and my network configuration, this could make write requests quite slow.
An alternative option is to update the data item on only a subset of the machines in my system before acknowledging the write to the client. The other nodes in the system could be updated asynchronously after the write is completed.
Note that this results in lower latency to the client at the cost of weaker consistency on that piece of data. If another client requests the data from a node that has not received the latest update, it will get a stale version of the data.
Many systems opt for this approach, often called eventual consistency and popularized by Werner Vogels and some of the Amazon systems he helped design. Both Amazon S3 and Amazon DynamoDB use eventual consistency as a way to reduce latency on requests.
Edit: I was reminded by Ireneusz Pastusiak that S3 is now strongly consistent for all object-level APIs. Thank you, Ireneusz!
I like PACELC because it helps you to think about consistency in the common case, when your database is operating normally. Can you handle the occasional stale read, and in which situations? If not, are you fine with the increased latency and reduced availability that strong consistency requires?
I do a lot of work in the NoSQL world, and a lot of my SQL friends give me a hard time about the lack of ACID transactions in most NoSQL databases (even though DynamoDB, my favorite NoSQL database, does offer ACID transactions).
The conversation might go as follows:
“Yea, NoSQL’s scalability is interesting, but I don’t want to deal with concurrency bugs by giving up ACID transactions.”
I always thought this was a pretty good point in favor of relational databases, but I’ve found that the truth is more nuanced than I’d been led to believe. When talking about consistency models and the “I” (isolation) in ACID, it’s not so straightforward.
Earlier, we discussed that serializability is the portion of consistency models that deals with transactions. Most relational databases, while claiming to support ACID transactions, do not give the full serializability guarantees for transactions. They typically offer a looser set of guarantees, such as repeatable read or snapshot isolation, which can lead to unexpected issues when performing concurrent operations.
My favorite introductory explainer of the different database isolation levels is from Alex Xu. His books on System Design Interviews are excellent as well.
For a deeper exploration of database isolation, another great resource is the Hermitage project and associated blog post from Martin Kleppmann. The Hermitage project helps test popular relational databases for isolation types and potential issues.
In the post, Martin discusses a concurrency bug from a NoSQL database that led people to think a relational database would have saved them, without understanding that the default isolation levels in those databases would have left them subject to the same problem!
None of this is to bash relational databases or to downplay the need for ACID transactions. Rather, I come advising caution in your choice of your database and recommend you spend the time to understand what it truly guarantees.
My final issue with the discussion of consistency is that most of the models and concepts are spoken in terms of absolutes, but there are some more granular tradeoffs to consider. Further, advances in technology or practices can change the calculus over time.
For example, there have been examples of database system CEOs or lead architects that claim they’ve built a “CA” system. This is partly a misunderstanding of CAP, but there’s also a nugget of truth in the fact that network partitions are less common in certain situations. As our datacenters advance and the resiliency around networks improves, the consistency vs. availability tradeoff becomes less salient.
A similar effect happens as distributed systems reduce the recovery time from a failure. While not technically within the CAP Theorem, it’s useful to also consider the effects on consistency and availability in the event of node failures (rather than network partitions).
Losing multiple nodes at the same time can lead to lower availability (in the classic, non-CAP sense) and other issues. However, the elasticity of the cloud combined with practices to reduce recovery time can greatly reduce the likelihood of concurrent node failures.
In the Amazon Aurora paper, the authors note that they can reduce recovery time by splitting data storage volumes into 10GB segments.
“Segments are now our unit of independent background noise failure and repair. We monitor and automatically repair faults as part of our service. A 10GB segment can be repaired in 10 seconds on a 10Gbps network link. We would need to see two such failures in the same 10 second window plus a failure of an AZ not containing either of these two independent failures to lose quorum.” — Amazon Aurora paper
I assume this pattern was modeled from the learning in DynamoDB, which also uses 10GB partitions to segment its data.
AWS now has deep experience managing enormous fleets of 10GB storage segments and repairing them when they fail. As the likelihood of a true outage decreases, you may be willing to take the (increasingly unlikely) availability hit rather than forge ahead with a “forget consistency, we’ll figured it out later!” approach.
Finally, in thinking about the latency vs. consistency tradeoff in PACELC, the particular strategies used to replicate to the other nodes can make a difference. While we can’t go faster than the speed of light yet, many modern systems are eventually consistent with replicas mere milliseconds behind. Peter Bailis and Adi Ghodsi wrote up some findings on this in 2013, and the situation doubtless has improved since then.
The key takeaway here is to not only think about the absolutes, but the graduations between them. These absolutes help to guide and ground the discussion, but you also need to consider more subtle measures.
In this post, we looked at the different types of consistency within the database and distributed systems worlds. We also reviewed some of the issues I have with the current discussions around consistency.
Thanks to many reviewers for their comments on this post, including Aaron Francis, Rob Sutter, Luigi Berrettini, Allen Helton, Carl Sverre, and Elsie DeBrie.
This blog post is eventually consistent, and I want to converge on the correct answer. If this post has egregious errors, the fault is all mine. If you’re smarter than me about this stuff and see something you want corrected, please leave a note below or email me directly!
I’d like to say I’m standing on the shoulders of giants in writing this post, but that would overstate my contribution here. It’s more like some giants are carrying me in a BABYBJÖRN.
Below is a sampling of the most useful resources I found in learning this stuff. I’m sure there are a million others that I learned from but forgot — I had a ton. It’s a rabbit hole. If you start with these ones, you’ll find links to many other helpful resources as well.