How do distributed databases handle secondary indexes? A survey

I've done a lot of work with DynamoDB over the years, so my mental model of how distributed databases work is very much based on DynamoDB. But not all distributed databases work like DynamoDB! Sometimes when I venture out into the world of other distributed databases, I'm surprised by how they handle things differently.

In this post, I want to look at a specific aspect of distributed databases: how they handle secondary indexes.

This post was spurred by a post I wrote about DynamoDB secondary indexes. When tweeting about that post, I mentioned that DynamoDB global secondary indexes are unique in that they reshard onto different shards than the primary index. This is different than other distributed databases with which I was familiar (MongoDB, Vitess), and I figured most distributed databases worked like those.

The excellent Franck Pachot of Yugabyte replied on Twitter and noted that other distributed databases, including Yugabyte, reshard their secondary indexes. Not only that, but they reshard synchronously during write operations. Franck wrote up a helpful performance test for this on Yugabyte.

This was all interesting to me, so I did more digging. Hence, this article.

In this post, we'll look at the various strategies that distributed databases use for secondary indexes. There are four main strategies used, and I'll discuss the pros and cons of each.

Background

Before we start, I want to give a bit of background. First, I want to talk about the general sharding mechanism for distributed databases and why secondary indexes can be tricky. Second, I want to talk about what I mean when I'm talking about the benefits or drawbacks of a particular approach.

Feel free to skip this section if you feel like you already understand distributed databases. But, if you get mad when I talk about the benefits and drawbacks of various approaches, make sure you at least read the 'Benefits, Drawbacks, and Comparisons' section!

Sharding in distributed databases

Over the last 15 - 20 years, there's been an increase in the use of distributed databases. In the public eye, it mostly started with the rise of NoSQL (MongoDB, Cassandra, DynamoDB, Elasticsearch) in the 2008 - 2012 time range. We now seen a rise of SQL-based versions of distributed databases (Cockroach, Vitess, Yugabyte, TiDB, etc.) in the last ten years.

For each of these distributed databases, the key insight is to horizontally scale the database by sharding the data. Rather than holding all of the data on a single machine, the data is split into multiple pieces and stored on multiple machines. To determine which machine holds which data, you'll usually declare some shard key or shard function. This function will take the primary key of the record and determine which shard the record should be stored on.

Note: Replication is another way to turn your single-node database into a distributed system, but replication is different than sharding. While sharding is about splitting the data into multiple pieces, replicas are about copying the data to multiple machines. You can have a sharded database with no replicas, a replicated database with no sharding, or a database that is both sharded and replicated.

The major benefits of sharding is the horizontal scalability it enables -- as your data grows, you can add more machines to store that data. This is in contrast to vertical scaling, where you would need to add more resources to a single machine to handle the increased load. Vertical scaling is limited by the size of the machine, while horizontal scaling is limited by the number of machines you can add.

When a request comes into a sharded database, the database will first determine which shard the request should go to. This is usually done by applying the shard function to the primary key of the record. The database will then route the request to the correct shard, which will handle the request.

Often (but not always!), this routing function will be handled by a coordinator node that is responsible for routing the request to the correct shard, while the shard itself is more responsible for strict data management.

For many (but not all!) of these databases, the goal is to reduce the number of nodes that are required to handle a request. This is for a few reasons. Network hops are slow, so you want to minimize the number of hops a request needs to make. The more hops and nodes required, the greater the chance that one of them will be slow. Further, you'll get better scalability from adding a single node if it will be involved in fewer requests.

Some databases take this to an extreme. DynamoDB, for example, wants to direct every request to a single partition (with some rare caveats). Some distributed databases are more willing to do scatter-gather queries, where the coordinator node will fan out the request to all shards and collect the results. Elasticsearch is basically the complete opposite to DynamoDB, where almost every read request will hit all the shards.

Because of this, choosing a good shard key is important. You'll want to choose something with high cardinality (so it spreads across your shards) that is used in almost all of your queries (so you can easily determine which shard(s) to query).

A good example in a multi-tenant system is something like the OrganizationId. If each relevant record belongs to a particular Organization, you can use the OrganizationId as your shard key and include it in all of your queries. This will allow you to easily determine which shard to query and ensure that your queries are directed to a single shard.

The harder question -- and the task this post is about -- is how to handle secondary indexes that don't include the shard key.

In our multi-tenant example above, perhaps an Order belongs to an Organization. But what if you need to locate an Order for processing without knowing it's OrganizationId? You'll need a secondary index on the Order table that allows you to query by OrderId. And different distributed databases handle this is different ways.

Benefits, Drawbacks, and Comparisons

In the sections below, I'm going to talk about four different strategies that distributed databases use for secondary indexes. For each strategy, I'll describe the strategy and talk about the various benefits and drawbacks of that strategy.

I know I'm going to get complaints about the benefits and drawbacks, so I'm trying to head that off here.

With distributed databases, as with many things, there are tradeoffs everywhere. There are multiple design decisions that will greatly impact the performance and availability of your system.

In describing the strategies below, I'm only comparing the benefits and drawbacks of a particular strategy to the exact same database using a different strategy. If I say that a particular Strategy A is faster than a different Strategy B, it doesn't mean that any database that uses Strategy A is necessarily faster than any database that uses Strategy B. Rather, it means that the same database using Strategy A will be faster than the same database using Strategy B.

Imagine that you are running a mile. If you run a mile with a 10 lb weight on your back, you'll be slower than if you run a mile without the weight. But this doesn't mean that my five year old son running a mile without a weight on his back will be faster than you. The weight is a drawback, but it's a drawback relative to what you would do without it. It doesn't necessarily mean that you're slower than everyone else without that weight.

My go-to technical example here is in comparing Dynamo (the in-house database built within Amazon in the early 2000s for critical applications) and DynamoDB (the managed database service from AWS created in 2012). While similar in some ways, Dynamo and DynamoDB are quite different in others. In particular, they are different in the CAP theorem sense.

Dynamo was an AP system, while DynamoDB is a CP system. But this doesn't mean that DynamoDB is less available than Dynamo! Someone (I think Marc Brooker? But I can't find the specific tweet) noted that DynamoDB is a consistent, highly available system. Because of the operational work by the DynamoDB team to work on the resiliency of the system, DynamoDB is probably more available than the self-managed Dynamo instances by various Amazon teams. DynamoDB is less available than it would be if it were a CP system, but it's still pretty-dang available, and the benefits of consistency are worth the last trailing nines of availability.

All of this is a long-winded way to say that you shouldn't get mad at me if I say your favorite database uses a comparably slower strategy for secondary indexes. I'm not saying that your favorite database is slow. I'm saying that your favorite database is slower than it would be if it used a different strategy. And there are likely good reasons why your favorite database uses the strategy it does!

The four approaches to secondary indexes

With our background out of the way, let's look at the different ways that distributed databases handle secondary indexes. There are four main approaches that I've seen, and you can figure them out with a bit of a decision tree:

Flowchart of options

First, do secondary indexes exist in the same shard as the primary index, or are the items resharded onto different shards?
Then, if secondary indexes are resharded, is this done synchronously or asynchronously?
Alternatively, if secondary indexes are not resharded, are queries allowed to span multiple shards?

Let's walk through each of these approaches.

Approach 1: Secondary indexes asynchronously resharded onto different shards

Implementations: DynamoDB Global Secondary Indexes, MongoDB Materialized Views, Rockset
Benefits: Faster + more available writes; fast reads from secondary indexes
Drawbacks: No strongly consistent reads from secondary indexes; no unique constraints on secondary index values

Where else could I start but with DynamoDB? I already cover the details of DynamoDB secondary indexes in the post that spurred this one, so I won't cover it too much here. But the basic idea is that DynamoDB as a whole forces you to use your partition key in accessing your data. This goes for not only your main table but for your global secondary indexes as well.

When a write comes in to DynamoDB, it will first ensure that the write is committed to the main table. After the write has been durably committed to two of the three nodes in the partition group, then the write will be acknowledged to the client. At that point, the write will be replicated to the global secondary indexes. This replication is done asynchronously, so there is a chance that a read from the secondary index will not see the latest data.

The big benefit here is that writes will remain fast and available, regardless of the number of secondary indexes you have. You're only writing to 2 nodes, rather than 2 * (number of secondary indexes) nodes. This is important for DynamoDB, which wants to be predictable in its performance.

You also get predictably fast reads when querying a secondary index. Because it's resharding the data, your request can quickly and reliably be routed to the correct shard.

But tradeoffs abound. The downside of this approach is in the inability to do strongly consistent reads from the secondary index. Because the write is propagated asychronously to the secondary indexes, there is a chance that a read from the secondary index will not see the latest data.

I've argued before strongly consistent reads aren't needed as after as you think, but it can be a problem for some use cases.

The excellent Rick Houlihan mentioned that MongoDB's on-demand materialized views are similar. To keep your materialized view up-to-date, you can set up an Atlas trigger that will update the view when the main table is updated. It's basically the same idea as DynamoDB secondary indexes, but you're responsible for the trigger.

Rockset is also similar to this, but for a variety of databases. Rockset subscribes to writes from your database and replicates them to their own storage for querying. This is sort of different since we're not talking about a single system, but the idea is the same: the secondary index is not updated synchronously with the main table.

Approach 2: Secondary indexes synchronously resharded onto different shards

Implementations: Yugabyte, TiDB, Cockroach, maybe Vitess secondary Vindexes?
Benefits: Fast reads from secondary indexes; strongly consistent reads from secondary indexes; unique constraints on secondary index values
Drawbacks: Slower, less available writes.

The second approach is the one Franck mentioned to me on Twitter. In this approach, writes are resharded to secondary indexes synchronously with the main table. This means that the write is not acknowledged until it has been durably committed to the main table and all secondary indexes.

Here, we retain the benefit of fast reads from secondary indexes as the data is resharded to the correct shard. But we also get the benefit of strongly consistent reads from the secondary index. Because the write is not acknowledged until it has been durably committed to the secondary index, we can be sure that a read from the secondary index will see the latest data.

The cost of this is paid on the write side. When you're writing a new record to a table, you need to wait for durable commits to all nodes involved in the operation, which includes not just the main table but all secondary indexes.

This seemed like a no-go to me, but Franck made a compelling case for the performance implications. In the standard case, adding a secondary index added about a millisecond to write requests. Adding a unique secondary index added about 4-5 milliseconds. Additionally, he discusses the availability hit of a down node in the cluster, which should be limited to a few seconds.

It's great to have some data on this. I'd like to see a larger example to get a sense of tail latencies and how it works at scale, but it's still great to see! I really appreciate the work to quantify this rather than just letting me get away with 'it's slower'. :)

Kudos to Franck Pachot for showing me that these exists.

Approach 3: Secondary indexes not resharded, but queries allowed to span multiple shards

Implementations: MongoDB, Vitess, Cassandra, Elasticsearch
Benefits: Fast writes; flexible reads
Drawbacks: Slower reads when not using shard key; harder to scale reads

The third approach is to not reshard the secondary indexes at all. Instead, the secondary index is stored on the same shard as the main table. This means that writes to the main table and the secondary index can be done without a network hop.

This is a big benefit on the write side. Writes can be fast and available, regardless of the number of secondary indexes you have. It's comparable to the DynamoDB secondary index approach in this way.

It gets trickier on the read side. What happens if your secondary index doesn't use the shard key of the table?

Most distributed databases that don't reshard the data for a secondary index will turn this to a scatter-gather query. The initial request will hit some sort of coordinator node that will be responsible for fanning out the query to all shards. The coordinator will collect the results and return them to the client.

Because you're hitting multiple nodes, this necessarily means that reads will be slower than if you were hitting a single node. Further, the more nodes you add to the cluster, the slower the read will be. Each additional node raises the chance that one of the nodes will be slow, which will slow down the entire query.

Further, you don't get the full benefit of horizontal scalability on the read side. As you need to scale your cluster to add more nodes, each node will have a baseline amount of traffic that it's serving for these scatter-gather queries. You won't get perfect linear scalability as you would with something like DynamoDB.

Finally, any unique indexes on will likely need to include your shard key. A system that is making this choice is not going to want to do cross-shard operations, so unique constraints will need to be enforced on a single shard.

Outside of DynamoDB, this is the approach I was most familiar with. MongoDB, Vitess, and Cassandra all work this way.

Elasticsearch also works this way, and almost as a default mode. While DynamoDB wants to direct every query to a single partition, Elasticsearch almost assumes that every query will hit every shard.

Approach 4: Secondary indexes not resharded, and queries not allowed to span multiple shards

Implementations: DynamoDB Local Secondary Indexes
Benefits: Fast writes; fast reads; strongly consistent reads
Drawbacks: Limited flexibility in queries

The final approach also does not reshard the secondary indexes. But, in this approach, queries are not allowed to span multiple shards. This means that the secondary index is stored on the same shard as the main table, but queries are still directly to a single shard.

In order to make this work, you need to only allow secondary indexes that contain the shard key. But this severely limits your query flexibility! While a different index with the same shard key as the main table can be helpful, it's often the case tha you want a completely different query pattern. This approach is not going to work for you in that case.

The only place I know of that uses this approach is DynamoDB Local Secondary Indexes. In this case, the secondary index is stored on the same partition as the main table, but you must use the same partition key for the local secondary index as you do for the main table.

The benefits here are nice -- you get fast writes that don't include a network hop, and you get fast reads that are directed to a single partition. Further, you get strongly consistent reads from the secondary index, as the data is copied synchronously to the secondary index.

The main downside is the lack of query flexibility. However, there's another downside in the DynamoDB case -- you are limited in the size of data that includes the same shard key. DynamoDB wants to keep its partitions small. While it can split items with the same partition key across multiple partitions, it will not do that if there are local secondary indexes. It wants to ensure that any individual write can be completed without a cross-partition operation. Because of that, it strictly limits the size of the data that can be stored with the same partition key.

I'm not well versed enough with other databases to know if this is a common approach or not. My guess is that it's a quirk of DynamoDB's unique focus that they chose this option. I'd be interested to hear if you know of other databases that use this approach!

Conclusion

In this article, we looked at the different ways that distributed databases handle secondary indexes. Secondary indexes can cause some trouble for how distributed databases want to shard data, so I thought it was interesting to see the variety of approaches in how they handle it.

I'm sure these categories are incomplete and even a bit wrong. If you notice something I missed, reach out to me and let me know! I'd love to hear about it.

Background​

Sharding in distributed databases​

Benefits, Drawbacks, and Comparisons​

The four approaches to secondary indexes​

Approach 1: Secondary indexes asynchronously resharded onto different shards​

Approach 2: Secondary indexes synchronously resharded onto different shards​

Approach 3: Secondary indexes not resharded, but queries allowed to span multiple shards​

Approach 4: Secondary indexes not resharded, and queries not allowed to span multiple shards​

Conclusion​