All Articles

My DynamoDB Wish List

DynamoDB is a solid, well-loved product, but that hasn’t stopped the DynamoDB team from innovating. At last year’s re:Invent, we saw two huge announcements. DynamoDB Transactions brought transactions to DynamoDB and made it easier to handle complex, multi-item operations in a single request. DynamoDB On-Demand Pricing let you forget about capacity planning and only pay for what you use.

But as AWS customers, we still want more. It’s the reason Jeff Bezos loves us — we are ‘divinely discontent’. In this post, I lay out my two big #awswishlist items for DynamoDB.

The two features (plus 1 bonus request!) I wish DynamoDB would add the most are:

  1. Filtered DynamoDB Streams
  2. More Redis-like operations on item attributes
  3. (Bonus) Increase max item size to 1MB

Let’s explore each of these below.

Filtered DynamoDB Streams

DynamoDB Streams are a powerful feature from DynamoDB. If you enable DynamoDB Streams, you get a changelog stream describing the operations on your table. Whenever you insert, update, or delete an item, a record will be dropped into the stream to indicate the change.

These streams unlock a lot of use cases, such as:

  • Updating marketing systems when a new user has signed up;
  • Providing a queue-like system with a data persistence layer;
  • Offloading analytics into a more OLAP-like system.

With the rise of microservices, event-driven architectures and tools like Apache Kafka, getting data out of a database for use across services has been a big focal point. Retrofitting existing databases to allow for these changelog-like features has been difficult, but DynamoDB provides it out of the box.

For more on databases and streams, check out Turning the Database Inside out by Martin Kleppmann. It’s one of my favorite talks of all time.

While DynamoDB Streams are great, they’re not as powerful as I’d like. One problem with DynamoDB Streams is that it’s a firehose — if I want to read any records from the stream, I have to take them all. But often my use cases don’t care about all updates. They only care about certain ones.

Currently, this means a lot of wasted compute in my Lambda function as I perform some filter logic at the beginning of the function and return early if the record doesn’t match my conditions.

I’d love to be able to filter my DynamoDB Stream and only receive records that match certain conditions. I think this could be handled in one of two ways.

1. Provide an expression to evaluate the stream record

My preferred way to handle this would be to allow me to specify an expression that will be evaluated on each DynamoDB Stream record. If the record matches the expression, the record will be passed to the consumer. If not, it will be ignored.

DynamoDB already has a rich expression syntax for use in Query Expressions, Filter Expressions, and more. This syntax could be re-used by DynamoDB Streams to enable filtered streams.

For a quick example, imagine I had a DynamoDB table that was storing customer orders from my e-commerce store. I want to record all orders that include a coupon code so that I can track the efficacy of my marketing efforts. I could use the following mock command in Boto3 to create my filtered stream:

import boto3

client = boto3.client('dynamodb')

client.create_filtered_stream(
  TableName='CustomerOrders',
  StreamFilterExpression='attribute_exists(CouponCode)'
)

Now, only items with the CouponCode property would be sent into my filtered DynamoDB stream.

One final note — the expression syntax would need to be extended to allow for filtering on meta-properties of the DynamoDB event itself, such as whether it’s a INSERT, MODIFY, or REMOVE event.

2. Allow users to add streams to indexes

For a way that’s likely easier technologically for AWS to implement but less beneficial to the user, AWS could allow users to create streams from secondary indexes.

Currently, you can only create DynamoDB Streams from your main DynamoDB table. And for most secondary indexes, the stream would be the exact same as your main table. However, using the power of sparse indexes, a stream on a secondary index would give you the power of filtered streams.

Sparse indexes are secondary indexes that don’t contain all items from the underlying table. DynamoDB will only copy items into your index that contain all elements of your key schema.

Using our coupon code example from above, we could create a secondary index that used the CouponCode property in its key schema. Only items that had a coupon code would be replicated into that index. If I could then be downstream of that index, it would solve the filtering problem in my DynamoDB stream.

I don’t like this solution as much as the first, as I’d be forced to pay for an additional, unused index. Further, I may have to alter the structure of my DynamoDB items (such as adding a property that only exists on desired items) to account for non-application use cases. That said, I’d be willing to do this if it gave me my filtered stream.

Redis-like operations on DynamoDB attributes

The second big wishlist item I have for DynamoDB is for it to enable more sophisticated operations on attributes. The main comparison here is Redis, which is a flat-out amazing piece of software.

Many people use Redis as a caching service due to its ridiculous performance — you can get hundreds of thousands of requests per second on a Redis instance. While DynamoDB performance is impressive, Redis is at least an order of magnitude faster.

But Redis performance isn’t what I want. A second reason people love Redis is due to the power of its data types and commands. Redis is more than a simple key-value cache. It’s an object server which allows a few different data types — strings, lists, sets, and more — and allows you to run powerful operations on them.

This is what I want for DynamoDB. Give me more flexible operations on DynamoDB attributes. We’ve already got a powerful set of attribute types with DynamoDB, including lists, maps, and sets. Now let me operate on them more easily.

Let’s check out a few of the core use cases I’d love to see.

1. Capped lists

One of the most common Redis use cases is to use the combination of LPUSH and LTRIM to maintain a list with the most recent X number of elements.

The LPUSH command allows you to insert an element at the beginning of a Redis list. The LTRIM command allows you to truncate a list to a given range. The combination of this list allows you to push elements to the beginning of a list and immediately truncate to ensure the list is no longer than a given length.

To make it real, imagine you had a social application where you wanted to store the ten most recent actions for a given user. If this was implemented in DynamoDB, you could use LPUSH to add a new action to the latestActions attribute for the given user, then use LTRIM 0 9 on the latestActions attribute to store only the ten most recent actions.

2. Popping elements off a set

Another awesome action from Redis is the SPOP command. SPOP allows you to remove one or more random elements from a set. The removed elements will be sent back to the client that removed them.

This operation can be good for a number of use cases. One is for working through a queue of work in a semi-random fashion, as you can enqueue the items in a set and then use SPOP to choose the next item to work.

Another example given in the Redis documentation is to model a card game. You can use a set to represent all the cards, then use SPOP to deal out cards to the players.

3. Sorted sets for leaderboards and priority queues

The first two examples are useful, but they’re merely convenience functions to existing DynamoDB data types. In this example and the next, we’d be adding new data types to DynamoDB.

The first data type is the Sorted Set. A Sorted Set combines the uniqueness properties of a regular set with ordering characteristics. This is helpful for a number of use cases.

The canonical use case for a Sorted Set is a game leaderboard. As you add users to a Sorted Set with their score, it’s simple to retrieve data like the top ten users, the bottom ten users, or all users with scores between a given range.

Another place I’ve found Sorted Sets really useful is in implementing priority queues. Imagine you’re adding tasks to a queue that you’d like processed, and you have High, Medium and Low priorities on them. You’d like High priority tasks processed before Medium, and Medium tasks processed before Low. Within each priority level, you’d like to process the tasks in the order they came.

With Sorted Sets, you could represent each of the priority levels with an enum number (e.g. High == 1, Medium = 2, Low = 3). When adding elements to the Sorted Set, the score is ${priority}${unixTimestamp}. Thus, a High priority item on the afternoon of October 12 would be scored as 11570904775 (priority = 1, timestamp = 1570904775), whereas a Low priority item at the exact same time would be 31570904775 (priority = 3, timestamp = 1570904775).

When processing, you could use ZPOPMIN to find the elements with the lowest score. This would prioritize higher priority and older tasks over lower priority and newer tasks.

With Redis Sorted Sets, priority queues are quite a bit easier than other methods.

4. Getting probabilistic with Hyperloglog

The last example is the wackiest. The HyperLogLog data type is a probabilistic data structure that allows you to track and estimate unique elements in a memory-efficient way.

In the Redis implementation, you can track an unlimited number of unique items in just 12k bytes. When asking how many unique items have been stored in your HyperLogLog item, the response is a probablistic number with a standard error of 0.81%. This is pretty wild.

Most of the use cases around HyperLogLog are analytics-focused, such as tracking an approximate number of unique visitors to a web page over a given time period.

Increase max item size to 1MB

While we’re here, I’ll add one more request. Since we’re adding more useful attribute types and operations, it’s possible our item sizes will get larger. This is particularly true for heavy uses of Sorted Sets.

Let’s increase the max item size from 400KB to 1MB. I would go even higher but am trying to be reasonable here. The response size limits for the Query and Scan operations are 1MB, so this shouldn’t require too much change overall.

Again, for most people, this increase shouldn’t matter. The main point is to provide a little extra leeway for implementing priority queues and leaderboards with Sorted Sets or similar functionality.

Conclusion

DynamoDB has made some incredible progress over the last few years, but I still want more. In this post, I wrote about my #awswishlist items for DynamoDB, including filtered DynamoDB streams, more Redis-like functionality, and bigger item sizes.

What’s on your #awswishlist?

If you have questions or comments on this piece, feel free to leave a note below or email me directly.

Published 15 Oct 2019

Working for Serverless, Inc. Infrastructure & data engineer with expertise in AWS, data processing, and serverless technologies. Learning never stops.
Alex DeBrie on Twitter