CosmosDB Partition Key Lessons

23 November 2022 By Tom Hyde

When using Cosmos DB, it is critically important to choose a good partition key strategy. Partitions in CosmosDB can hold a maximum of 20GB of data - including indexes (this used to be 10GB). If you get it wrong, you might fill up a partition and break things in production.

We got it wrong on a shared service and four years later it broke - of course on the weekend.

Here is what we did, what happened, how we fixed it, and what didn’t work.

TL;DR

Set up an alert so you get a warning before the partition fills up.
Deleting data from a partition might not free up space - at least not immediately.
Consider using the built-in Time-To-Live feature to automatically get rid of old data.

How we chose a “good” partition key strategy

Four years ago, we built a service using Cosmos DB for of some of its persistence. This system was a service responsible for sending emails, with a load of additional features such as being able to search for sent emails by arbitrary tags. At the time, the partition key strategy was considered, and a partition key chosen to use for storing models (the data that varies between emails, such as name and specific information for the recipient). A quick bit of maths showed us that with the average size of the models we were using, this could handle over 50 billion email models. That seemed pretty future-proof.

A nasty surprise

One day (at the weekend, of course), alerts were fired saying that this Cosmos database was responding with an unusual number of 403 (forbidden) status codes. This seemed odd.

Given the status code, the first thing to check was that there was no authentication issue with Cosmos - there wasn’t.

Application Insights came to the rescue. Looking into logs, we found some clear messages saying that a certain partition key had data which exceeded its allotted 20GB. The 403 status made some sense now - we were forbidden from assigning any other documents this partition key.

Impact

Since the alert had been fired, time-sensitive emails had stopped being sent.

How did this happen?

With the benefit of hindsight, it seems pretty obvious what would happen, but given that assumptions were made years ago with the original design, these had all but faded from memory.

Over the course of four years, a number of things had changed with how the service was used.

The number of emails sent through the service had increased by orders of magnitude, and systems using it had vastly increased their user bases.
The sizes of some of the models stored were much larger than when the original assumptions were made. Some included base-64 images, significantly increasing their size.

Getting the service back on the road

The obvious solution to the problem of having too much data assigned to a given partition key was to reduce how much data was assigned to that partition key. So, we put together a simple program which would take older documents from the database (which we could temporarily live with not being able to retrieve), copy them with a new partition key, then delete the original document. We ran this against over 50% of the total number of emails in the service.

However, after running this program, looking at metrics showed us that there had been no reduction in data stored against that partition key, and the 403s were still coming thick and fast. Clearly, removing documents from a database does not immediately reduce how much data it is reported to hold. This is not intuitive, and we’re not yet sure why this is.

Time for plan B. We created a brand new partition key for these documents to be stored in. We pointed our service at this new partition key and moved more recent documents to use it. Thankfully, due to how the system had been designed, pointing at the new partition key was possible purely through config changes. The emails started flowing again.

Since the service had been designed such that emails were placed on a queue before being sent, we were also able to requeue all of the emails which had failed to send while there was no space in the logical partition, and have these successfully sent.

Longer-term, the service is being developed to have a mechanism to elegantly switch between partition keys as logical partitions fill up.

Lessons learnt

Even when we have chosen partition key strategies which seem like there is no way of them being “filled”, we now always set up an alert to let us know when one is getting anywhere near full, so we can react before there is an issue.
Remember when designing partition key strategies to think of how usage can change over time, and if there is any foreseeable change which could mean that what once seemed a good choice, later appears less-so.
Once a partition key is “full of data”, there is not a simple solution to clear it out again.
When designing your database, question whether your data needs to be retained forever. It may make no sense to retain data from over a year ago for example, in which case, you can use Cosmos’s excellent TTL feature to automatically delete documents over a certain age. This in itself may solve any problems with perpetually growing data in logical partitions.