Migrational thinking

Hello. You're getting this email as you signed up for productengineered.com. I hope you enjoy.

Data models and API boundaries are some of the most important leverage points in a codebase. Get these things right and programming can feel effortless. Get them wrong and even the simplest tasks can become a frustration.

While there is a seemingly endless supply of content on data modelling itself, I have long been surprised at how little information there is on changing designs once they have been put into production. In other words, no one talks about migrations.

A well-executed migration can clean up knotty data models, fix insurmountable performance issues and evolve features for users. Conversely, poorly-executed migrations cause incidents and company-killing levels of technical debt. This edition of Product Engineered is going to help you achieve the former and avoid the latter. In particular, we will:

Look at when you need migrational thinking (hint: often)
Walk through systematic, error-free and zero-downtime database/API migrations
Consider how to run migrations at scale

What are migrations?

You run a migration when you make changes to interfaces between independently deployed systems. In other words, migrations are refactors across API boundaries.

A database migration is the obvious example: you change your database schema and thus change the contract with any clients. However, the definition of ‘system’ needs to be much broader than just database and server. Some other examples of when migrational thinking is required are:

You change an interface for a public API, thus modifying the contract with consumers of that API.
You want to make a change to both an internal API and also the web app that consumes it.
You modify the type of an event dispatched to a high-throughput event queue. You need to be mindful of what happens if code that enqueues new events is deployed before the code that handles them.
You change the public API of an NPM package and want to avoid migration headaches for users.

What connects these examples is a non-atomic change across API boundaries. Each has two or more components, deployed independently, with an implied contract between them. When you need to make a change like this, you need to consider a migration plan.

Why get migrations right?

Browse public post-mortems and you’ll see that many incidents are some form of failed migration. This is the core reason to get migrations right: getting them wrong results in outages.

Migrations can go wrong in many ways. At a high-level, the three big failure modes you’ll see are:

API contracts are violated, causing reads or writes to fail. For example, you may remove a column from your schema while a client continues to query it.
Inconsistent, lost or corrupted data.
Unanticipated performance side-effects. For example, a schema migration locks and causes timeouts.

We’ll discuss these failure modes in more detail shortly but, suffice to say, the stakes are high. At best, issues will cause a few seconds of downtime. At worst, you can cause hours-long outages or irrecoverably corrupt customer data.

More optimistically, if your team can reliably execute precise, swift and systematic migrations, your codebase becomes infinitely more malleable. You will be able to scale through orders of magnitude of performance. When requirements change, or you rebuild a feature, you upgrade your system without users ever noticing.

How to get migrations right

Before we begin, a warning: migrations are very context dependent. I started off writing a far more concrete, tutorial-style post before discarding it due to fears that it would not generalise outside of my rather specific circumstances¹. So, while the following principles are carefully considered, an exercise is inevitably left to the reader to apply them to their own situation. In particular, you will need to consider:

Is your scale so large that you cannot possibly migrate everything, or so small that particular techniques are redundant?
What database and tooling are you using? What are their properties? For example, is adding a new index effortless and safe, or is it doomed to take down your entire system?
What constraints does your deployment process enforce? For example, is your frontend guaranteed to rollout before your server?

Warning complete, let’s begin.

Consider contracts over time

The first skill to master is thinking not in terms of static snapshots of code, but overlapping rollouts of changing API contracts.

Let’s consider a toy example to make this concrete. Imagine we have a task table for a task manager app.

CREATE TABLE task (
  id INTEGER PRIMARY KEY,
  title TEXT,
  is_complete BOOLEAN NOT NULL,
  created_at TIMESTAMP NOT NULL,
);

Putting aside the fact that we should have designed our schema like this from the get-go, let’s imagine we need to modify our app to introduce a ‘recently completed tasks’ feature. We don’t want just complete and incomplete tasks, we want to show when completion happened. We’d like our redesigned table to look like this.

CREATE TABLE task (
  id INTEGER PRIMARY KEY,
  title TEXT,
  completed_at TIMESTAMP, -- replaces is_complete
  created_at TIMESTAMP NOT NULL,
);

Furthermore, we want to expose the timestamp variant of our design to the client so that we can show our tasks in the order they were completed.

What not to do

Let’s think through what would happen if all of these changes were implemented in the same PR and deployed ‘simultaneously’. You would write a database migration and backfill to drop is_complete , add completed_at and move data from the old column to the new one. On the server, you would update your SQL to use the new schema, define a new shape for the task API response. Finally, the client would use the new response to sort and write tasks.

Exactly what will go wrong depends on you deployment constraints, but here are some possibilities:

Your server rolls out before your database migration is complete. A user completes a task and hits an error where a SQL query tries to write to a completed_at column that hasn’t been added yet.
Your database migration completes before your server deploys. SELECT * FROM task is now returning you an unexpected shape of data. Writes to the is_complete column fail.
Your client deploys before your server, and tries to sort by the completed_at column on the tasks response. The server isn’t returning data in this shape yet, so you see errors.
Your server deploys before you client. We’re no longer supporting task completion via setting the is_completed property and any API requests to complete a task error.

You will notice that errors occur regardless of the deployment order. Server deploys before the client? Error. Client deploys before server? Also error. The only way we can avoid errors is if the rollout is exactly simultaneous. This is very unlikely indeed.

You might get lucky. These particular errors only happen if users make a request during the rollout period. But if you have many active users, or always-running background jobs, you need a better approach. The better approach is to start thinking not about snapshots of code, but about contracts over time.

Implicitly or explicitly, an API describes a contract of permitted behaviour. Update your API and you update the contract, either with a breaking or non-breaking change. Adding a new optional property to the body of a POST request would be an example of a non-breaking change, while a new required property would be an example of a breaking one.

Crucially, independent deployments mean these changes do not happen simultaneously on all services. Instead, you must think not just of changing contracts, but of contracts changing over time. The trick to a successful migration is to ensure that at no point do you end up with a breaking change.

We can visualise changes on a timeline. In our optimistic (but unrealistic) world where all changes happen simultaneously, we only have two states to worry about and there are no breaking changes.


DATABASE │is_complete────│completed_at─▶
SERVER   |is_complete────│completed_at─▶
CLIENT   |is_complete────│completed_at─▶
        ─t0──────────────t1────────────▶ time
         │               │
DEPLOYS  Old contract    New contract

If we represent a more realistic scenario, the actual changes will happen at different points in time and result in breaking changes.


DATABASE │is_complete────────│completed_at─────────────────▶
SERVER   │is_complete───────────────│completed_at──────────▶
CLIENT   │is_complete───────────────────────│completed_at──▶
        ─t0──────────────────t1─────t2──────t3─────────────▶                                                                                                          ‎                             │      │       │
DEPLOYS                      DB     Server  Client

                             ├──────┤       
                             BREAKING
                    Server writes to is_complete

                                     ├──────┤
                                     BREAKING
                              Client expects is_complete

To perform a successful migration, you need to replace these breaking changes with a series of isolated, non-breaking changes. Let’s see how we do this.

Successful database migrations

The first boundary to consider is between the server and the database. This contract is defined by your database schema.

When writing a given property on a database row, there are three possibilities. The property can be required (the write will error when it is absent), it can be optional (the write will always succeed, regardless of whether the property is included or not), or the schema can reject writes which include the property.

In our example above, we hit an error because the server continues to write to the is_complete column even though it has been removed. Additionally, although you won’t hit a validation error for not setting the new completed_at column (this column is nullable), you can also consider this case to be an error of sorts; failure to set completed_at when the user completes a task means your app is losing data.

The trick we need to pull is to deconstruct our updates to the schema and the server into non-breaking, incremental steps. There is a systematic way to implement this: the database migration manoeuvre.

The database migration manoeuvre

The database migration manoeuvre is as follows.

New schema: Migrate your database schema so that you can write data into its new home. Note: this is just about adding the new parts of the schema, not cleaning up the old parts.
Dual-write: Update write paths to write new and old data simultaneously.
Backfill: Run an idempotent backfill task/script to fill in any data that hasn't been covered by dual-writes.
Swap to new reads: Swap reads over so they read the new data and not the old data.
Remove old writes: When you're confident that nothing is reading the old data, remove any old writes.
Clean up: Drop unused columns, clean up old endpoints and generally modify your code to reflect the fact that the old schema has been removed.

Let's apply this to our tasks migration.

New schema: ALTER TABLE task ADD COLUMN completed_at TIMESTAMP; . Note we leave the old schema in place.
Dual-write: Modify the backend so that completing a task still sets is_complete to TRUE and also sets completed_at to CURRENT_TIMESTAMP.
Backfill: Backfill a value for completed_at for any rows with is_complete=TRUE. The created_at column or the current timestamp would both be suitable values.
Swap to new reads: Remove all reads of is_complete and replace them with reads from completed_at.
Remove old writes: We're no longer reading from is_complete; remove all writes to this column.
Clean up: Drop the is_complete column.

If we deploy each of these six steps in order, we achieve the same result as our naïve migration but with zero errors.

Notice that our HTTP API (the contract between the client and server) has not changed at all. We’ve simplified the problem by only thinking about the server and the database. An even more extreme view on this (and one that I would recommend) is to strictly only think about the data model at this point. You may be tempted to also start refactoring controllers and other parts of the codebase. Avoid this. Focus surgically on the database and get that done first. You can always clean up later.

Client-server contract

The approach required for the client-server boundary is less involved. Here, I don’t have a specific manoeuvre to recommend. It’s really just a matter of paying attention to contracts over time and ensuring they don’t break.

For our task example, you could:

Modify the update task endpoint to support both is_complete and completed_at properties. Return both properties on all task responses.
Update the web app to only write to the new completed_at value. Remove all reads of the is_complete response property.
Clean up any reference to is_complete in API controllers and responses.

In practice, setting completed_at using a server-authoritative timestamp might make more sense anyway. Sometimes, if you’re launching a new feature, you can more or less skip these steps entirely. Performing a single transition to a brand new set of endpoints can be easier.

One word of warning when it comes to mobile and browser clients: they stick around far longer than you expect. For mobile apps this is more obvious: you need to update to the latest version of an app to run the latest code. However, the same principle applies to the web bundles of single page apps. These can linger for weeks if the user doesn’t refresh the page. Your new bundle may have been deployed and will be used by new users, but don’t assume that everyone is still running it. You need to account for longer rollout periods or implement mechanisms to force a bundle refresh.

Check your work

The least obvious part of migrations is testing. This was the place I under-indexed on most when starting out. My most painfully learned piece of advice: always test more than you think.

Think in data

When running database migrations, you should be thinking in terms of data. When you start to reason about states and invariants rather endpoints and features, your testing will become both more comprehensive and simpler.

Think in state transitions

Think in terms of state transitions. If I am backfilling is_complete → completed_at, then I want to ensure that I test both the is_complete=true case and the is_complete=false case. This is a simple example, but this approach is essential when your data becomes more complex.

Here's how I work through dual-writes systematically using this thinking.

Find the next call site and implement the dual-write.
Start with no data on the target side then manually trigger the code path.
Check the database to ensure the target-side state is as expected. Check cell by cell.
Run the code path again.
Check the database again. Now that there's already some target side data, does it still match the input? Are your logs error-free?
Repeat steps 1-5 until done.

You don't just test the path, you test every variant of it until you have 100% confidence.

There's a similar story for backfills.

Run your backfill on top of your dual-written data.
Check there are no runtime errors in your system.
For each possible variant of data, check the target side is written as expected. For example, find one task row with is_complete=true and check that has the right completed_at value. Then, find another row with is_complete=false and check that as well.
Wipe all the target-side data.
Run your backfill again.
Check for errors again.
Check your target data again.
Run the backfill one last time to check for idempotency (nothing should change in this run).
Check there are no runtime errors.
Check that no data has changed.

Think in invariants

As well as thinking row-by-row, you can think about invariants that need to hold for the system as a whole. You can then run SQL commands to check that everything is as expected.

For our tasks migration, we’d want to assert that every task with is_complete=TRUE also has a non-null value for completed_at. Express this as a SQL query, run it, and check your data integrity in development and production. If anything is off, dive into the details until you either fix your understanding or fix a bug. Only then should you proceed.

Seed data with as many states as possible

You can add further confidence by ensuring that you seed your development environment with as many states as possible. If I try and run our task backfill on a fresh workspace with no tasks, I won't be testing much. If I instead test my code on an environment with thousands of seeded tasks from many users and ensure I have a mix of done, not-done, deleted tasks and so on, I'll be much more confident in my tests. An error-causing edge case will be more likely to first appear on your machine, not production. As an advanced approach to this problem, you can consider automatically seeding development environments from production.

Observability

Lastly, you should instrument your code so that any errors that do slip through to production are caught rapidly. Setup dashboards to monitor for API errors. Log generously. Stay vigilant for exceptions. As a last resort, keep an eye open for any customer reports. Catching errors quickly allows you to roll back changes and minimise the effect of any mistakes.

How far does this approach get you?

Our tasks example is simple: one table, one column, few moving parts. However, I don’t want to suggest that this approach is only suitable for toy examples. These same manoeuvres and principles will let you migrate whole features, backed by millions of database rows, with no downtime.

That said, purely theoretical, contract-based migration thinking is not limitless. Production considerations ought to feed heavily into your migration planning. As I stated at the beginning of the post, your particular situation (product, scale, database, tooling) matters.

I am lucky to work in an environment where migration tooling is excellent. Adding a new column or index to a live table is not a problem for me. It may well be for you.

I have skipped over the topic of transactionality entirely. If you can’t assume transactional writes, you may need to consider how to handle errors.

I have been assuming that it’s feasible to backfill and migrate your entire dataset. If you’re at Google scale, this might not be an option. I generally run migrations to 100% completion and clean up any old schemas/code. In practice, it might make more sense to leave old columns in place and instead operate using mapping layers.

Whatever your situation, and however much it differs from these examples, the fundamentals hold: reason about contracts over time, execute non-breaking changes systematically, and verify your changes thoroughly. Get these right and your migrations will be seamless.

If you have a particularly interesting migration setup or ways of thinking about these problems, please email me at [email protected]. I am also always looking to learn more about how people tackle these challenges.