Operational Mastery: What It Is and How To Develop It

I want to talk today about what I consider a core capability of an engineering team: “Operational Mastery”.

This capability means a team can keep their system running and responsive in the face of:

  • Unexpected surges in load

  • Sudden failures of adjacent (or underlying!) systems

  • Nasty bugs that slip through testing

  • Users exhibiting malevolently surprising behavior

  • Zero-day CVE’s that require immediate, mass library upgrades

  • The longest-tenured engineer on the team occasionally sleeping or taking a vacation

  • etc, etc,

When such incidents arise, a team with Operational Mastery smoothly deals with them, maybe runs a quick post-mortem, and moves on.

This kind of full, calm ownership over a system isn’t just pleasant – it’s valuable.

Overwhelmingly, software businesses today operate the software they write – and being able to operate it well, with minimal effort, creates a great deal of value in the long term.

Consider the alternative: if the team doesn’t have such mastery, people constantly get sucked into awful, reactive fire drills. Burnout grows throughout the team. Customers start to lose trust, because the system isn’t consistently available to them. Executives get more and more frustrated with their engineering team.

What I want to talk about today is how to move from the unhappy state, towards the better one.

I’ll start by discussing some powerful practices that you’ll see over and over again, on teams that operate their systems well.

But then, oddly, I’m going to suggest that, if you want to increase your team’s operational mastery, you should not directly prioritize the development of those practices.

I’ll propose you do something different – more on that below…

Some Practices of Operational Mastery

First, to note: operational mastery is related, but distinct, from the other key property of the combo of a team plus a system - the ability to rapidly and safely change the system to meet new needs of the business (aka, the ability to make small, frequent, safe deploys). That property is necessary for operational mastery, but it’s not generally sufficient.

If you study teams that demonstrate operational mastery, you’ll notice:

  • They have excellent visibility into the behavior of the system

If something goes wrong, the team will immediately start reviewing dashboards and diving into logs – they know where to look to understand what is actually happening.

  • The overall team has a shared understanding of the system

Multiple people on the team have a strong mental model of how the system behaves in reality – and they have, at least roughly, the same such mental model.

  • The team possesses solid documentation

E.g. if the team discovers that, say, a node in some search cluster has failed, they’ll open up their runbook that explains how to cleanly terminate a node and replace it with a new one. That runbook covers a few nasty pitfalls to avoid, and also explains how to verify that a new node is fully functioning.

  • There is reliable tooling in place

Various steps are well-supported by tools which allow the team to team to easily probe for detailed further data, and then quickly adjust or repair the system. Those tools work, even when much else is in crisis (e.g. they don’t depend on the infrastructure they’re supposed to be operating on).

“Tooling” implies that some set of human operators is using the tooling to detect, diagnose and remediate issues.

“Automation” implies that humans don’t need to be involved.

Although that latter situation may sound attractive (“Won’t issues be resolved more quickly if it’s all automated?"), I believe it’s a very dangerous mental frame.

Complex systems fail in interesting enough ways that, for the bad cases, you’ll always end up needing humans involved.

If you’ve tried to “automate them away”, they’ll end up fighting their way through a system designed, at best, without considering their needs, and at worst, to actively prevent them from being involved. And they’ll be doing so precisely when things have gone very bad indeed.

Lisanne Bainbridge’s wrote beatifully about this in the Ironies of Automation (in 1983!).

Thanks For the Roadmap? Um, No

Okay, so: visibility, full team understanding, documentation, battle-tested tooling. Check, check, check, check.

Let’s go add all those things onto our backlog, right?

But, that’s precisely what I’m saying you shouldn’t do.

I have tried, on multiple occasions, to directly instill the above practices – both when I’ve been working on a team, but also when I’ve had some form of leadership role.

I’ve challenged teams to build dashboards and reduce the false positive error rates of their logging… and mostly seen those efforts peter out after an initial period of focus.

I’ve tried to build a culture of documentation… and found that most documentation became so quickly outdated, no one wanted to look at it in a crisis.

I’ve asked teams to set aside portions of their sprint time to improve their operational tooling… and discovered that the teams don’t know what kind of tools they need.

There’s an underlying reason why all these approaches tend to languish: ultimately, you’re asking the team to prepare for events which aren’t yet happening.

This causes two problems:

First, they lack motivation. It’s hard to prioritize this work against all the urgently-requested features… if it feels only theoretically needed.

Second, if the team hasn’t been living with their system and actually operating it in the face of a hostile reality… they won’t actually know what kind of investments to make.

How can you expect a team to build genuinely valuable visibility, documentation or tooling, if it’s in support of a system they don’t really understand, facing user inputs or external failures they haven’t thought about yet?

Note, even if a team has written a system, they often don’t have full understanding of how it will behave, once reality hits. Will the database grow slow? What happens when requests to that 3rd party API time out? What kind of crazy data do users shove into the system? etc, etc

Just as with performance issues, where engineers generally have horrible intuition about where their systems are spending time, pre-mastery, teams don’t generally have good intuitions about what will improve their ability to operate.

So, what can you do?

Relentless and Public Failure is a Powerful Teacher

There is this… one… way teams can develop operational mastery:

  • Have their system fail, painfully, in production, over and over

If they’re lucky, then can sometimes learn enough from those failures to achieve mastery… before the issues cause their business to fail.

I want to be clear: this can actually work – in particular, if the team has the discipline to run thorough, blameless post-mortems for every single failure.

But it does have some downsides.

First, living through constant, unplanned failure is playing with fire.

You run real risks that your team will burn out, your executives will lose trust in the engineers, and your customers will give up on your product because it’s not available and responsive when they need it.

But, beyond those fundamental risks, there are cases where the “fail repeatedly due to forces beyond your control” plan just doesn’t help:

  • Your business might have infrequent but very important periods of dramatically increased load

E.g. at an ecommerce company I worked at, we had to be ready for “peak”, which was driven by the incredibly intense online buying in the days immediately after Thanksgiving.

We could post-mortem all we wanted leading up to that, but none of that would prep us for 5x or 10x load during the “Cyber 5”. And, we only had those 5 days to then get it right – we couldn’t really use a sequence of production failures and learnings to gradually adjust the system and our practices.

  • Failures and post-mortems don’t do a great job of transitioning knowledge to new team members

I’ve seen this happen multiple times – a core group of engineers, product and customer support folks, who go through a series of incidents and post-mortems together (you must invite product and customer support to post-mortems, by the way), all build a deeply shared understanding of the system, how to respond to incidents, etc. As a result, the system becomes more stable.

But then, those people leave the company, or move to other teams… and the people who replace them weren’t in all those post-mortems, and thus don’t have the deep mental model of the system, nor the ingrained habits of how to handle an incident. And suddenly, you’re back where you started.

Why Do You Keep Saying “Failure” Like It’s a Good Thing?

If the key to a team developing mastery is living through a series of failures… you can win by deliberately and repeatedly putting your production systems into failure states.

Doing so often requires some real creativity and risk tolerance, but, if you do so, you’ll obtain a massive double benefit:

  • Due to the direct experience of operating their system in the face of challenges, the team will start to understand how it works and what they need to do to more successfully operate it

  • If you practice regularly, the team will be motivated to make steady, incremental investments (which is how you win, in basically all software development)

If you encourage/force your team to practice their operations by dealing with failure states, you’ll find that visibility, documentation, and tooling all start to improve… even if you don’t tell your team work on those things!

What Does Practicing Look Like?

You may have heard of Chaos Engineering – that certainly fits within this overall frame. But, if you think about “inducing failure” as a general approach to practicing operations, you can see lots of other kinds of work in this light:

  • Deploy all the time, in the middle of the day

Not only does this get small changes out into reality on a regular basis (which create tremendous economic benefit), it also increases the odds that you’ll trigger a set of small-scale failures.

These small-scale, deploy-triggered failures are an excellent training ground for operational mastery.

  • Take out key underlying systems in production

Aka, run “Game Days” – disable your own database, and see what happens!

No, really!

See this fantastic conversation around Resilience Engineering, or this presentation on Gamedays on the Obama for America Campaign.

  • For request/response systems, saturate one of your servers in production

i.e. tweak your load balancer configs to send all traffic to one host in your fleet, and just completely knock it over.

If you want, you can try to get clever and send copies of requests to some spare host, to minimize customer impact… but honestly, you’re much better off trying to create an honest-to-god failure in production.

If you make this a regular practice, you’ll find your team suddenly writing better dashboards to monitor host health, improving the tooling to replace a host, improving their runbooks, etc.

  • For queue-based systems, pinch production demand and release it all at once, to generate peak load at off-peak times

At the aforementioned ecommerce job, this is how we had our most successful year of preparing for peak.

During normal operations, our customer-facing systems dropped new orders in a queue, which the various order fulfillment systems then consumed. For a “pinch test”, we paused all consumption of new orders for several hours, which let a high number of orders build up in the queue. We then released those all at once… and the downstream fulfillment systems got to experience 10 minutes of the kind of intense load we otherwise only saw once per year.

Each team observed their system’s response, identified bottlenecks and/or places of limited visibility, and then made adjustments to prep for the next month’s pinch test.

The PM’s on the teams were very happy to see this work prioritized – the direct evidence of the limitations of the systems built tremendous buy-in.

Note: attempting to do this outside of production would have taken years of preparation. Whereas, for the pinch tests, the tech work was pretty straightforward (we did have to do some careful coordination with our business stakeholders, in particular, the warehouse teams, who had to staff to handle the extremely heavy influx of orders partway through the day).

But We’re Not Ready!

You might be thinking: my team isn’t ready for those practices yet, we need to build up our visibility, documentation, tooling and team understanding first.

I know, I know, that sounds like the way to go.

But, think about it: if you don’t yet have such operational maturity… it’s likely because your organization’s overall decision-making systems don’t prioritize it. What is going to change that? Again, from my experience, making the better practices the goal in and of themselves rarely overcomes the organizational desire to instead focus on new features.

But, if you put an upcoming failure event on the calendar, you’ll suddenly find motivation that’s been lacking.

This usually does require investing some political capital – so maybe start with a “small” failure in production. But spend your energy on that, and on getting into a regular habit of creating operational challenges.

“Depend upon it, sir, when a man knows he is to be hanged in a fortnight, it concentrates his mind wonderfully.” – Samuel Johnson.

Here’s a recent story of how we did this at Ellevation.

Our Peak is Called Back to School

Ellevation sells software to help school districts across the country manage their English Language Learner (“ELL”) programs.

One of our biggest operational challenges comes around the annual generation of “Student Reports”: school districts are required to generate and store a set of documents for each ELL student, detailing what educational supports will be provided.

Ellevation automates almost all of this work – saving huge amounts of time for the educators in the ELL departments (and thus freeing them up to spend more time with students).

At certain peak periods (primarily in the late summer and fall, as the school year starts), we’ll go from seeing zero requests to generate Student Reports, to suddenly seeing requests for 20,000/day or more.

In previous years, we’d badly struggled – the report-generating system would grow sluggish, or would simply fall over (generally silently).

In 2021, the team spent almost a month and a half doing nothing but supporting report generation (which put a huge divot in their progress on key new features for the product, and also burned them out).

Looking ahead to the 2022 Back to School, the team decided to prepare more proactively.

They did so by running a series of load tests – specifically aiming to overload the system to the point of collapse.

They started with some basic dashboards, and a set of (slightly hacky and poorly documented) JMeter scripts to generate requests for Student Reports.

Over the course of a series of load tests, run in production, they:

  • Improved the dashboards which show the number of reports in various queues, the error rates, etc.

  • Gave multiple people active experience in working with the system: launching load tests, observing the results, adjusting the configuration (when we started, we had exactly one engineering who could do most of those things)

  • (Re)discovered connections in the web of systems that: pull data about students; generate translated versions of legally-required copy; and render and store the results (e.g. the team found a a bottleneck in CPU utilization in the database the hosts translated copy – a DB that half the team didn’t even know existed, when the work started).

  • Wrote detailed runbooks on launching load tests, and also on monitoring the system (I’d love to say that they “improved” the existing runbooks… except we totally didn’t have any, which is why our first ‘small’ test accidentally triggered the generation of 5,000 reports in 1 minute, which then flooded the inbox of an unrelated team with an email per failure! We fixed both the load tests script and also the email alerting rules)

  • Flushed out a nasty bug where five failures in a row would trip a circuit breaker… which then could never be cleared. (Which may well have been the cause of some of the worst issues from last year’s Back to School period)

As a result of this work (and other improvements), we had the smoothest Back to School we’ve had in years – featuring a grand total of zero major outages… which metric (“0”!) our customer support and success team featured prominently, in their “Looking Back on Back To School” company all hands presentation.

Dan Milstein

Dan is the CTO at Ellevation and leads the engineering team. In his spare time, he likes to play ultimate frisbee, read math textbooks and develop original puppet theater plays.