kelly kaoudis: Oncall dysfunctions

The main idea of having an oncall, as I understand it, is to share the burden of interruptish operations work so no one human on the team is a single point of failure. I have so far encountered several kinds of dysfunction in implementations of "oncall rotation" and would like to provide some ideas for other folks in similar situations on how to get to a better place.

What ownership of code or services means to a team seems to be a general contributing issue to oncall dysfunction. Everyone doesn't need to have 100% of the context for everything, but enough needs to be written down that someone with low context on the problem can quickly get pointed in the right direction to fix it.

This post is a) not perfect, b) limited by my own experience, and c) loosely inspired by this. Some or all of this may not be new information to the reader; this is not a new topic but I wanted to take a whack at writing about it.

Assumptions in this post:

if you carry the pager for it, you should be able to fix it
nobody on a team is exempt from oncall if the team has an oncall rotation
the team is larger than one person
the team owns things which have existing customers

I felt like I had a lot of thoughts on this topic after tweeting about it, so here ya go.

Everything is on Fire

Possible symptoms: Time cannot be allocated for necessary upkeep perhaps due to management expectations or a tight release schedule for other work, leading to a situation where multiple aspects of a service or product are broken at once, or at least one feature is continually broken. The team is building on a quicksand and not meeting their previous agreements with customers.

Ideas: A relatively reasonable way out of this situation is to push back on product or whoever in management sets the schedule, and make time during normal working hours to clean up the service. Even if the amount of time the engineers can negotiate to allocate to cleaning up technical debt and fixing things that are on fire is minimal, some is better than none.

While spending less than 100% of the team's time on net new work could result in delays to the release schedule, if the codebase is a collapsing house of cards, every new feature just worsens the situation. Adding more code to code that isn't well hardened / fails in mysterious ways just means more possibility for weirdness in production.

The right amount of upkeep work enables the team to meet customer service agreements while still leaving plenty of time for net new work. The balance of upkeep work to new work will vary by team, but getting to a place where it's, say, 85% new to 15% upkeep if everything is really on fire across multiple products may require multiple weeks' worth of 75% or even 100% upkeep and no net new additions during that time to get to a better place.

On the other hand, even if there's some backlog of stuff to fix, that is okay (perhaps even unavoidable) as long as agreed upon objectives are met and it's possible to prioritize fixes for critical bugs. If the team does not have SLAs but customers are Not Pleased, consider setting SLAs as a team and socializing them among your customers (perhaps delivered with context to adjust expectations like "to deliver <feature> by <date>, we need to reduce <product> service for <time period> to <specifics of availability/capacity/latency/etc> in order to focus on improving the foundation so we can deliver <feature> on time and at an <availability/capacity/latency/etc> of <metric>").

When everything is on fire long enough, it becomes normal for everything to be on fire, and so folks don't necessarily fix things until they get paged for whatever it is. One possibility for a team of at least three is to have a secondary oncall for a period of time until things are quietened. The person "up" on the secondary rotation dedicates at least part of their oncall period working on the next tech-debt-backlog item(s) and serves as a reserve set of helping hands/second pair of eyes if the primary oncall is overwhelmed.

The next step could be to write runbooks or FAQs to avoid the situation turning into Permanent Oncall over time as individuals with context leave.

Permanent Oncall (Siloing)

Possible symptoms: The team has an oncall rotation, but it doesn't reflect the team's reality. The answer to asking whoever is oncall "what's wrong with <thing on fire>" most of the time is "I need <single point of failure person's name> to take a look", not "I will check the metrics and the docs and let you know by <date and time> if I need help". When the SPOF is out sick or on vacation, the oncall might attempt to fix the issue, but there isn't enough documentation for the issue to be fixed quickly. The team would add to the documentation if much existed, but to start from scratch for all the things is too much a burden for any one oncall. When the SPOF comes back, they may return to everything on fire.

This can happen when a team has a few experienced folks, hires a lot suddenly, and are not intentional about sharing context and onboarding. This can also happen when context / documentation hasn't been kept up over the life cycle of the team, and suddenly the folks who don't know some domain of the team's work outnumber those who do.

Ideas: Having a working agreement that includes a direct set of expectations regarding who owns what (the entire team should own everything the team owns if everyone on the team shares in oncall; the number of individuals with context on any one thing the team is responsible for, for each thing the team is responsible for, should be at least half the team, if the team size > 2) and how things are to be documented and shared can help here.

If the team is to have a working agreement and it actually reflects reality, the EM/PM/tech lead cannot dictate it, but can help shape it to meet their expectations and customer expectations. Having one person on the team who felt like their voice didn't get heard in the process of creating a working agreement could lead to that person becoming more dissatisfied with the team and eventually leaving. Working agreements can help with evening out load between team members or at least serve as a place that assumptions get codified so everyone is mostly on the same page about what should happen.

Some teams only work on one project at a time to try to prevent this (but this may be an unrealistic thing to try for teams with many projects or many customers). It can be hard to build context as an engineer on a project you have never gone into the codebase for if you are not the kind of person who learns best from reading things. If the entire team is not going to work on the exact same workstream(s) all the time, it is crucial to have good documentation to get to a place where oncall has some information on common failure modes in the service and common places to look for things contributing to those failures. Mixing and matching tasks across team members is hard at first, but if everyone is oncall for all the things, this is going to happen anyway and it's better to do it earlier so that people have context when it's crucial. Going the other way and limiting the oncall rotation for any service to just the segment of the team who wrote it or know it best is just another, more formal variant on the Permanent Oncall problem and is also best avoided.

Lots of folks do not enjoy writing documentation or may feel like it takes them too long to document stuff for it to be useful, but oncall is not a shared burden if there is not enough sharing of context. When the SPOF leaves because they're burnt out, the team is going to have to hustle to catch up or will transition into No Experts Here.

An alternative to having extensive runbooks, if they aren't something your team is willing or able to dedicate time to keeping up, might be regularly scheduled informal discussions about how the items the team owns work, combined with the oncall documenting what the issue was and how they fixed it whenever something new to them breaks in FAQs. An FAQ will serve most of the purpose of a runbook, but may not cover the intended functionality of the system.

Firefighters

Sometimes, teams have an oncall rotation, but one often more senior person swoops in and tries to fix every problem within a domain. This paralyzes the other team members in situations where the swooping firefighter team member is not available and encourages learned helplessness and siloing. This is an edge case of Permanent Oncall where the SPOF actually likes and may actively encourage the situation.

Sometimes, the firefighter doesn't trust the oncall or doesn't feel they are adequately capable of fixing the problem. This can be incredibly frustrating as the oncall, especially when there is adequate documentation for the system and its failure modes and the oncall believes they can come to a working fix in an acceptable amount of time. It is also likely the firefighter hoards their own knowledge and is not writing down enough of what they know.

Sometimes the firefighter just is a little overenthusiastic about a good bug and management doesn't care how the problem gets solved. As I become more experienced in software engineering, I am finding areas within teams I have been part of where (despite having confidence the oncall will be able to eventually solve the problem...) I am susceptible to sometimes offering a little too much advice when a thing I have been primarily responsible for building or know a lot about breaks and I happen to notice around the same time the oncall does. In cases like this I am working to write down more of what I know for the team.

Actively seeking out the advice of a teammate as the oncall is likely not an example of this pattern nor of Permanent Oncall in general unless every single bug in a system requires the same person or couple of folks to fix it.

No Experts Here

Possible symptoms: The team inherited something (service, set of flaky cron jobs...) secondhand and it is unreliable. Maybe it's a proof of concept that needs to be hardened; maybe it's an old app which has no existing documentation but is critical to some aspect of the business and someone has to be woken up at 3 am to at the very least kick the system so it gets back into a slightly less broken state. The backlog for this project is just bug reports and technical debt. Nobody takes responsibility and ownership for this secondhand service is not prioritized. Management is prioritizing work on other projects. This is an edge case of Everything is On Fire.

Ideas: perhaps one or two individuals do a deep dive into the problem service, start a runbook or FAQs based on observations, and then present their findings to the team. As oncalls encounter issues, they will then have a place to add their findings and how they fixed the problem(s). The deep dive doesn't have to result in a perfect understanding of the service, just a starting point that the oncall can use to inform themselves and/or build on.

The team as a whole needs to be on board with the idea that as owners of <thing>, they must know it well enough to fix it, or it will always be broken. If only part of the team is on board, this turns into Permanent Oncall for those folks, which is also not ideal. If nobody has time and mental space for <thing>, it needs to be transferred to a group who do have the time and space to develop proper knowledge of it, or it needs to be deprecated and spun down.

No Oncall

Possible symptoms: Everything that breaks is someone else's problem. The team does not carry a pager, but has agreements with customers about the level of production service that will be provided.

Someone else without good docs or context (perhaps an SRE or a production engineer?) is in charge of fixing things in production in order to keep agreements with customers. Perhaps this person is oncall for many different teams and does not have the time to devote to gaining enough context on each of them.

Ideas: Some things that may help this situation and help the SRE out (if having people who don't carry the pager is for some deeply ingrained cultural reason unavoidable): the team and the SRE come to a working agreement specifying when and how much to share context, when the SRE will pull someone from the team in to help, who they will pull into help at what points, and so on. If you must have someone not on the team oncall for the product, it may be useful to have the team have a "shadow" oncall so that the situation does not turn into Permanent Oncall for any one or two individuals.

Newbies

When and how folks get added to the rotation are also critical to making oncall a healthier and more equitable experience. Expecting someone to just magically know how to fix something they don't understand and have never worked on is a great way to make that person want to leave the team. Having a newer person shadow a few different folks on the existing rotation before expecting them to respond to production issues ("shadowing" == both the newer person and the experienced person get paged if something breaks, the experienced person explains their process for understanding and fixing the issue to the newer person as they go, and the newer person is able to ask questions before, during, and after the issue).

Conclusion

It is hopefully not news that software engineering is a collaborative profession that requires communication; being intentional about when and how agreements with customers happen and how customer expectations get adjusted so they can be met is crucial. It may also be the case that the service(s) the team owns are noncritical enough that there doesn't need to be an oncall rotation outside business hours at all and fixing issues as they can be prioritized is fine, but taking care to prioritize bugs and production problems at least some of the time is necessary. It may also be the case that the team is distributed enough that oncall can "follow the sun" and oncall can be arranged to follow business hours for all team members. There are lots of ways to learn of and fix bugs in a timely way to meet customer expectations and no one way is perfect.

kelly kaoudis

04 February, 2021

Oncall dysfunctions