Your Best Engineer Is Your Biggest Risk | Jono Herrington

Every engineering org has one.

The person who knows where everything lives. The one who built the original service five years ago. The one everyone Slacks at 4:47 PM when something breaks because nobody else knows how it works.

You call them your best engineer. I call them your single load-bearing wall.

And load-bearing walls don’t warn you before they leave.

What Indispensability Actually Looks Like

From the outside, it looks like excellence. This person ships faster than anyone. They have context on every system. They can debug in minutes what takes others days. Leadership loves them. The team relies on them.

From the inside, it looks like a system that runs on one person’s memory.

The architecture diagrams are 18 months out of date. The deployment process has six steps that aren’t written down because the one person who does it has them memorized. The reason that service was built the way it was built? It’s in their head. The edge case that breaks the payment integration if you change the order of operations? Also in their head.

Nobody flags it. Because it doesn’t look like a risk. It looks like having a great engineer.

The knowledge isn’t in the system. It’s in one person. And every question that gets answered verbally instead of documented is another brick in a wall that collapses the moment that person is unavailable.

The Day It Broke

We had one. The only person who understood the full system context across every service we ran. Not just the code. The why behind the code. The decisions, the trade-offs, the undocumented dependencies that nobody else knew existed.

Then they left.

Every project in flight stalled. Not for a day. For over a week. Engineers across the globe were stuck waiting for context that walked out the door.

New engineers who had joined in the months before were the hardest hit. They’d been onboarding by asking this person questions. Four times a day, walking over to their desk, getting verbal answers, going back to work. It looked like mentorship. It looked like collaboration. It was actually a single point of failure disguised as teamwork. Every answer that was given verbally instead of written down evaporated the day that person left.

It wasn’t that the team couldn’t code. They could. It wasn’t that they weren’t talented. They were. They just didn’t know why things were built the way they were built. They didn’t know which decisions were load-bearing and which ones were incidental. They didn’t know what would break if they changed something, because the person who held that map was gone.

We treated their indispensability like a compliment. It was an indictment of us.

Not of them. Of us. Of leadership. Of me. Because building a system where one person’s absence paralyzes the organization isn’t that person’s failure. It’s a leadership failure. We let it happen because it was convenient. Because asking one person is faster than building documentation. Because it felt efficient right up until the moment it catastrophically wasn’t.

Why Nobody Fixes This

If the problem is this obvious, why does every engineering org have it?

Because fixing it feels like slowing down.

Documentation takes time. Writing context into pull requests takes time. Pair programming on critical systems so more than one person understands them takes time. Cross-training takes time.

And the person who’s the single point of failure? They’re also your fastest shipper. Pulling them off feature work to do knowledge transfer feels like sacrificing velocity. Leadership sees the sprint slow down. They ask why. You say “we’re reducing our bus factor” and they look at you like you’re speaking a different language.

So you don’t do it. You tell yourself you’ll get to it next quarter. You tell yourself the risk is manageable. You tell yourself that person isn’t going anywhere.

Until they do.

There’s also a harder truth underneath the operational one. Some engineers like being indispensable. It feels good to be the person everyone needs. It’s job security. It’s identity. And some of them, not consciously, not maliciously, resist the systems that would distribute their knowledge. Because being the only one who knows is what makes them valuable.

I know this because I’ve been that person. Earlier in my career, I held context tightly because it made me essential. I told myself I was protecting the team from complexity. I was really protecting my position. It took someone telling me directly that projects only worked when I was involved to realize that being essential and being effective are two different things. And that I was choosing the wrong one.

Why It Gets Worse the Longer You Wait

Here’s the math nobody wants to calculate.

The longer a single point of failure stays in that role, the deeper the dependency gets. Every month they’re the sole source of truth, they accumulate more context that nobody else has. The bus factor doesn’t stay constant. It deteriorates.

The engineer who’s been the single point of failure for six months? That’s recoverable. A few weeks of knowledge transfer, some documentation sprints, and the team can absorb the gap.

The engineer who’s been the single point of failure for five years? That’s a canyon. The knowledge gap between them and the rest of the team isn’t something you bridge with a few pairing sessions. It’s years of accumulated decisions, edge cases, undocumented behaviors, and institutional memory that can’t be transferred in a reasonable timeframe.

And here’s what makes it compound. The more indispensable that person becomes, the less time they have for knowledge transfer. Because everyone is asking them questions. Because they’re the only one who can fix production issues quickly. Because pulling them off the critical path to document what they know means the critical path stops.

It’s a trap that tightens every quarter you don’t address it. And most leaders don’t address it because the risk is invisible right up until the moment it becomes catastrophic.

What We Built After It Broke

After we lost our single point of failure, we didn’t just feel the pain. We redesigned the system so it couldn’t happen again.

Required context in every pull request. Not just “what changed.” Why it changed. What was considered and rejected. What downstream systems are affected. What breaks if this assumption is wrong. PRs became documentation, not just code delivery mechanisms. It felt like overhead at first. Within two months, new engineers were onboarding in half the time because the codebase was telling its own story.

Runbooks became non-negotiable. Every critical system got a runbook. Not a wiki page that someone wrote once and forgot about. A living document that gets updated every time the system changes. If you touch the system, you update the runbook. No exceptions. The test we used was simple: could someone who’s never seen this system before follow the runbook and operate it successfully? If not, the runbook isn’t done.

Rotation on critical systems. The person who built the service shouldn’t be the only person who deploys it three years later. We started rotating ownership on critical systems deliberately. Not because the original owner was doing a bad job. Because having one owner is a structural risk, not a performance issue.

Architecture decision records. This one paid for itself the fastest. Six months after we started writing ADRs, an engineer wanted to change how we handled session management. Before ADRs, they would have either asked the person who built it (who was gone) or guessed and hoped. Instead, they pulled up the ADR. It laid out exactly what we’d considered, why we chose the approach we chose, and what constraints would need to change before a different approach made sense. The engineer read it, realized the original constraints still held, and saved the team two weeks of work on a migration that didn’t need to happen.

None of this was popular at first. Writing things down feels slower than just asking one person.

Until that person is gone.

How I Measure This Now

I measure bus factor the same way I measure any other operational risk. Not as a thought experiment. Not as a fun hypothetical about someone getting hit by a bus. As a number on a risk register.

For every critical system, I ask one question: how many people can operate this independently?

If the answer is one, that system goes on the register. It gets reviewed with the same seriousness as security vulnerabilities and infrastructure costs. Because that’s what a bus factor of one is. It’s a quantifiable risk that could paralyze your engineering organization. Treating it like a cultural nicety instead of an operational threat is how you end up with a week of stalled projects and engineers waiting for context that no longer exists.

The question I ask every engineering leader I talk to is simple: if your most senior engineer is unavailable for a month, what breaks?

If you don’t know the answer, that’s the first problem.

If you do know the answer and you haven’t fixed it, that’s the real one.

What I Got Wrong

I spent the first section of this essay talking about “your” best engineer like this is someone else’s problem. It wasn’t. It was mine. Multiple times, across multiple companies.

I let it happen at my first leadership role because I didn’t know better. I let it happen at my second because fixing it felt too expensive. I let it happen at my third because the engineer was so good that the risk felt theoretical.

It’s never theoretical. It’s just not yet.

The best engineering teams I’ve built aren’t the ones with the most talented individuals. They’re the ones that operate the same whether their best person is in the room or on a beach with their phone off.

That’s not a low bar. It’s the highest bar in engineering leadership. And most teams aren’t even trying to clear it.

Your bus factor is a metric. Measure it. And if you don’t like the number, that’s the most important thing you can fix this quarter.

Nothing else on your roadmap matters if one person leaving can stop your team from shipping.