Monday morning, one engineer was out on PTO and three separate workstreams were already blocked before lunch. Slack filled with "quick question" pings nobody else could answer, and standup turned into a scavenger hunt for missing context. By Wednesday, it was obvious we did not have a throughput problem ... we had built a system that stored critical knowledge in one person.
Most leadership teams do not call this out when they see it. They call it excellence: "She is our strongest engineer," "He just knows the platform better than everyone else," and "If anything goes sideways, bring them in." I have said every one of those lines in my career, and I have paid for each one later.
What stalled us was not coding skill. It was missing map data ... why certain decisions existed, where hidden dependencies lived, and which changes looked safe but would actually blow up downstream. The team was talented. The system was fragile.
If you have read The Phoenix Project, you already know this pattern as the Brent Effect. One indispensable person gets attached to every critical path, leadership mistakes heroic saves for system health, and everything looks fine until that person is unavailable.
Different decade, same movie, better dashboards.
The thing that matters is this ... the Brent Effect is not a personality problem. It is an operating model choice. You design for local efficiency, then wake up with global fragility.
The dangerous part is how normal it feels while it is forming. Nothing looks broken at first. Tickets still move. Releases still go out. Leadership still sees green dashboards. The only early signal is behavioral ... everyone knows exactly who to tag when risk shows up, and nobody asks why that pattern keeps repeating.
The Metric Most Leaders Skip
Most teams track cycle time, deployment frequency, and incident counts. Good. Track those. But if you are not tracking skill and context distribution, you are missing the risk that can invalidate all three. I use a simple score now for critical domains, not because scores are sexy, but because hand-wavy "we should cross-train more" conversations never survive quarter-end pressure.
The Skill Distribution Score
Pick one critical service or workflow and score it 0 to 2 across five dimensions. The goal is not perfection. The goal is to expose where your roadmap depends on one nervous system.
1) Coverage
How many engineers can safely modify this area without the usual owner?
- 0 = one person
- 1 = two people with supervision
- 2 = three or more people independently
2) Recovery
Can on-call recover core flows without escalating to the same person?
- 0 = frequent single-person escalation
- 1 = partial runbook support
- 2 = rotation can recover consistently
3) Decision Traceability
Can engineers find why decisions were made?
- 0 = tribal memory
- 1 = partial PR context, scattered docs
- 2 = strong PR context + ADRs
4) Onboarding Transfer Speed
Can a new engineer ship meaningful change in 30 days?
- 0 = blocked by verbal dependency
- 1 = progress with heavy handholding
- 2 = can execute from docs and standards
5) Ownership Rotation Health
Has ownership rotated in the last 2 quarters?
- 0 = one long-term owner
- 1 = limited shadowing
- 2 = deliberate rotation with accountability
Total score out of 10.
- 0 to 3 = immediate structural risk
- 4 to 6 = hidden fragility with decent optics
- 7 to 8 = healthy core, risk at the edges
- 9 to 10 = resilient system behavior
One important note ... score by domain, not by team. Teams can look healthy in aggregate while one payment flow, one integration surface, or one deployment path is still hanging by a thread. The goal is not to get a pretty average. The goal is to find the load-bearing edges before they fail in public.
If your score is low, the fix is not "find another hero." The fix is "stop architecting hero dependency."
Why Teams Avoid This Work
Because it feels slower in the short term. Pulling your fastest engineer into documentation, decision capture, pairing, and ownership transfer can look like a throughput hit on this sprint's board. It is also one of the few moves that protects next quarter's board from fiction.
This is the leadership tax nobody wants to pay when times are good. Then someone leaves, gets sick, burns out, or just takes a real vacation for the first time in a year, and the tax bill arrives with interest.
I used to think we were being efficient by routing everything through our strongest people. What we were actually doing was borrowing stability from one human and calling it process. I called it leverage in leadership meetings. It was dependency with better branding.
How Leaders Accidentally Build Hero Dependency
Nobody sets out to build this on purpose. It happens through reasonable decisions that compound in the wrong direction.
A high-pressure quarter shows up, and your most experienced engineer handles the hardest path because you need certainty. Next sprint, they do it again because they are already warm in the domain. By month three, they are the default owner for anything risky, and by month six the rest of the team has learned a quiet rule ... if it matters, route it to that person first.
Then leadership reinforces it without noticing.
You praise speed over transfer.
You reward clean incident recovery without asking why the same name keeps showing up.
You celebrate "ownership" while tolerating single-threaded knowledge.
You call this accountability. Your system experiences it as concentration risk.
This is why I treat bus-factor risk like infrastructure risk now. Nobody argues about whether production needs redundancy. But teams will still debate whether people context needs redundancy, even though both failures produce the same operational result ... things stop moving when one node disappears.
What This Looks Like in Practice
What this usually looks like in real teams is less dramatic than people expect. You do not find one big failure and fix it with one big project. You find recurring friction patterns that keep pointing to the same dependency shape.
On-call keeps escalating to the same name. New engineers can ship, but only after multiple verbal handoffs. Major decisions can be explained by people, but not by artifacts. The dashboard says delivery is healthy while the team quietly optimizes around one person's availability.
That is why the score matters. It turns "we should probably cross-train" into a concrete diagnosis you can act on.
And the action sequence is simple. Pick the lowest sub-score, make one transfer move this sprint, and rerun the score next month. Do that repeatedly and the system gets less brittle without pretending you can redesign everything in one quarter.
That is what this work looks like when it is real ... less theater, more compounding.
The Monthly Cadence That Keeps It Alive
Most teams fail here by turning this into a one-time cleanup project.
Do not do that.
Run a 30-minute monthly review with your EM and tech lead on one critical domain:
- Re-score the 5 dimensions.
- Identify the lowest sub-score.
- Pick one transfer action for the next sprint.
- Assign a clear owner and due date.
That is it.
No giant transformation deck. No offsite. No "cross-functional task force" that dies quietly in 3 weeks. If your resilience plan starts with a steering committee, you built calendar theater, not redundancy.
The score is not useful because it is clever. It is useful because it forces a small leadership behavior most teams avoid ... replacing verbal dependency with system memory on purpose.
The Global Team Reality Check
The moment your team spans time zones, this problem gets louder.
Time zones turn knowledge concentration into a latency multiplier. A blocked engineer in Europe can lose half a day waiting for North America context. A hotfix in North America can miss handoff quality for Asia if the only person who understands the dependency graph is asleep. What looks like "normal async friction" is often hidden bus-factor cost.
This is where your scorecard becomes operational, not theoretical.
If recovery is low, your follow-the-sun model is performative.
If decision traceability is low, your handoffs are storytelling instead of systems.
If onboarding transfer is low, your global hiring strategy is paying for capacity your operating model cannot absorb.
The fix is not more meetings. The fix is better artifacts and clearer ownership transfer points that survive time zones.
What to Change This Sprint
You do not need a reorg to start fixing this. Run these 4 moves on one critical domain this sprint:
- Require "why" context in PRs, not just "what changed."
- Treat stale runbooks as production risk, not documentation debt.
- Rotate one ownership responsibility this sprint, even if it is slower.
- Review one high-impact decision and backfill the missing ADR.
If you want to pressure test whether the moves are working, pick one simple check ... when something breaks this month, does your incident channel still default to tagging one name first? If yes, keep going. Your process changed, but your dependency graph did not yet.
Your best engineer should be an accelerator, not a life-support machine.
If this feels familiar, read The Leadership Bench Test. It is the same problem through the leadership lens. This post gives you the measurement model.
The Real Leadership Test
Any team can look strong when the same people are always available. Real system quality shows up when they are not.
If one PTO request can put your roadmap at risk, your org does not need a motivational speech. It needs redesign ... and less wishful thinking in leadership meetings.
You do not scale by finding better heroes. You scale by making heroics unnecessary.
