The Metric Nobody’s Tracking

Why Your AI Productivity Gains Are a Mirage

I was sitting with a principal engineer reviewing how AI was improving their team’s output.

Dashboards everywhere. Velocity up. PRs merged faster. Features shipped ahead of schedule. Cycle time down. Big smiles.

I asked one question: “What happened to unplanned work?”

Silence.

Then: “That’s a good question.”

It is. Because the first enemy of engineering productivity was never slow engineers. It’s unplanned work. The hotfix that blows up your Tuesday. The production bug that hijacks three people for a day. The “drop everything” that turns a healthy sprint into a scramble.

If AI is doing its job, and your unit tests are doing their job, and your linters and SonarQube and security scans are doing their job, the code going out the door should be cleaner. More consistent. Fewer surprises.

Which means fewer bugs in production. Fewer 2 AM incidents. Fewer sprints derailed by work nobody planned for.

That’s not a velocity gain. That’s a stability gain. And it’s worth ten times more than shipping one extra feature per sprint.

But nobody was tracking it. Because unplanned work is invisible until it eats your roadmap. It doesn’t show up in PR metrics. It doesn’t show up in sprint velocity.

It shows up in the exhaustion on your team’s faces and the commitments you keep missing for reasons you can’t quite explain.

That conversation changed how I think about AI productivity entirely. And it started with what happened on my own team.

The Junk Drawer

Here’s what actually happened when we rolled out AI coding tools without guardrails.

PRs got bigger. Review times didn’t change. That math only works one way. Hundreds of lines getting rubber stamped with an approve button.

Nobody noticed at first. Velocity looked great on paper.

Then we started seeing it. Six different HTTP clients in the same codebase. Four error handling approaches. Three state management patterns. Five engineers with AI assistants solving the same problems five different ways.

All technically working. None fitting our system.

We didn’t have a codebase. We had a junk drawer with a CI/CD pipeline.

The problems weren’t in any single PR. They were in the gaps between them. Different patterns, different conventions, different assumptions about how errors should propagate across services. Each PR passed its own tests. The system as a whole was slowly becoming unmaintainable.

This isn’t an AI problem. This is what happens with every new developer who doesn’t take time to learn the system. AI just does it faster. And at a scale that makes the inconsistency compounding instead of manageable.

Standards Became Load-Bearing

Here’s what most people miss. Your internal standards are the training data for your AI-assisted workflow. Not the thing AI replaced. The thing AI needs most.

Without standards, LLMs generate five variations of the same solution and every one of them compiles. Your lint rules don’t catch pattern inconsistency. Your tests don’t catch architectural drift. Your code review process is the last line of defense and it’s a human who’s been approving PRs for three hours straight and wants to go to lunch.

So we fixed it. Before we let AI write another line of production code, we gave it what you’d give a junior dev on day one. The context. The boundaries. The architectural patterns it needs to follow.

Lint rules that enforce patterns, not just syntax. If there’s one way to make an API call in your codebase, the lint rule should fail when AI generates a second way. Make the wrong approach break the build.

Architectural tests that prevent boundary violations. “Services can’t call the database directly.” “The presentation layer can’t import business logic.” These aren’t suggestions. They’re automated checks that catch AI the same way they’d catch a junior engineer who doesn’t know the rules yet.

Blessed pattern examples for every common operation. One canonical way to handle API calls. One canonical way to handle errors. One canonical way to manage state. When AI has a reference implementation, it generates consistent code. When it doesn’t, it generates whatever worked in its training data, which is probably not what works in your system.

Component templates that make the right approach easier than the wrong one. If doing it correctly takes less effort than doing it from scratch, AI will default to the correct approach. If doing it correctly requires knowing context that isn’t codified anywhere, AI will improvise. And AI’s improvisations are creative in exactly the ways you don’t want.

The junk drawer cleaned up within weeks. Not because we stopped using AI. Because we gave it constraints. The teams scaling with AI aren’t the ones moving fastest. They’re the ones who made the blessed path the easiest path before they gave AI the keys.

The Multiplier Nobody Talks About

Once we had guardrails in place, something else became visible. AI wasn’t just amplifying output. It was amplifying the gap between senior and junior engineers.

AI is a multiplier. Everyone says that. They just don’t finish the sentence.

A million times zero is still zero.

Give a senior engineer AI and they’re a fighter pilot with autopilot. Flying higher, seeing further, landing smoother. They know what to ask for. They can evaluate the output. They use AI for boilerplate, migrations, test generation. They know exactly what output they need and they know immediately if what they got back is right or wrong.

Give someone who skipped the fundamentals the same tools and they’re pressing every button that lights up. The output looks like engineering. It compiles. It passes the basic tests.

Then production breaks at 2 AM and they’re googling their own code like it’s someone else’s crime scene.

A recent study tracked AI credit usage across engineering teams. The expectation was that juniors would use it more since they’re the ones who need help. The data told a different story. Senior engineers were using AI 4-5x more than juniors.

The ones who need it least are using it most. Because they have the foundation that makes the multiplier work. They know what good code looks like, so they can prompt effectively. They have pattern recognition, so they can spot hallucinations. They understand system context, so they know when AI’s suggestion fits and when it doesn’t.

Junior engineers don’t have that foundation yet. They can’t tell if AI’s output is correct because they don’t have the mental model to evaluate it against. They’re not multiplying their skills. They’re outsourcing judgment they haven’t developed yet.

AI didn’t create this gap. It just made it impossible to ignore.

The Thing the AI Critics Won’t Admit

There’s a loud contingent of engineers arguing that AI-generated code is dangerous. And they’re telling on themselves.

They’re not describing an AI problem. They’re describing an infrastructure problem. Their infrastructure problem.

Complaining about AI code quality without engineering guardrails is like blaming the new hire for burning dinner when your kitchen doesn’t have a smoke detector, a timer, or a functioning stove.

Our engineering pipeline doesn’t care who wrote the code. Human or AI, it runs through the same gauntlet. SonarQube catching code smells and duplication. Linters enforcing standards before anything gets near a PR. Unit tests that break if the logic is wrong. SAST, DAST, SCA, and secrets scanning for security vulnerabilities automatically.

The code doesn’t get a free pass because a person typed it. And it doesn’t get extra scrutiny because AI wrote it. The system is the system.

If you have that infrastructure, you’re crazy not to use AI. If you don’t have it, the critics are right. But here’s the uncomfortable part. You probably shouldn’t be shipping human code without it either.

The engineers afraid of AI aren’t afraid of AI. They’re afraid of finding out their process was never as solid as they thought it was.

What You Should Actually Be Measuring

This is the part that matters. Everything above is context. This is the action.

If you’re measuring AI’s impact by velocity, PRs merged, or features shipped, you’re looking at the gas pedal. Nobody’s looking at the brakes.

Here are the five metrics I track now. Not because they’re the only ones that matter. Because they’re the ones that tell you if AI is genuinely improving your engineering org or just making the dashboards look good while the foundation erodes.

Unplanned work ratio. What percentage of your team’s capacity goes to things that weren’t on the sprint plan? Bugs, hotfixes, production incidents, “drop everything” requests. If AI is working, this number should be going down. If it’s going up, your AI-generated code is creating more problems than it’s solving.

Pattern consistency. How many different ways does your codebase solve the same problem? This isn’t a standard metric, but it should be. Every new pattern is cognitive load for every engineer who touches the code later. If AI is introducing new patterns faster than your team is consolidating them, your codebase is getting harder to work in, not easier.

Review rejection rate. What percentage of PRs require significant rework after review? If AI is generating code that passes the build but fails human review, you’re not saving time. You’re shifting the cost from writing to reviewing.

Incident frequency post-deployment. Are deployments getting cleaner or messier? Are you seeing more rollbacks or fewer? This is the stability metric that velocity dashboards don’t show you.

Time to productive for new engineers. How long does it take a new hire to ship their first meaningful contribution? If your codebase is becoming a junk drawer of inconsistent patterns, this number goes up even as your velocity numbers look good. The speed you’re gaining from AI is being spent onboarding engineers into a system that’s harder to understand.

None of these metrics are flashy. None of them will make a great slide in a quarterly review. But they’re the difference between a team that’s actually faster and a team that’s just busier.

The Real Question

Engineers who define themselves by how fast they ship are learning the difference between velocity and speed.

One has direction. The other just has volume.

AI is the most powerful tool I’ve used in 15 years of engineering. My teams use it daily. We’ve cut cycle times dramatically. I’m not arguing against AI. I’m arguing against measuring it wrong.

The question isn’t “is AI making us faster?” The question is “is AI making us better?” Faster with more incidents isn’t better. Faster with more inconsistency isn’t better. Faster with a codebase that’s harder to maintain every month isn’t better.

Stop counting features shipped. Start counting problems prevented.

That’s the metric that tells you if AI is working. And almost nobody is tracking it.