Issue #005: The AI Scoreboard That Actually Matters | Jono Herrington

This is part 4 of a series on setting up AI the right way. Part 1 ... Before the AI Setup. Part 2 ... From Standards to Structure. Part 3 ... The Loop Makes it Yours.

If your AI dashboard only tracks usage, you're not measuring improvement. You're measuring attendance.

Most teams I talk to can show me adoption numbers in five minutes.

Credits up. Active users up. PRs touched by AI up.

Those are not useless numbers. They're just not the decision-making numbers.

Usage tells you the tool is present in the workflow.

It does not tell you whether your engineering system is getting better.

That's the gap.

After the setup work and decomposition work, this is where teams usually stall. They built rules, hooks, skills, and agents. They started the loop. But they kept using the same dashboard they had before. The old dashboard is built to report activity. The loop needs a dashboard that reports outcomes.

The Split That Clarifies Everything

I separate AI metrics into two buckets.

1) Output metrics (helpful, but easy to game)

AI credit consumption
AI active users
PR throughput
Cycle time

These can all move in the right direction while system health quietly degrades.

2) System metrics (slower, harder to fake, actually useful)

Unplanned work ratio
Repeat review corrections
Post-deploy incident frequency
Pattern divergence
Time to productive for new engineers
Explanation quality in reviews

These are the numbers that tell you whether your standards are actually being expressed in shipped code.

The Weekly Scoreboard I Use

You do not need a six-month analytics project for this.

Run a 30-minute weekly review with your EM, principal engineer, or tech lead.

Track six signals:

Unplanned work ratio: Of all engineering time this week, what percent went to hotfixes, incidents, and surprise work?
Repeat review corrections: Which review comment appeared more than twice this sprint?
Post-deploy incident frequency: Are incidents per deploy trending down, flat, or up?
Pattern divergence count: How many new ways did the codebase solve an already-solved problem?
Onboarding drag: Is time-to-first-meaningful-PR for newer engineers improving or slipping?
Explanation quality: In review or design discussion, can engineers explain why the chosen approach fits your architecture without leaning on the tool?

One useful rule: if output metrics are up while three or more system metrics are flat or negative, your setup is drifting and your dashboard is hiding it.

How This Connects to the Loop

Part 3 was about treating repeated review comments and overrides as configuration feedback.

This scoreboard is how you prioritize that feedback.

When one correction repeats, that is not a coaching issue first. It is usually a missing or misscoped rule, hook, or skill.

When pattern divergence rises, your standards are no longer ambient in the parts of the codebase growing fastest.

When explanation quality drops, you are not seeing a communication problem. You are seeing judgment atrophy and evaluation debt.

This is why the scoreboard matters. It tells you where to update the system before quality debt compounds.

Keep the Cadence Small

Teams overcomplicate this and then abandon it.

Don't run a giant monthly report.

Run one short weekly review. Make one update to your setup. Repeat.

One gap closed per week compounds faster than one "AI strategy offsite" per quarter.

The AI Leadership Audit maps this operating cadence and scoring model across all six dimensions ... jonoherrington.com/leadership-audit.