Tools & Comparisons 9 min read

How to Measure Team Performance (Beyond Velocity and Deadlines)

Output metrics are easy to measure and easy to game. Behavioral metrics are harder to measure and harder to game. The teams that actually perform run both layers together. Three output metrics, three behavioral metrics, and a plan for using them without producing Goodhart's Law problems.

By Asa Goldstein, QuestWorks

TL;DR

Team performance measurement has a shallow answer and a complete one. The shallow answer is to pick velocity or cycle time and make it the number everyone optimizes. Teams game the metric, the data goes bad, and leaders conclude "you can't measure engineering." The complete answer is to use three output metrics (velocity, cycle time, PR throughput) alongside three behavioral metrics (coordination speed, handoff friction, decision latency). Output metrics tell you what shipped. Behavioral metrics tell you how it shipped. Together they resist Goodhart's Law because the team can't game both layers at once. Each metric is explained below with its limitations.

Every engineering leader has the same problem. The team feels productive. The team says it's productive. The output feels slow. Something is off, and the metrics you have do not tell you what.

The standard answer is to pick a number and push on it. Velocity. Cycle time. PR throughput. Deployments per day. Pick the metric, set the target, run the playbook. Six months later the number moves and the actual work does not feel any better. Maybe it feels worse. Congratulations, you have encountered Goodhart's Law.

British anthropologist Marilyn Strathern gave the most quoted formulation of the principle in a 1997 paper on accountability in education: "When a measure becomes a target, it ceases to be a good measure" (Strathern, 1997). The minute the team knows the number is being watched, the team starts optimizing the number instead of the work. The metric decouples from the thing it was supposed to measure.

So how do you measure team performance without breaking the measurement the moment you use it? Six metrics, three from the output layer and three from the behavioral layer underneath. The behavioral layer is the one most engineering management content skips, and it is the layer Goodhart's Law cannot easily corrupt because the team does not see it the way they see cycle time dashboards.

The Three Output Metrics (Useful, Gameable)

Output metrics measure what the team shipped. They are the metrics everyone already knows. Use them. Just understand their limits.

1. Velocity (Story Points Per Sprint)

What it measures: Story points completed per sprint. Designed as a planning tool for the team to forecast how much it can commit to in the next sprint.

What it catches: Trend changes in team throughput. If velocity drops 30 percent for three sprints in a row, something structural changed. New person onboarding, a hard technical migration, team morale, or a shift in work type.

What it misses: Everything about quality, value, and sustainability. A team that hits velocity targets by shipping bugs is "performing" by this metric. A team that refactors a tangled service and lowers velocity temporarily is "underperforming." Both readings are wrong.

How it breaks: Velocity as a target is the canonical Goodhart's Law case study. The team inflates point estimates until the velocity number looks acceptable. Eventually the points mean nothing and the planning purpose the metric was designed for is destroyed (Jellyfish).

Best used as: A team-level planning tool. Never a cross-team comparison. Never tied to compensation.

2. Cycle Time (Commit to Deploy)

What it measures: The time between a commit and a deploy to production. Breaks down into coding time, pickup time, review time, and deploy time.

What it catches: Friction in the delivery pipeline. The pickup phase (time a PR waits for someone to start reviewing) is almost entirely about team coordination. Research on knowledge work shows items typically spend 70 to 85 percent of cycle time waiting rather than being actively worked on (Osborn, Medium, 2026). That waiting time is the team coordination signal hiding inside the cycle time number.

What it misses: Whether the work that moved through the pipeline was the right work. A team can drop cycle time by half by shipping trivial PRs and deferring the hard ones.

How it breaks: Teams split work into smaller PRs to look fast. Nothing wrong with small PRs, but the metric is now measuring PR size more than actual delivery speed.

Best used as: A pipeline health signal at the team level. Watch the trend. Investigate when it moves. Do not benchmark it across teams with different work profiles.

3. PR Throughput (Merged PRs Per Developer Per Week)

What it measures: How many pull requests each developer merged, averaged over a period.

What it catches: Individual and team activity levels. Useful for spotting dramatic changes ("this person merged zero PRs this month, what's going on?").

What it misses: Almost everything that matters. A developer refactoring a core subsystem might merge one PR in a month that removes 10,000 lines of tech debt. A developer patching typos might merge 50 PRs in the same month. The throughput number rates the typo-patcher higher.

How it breaks: Same as velocity. Target it and people split work to inflate counts. Or game it by reviewing each other's PRs perfunctorily to keep the numbers flowing.

Best used as: A diagnostic signal to trigger a conversation, not a performance target. If a senior engineer's PR count drops to zero for two months, that is worth understanding, not punishing.

The Three Behavioral Metrics (Harder to Measure, Harder to Game)

Output metrics cover what shipped. Behavioral metrics cover how it shipped. These are the layer most engineering management content skips, usually because they are harder to instrument. The payoff for instrumenting them is that they are Goodhart's Law resistant: the team cannot easily game them without changing the actual behavior, which is the behavior you wanted to improve in the first place.

4. Coordination Speed (Time from Decision Needed to Decision Made)

What it measures: The interval between a decision becoming blocking and the team actually making the decision. Can be measured at the ticket level (how long was this blocked on a decision?), the meeting level (how long between raising the question and committing to an answer?), or the release level (how long between identifying a trade-off and resolving it?).

Why it matters: Teams with slow coordination speed look productive in individual work and grind to a halt when something requires cross-functional agreement. Coordination speed is where distributed teams with broken communication norms fall apart. It is also where highly functional teams pull ahead of average ones.

What it catches: Decision latency, which is usually invisible in output metrics. A team that shipped 10 features last quarter looks faster than one that shipped 8, but if the 8-feature team resolves decisions in 2 hours and the 10-feature team takes 2 weeks, the 8-feature team is actually faster on the work they chose to do.

What it misses: Decisions that never got raised because the team was avoiding the conflict. Hidden decision debt is a separate problem and requires direct conversation to surface.

How to measure it: Tag blocked tickets with the reason they are blocked. Time-stamp the block and the unblock. Review monthly. Alternatively, track how often "needs decision" items sit in Slack threads without resolution.

5. Handoff Friction (Information Loss Between Team Members)

What it measures: How much context gets lost when work transfers between team members or between teams. Measured through rework rates, cross-functional defects, and the number of clarification questions asked after a handoff.

Why it matters: Handoffs are expensive and invisible. Most organizations underestimate handoff costs because the delays appear as waiting time rather than active work (Osborn, Medium, 2026). Research from McKinsey documented a 45 percent decrease in code defects and 20 percent faster time to market after companies restructured to reduce coordination dependencies between teams (LinearB).

What it catches: The cost of information loss. A clean handoff means the receiving person knows what they are picking up, why it matters, and what done looks like. A messy handoff means they reverse-engineer all three and either ship the wrong thing or waste days asking clarifying questions.

What it misses: Handoffs that happened informally and were not tracked. Much of the friction lives in hallway conversations that never hit a ticket.

How to measure it: Track rework rate (percentage of tickets that come back after being marked done). Measure clarification question volume on Slack threads following handoffs. Run post-handoff retros quarterly to surface the patterns.

6. Conflict Resolution Time (Disagreement to Alignment)

What it measures: How long it takes the team to resolve a substantive disagreement into a clear decision. Resolution means someone commits to a direction and the team moves. Not "we tabled it" and not "we agreed to disagree."

Why it matters: Teams that cannot resolve conflict freeze at every decision. Teams that resolve conflict badly leave unresolved tension that metastasizes into politics. The right middle is teams that can hold real disagreement in the room, resolve on the merits, and move.

What it catches: The health of productive conflict. Amy Edmondson's psychological safety research is the academic anchor here. Teams high in psychological safety surface disagreement openly and resolve it faster. Teams low in psychological safety either suppress the disagreement or let it fester (HBR).

What it misses: Conflicts that never surfaced. Safety problems sometimes hide as silence rather than visible disagreement.

How to measure it: Track the interval between a technical disagreement being raised and a commitment being made. For high-stakes disagreements, this is easy to instrument via architecture decision records (ADRs) with a "raised" and "resolved" timestamp.

DORA and SPACE: How the Standards Handle This

Two frameworks dominate engineering productivity conversation in 2026. Both are useful. Both have specific blind spots.

DORA (DevOps Research and Assessment) consists of four metrics: deployment frequency, lead time for changes, time to restore service, and change failure rate. The original DORA research is strong. The limits are also well documented: DORA measures delivery, and a team can hit elite DORA numbers while engineers are burning out, the codebase is degrading, or collaborative practices are breaking down, which DORA will not surface (Swarmia). The DORA authors themselves acknowledge the framework is incomplete and needs to be paired with other signals.

SPACE, published by Microsoft Research and GitHub, was designed as a more complete counterweight to DORA. It covers five dimensions: Satisfaction and well-being, Performance, Activity, Communication and collaboration, and Efficiency and flow (GetDX). SPACE covers coordination and collaboration, which DORA does not. The SPACE authors explicitly noted that the metrics within the framework were intended as illustrative examples, not prescriptive standards, and that organizations should avoid one-size-fits-all approaches.

The 2025 consensus from Swarmia and others is that DORA and SPACE work better together than apart. DORA tells you how efficiently code moves from commit to deploy. SPACE tells you how sustainably and collaboratively that code gets written. The six metrics above fit inside this framing: three output metrics borrowed from the DORA tradition, three behavioral metrics borrowed from the SPACE tradition.

How to Use These Six Metrics Without Breaking Them

Measurement works if the team trusts the measurement. It fails the moment the measurement feels like surveillance. Four rules for using these six metrics without triggering Goodhart's Law.

Do not tie individual metric scores to compensation. The moment PR throughput or cycle time shows up in a performance review, the team starts optimizing for the number instead of the work. Use the metrics diagnostically, not punitively.

Cross-check output metrics against behavioral metrics. If velocity is up and handoff friction is up, something is wrong. The team is shipping more but losing more context in the process. That is not performance improvement; that is debt accumulation.

Rotate which metrics are public and which are diagnostic. Public metrics get gamed. Diagnostic metrics, the ones the team does not stare at every day, tend to stay useful. Rotate which ones are visible so no single metric becomes the gaming target.

Ask the team what is broken before looking at the numbers. The single most reliable team performance diagnostic is a good retro. Numbers confirm what the team already knows but is not saying out loud. Retros surface the thing the numbers do not capture.

Where the Behavioral Signal Actually Comes From

Here is the problem with behavioral metrics. They are hard to instrument from production traffic. Coordination speed, handoff friction, and conflict resolution time all require the team to either self-report or for you to observe the work happening. Self-report is slow and distorts the data. Observation is expensive and changes the behavior the moment the team knows it is being observed.

Pilots, surgeons, and elite military units solved this problem the same way half a century ago: they moved the measurement into a simulator. The team practices the real work in a designed environment where the system observes the behavior without distorting the live operation. Commercial aviation's Crew Resource Management training is the canonical example. Flight crews practice coordination scenarios in simulators. The simulator sessions produce behavioral data on how the crew coordinates, defers, escalates, and recovers. The data feeds back into training. The crew gets better. The live flight operates without anyone's job being on the line for the measurement.

QuestWorks is the flight simulator for team dynamics, applied to knowledge work teams. Voice-controlled scenario challenges on QuestWorks' own cinematic platform generate behavioral data on coordination speed, handoff friction, and conflict resolution, measured in designed scenarios rather than live work. QuestDash surfaces the patterns to leaders as aggregate trends and to players as strengths-based callouts. HeroGPT, the private AI coach in the Slack integration, helps individual players work on the behaviors the simulator surfaced. Coaching conversations never share upstream. Participation is voluntary and never tied to performance reviews.

The advantage of the simulator layer is that the behavioral metrics get measured where Goodhart's Law cannot reach them. Players are not incentivized to game cycle time numbers because cycle time is not what the simulator is measuring. The simulator is measuring coordination behaviors in a designed scenario, and those scenarios are not the live work being graded.

The Stack That Actually Works

The team performance measurement stack that works in 2026 has three output metrics and three behavioral metrics running together. Velocity, cycle time, and PR throughput give you the delivery signal from your existing tooling (LinearB, Jellyfish, Swarmia, Uplevel, or similar). Coordination speed, handoff friction, and conflict resolution come from either a disciplined retro practice or a behavioral measurement layer like QuestWorks.

Neither layer is complete by itself. Output metrics without behavioral context turn into Goodhart's Law gaming exercises. Behavioral metrics without output context turn into soft diagnostics nobody takes seriously. Together they tell you both what the team shipped and how the team shipped it, which is the full team performance picture.

And then the hard part: act on what the metrics tell you. Measuring team dynamics broader than just performance is the next layer up, and it is covered in more detail in that piece. For now, the six metrics above are the complete answer to "how do you measure team performance beyond velocity and deadlines."

QuestWorks starts at $20 per user per month with a 14-day free trial. It integrates with Slack for install, invites, and HeroGPT coaching, then runs on its own platform for the actual practice. The behavioral signal surfaces from play. Your existing DORA and output metrics keep telling you what shipped. Together they tell you whether the team is actually getting better or just getting better at the dashboard.

Frequently Asked Questions

Effective team performance measurement combines output metrics (velocity, cycle time, PR throughput) with behavioral metrics (coordination speed, handoff friction, decision latency). Output metrics tell you what the team shipped. Behavioral metrics tell you how the team shipped it. Using only output metrics creates Goodhart's Law problems, where the metric becomes a target and the team games the number instead of improving the work.

Velocity itself is fine. The problem is misuse. Velocity measures story points completed per sprint, which was designed as a planning tool for the team itself, never as a performance metric for comparison across teams or over time. When managers use velocity as a target, teams inflate point estimates until the numbers look right. The data becomes useless for the planning purpose it was created for.

SPACE is a developer productivity framework published by Microsoft Research and GitHub that measures performance across five dimensions: Satisfaction and well-being, Performance, Activity, Communication and collaboration, and Efficiency and flow. SPACE was designed as a counterweight to DORA, which focuses narrowly on deployment speed and stability. The framework's creators noted metrics within SPACE were intended as illustrative examples, not prescriptive standards.

DORA consists of four metrics: deployment frequency, lead time for changes, time to restore service, and change failure rate. DORA measures delivery. It does not measure developer experience, collaboration quality, cognitive load, or team workload. A team can hit elite DORA numbers while engineers burn out, tech debt accumulates, and the codebase degrades. DORA is useful as one input among several, not as the complete performance picture.

Use multiple metrics that are hard to game simultaneously. Cross-check output metrics against behavioral metrics. Measure the inputs (how the team coordinates, hands off, resolves conflict) alongside the outputs (what the team shipped). Rotate which metrics are public and which are diagnostic. And never tie a single metric to compensation or performance reviews, which is the fastest way to turn a useful measurement into a gaming target.

Ready to Level Up Your Team?

14-day free trial. Install in under a minute.

Slack icon Try it free
The flight simulator for team dynamics Try QuestWorks Free