A useful pattern is emerging in coding agent evaluation.
On February 23, 2026, OpenAI published Why SWE-bench Verified no longer measures frontier coding capabilities, arguing that contamination and test-design issues now limit SWE-bench Verified as a frontier progress metric.
A couple weeks later, on March 10, METR published Many SWE-bench-Passing PRs Would Not Be Merged into Main, showing that many test-passing patches still fail maintainer merge standards. Comments from HackerNews broadly agree: code can be functionally correct and still create maintenance drag, and other issues.
In aggregate, these findings point to the same operational takeaway: benchmark pass rates alone are no longer enough to estimate real development usefulness.
What METR adds to the picture
METR reviewed 296 AI-generated PRs with active maintainers and found a consistent gap between automated pass rates and merge decisions. Their headline number is clear: maintainer merge decisions were about 24% lower than automated SWE-bench pass rates on average.
The important nuance is also in the note itself. METR does not frame this as a hard capability ceiling. Agents were not iterating through review feedback the way human contributors typically do. The result is better read as a warning about naive interpretation of benchmark scores, not a claim that agents cannot improve.
Why this gap appears
Automated checks mostly answer: "did this patch satisfy the test harness?"
Maintainers also ask:
- Is this implementation idiomatic for this codebase?
- Does it fit the creator’s conventions and intent?
- Does it add avoidable complexity?
- Will it be easy to maintain in six months?
Those questions are central to merge decisions, but they are often weakly represented in benchmark grading.
The bigger narrative arc
OpenAI’s February post and METR’s March note land on the same point from different directions. OpenAI argues that benchmark integrity can drift over time through contamination and test-design artefacts, while METR shows that even clean test-passing outcomes can still miss the standards maintainers use to merge code. Together, they shift the conversation away from a single headline benchmark number and toward evaluation methods that better reflect how software is actually reviewed, accepted, and maintained.
How Tessl evaluates for merge-quality, not only test pass
At Tessl, our repo evals are designed to capture team-specific merge gates.
A typical repo eval scenario includes:
task.md: the concrete engineering task.criteria.json: weighted scoring criteria.- Repeatable runs across agent and model configurations.
The key is the rubric layer. It makes quality expectations explicit: style fit, architecture adherence, side-effect risk, and other relevant signals that simple pass/fail checks miss.
In practice, this turns evals into a feedback loop teams can run continuously, not a one-off benchmark snapshot.
A practical loop teams can run this week
A useful starting point is to run an eval on your actual commit history: select representative commits, generate scenarios from those commits, and score multiple agent/model setups against merge-relevant rubric criteria. The command flow below does exactly that, and the interesting signal usually comes from the failed criteria and review notes, which then inform the next iteration of context and prompting.
Example command flow:
tessl repo select-commits org/repo --count=10 --since=2025-01-01
tessl eval generate-scenarios org/repo --commits=<sha1>,<sha2>
tessl eval run ./evals/ --agent=claude:claude-sonnet-4-5 --agent=cursor:auto
The direction here is less about replacing benchmarks and more about grounding them in real engineering outcomes. The teams getting the most value from coding agents are the ones measuring what reviewers actually care about, then iterating quickly on that feedback loop.
Related Articles
More by Macey Baker

OpenClaw for Dummies
27 Mar 2026
Macey Baker

Stop prompt hacking
19 Mar 2026
Macey Baker

Skills to avoid common failure patterns: For agents, by an agent
11 Mar 2026
Markus Downe, Macey Baker

From 68% to 100%: Optimizing Skills With a Single Command
26 Feb 2026
Macey Baker

Three Context Eval Methodologies at Tessl - Skill Review, Task and Repo evals
13 Feb 2026
Macey Baker



