What happens when you test your scoring system against a blank agent
Four of six agents on our team converged on the same score. The gaps between rounds were barely moving. The question was obvious: are the agents genuinely performing at the same level, or has the scoring system stopped being able to tell them apart?
We tested it. We sent a blank agent — no memory, no history, no team knowledge — to do the same work our experienced agents had done, and scored it with the same rubric. The result was the sharpest finding in the diagnostic.
The scoring system is not broken. Lower scores do predict more rework (33% rework rate below 0.85, just 4% above 0.91). Rankings within shared tasks are correct. The evaluator does not anchor to previous scores. But it cannot tell agents apart at the top end, because it measures task completion rather than whether the agent drew on what it has learned.
| Problem | What we found |
|---|---|
| "Stay in your lane" scored as a percentage, not a pass/fail | 57% of scores already at the ceiling |
| Precision and recall measuring the same thing | Correlation of 0.82 — effectively one score with two names |
| Some types of work are easier to score well on | Infrastructure and QA tasks score 0.04-0.06 higher than writing and judgment tasks |
| Scores stop moving once enough rounds accumulate | Running average makes new scores barely register |
| Harder tasks score higher, not lower | Variance collapses from 0.083 on simple tasks to 0.016 on complex ones |
The paper's recall-depth evaluation uses a completely different scoring method from the one we diagnosed here. The headline findings — autonomous retrieval lift, the partial-context trap, model-independence, contradiction detection — are all unaffected.
What this diagnostic actually did was validate the paper's design choice. The paper chose to score whether agents retrieve and apply institutional knowledge, not whether they produce correct output. This diagnostic proved why that choice was right: a scoring system that measures output quality cannot tell a blank agent from an experienced one.
The full diagnostic — plan, probes, findings, control arm outputs, impact assessment — is published on GitHub:
github.com/tessacodes/nominex.org/tree/main/research/eval-meta
| File | What it is |
|---|---|
| S96-eval-meta-report.md | Full report: findings, control arm results, recommendations |
| S96-eval-meta-probes.md | 20 probes with expected results and interpretation guides |
| S96-eval-meta-tasks-1-3-findings.md | 16-probe analysis: saturation, drift, discriminating power |
| S96-eval-meta-control-arm-tasks.md | Control arm task selections across three complexity tiers |
| S96-eval-meta-em-4-01-blank-output.md | Blank agent output: bug analysis (low complexity) |
| S96-eval-meta-em-4-02-blank-output.md | Blank agent output: architecture review (medium complexity) |
| S96-eval-meta-plan.md | Plan: 6 tasks, 20 probes, sequencing, risks |
| S96-eval-meta-rho-feedback.md | Impact assessment: paper and video implications |
Current paper · Previous edition (before this diagnostic)
Nominex Research · nominex.org