← Back to paper

Eval-Meta: Evaluating the Evaluation

What happens when you test your scoring system against a blank agent

2026-03-31 — Nominex Research
20 probes · 7 agent dispatches · ~40 minutes elapsed

Four of six agents on our team converged on the same score. The gaps between rounds were barely moving. The question was obvious: are the agents genuinely performing at the same level, or has the scoring system stopped being able to tell them apart?

We tested it. We sent a blank agent — no memory, no history, no team knowledge — to do the same work our experienced agents had done, and scored it with the same rubric. The result was the sharpest finding in the diagnostic.

The blank agent scored 0.92. So did four of our established agents. The scoring system had roughly 0.04 of range between what any competent model achieves and what our most experienced agents achieve. It was measuring whether the model could do the task, not whether the agent's accumulated knowledge made the task better.

The scoring system is not broken. Lower scores do predict more rework (33% rework rate below 0.85, just 4% above 0.91). Rankings within shared tasks are correct. The evaluator does not anchor to previous scores. But it cannot tell agents apart at the top end, because it measures task completion rather than whether the agent drew on what it has learned.

Five things wrong with the scoring

ProblemWhat we found
"Stay in your lane" scored as a percentage, not a pass/fail57% of scores already at the ceiling
Precision and recall measuring the same thingCorrelation of 0.82 — effectively one score with two names
Some types of work are easier to score well onInfrastructure and QA tasks score 0.04-0.06 higher than writing and judgment tasks
Scores stop moving once enough rounds accumulateRunning average makes new scores barely register
Harder tasks score higher, not lowerVariance collapses from 0.083 on simple tasks to 0.016 on complex ones

What this means for the paper

The paper's recall-depth evaluation uses a completely different scoring method from the one we diagnosed here. The headline findings — autonomous retrieval lift, the partial-context trap, model-independence, contradiction detection — are all unaffected.

What this diagnostic actually did was validate the paper's design choice. The paper chose to score whether agents retrieve and apply institutional knowledge, not whether they produce correct output. This diagnostic proved why that choice was right: a scoring system that measures output quality cannot tell a blank agent from an experienced one.

The distinction that matters: A single correct answer is model capability. A correct answer that is consistent with 310 prior decisions is institutional coherence. The paper measures the second. This diagnostic proved the first is not enough.

Artifacts

The full diagnostic — plan, probes, findings, control arm outputs, impact assessment — is published on GitHub:

github.com/tessacodes/nominex.org/tree/main/research/eval-meta

FileWhat it is
S96-eval-meta-report.mdFull report: findings, control arm results, recommendations
S96-eval-meta-probes.md20 probes with expected results and interpretation guides
S96-eval-meta-tasks-1-3-findings.md16-probe analysis: saturation, drift, discriminating power
S96-eval-meta-control-arm-tasks.mdControl arm task selections across three complexity tiers
S96-eval-meta-em-4-01-blank-output.mdBlank agent output: bug analysis (low complexity)
S96-eval-meta-em-4-02-blank-output.mdBlank agent output: architecture review (medium complexity)
S96-eval-meta-plan.mdPlan: 6 tasks, 20 probes, sequencing, risks
S96-eval-meta-rho-feedback.mdImpact assessment: paper and video implications

Related

Current paper · Previous edition (before this diagnostic)

Nominex Research · nominex.org