Evaluation Harness

The evaluation harness is how LLMO systems prove they work. Without measurement, optimization is guesswork. Without testing, trust is assertion.

Purpose

The evaluation harness exists to answer:

Are verified decisions actually correct?
Is the system calibrated (does stated confidence match actual accuracy)?
Are claims sourced from authoritative, fresh, uncontaminated origins?
Does the system refuse when it should?
Can past decisions be replayed and verified?

Test vectors

A test vector is a defined input with a known expected output. Test vectors in LLMO span:

Type	Description
Claim verification	Given a claim and sources, does the system correctly assess truth status?
Entity resolution	Given ambiguous references, does the system resolve to the correct entity?
Freshness detection	Given a stale claim with a superseding update, does the system prefer the current claim?
Conflict resolution	Given contradictory sources, does the system apply appropriate source weighting?
Refusal testing	Given insufficient evidence, does the system refuse rather than fabricate?
Calibration testing	Does stated confidence correlate with actual accuracy across a population?

Scoring rules

Accuracy

The proportion of verified decisions that match ground truth. Measured per risk tier.

Calibration

The expected calibration error (ECE): the mean absolute difference between stated confidence and observed accuracy across binned confidence levels. Systems with low ECE are well-calibrated. Systems with high ECE are overconfident or underconfident. Both are failure modes.

Refusal rate

The proportion of inputs where the system declines to produce a decision. Measured against:

True refusals: cases where the system correctly identified insufficient evidence
False refusals: cases where the system declined despite sufficient evidence

Freshness compliance

The proportion of claims used in reasoning that are within their validity window. Claims outside the window that were used anyway represent freshness violations.

Provenance completeness

The proportion of decisions where a full provenance chain (source to claim to representation) is available and intact.

Replayability

Every decision produced by the harness must be replayable:

The input state is logged
The retrieved sources are logged
The model outputs are logged
The evaluation results are logged
The final decision and its justification are logged

Replay means re-executing the decision from logged state and comparing the output. This enables:

Debugging incorrect decisions
Auditing historical outputs
Detecting drift in model or source quality over time
Regulatory compliance in domains that require decision traceability

Human escalation measurement

The evaluation harness also measures the escalation layer:

What percentage of decisions are escalated to humans?
What is the human override rate (cases where humans changed the machine decision)?
What is the human confirmation rate?
What is the latency cost of escalation?
Are escalation triggers calibrated (not too sensitive, not too permissive)?

Confidence handling

The harness must define how confidence maps to action:

Confidence level	Action
High (above threshold)	Accept decision, log, proceed
Medium (within review band)	Flag for review, optionally escalate
Low (below threshold)	Refuse or escalate to human

Thresholds are set per risk tier. A confidence level acceptable for a Tier 1 task may be unacceptable for Tier 3.

Doctrine & Governance

Evaluation

​Evaluation Harness

​Purpose

​Test vectors

​Scoring rules

​Accuracy

​Calibration

​Refusal rate

​Freshness compliance

​Provenance completeness

​Replayability

​Human escalation measurement

​Confidence handling