Evaluation Harness
The evaluation harness is how LLMO systems prove they work. Without measurement, optimization is guesswork. Without testing, trust is assertion.Purpose
The evaluation harness exists to answer:- Are verified decisions actually correct?
- Is the system calibrated (does stated confidence match actual accuracy)?
- Are claims sourced from authoritative, fresh, uncontaminated origins?
- Does the system refuse when it should?
- Can past decisions be replayed and verified?
Test vectors
A test vector is a defined input with a known expected output. Test vectors in LLMO span:| Type | Description |
|---|---|
| Claim verification | Given a claim and sources, does the system correctly assess truth status? |
| Entity resolution | Given ambiguous references, does the system resolve to the correct entity? |
| Freshness detection | Given a stale claim with a superseding update, does the system prefer the current claim? |
| Conflict resolution | Given contradictory sources, does the system apply appropriate source weighting? |
| Refusal testing | Given insufficient evidence, does the system refuse rather than fabricate? |
| Calibration testing | Does stated confidence correlate with actual accuracy across a population? |
Scoring rules
Accuracy
The proportion of verified decisions that match ground truth. Measured per risk tier.Calibration
The expected calibration error (ECE): the mean absolute difference between stated confidence and observed accuracy across binned confidence levels. Systems with low ECE are well-calibrated. Systems with high ECE are overconfident or underconfident. Both are failure modes.Refusal rate
The proportion of inputs where the system declines to produce a decision. Measured against:- True refusals: cases where the system correctly identified insufficient evidence
- False refusals: cases where the system declined despite sufficient evidence
Freshness compliance
The proportion of claims used in reasoning that are within their validity window. Claims outside the window that were used anyway represent freshness violations.Provenance completeness
The proportion of decisions where a full provenance chain (source to claim to representation) is available and intact.Replayability
Every decision produced by the harness must be replayable:- The input state is logged
- The retrieved sources are logged
- The model outputs are logged
- The evaluation results are logged
- The final decision and its justification are logged
- Debugging incorrect decisions
- Auditing historical outputs
- Detecting drift in model or source quality over time
- Regulatory compliance in domains that require decision traceability
Human escalation measurement
The evaluation harness also measures the escalation layer:- What percentage of decisions are escalated to humans?
- What is the human override rate (cases where humans changed the machine decision)?
- What is the human confirmation rate?
- What is the latency cost of escalation?
- Are escalation triggers calibrated (not too sensitive, not too permissive)?
Confidence handling
The harness must define how confidence maps to action:| Confidence level | Action |
|---|---|
| High (above threshold) | Accept decision, log, proceed |
| Medium (within review band) | Flag for review, optionally escalate |
| Low (below threshold) | Refuse or escalate to human |

