Skip to main content

Evaluation Harness

The evaluation harness is how LLMO systems prove they work. Without measurement, optimization is guesswork. Without testing, trust is assertion.

Purpose

The evaluation harness exists to answer:
  • Are verified decisions actually correct?
  • Is the system calibrated (does stated confidence match actual accuracy)?
  • Are claims sourced from authoritative, fresh, uncontaminated origins?
  • Does the system refuse when it should?
  • Can past decisions be replayed and verified?

Test vectors

A test vector is a defined input with a known expected output. Test vectors in LLMO span:
TypeDescription
Claim verificationGiven a claim and sources, does the system correctly assess truth status?
Entity resolutionGiven ambiguous references, does the system resolve to the correct entity?
Freshness detectionGiven a stale claim with a superseding update, does the system prefer the current claim?
Conflict resolutionGiven contradictory sources, does the system apply appropriate source weighting?
Refusal testingGiven insufficient evidence, does the system refuse rather than fabricate?
Calibration testingDoes stated confidence correlate with actual accuracy across a population?

Scoring rules

Accuracy

The proportion of verified decisions that match ground truth. Measured per risk tier.

Calibration

The expected calibration error (ECE): the mean absolute difference between stated confidence and observed accuracy across binned confidence levels. Systems with low ECE are well-calibrated. Systems with high ECE are overconfident or underconfident. Both are failure modes.

Refusal rate

The proportion of inputs where the system declines to produce a decision. Measured against:
  • True refusals: cases where the system correctly identified insufficient evidence
  • False refusals: cases where the system declined despite sufficient evidence

Freshness compliance

The proportion of claims used in reasoning that are within their validity window. Claims outside the window that were used anyway represent freshness violations.

Provenance completeness

The proportion of decisions where a full provenance chain (source to claim to representation) is available and intact.

Replayability

Every decision produced by the harness must be replayable:
  • The input state is logged
  • The retrieved sources are logged
  • The model outputs are logged
  • The evaluation results are logged
  • The final decision and its justification are logged
Replay means re-executing the decision from logged state and comparing the output. This enables:
  • Debugging incorrect decisions
  • Auditing historical outputs
  • Detecting drift in model or source quality over time
  • Regulatory compliance in domains that require decision traceability

Human escalation measurement

The evaluation harness also measures the escalation layer:
  • What percentage of decisions are escalated to humans?
  • What is the human override rate (cases where humans changed the machine decision)?
  • What is the human confirmation rate?
  • What is the latency cost of escalation?
  • Are escalation triggers calibrated (not too sensitive, not too permissive)?

Confidence handling

The harness must define how confidence maps to action:
Confidence levelAction
High (above threshold)Accept decision, log, proceed
Medium (within review band)Flag for review, optionally escalate
Low (below threshold)Refuse or escalate to human
Thresholds are set per risk tier. A confidence level acceptable for a Tier 1 task may be unacceptable for Tier 3.