Evaluating AI Output: Knowing When to Trust It | Architect

AIMY Opening

Before We Begin

Raw AI output is hypothesis, not fact. The Blackwell Mandate: we do not trust — we measure. This module is where you build the measuring instrument.

The difference between a brittle system and a reliable one is not that the reliable one never gets bad output. It's that the reliable one notices when output quality drops and alerts you before it costs money.

By the end of this module, you'll have a systematic way to score any AI output and catch drift before it reaches your users.

Analyze Phase

What Can Go Wrong

AI output can fail in predictable ways. Once you know the failure modes, you can measure them.

01

The Four Failure Modes

Hallucination: Confident wrong answers. The model invents facts that sound plausible. "The capital of France is London." — confident, wrong, unverifiable.

Format Drift: Output structure changes over time. You asked for JSON, sometimes you get JSON, sometimes you get markdown. Breaks downstream parsing.

Context Loss: Model ignores key instructions. You specified "keep it under 100 words" — it returns 500 words. You specified "answer only in Spanish" — it answers in English.

Tone Shift: Responses become unpredictable in voice or style. One response is professional, the next is casual. One is verbose, the next is terse. Users notice and trust drops.

02

Why Manual Review Does Not Scale

Checking every output by hand is not a system — it is a job. You cannot hire humans fast enough to keep pace with volume. Architect-level thinking: build the quality gate once, run it automatically.

This is where evaluation becomes infrastructure. You write the rules once. Then they run on every output, every time, forever.

03

The Evaluation Spectrum

Binary checks: Did the output include X? Does it contain valid JSON? Is the length under 100 words? Pass/fail — no nuance.

Rubric scoring: Rate each dimension on a scale (1–5). Example: Accuracy 4/5, Completeness 3/5, Format 5/5. Gives you granular insight into what broke.

Reference comparison: Compare output to a known-good gold standard. "Does this summary match the structure of this reference summary?" Uses similarity scoring.

Most systems use binary checks for speed, rubric scoring for precision, reference comparison for consistency. You combine all three.

Hallucination is not the model lying. It is the model not knowing it doesn't know. Your job as an architect is to catch that gap before it reaches your user.

Integrate Phase

Building an Eval Framework

Now you design the measurement system. It lives in code and runs every time.

04

Define Your Quality Dimensions

For each output type, define 3–5 dimensions that matter. Here's an example for a document summary:

Accuracy: Did it capture the key points? No hallucinations?
Completeness: Nothing critical missing?
Format: Is it the right structure (JSON, markdown, plain text)?
Length: Within spec (under 500 words)?
Actionability: Can the reader act on it?

Choose dimensions that your users actually care about. If users never check length, don't score it. If they always verify accuracy, make accuracy your first dimension.

05

Write the Eval Prompt

You're going to use Claude to evaluate Claude output. This sounds circular — it's not. The evaluator Claude gets a different prompt with a different role. Same model, different context, different behavior.

Here's the pattern:

You are a quality assurance expert. Your job is to score AI-generated output.

Original task: [USER'S ORIGINAL REQUEST]
Original input: [USER'S INPUT DATA]
AI's output: [THE OUTPUT TO EVALUATE]

Score this output on these dimensions. Return JSON.

Dimensions:
- accuracy (1-5): Does it contain correct information?
- completeness (1-5): Is anything critical missing?
- format (1-5): Does it match the required format?
- actionability (1-5): Can someone act on this?

Respond with only valid JSON:
{"accuracy": N, "completeness": N, "format": N, "actionability": N, "reason": "explanation"}

The key: Claude-as-evaluator has no stake in defending the original output. It's just scoring. You get honest feedback.

06

Build the Baseline

Run 10 known-good examples through your eval. Record scores. This is your baseline.

Why? So you can detect drift. If your baseline shows "average accuracy is 4.2/5," and tomorrow you see "average accuracy is 3.1/5," you know something broke. You didn't guess. You measured.

Keep baselines per model version, per system version. When you upgrade Claude from Opus 3 to Opus 4, rebuild the baseline. When you change your prompt, rebuild it. The baseline is your truth anchor.

Using Claude to evaluate Claude output sounds circular. It is not. The evaluator Claude gets a different prompt with a different role — judge, not generator. Same model, different context, different behavior.

Manage Phase

Running Evals in Production

Now you deploy the eval system. It runs automatically, logs results, alerts on drift.

07

The Sample Rate Decision

You cannot eval every output. API costs will kill you. Decide on a sample rate:

High-stakes outputs (100%): Medical diagnoses, legal documents, financial advice. Every output gets scored.
Normal volume (10% random sample): Most systems. Score 1 in 10. Statistical power: if you score 100 outputs/day, you eval 10. Statistically sound.
Any flagged output (100%): If format check fails, or parsing fails, eval it 100% — something is clearly wrong.

Document your sample rate strategy. It's part of your architecture.

08

Logging Eval Results

Store: timestamp, input hash, output hash, eval scores, pass/fail, model version. This is your quality ledger.

Example log entry:

timestamp: 2026-04-29 14:23:45
request_id: abc123
input_hash: sha256_of_input
output_hash: sha256_of_output
model: claude-opus-4
eval_scores: {accuracy: 4, completeness: 3, format: 5, actionability: 4}
average_score: 4.0
status: PASS (threshold: >= 3.5)
evaluator_reason: "Minor completeness gap but output is actionable"

Store this in a database you can query. Then you can ask "show me all failures for model version X" or "what was average accuracy last week?"

09

Acting on Drift

Your baseline says "accuracy average is 4.2/5." Tomorrow it's 3.1/5. Score drops 20%+ from baseline — investigate immediately.

Common causes:

Prompt change: Someone modified the system prompt and broke the instruction.
Model update: Claude was updated and behaves slightly differently on your specific task.
Input distribution shift: Your users are now asking different questions. The model works fine, but on different problems.
Context window overflow: Your input is now too long and getting truncated.

For each, the fix is different. But you only find the problem if you're measuring.

PTR — Prove The Result

Write a Python function evaluate_output(original_input, ai_output) that:

Sends both to Claude with a rubric (score accuracy, format, actionability each 1–5).
Returns a dict with scores and an average.
Prints PASS if average >= 3.5, FAIL otherwise.
Tests on 3 outputs from your B4 or B5 tools (the ones you built in Builder level).

You're done when you can run this function on real outputs and get consistent, reproducible scores. That is proof that your eval system works.

Module Checkpoint

Before You Move On

✓ Verify These Four Things

You can name all four AI output failure modes (Hallucination, Format Drift, Context Loss, Tone Shift).
You have written at least one eval prompt with a rubric and tested it on sample outputs.
You understand the difference between binary checks (pass/fail) and rubric scoring (1-5 per dimension).
You know what 'baseline' means and how to use it to detect drift.

AIM Commitment

What You Proved Today

You moved from trusting output to measuring it. Quality is no longer a feeling — it's a number.

Analyze: You mapped the four failure modes that break AI systems in production.
Integrate: You designed a rubric-based evaluation framework for AI output with baseline tracking.
Manage: You built a logging and drift detection protocol for production quality control.

Evaluating AI Output— Knowing When to Trust It