IQ Management: Measuring and Improving AI Output | Architect

Dr. David Harold Blackwell proved that you can always find a better estimator if you look at the right data. IQ Management is the same principle applied to AI: look at the right data, measure the right variables, and you will always find a path to better performance.

Analyze Phase

What Intelligence Quotient Means in AI Systems

01

The AIIQM Definition of IQ

IQ in the AIIQM framework = Output Quality ÷ Input Cost

A system with high IQ produces great outputs cheaply. A system with low IQ burns tokens for inconsistent results. If you can improve quality without increasing cost, IQ goes up. If you can reduce cost without sacrificing quality, IQ goes up. Both directions matter.

02

The Three Sources of AI Variance

Prompt variance: Slight wording changes cause big output changes. Reorder a sentence, change a comma, and the output shifts significantly.

Context variance: What else is in the context window matters enormously. The same prompt with different examples produces different results.

Model variance: Same prompt behaves differently across model versions. Claude 3.0 vs. Claude 3.5 may answer the same question differently.

Understanding these three sources explains most AI quality issues.

03

Why "It Usually Works" Is Not Enough

A system that works 80% of the time is a system that fails 1 in 5 times. In production with 1,000 calls/day, that's 200 failures.

The Blackwell Mandate: we do not hope — we measure.

If you cannot measure it, you cannot improve it. If you cannot improve it, you should not ship it.

The Rao-Blackwell Theorem says: if you have a good estimator, conditioning it on sufficient statistics can only make it better, never worse. For AI: if you have a good prompt, conditioning it on the right context — the right examples, the right constraints — can only improve it. Never stop at "good enough."

Integrate Phase

The IQ Ledger

04

Building Your IQ Ledger

A spreadsheet (or database) with columns:

date — when was this run?
prompt_version — v1, v2, v3?
model_version — Claude 3.0, 3.5, etc?
input_type — what kind of data was this?
eval_score — how well did it perform (0-100)?
token_cost — how many tokens did it use?
latency_ms — how long did it take?
notes — what did you observe?

Every eval run adds a row. Your ledger is your instrument panel. Stare at it. Look for patterns. This is how you find improvement.

05

The Improvement Cycle

Baseline: Run your test battery once. Log all results. This is your starting point.

Hypothesis: What change might improve score? More context? Different phrasing? A new example?

Experiment: Run changed version against same 10 test cases.

Compare: Did score improve? Did cost change? Did latency shift?

Commit or Revert: If improved, keep it. If worse, revert and try something else.

This cycle is your engine for improvement. Run it weekly.

06

Prompt Versioning

Treat prompts like code. Every change gets a version number. You need to be able to roll back if a "improvement" actually makes things worse.

Store prompts in a prompts/ folder with v1.txt, v2.txt, v3.txt naming. In your IQ Ledger, reference the version. When you look back at a test run six months from now, you need to know exactly what prompt generated that result.

The most common mistake at Architect level: changing the prompt AND the eval in the same cycle. Never change both at once. If scores change, you will not know which change caused it. The Blackwell way: one variable at a time.

Manage Phase

Systematic Improvement at Scale

07

The 10-Case Test Battery

For every system you build, create a test battery of 10 cases that cover:

3 easy cases (low complexity, straightforward input)
4 medium cases (realistic, moderately complex)
2 hard cases (edge cases, ambiguous input)
1 adversarial case (intentionally malformed input)

Run this battery before every prompt change. Never ship without it. The 10-case battery is your minimum gate.

08

Cost Optimization Without Quality Loss

Track cost-per-quality-point. If model A costs 2x but scores 10% higher, the IQ ratio may still favor model B.

Formula: IQ = quality_score ÷ token_cost

Calculate this for every model and prompt combo. Use the cheapest model that meets your quality floor. Sometimes that's Claude 3 Opus. Sometimes it's Claude 3.5 Sonnet with a better prompt. Always run the math.

09

The Sovereign Quality Standard

Your personal quality standard is: Would I stake my name on this output?

Not every output needs to be perfect — but every output that leaves your system should clear your bar, not the model's defaults.

If you run a report and the model gets 3 out of 10 facts slightly wrong, would you send it to your boss with your name on it? If no, it's not done. If yes, it's done.

PTR — Prove The Result

Build an IQ Ledger for your A2 eval framework. Run your 3 test cases from A2 and log them in a CSV: prompt_version, eval_score, token_cost, pass/fail. Then change one element of your eval prompt and run again. Compare. Did the score improve? Did the cost change? Write a 2-sentence conclusion.

Create a spreadsheet with columns: date, prompt_version, model_version, eval_score, token_cost, notes.
Run your 3 test cases with prompt v1. Log the results.
Modify one part of the prompt (wording, example, constraint). Create prompt v2.
Run the same 3 test cases with v2. Log the results.
Compare: Is v2 better? Worse? Same cost? Cheaper? Conclusions?

Common Mistakes

⚠ "My IQ Ledger has too many rows with no clear pattern"

You're running random experiments without hypotheses. Before each run, write down: "I expect changing X will improve Y by Z%." Then measure. This gives you a target and makes pattern-spotting easier.

⚠ "I can't compare scores because my eval changed"

You changed the prompt and the test cases in the same cycle. This violates the one-variable rule. Revert the test case change. Re-run with only the prompt change. Then you have a valid comparison.

⚠ "The cheapest model fails too often"

Cost optimization only works if quality stays above your threshold. If cheap model A scores 40/100 and expensive model B scores 90/100, B is the right choice — full stop. IQ is a ratio, not a race to zero cost.

Module Checkpoint

Before You Move On

✓ Verify These Four Things

Can define IQ as Output Quality ÷ Input Cost and explain what each means.
Know the three sources of AI variance (prompt, context, model).
Have built an IQ Ledger with at least 3 entries from different runs.
Can run a one-variable improvement cycle and interpret the results.

AIM Commitment

What You Proved Today

Analyze: Applied the Blackwell Standard — identified three variance sources and why 80% is unacceptable in production.
Integrate: Built an IQ Ledger and ran one controlled improvement cycle with hypothesis and comparison.
Manage: Established a 10-case test battery and personal quality standard for all future systems.

IQ ManagementMeasuring and Improving AI Output