PaperSt.AI
The Machine Room.
How AI actually works — and how we build with it. From "what is a token" to "how do I design a self-improving machine." Plain English first, then the proper word for it. Written for a smart non-engineer; no maths or code assumed.
Part 00
Start here — how to use this
This is not a coding course. You will not write code by hand. It's a course about how the machine actually works — the AI you already use, the machine-learning that powers a prediction, the "loops" that let an AI do work while you sleep, and the blueprint of the system we're building. The goal: when you sit down to design the next piece, you reason from how it truly works, not from analogy and vibes.
The colour key — read this once
The whole guide is colour-coded so you can scan it fast. Five box types, and you'll see them everywhere:
◆ Why this matters
The reasoning behind a thing. The most important box — rules you don't understand get misapplied.
✓ Do this
A good habit or the correct move.
✕ Never do this
A hard line. Breaking it costs money, a client, or trust.
▲ Watch out
A common trap or a place people get confused.
▸ Try it
A small exercise. Doing beats reading.
You'll also see term chips for vocabulary (all collected in the Glossary), monospace for things you'd type, and the inverted highlight for the single sharpest idea on a page.
The map of the journey
| Part | What you'll understand | The one-liner |
|---|---|---|
| 1 · How AI works | LLMs, tokens, training vs using, why it lies, context | the ground floor |
| 2 · Building with AI | prompts → loops → agents, the L0→L5 ladder | how you direct it |
| 3 · Machine learning | predict → decide → learn; the maths, made simple | how the prediction works |
| 4 · The Paper St system | the thesis, the three loops, the moat, the blueprint | what we're building |
| 5 · Put it together | how the pieces compose, and how to keep designing | the whole machine |
◆ Why this order
You can't design loops until you know why an AI needs a leash (Part 1's "it predicts plausible text, not true text"). You can't design the prediction engine until you can direct AI to build it (Part 2). And you can't design the company until you've seen both halves it's made of (Parts 2 and 3). Each part is the floor the next one stands on.
Part 1
How AI actually works
The ground floor. Five ideas, in order. Get these and the rest of the course has somewhere to stand.
1.1 The nesting dolls: AI ⊃ ML ⊃ Deep Learning ⊃ LLM
These five words get used like synonyms. They're actually nested boxes, like Russian dolls.
+- ARTIFICIAL INTELLIGENCE -- any machine doing something "smart" ---+ | +- MACHINE LEARNING -- learns patterns from examples --------+ | | | +- DEEP LEARNING -- ML with many-layered networks -------+ | | | | | +- LLM -- trained on huge text, predicts language ---+ | | | | | | | Claude . ChatGPT live here | | | | | | | +----------------------------------------------------+ | | | | | +--------------------------------------------------------+ | | | +------------------------------------------------------------+ | +--------------------------------------------------------------------+ GENERATIVE AI = a sticker across these, for the ones that CREATE
Generative AI is not a smaller doll — it's a label that cuts across the stack, meaning any AI whose job is to produce new content (text, images, code). An LLM writing you a paragraph is generative; a model labelling an email "spam / not spam" is AI/ML but not generative.
▸ Try it — say it in one breath
"AI contains machine learning, which contains deep learning, which contains LLMs; generative AI is a label for the ones that create." If you can say that, §1.1 is done.
1.2 What a "model" is — training vs inference
A model is just a big maths function that's been tuned so that, given an input, it produces a useful output. Not a database of answers, not a person — a giant pile of numbers (the weights, a.k.a. parameters) wired so that running an input through them yields a prediction. "A 70-billion-parameter model" means 70 billion of those numbers.
There are two completely different moments in a model's life, and beginners conflate them constantly:
| Phase | What it is | Cost & when |
|---|---|---|
| Training | reads enormous data, slowly nudges its numbers until predictions get good | millions of $, weeks — once per version, before you touch it |
| Inference | actually using the finished model: send a prompt, get an answer | cheap, fast, every time — the numbers do not change |
◆ Why this matters more than it looks
Because the weights are frozen during inference, an LLM has no memory of yesterday unless you re-feed it. That one fact explains "memory systems," "context," and half the design decisions later in this course. The model doesn't remember you; the system around it does.
The structure those weights live in is a neural network — layers of simple maths units, each multiplying its inputs by its weights and passing the result on. Stack many ("deep") and it represents complex patterns. Analogy: training is baking a recipe into a chef's muscle memory over years; inference is the chef cooking one dish to order. They don't re-learn cooking each time you order.
1.3 How an LLM works: next-token prediction
This is the single most clarifying fact in the course. Underneath all the polish, an LLM does one thing: it predicts the next chunk of text given all the text so far. Then it appends that chunk and predicts the next, one piece at a time, until it stops.
"The capital of France is" --> [ model ] --> "Paris" "The capital of France is Paris" --> [ model ] --> "." predict the next token --> append it --> do it again, until done
The chunk is a token — a word-fragment. Rule of thumb: ~4 characters ≈ 1 token, ~100 tokens ≈ 75 words. Common words are one token; rarer ones split ("tokenization" → "token" + "ization").
Why it's fluent: trained on a colossal amount of human writing, its "what comes next" instinct mirrors how people write. It's an extraordinarily good autocomplete.
✕ Why it HALLUCINATES — the most important calibration here
The model optimises for plausible-sounding next tokens, not for truth. It has no fact-checker and no real sense of "I don't know this." Ask for a citation it never saw and it produces one that looks perfect and is fabricated — because that string is statistically plausible. Fluency and hallucination come from the same mechanism. You can't keep one and delete the other by asking nicely.
◆ Why the whole back half of this course exists
Because the model is fluent-but-unreliable, you never trust its confidence — you check its output with something outside it. Loops, verifiers, and grounding (Parts 2 and 3) are all answers to this one problem.
Temperature is the randomness dial. Low (near 0) = grab the single most-likely token → consistent, good for facts/code. High = sometimes pick less-likely tokens → varied, creative, more drift. Analogy: it's your phone's autocomplete, if autocomplete had read the entire internet — and a great-sounding suggestion can still be flat wrong.
1.4 The context window — its working memory
The model stores nothing about you between sessions. Everything it can "see" when it answers must fit in one bucket: the context window — the total text (in tokens) it can read at once: your instructions, the chat history, pasted documents, and the answer it's writing, all sharing that one bucket. Bucket full → something drops. That's why a long chat eventually "forgets" the start.
It's finite because cost grows steeply with length (roughly: doubling the context can quadruple the cost). As of mid-2026 the big Claude models (Opus 4.8, Sonnet 4.6, Fable 5) hold ~1,000,000 tokens (~750k words, ~10 novels); the small fast one (Haiku 4.5) holds 200,000. (These numbers move — re-check per model.)
▲ Watch out — bigger is NOT automatically better
Research ("Lost in the Middle," 2023) showed models use the start and end of a long context well but sag in the middle — a U-shaped curve. Newer work ("context rot," 2025) found accuracy degrades as length grows even on trivial tasks. So cramming 800K tokens "just in case" can make the model worse: the number you needed gets lost in the middle and the noise distracts it. This region is the "dumb zone."
◆ Why this becomes a design rule
Keep each job's context small and relevant, put load-bearing instructions at the top and bottom, and reset context with file handoffs instead of one sprawling mega-session. Many short fresh-context jobs beat one long smart one — the cheapest reliability upgrade there is. (This hands straight off to Part 2.)
1.5 The leverage shift: prompt → context → systems
Everyone's journey has three stages, and the leverage climbs at each:
- Prompting. Write the perfect instruction, hope for a great answer. Works for one-offs; it's where you start. But it hits a model that hallucinates, forgets when the window fills, and has no memory tomorrow.
- Context engineering. Stop obsessing over magic wording; manage what goes into the window — which docs, in what order, how much history. You're curating the model's working memory. The leverage moves from the sentence you write to the information environment you build.
- System design. Stop relying on the model being right; build a machine that keeps it correct — small jobs, clean context, an external verifier, file handoffs, loops that catch errors. Reliability comes from the harness, not the model's brilliance.
Analogy: prompting is giving a brilliant-but-forgetful intern clearer instructions. Context engineering is controlling which files are on their desk. System design is building the whole office around them — a checklist, a second reviewer, an inbox that feeds the right doc at the right moment — so the team is reliable even though the intern alone never is.
1.6 Honest calibration: good at vs bad at
| Genuinely good at (lean in) | Genuinely risky (build guardrails) |
|---|---|
| Fluent language: draft, rewrite, summarise, translate, tone | Facts from memory — confidently wrong; needs grounding |
| Transformation over a source you give it (truth is in the input) | Exact maths / counting — it predicts tokens, doesn't calculate |
| Code generation & explanation, with a human reviewing | Knowing what it doesn't know — confidence ≠ correctness |
| Breadth — a strong first-draft brainstorm partner | Self-correction with no external check — can get worse |
| Pattern tasks: classify, sort, extract — fast & cheap at scale | Current events past its cutoff; long cluttered context |
Part 2
Building with AI — loops & agents
The orchestration skill — how you direct AI to do work instead of doing every step by hand. The most immediately useful part of the course.
2.1 A prompt vs a loop
Most people use AI the slow way: type a request, wait, judge it, fix it, ask again — all by hand. You are the engine; the AI is a tool in your hand, and a tool does nothing on its own.
A loop is the faster way: you define the goal once, and the system finds the work, does it, checks its own result against a test it cannot argue with, writes down what happened, and repeats until the goal is met or a hard limit stops it. The skill shifts from writing the perfect prompt (authorship) to designing the cycle that keeps the AI correct (orchestration).
2.2 The ladder: L0 → L5
An honest progression from lowest leverage to highest. Each level is the right tool somewhere — climbing is not always correct.
| Level | What it is | Real outside check? | Where it's right |
|---|---|---|---|
| L0 | you prompt by hand, every step | none — you judge | one-offs, taste calls, exploring |
| L1 | the prompt grades itself vs written criteria | weak (self-scored) | one doc held to a bar |
| L2 | a single agent loops until a real test passes | yes — objective | one verifiable target |
| L3 | a maker + a separate checker (the heart) | yes — independent | "done" must mean something |
| L4 | a planner fans work to executors + verifiers | yes — a gate per item | a batch; a multi-phase build |
| L5 | a heartbeat fires it; it finds its own work; improves itself | yes + audit + budget guard | recurring, machine-checkable work |
◆ Why it's a thinking tool, not a scoreboard
The rule of motion: climb a level for leverage, drop a level for reliability. The instant a loop gets flaky, step down until it's solid, then climb again. Most real value lives at L3 and L4. L5 is only for work that genuinely repeats and a machine can check.
✓ The build order — do NOT skip ahead
Scheduling something you haven't made reliable by hand is how loops blow up while you sleep:
- Get one manual run reliable (L0/L1 — prove it end to end, by hand).
- Turn it into a skill — a saved, reusable instruction file.
- Wrap it in a loop: add the gate it can't argue past + a hard cap (L2/L3).
- THEN put it on a schedule (L5). Prove it once, harden it, then automate it.
2.3 What a loop actually IS — four parts
A loop is not "an agent that runs a few times." It's four specific things, and three are where people go wrong:
| Part | What it is |
|---|---|
| Goal | a checkable condition, not a vibe. "every test in /auth passes, lint clean," not "improve it" |
| Verifier (the gate) | THE HEART. the check it can't talk past. Without it, a loop is an agent agreeing with itself |
| State | a small record (done / failed / next) so the next pass resumes instead of repeating the mistake |
| Stop | success, OR a hard limit ("after 8 tries, stop and report"). No exit = it runs till it drains the account |
DISCOVER -- find what needs doing
|
PLAN -- break the goal into checkable tasks
|
EXECUTE -- the agent calls tools ( matters least )
|
VERIFY -- a SEPARATE gate, not self-judgment ( the heart )
|
ITERATE -- not done? carry state, loop back up to PLAN
◆ Why "state" is secretly the multiplier
Microsoft's Magentic-One system degrades 31% if you remove its written ledgers. The orchestrator doesn't think better — it writes down what it knows, tried, and plans. Boring bookkeeping is the lever, not a smarter model. (At Paper St: the LEDGER, trackers, verdicts-to-disk, the lesson you write into the project memory.)
2.4 When NOT to loop — the gate & four failure modes
A loop pays off only when all of these hold — miss one and a single good prompt wins:
- It repeats (roughly weekly+) — so the setup cost amortises.
- A machine can auto-reject bad output — a test, build, linter, or a rubric a second model scores.
- The agent can do it end to end — not hand half back to you each pass.
- "Done" is objective, not a taste call.
- (On our machine) RAM can absorb it — too many parallel agents OOM-kills the laptop.
The metric nobody tracks is cost per accepted change — not tokens spent or loops run. If a loop hands you ten results and you toss six, you're doing the review it was meant to save. And the four failure modes get worse as the loop gets smoother:
✕ The four silent failures
- Ralph Wiggum loop: the agent decides it's done too early, exits half-finished, and the loop keeps spending while producing nothing. Loops don't crash this way — they bill you in silence.
- Grading your own homework: no separate check = an agent agreeing with itself.
- Comprehension debt: the gap between what the repo contains and what you understand; it grows the faster a loop ships code you didn't read.
- Cognitive surrender: accepting whatever comes back. Build the loop like someone who intends to stay the engineer.
2.5 Where the vocabulary comes from
The hype is new; the mechanism is old.
- ReAct the seed
- Reason + Act: think → act → observe → think again. That single cycle is what most people mean by "an agent."
- Augmented LLM
- the substrate everything's built on — a model extended with retrieval (writes its own queries), tools (picks & calls them), and memory.
- Workflow vs Agent
- a workflow runs model calls through control flow you wrote in code (predictable, cheaper). An agent lets the model direct its own process (flexible, less predictable). The whole distinction: who holds control flow — your code, or the model. Pick deliberately.
- The five workflow patterns
- prompt chaining · routing · parallelisation · orchestrator-workers · evaluator-optimizer (generate → a second model evaluates → loop — the direct ancestor of maker/checker).
- "Ralph"
- the plainest loop: a coding agent in a bare
whileloop, same prompt against a spec, fresh instance each time, filesystem as memory. Proof the leverage lives in the loop, not a clever prompt.
◆ Why start simple
Anthropic's central discipline: begin with the simplest viable approach; add agentic complexity only when simpler solutions fall short. Every extra loop costs latency, tokens, and a new way to fail.
2.6 The Claude Code primitives — tool ↔ loop block
| Loop block | The tool | What it does |
|---|---|---|
| Heartbeat (in-session) | /loop | re-runs a prompt on an interval, or self-paces |
| Stop-condition runner | /goal | runs until a verifiable condition is met; a separate fast model checks "done" |
| Heartbeat (out-of-session) | cron / GitHub Actions | fires on a schedule or repo event, no session open |
| Act on the world | MCP connectors | read issues/CRM/DB, open PRs, post — not just suggest |
| Orchestration (L4) | Workflow, subagents | deterministic multi-agent scripts; delegated helpers |
| Make a gate blocking | hooks | a PreToolUse hook can refuse a tool call outright |
▲ The WSL override
The Workflow tool allows up to 16 concurrent agents; on this machine we cap it DOWN to ≤4 (default 3) — too many parallel agents crash the laptop. Chunk every batch to ≤4; never trust the tool's internal cap.
2.7 Maker / checker — the single most important pattern
The agent that wrote the work is a poor judge of it — not a model limitation, a structural one: the maker is too generous grading its own homework. So split the roles.
+---------+ builds +----------+ refutes +-----------+ | MAKER | ----------> | artifact | <---------- | CHECKER | | fast, | +----------+ | slow, | | cheap | a DIFFERENT model, DIFFERENT | strict | +---------+ instructions, fresh frame +-----------+ the separation IS most of the quality. >> fix the artifact -- never weaken the gate.
Three levers make a checker independent: different instructions ("find what's wrong, fail on uncertainty"), a different/stronger model (substitute up), and less context / a fresh frame. For high stakes, use adversarial verification — N skeptics each told to refute, each with a different lens (correctness, security, does-it-reproduce); kill the finding unless a majority fail to refute it.
2.8 The named agent patterns — a menu
The single-agent reasoning patterns, one line each — and the rule is match the pattern to whether a real verifier exists:
- ReAct — reason→act→observe→repeat; the baseline (1× cost).
- Reflexion — turn a pass/fail into a failure narrative, feed it to the next try. For retryable tasks with a verifier.
- Self-Refine — generate → self-critique "top 3 problems" → regenerate. ~+20% on prose with no objective answer.
- Self-Consistency — sample N answers, take the majority. For extraction with a unique right answer.
- Plan-and-Execute — plan the whole flow upfront, then execute. ~−30% tokens; for well-defined pipelines.
◆ The rule almost everyone converges on
Decouple inference from execution: the LLM proposes a structured action {tool, args, reason}; a separate validated layer checks it against an allowlist and runs it. Never pass credentials, DB handles, or write-permissions into the model's context. (This returns as a safety law in §2.11 and as the Governor in Part 4.)
2.9 Evals + LLM-as-judge — "evals are the moat"
"Done" must be measured, cheapest-and-hardest-to-fool first:
- Deterministic checks — grade the side effects / execution trace, not the narrative. "Grade the trace, not the vibe." "The site looks good" is not an eval; "eye-gate ran AND returned green before deploy" is.
- Reference-guided grading — compare to a known-good answer.
- LLM-as-judge — an LLM scoring a rubric. Only when 1 & 2 can't capture the signal (tone, helpfulness). Biased in known ways.
| Judge bias | What it does | Fix |
|---|---|---|
| Position | prefers the answer in a given slot | run both orders, accept only consistent verdicts |
| Verbosity | scores longer answers higher | penalise length; score conciseness separately |
| Self-preference | rates its own family higher | judge with a different model family |
✕ The calibration gate (a hard rule)
Before you trust any LLM judge: hand-grade ~30 examples, run the judge on the same 30, measure agreement, target >75%. Below that, fix the rubric or swap the model before deploying. A judge you haven't calibrated is not a gate.
2.10 Governed state — the agent proposes, a gate executes
The most transferable thing Palantir does: never let an agent reason over raw data and change it directly. Force every change through one governed action that validates, permissions, logs, and can stage before commit.
the agent PROPOSES a change -- then it must pass, in order:
1 AUTHZ permitted? (deny by default)
2 VALIDATE does it fit the schema + the rules?
3 CLASSIFY auto-run / stage for a human / forbidden
4 EXECUTE run it now, OR stage for approval, OR reject
5 LOG who . what . why . when (immutable)
the prompt NEVER holds credentials or DB handles.
▲ The bottleneck is retrieval quality, not window size
Under ~25% of naively-injected "memory" is relevant to a query, so retrieve, don't preload. And at our scale, do not build a knowledge graph or vector store — grep + file structure is the index. The fanciest tool is usually the wrong one.
2.11 Agent security — the lethal trifecta
PRIVATE DATA UNTRUSTED CONTENT
(the CRM) (an inbound email, a scraped page)
\ /
\ /
v DANGER v
+----------------------+
| all three in ONE | <-- a prompt injection in the
| context | untrusted content reads the
+----------------------+ private data and ships it out
^
|
ABILITY TO SEND OUTWARD
You can't reliably detect every prompt injection, so you make the dangerous combination structurally impossible: split the capabilities across agents so no single context has the full trifecta. Around that, defence in depth — input rail (screen content), execution rail (gate every tool call), output rail (scan for secrets) — and a red-team gate: if any attack succeeds, the design fails; fix the topology, don't add a filter.
2.12 Reliability & governance — surviving production
Load-bearing principles: treat all LLM output as schema-validated data before execution; own your context window (a budget, not a backpack — the "dumb zone" is the least-attended middle, so retrieve, don't preload); small focused agents (<~25 steps); durable execution (append-only event history + memoization, so a crash replays without re-paying for completed LLM calls); wrap every external call with idempotency, backoff+jitter, and a circuit breaker.
◆ Why human-in-the-loop must be architectural
For high-consequence actions, the agent is built incapable of executing — it can only recommend; a human holds the execution token. A capability boundary enforced in code, not a prompt reminder. (And approval is only real if it's not a rubber-stamp: the human must see the reasoning and impact, logged with attribution.)
autoresearch — a ~630-line loop — ran ~50 ML experiments overnight on one GPU; on a longer run it found 20 genuine improvements, an ~11% speedup on already-optimised code. A real verifier turned overnight repetition into progress with the human asleep. The leverage lived in the loop, not a clever prompt.Part 3
Machine learning, plain
Part 2 was how you build with AI. This is how the prediction underneath a machine works — the maths that turns data into a number you can bet on. The heaviest part; take it slowly.
▲ Two different tools — don't conflate them
An LLM (Part 1) predicts text. A machine-learning model here predicts an outcome (will this lead buy?) from columns of data. A real machine often uses both.
3.1 The one sentence
MEASURE --> PREDICT --> DECIDE --> ACT --> LEARN ...then it loops: each LEARN result sharpens the next MEASURE.
3.2 What ML is — and the three kinds
Machine learning is learning a rule from examples instead of being told the rule. You don't write "if the lead opened 3 emails and lives in zip X, call them." You show it thousands of past leads plus what happened, and it finds the pattern that best predicts the outcome — then applies it to a new lead it's never seen.
Under the hood, almost every model here is one optimisation problem: pick the settings (the parameters) that make predictions least wrong, with a small penalty for being too complicated. The three kinds:
| Kind | What it does | Example here |
|---|---|---|
| Supervised | you have the answers (labels) | "here are leads, here's who bought" — most of prediction |
| Unsupervised | no labels; finds structure itself | grouping customers; compressing many columns to a few |
| Reinforcement | acts, sees a reward, improves by trial | the decide/act layer (§3.8) |
3.3 The core maths, made simple
Six load-bearing ideas, each with an analogy.
- Loss
- the score for how wrong a prediction is; training = making it as small as possible. Analogy: loss is your golf score. Lower is better; every model just tries to shoot a lower round on the examples it's shown.
- Gradient descent / SGD
- with no neat formula, the machine feels its way downhill on the error surface — steepest-down, small step, repeat. SGD peeks at one random example per step (what makes huge-data training possible). Analogy: blindfolded on a hillside, feeling for the valley one foot at a time.
- Cross-validation
- split into ~5 chunks, train on 4 test on the 5th, rotate, average — a stable estimate of new-data performance without wasting data. Trap: anything that learns from the data must happen inside each fold, or the test leaks in.
- Regularisation · ridge · lasso · λ
- the overfitting dial — penalise complexity, push parameters toward zero. Ridge shrinks all; lasso pushes some to exactly zero (auto feature-selection). The strength is λ, tuned by CV. Analogy: lasso drops the weakest team members entirely; ridge trims everyone's expenses but keeps the team.
Overfitting + bias-variance — the silent killer
A model is graded on test error (fresh data it didn't train on), never training error. Overfitting is memorising the training examples including their noise — brilliant on training, falls apart on new data. You cannot see overfitting by looking at the training set. This is the mechanism behind "the lead score that looked amazing offline and collapsed live."
TEST ERROR
high |\ /
| \ /
| \__ __/
| \__ __/
| \___ ___/
| \____ ____/
low | \______/ <-- sweet spot
+-----------------------------------> model flexibility
too simple too flexible
(high BIAS / underfit) (high VARIANCE / overfit)
Analogy: a student who memorises last year's exact exam aces the practice and fails the real test. Overfitting is memorising instead of understanding.
3.4 Features — where the work & the bugs live
A feature is not raw data — it's the concept of a signal, computed per entity. "Clicked at 3:02pm" is raw; "product pages viewed in the last 30 days" is a feature. ~60% of real ML work is here, and so are the two deadly bugs — both invisible to every offline metric:
✕ The two silent killers
- Target leakage — a feature that secretly encodes the answer: a value that only exists because the outcome already happened (a bank's "call duration" — you don't know it until after you've called). Near-perfect offline, collapses live. The test: "Would this value exist, unchanged, the instant a new lead arrives, before any human worked it?"
- Train/serve skew — the same feature computed one way offline and a different way live (a rounding diff, a timezone). "The most expensive feature bug because it's silent." Fix: define each feature once (a feature store in principle — one shared function both train and serve call; you do not need the platform at this scale).
◆ Why this is the asset, not the model
A model swap is cheap; a leakage-safe feature library built from a client's own years of events is what a competitor can't reproduce. The mechanism that structurally prevents leakage: point-in-time correctness — every feature frozen as of the decision timestamp.
3.5 The three tests a score must pass
A single number ("AUC 0.82") hides which property you have. A score must pass three genuinely different tests before you bet a dollar:
| Test | The question | Catches |
|---|---|---|
| Rank | do good leads sort above bad? (lift / AUC) | a great ranker whose probabilities are nonsense |
| Calibrate | does "0.8" actually mean 80%? | a well-ranked score you can't do money-maths on |
| Validate forward | will it survive going live? (no time-leak) | a number inflated by leaking the future |
Accuracy is the wrong metric under imbalance: if 5% of leads buy, "say no to everyone" is 95% accurate and useless — so rank with AUC, not accuracy. Calibration is the one that matters for money: among leads scored ~0.8, ~80% really buy — fix miscalibration by fitting a calibrator (Platt / isotonic) on held-out data. And random cross-validation leaks the future on time-ordered data, so validate forward in time.
▲ Twyman's Law
"Any figure that looks too good is probably wrong." A near-perfect AUC is a leakage alarm to investigate, not a win to report. First reaction to a 0.99 should be suspicion, not celebration.
3.6 Lead scoring is triage
Lead scoring is risk stratification: split leads by probability of buying so you spend human effort on the high group. The deliverable isn't a yes/no — it's an ordering (a ranked call list) plus an action cutoff. The backbone: logistic regression first (readable, defensible, debuggable), trees only when a held-out test shows the fancier model actually wins — "not because it's fancier."
3.7 Uplift — don't waste a touch on a sure thing
Propensity answers "who is likely to convert on their own." Uplift answers a better question: "whose conversion does our touch actually cause?" These are different people. Analogy: propensity asks "was he going to buy anyway?"; uplift asks "did our touch change his mind?" — only the second tells you where the budget actually worked.
| Segment | Touched → | Not touched → | Verdict |
|---|---|---|---|
| Persuadables | converts | doesn't | the only profitable target |
| Sure Things | converts | converts anyway | wasted (falsely booked as a win) |
| Lost Causes | doesn't | doesn't | wasted (no effect) |
| Sleeping Dogs | doesn't | would have | negative — the touch suppresses a sale |
▲ The honest ceiling
Uplift needs a randomised (or quasi-experimental) split — a random control that got no touch. On plain observational data the model confuses effect with selection and returns confident, wrong scores. "A propensity model in a causal costume is worse than an honest propensity model."
3.8 Bandits vs full RL
To act under uncertainty and learn, the whole field is one trade-off: explore (try something worse, to learn) vs exploit (use the current best). Climb the ladder — don't start at the top:
| Rung | Use when | Answers |
|---|---|---|
| A/B test | few fixed options, you can wait | "which one is best on average?" |
| Multi-armed bandit | stop wasting traffic on losers while learning | "how do I earn while I learn?" |
| Contextual bandit | best action depends on this lead's features | "which action for this lead?" |
| Full RL (MDP) | today's action changes the future state | "which sequence maximises long-run reward?" |
3.9 Self-iterating machines — a loop is only as good as its frozen scorer
A self-iterating machine improves itself: propose a change → run it → score it against a frozen metric → keep if it wins, discard if not → repeat.
✕ The single most important lesson
The scorer is the heart, frozen and walled off from the proposer. A loop whose proposer can edit or see through its scorer is grading-your-own-homework on autopilot — it climbs the number while the real objective rots. The proposer is the easy, commoditised part; the honest scorer is the whole job.
◆ Why our engine runs slow on purpose
Open-source research loops have a cheap, fast verifier (~100 experiments a night). Ours is the opposite — a booked sale under a randomised holdout is slow, noisy, confounded, and costly. So our cadence is set by how fast we can honestly measure: "a few honest experiments a quarter, not a hundred a night, and that is the correct, not the broken, speed." We are verifier-bound. Swap the slow honest scorer for a fast proxy (clicks instead of revenue) and you guarantee Goodhart drift — perfect optimisation of the wrong thing. The operating model is build-to-HOLD: the winner is held for a human read in the morning; nothing self-promotes to a client.
3.10 Fairness — dropping protected attributes does NOT make a model fair
"We don't collect race or sex, so we're fine" is exactly the move that fails — and fails worse the smarter the model. An AI denied the protected attribute is structurally driven to rebuild it from proxies (a zip code, the shows you watch). You often can't tell by reading the feature list. So ask: (1) can the model reconstruct the protected attribute from what's left? (2) does each feature earn its place for a reason other than the disparate impact it produces? — answerable only by measuring against the protected attribute.
▲ The audit paradox — resolved by topology
To audit for proxy discrimination you need the protected attribute; minimisation says hold as little as possible. Resolve it: collect protected-class data only into a walled-off, access-controlled fairness-audit pipeline — the audit sees it; the scoring model never does. (Doctrine, not legal advice — counsel reviews before any live regulated model.)
3.11 The honest ceilings — memorise these
Part 4
The Paper St system
The payoff: the whole system, drawn so you can keep designing it. Parts 2 and 3 were the two halves — building with AI (loops) and the prediction engine (machine learning). Here's how they compose into a company.
4.1 The one-line thesis
You use Claude Code to build little "machines" that make a business money — a website, a lead system, an email flow. Every time you build one, you write down what worked into a private library. The more machines, the bigger and smarter that library. You group the machines by industry under a parent company, and eventually the whole thing becomes a product anyone can use.
The pattern has a precedent: the consulting-firm-to-product move — engineers run the playbook by hand for each client, then the firm turns the playbook into a product trained on everything those engagements taught it.
4.2 The three nested closed loops
A closed loop captures its own results and uses them to improve itself — like a thermostat reading the room. The system stacks three, at three sizes:
+- LOOP 3 . HOLDCO ----------------------------------------+
| pools which PATTERNS make money across ALL industries |
| +- LOOP 2 . NICHE-CO ----------------------------------+ |
| | pools what worked across ALL clients in one industry | |
| | +- LOOP 1 . CLIENT ------------------------------+ | |
| | | one machine learns from one client's customers | | |
| | +------------------------------------------------+ | |
| +------------------------------------------------------+ |
+----------------------------------------------------------+
"More systems = more real-world data = better systems."
Analogy: three Russian dolls. The smallest (one client) sits inside the middle (the industry brand), inside the biggest (the HoldCo). Each outer doll learns from everything every inner doll discovered.
4.3 The moat = Loop 3
A moat is what keeps competitors out. The moat is not any one website — those get copied in a weekend. It's Loop 3: the library of what actually made real businesses money, across many industries, that no competitor can reproduce because no competitor ran those engagements.
◆ Why "learning flywheel," not "network effect," for now
At a handful of clients, what exists is a learning flywheel (a loop that keeps improving), not yet a true data network effect — most "data moats" are weak scale-effects that plateau fast. The honest rule: call it a flywheel until three conditions hold — capture is automatic, the loop visibly closes per client, and opt-in reciprocity is in the contract. This honesty is good positioning — show it, don't hide it.
4.4 The holding company + the three-tier offer
A holding company is a parent that owns other companies but doesn't itself sell to customers. It owns the niche companies (one per industry), the proprietary systems database (the shared moat), and the self-serve product. The same core offer sells at three altitudes:
| Tier | What it is | Touch / price |
|---|---|---|
| Bespoke | we build your machine for you | highest touch, highest price |
| Niche product | a productised machine for one industry | mid |
| Self-serve | Claude + the database + governance; you build your own | lowest touch, lowest price |
Analogy: same coffee three ways — a barista makes it (bespoke), a branded pod machine for your office (niche product), or a bag of beans + instructions (self-serve). Same beans (the corpus) under all three.
4.5 The deal, in plain language
For the moat to work, every engagement makes two things true at once:
✓ The honest split
- The client owns all their raw customer data, accounts, domains, leads, and the machine built in their name — from day one, never held hostage.
- Paper St learns only from anonymised, aggregated patterns (what worked, with identifying info stripped) — never sells your data, never shares your customers.
The line: the client owns everything they touch; Paper St owns the meta — the pattern of what makes systems win.
4.6 What paperst.ai actually is
4.7 The two halves that build each machine
Two distinct systems do the work, mapping exactly onto Parts 2 and 3:
A fixed recipe every system goes through so quality is repeatable, not luck:
SCOPE -> revise -> PLAN -> revise -> BLUEPRINT -> stress-test -> BUILD -> deep audit
Each stage saves the assumptions it rested on, so if something breaks later you trace it to the exact thinking error and fix the recipe, not just the symptom. (Use the strongest model on the hardest judgment — "frontier-bound" work — and design routine "cadence-bound" work so a cheaper model can do it later.)
The system that makes Loop 3 automatic: every running machine sends its anonymised results up → the Refinery ingests, validates, de-identifies, sorts, and digests them → it proposes improvements back down.
✕ The critical rule — "proposals, never pushes"
The Refinery never changes a live system on its own. It only proposes; a separate, gated, attributed step applies it. No exception, including trivial. This is build-to-HOLD, encoded into the architecture.
4.8 The end-to-end machine
CORPUS --> BRIEF --> RENDER --> GATE --> PUBLISH --> MEASURE ...then it loops: every result becomes a tagged ledger row that feeds back into the CORPUS. ( capture -> predict -> decide -> act -> learn )
A flywheel is a loop that, once spinning, takes less and less push and gets faster as it compounds. Predict here is the corpus + Claude's judgment (no billion-dollar model needed — the tagged ledger is the substitute); act is always the gated apply step, never automatic.
◆ Why "copy the schema, skip the corpus"
The smart move isn't to out-spend the giants — it's to copy their tagging schema and shape (generate wide → human-curate → bandit-allocate) while skipping what needs billions in spend (in-platform ad testing — the big platforms already do it free and better, so own the landing side instead). Knowing what not to build is half the design.
4.9 The learning boundary — corpus vs registry
The single most important safety design in the whole system. The database is split in two, with a structural wall between:
+- database/corpus/ ----+ +- database/registry/ -------+
| systems + outcomes | | machine-key -> real client |
| the MOAT | <-- WALL | the identity map |
| the ONLY thing a | (nothing | NEVER enters a learning |
| learning job may read | crosses) | job (a lint check fails |
+-----------------------+ | if a config includes it) |
+----------------------------+
◆ Why a wall instead of a promise
Even with names stripped, data where the mapping back to a person is held nearby is still "personal data" under privacy law. So you put a structural wall between the learnable patterns and the identity map, instead of trusting a model to "just not look." Keys are opaque (mch_<hex>, no meaning), and distinctive facts are banded (audience "10K–100K", geography "US West") so a record can't be re-identified by its details. Analogy: a hospital research dataset — researchers see "Patient 4471, age band 40–50, condition X" and learn; the file mapping 4471 to a real name lives in a room they never get the key to.
4.10 The governance layer
The governance layer is the leash — the rules and gates that hold Claude inside set parameters so it can't do something illegal, off-brand, or destructive even while working on its own.
Model routing: not every job needs the biggest brain. The first rule is "the cheapest model is no model" — anything a plain script can check never burns AI tokens (Tier 0). Cheap models sort/tag; the strongest are reserved for deep architecture and audits. When money moves, a stronger model reviews — "higher blast radius = higher-tier review."
The Governor — the rule engine every action passes through. A few worth knowing:
- Rule-of-Two — any job holding all three of {untrusted input, private data, outbound ability} is dangerous and must be redesigned or human-gated. (§2.11's lethal trifecta, as policy.)
- Deterministic policy only — a guardrail that "blocks 95%" never counts; only a hard architectural cut does.
- Proposals, never pushes — §4.7, encoded as governance.
- Learn only from verified outcomes — "the AI reflected and learned" with no attached verified result is banned.
- Kill switch outside the reasoning path — a stop button that doesn't depend on the AI agreeing to stop.
A consent envelope is the explicit leash: a signed, bounded permission slip that lets an agent act on its own within hard limits (spend caps, banned claim-classes, pre-approved templates, a kill switch). Analogy: a nuclear plant's control room — cheap sensors handle the routine, humans sign off on anything that moves real fuel, and there's a physical scram button wired outside the computer.
4.11 The whole stack on one page
| Level | What it is | Status |
|---|---|---|
| L0 · Truth | claims & assumptions ledgers; every row human-verifiable | now |
| L1 · Build Loop | the standard recipe for making anything | now |
| L2 · Corpus + papertrail | the moat: systems + outcomes records | now |
| L3 · Refinery | the brain that ingests, digests, and proposes | next (research-first) |
| L4 · Autonomy | governed agents inside consent envelopes | later (gated) |
| L5 · Product | self-serve: the packaged corpus + Governor for outside tenants | future |
Cross-cutting all six: the Governor (every arrow passes a gate), the Console (every job emits a heartbeat, so silent failures surface), the Fleet (all deployed machines). The papertrail doctrine — every level keeps an append-only ledger and every recommendation cites the row it came from — is the edge over vanilla Claude: the captured failures + traced reasoning are what a fresh model can't reproduce.
▲ Honest build status
L0–L2 are live work; L3 is next; L4–L5 are later and gated. This is the designed plan, not a finished system — and honesty about what's built is itself part of the design.
4.12 How to keep designing it
When you sit down to design the next piece, route the decision through what this course gave you:
- Part 2 problem or Part 3 problem? "Get the AI to build this reliably?" → loops. "Predict / decide well from data?" → engine.
- Find the rung first. A one-off is L0/L1. Repeats and machine-checkable → L3/L4. Nothing goes to L5 until one manual run is reliable.
- Name the verifier before you build. Can't say what "done" objectively is? You have an agent agreeing with itself, not a loop.
- Decouple proposal from execution, always. The agent proposes; a separate validated layer executes and logs.
- Keep the learning boundary sacred. Anything that learns reads the corpus, never the registry.
- Be honest about what's built. Flywheel not network-effect; estimate not quote; designed not deployed.
▸ A worked example — designing one new piece, out loud
Say you want a subject-line learner for the outreach machine — something that figures out which cold-email subject lines get replies. Run it through the questions instead of guessing:
- Part 2 or 3? Both. Deciding which line to send is a Part-3 prediction; the machinery that tries, measures, updates is a Part-2 loop.
- Which rung? It repeats, a machine can check it, options are a fixed menu — that's a bandit (§3.8), not full RL. Most marketing "AI" is a bandit nobody set up. Start there.
- The frozen verifier? The honest metric is reply / booked-call rate, not open rate — open rate is a proxy, and a self-modifying loop optimises exactly the number you give it (Goodhart). Freeze the scorer on the real outcome.
- Enough data? Below ~100 booked calls/segment it's a ranking aid, not a bet (§3.11).
- Proposal vs execution? It proposes a line; a gated step (or a consent envelope with a spend cap) sends — never the model directly.
- Cross the wall? Reads aggregated patterns from the corpus; never the registry. If a feature needs a lead's identity, redesign the topology — don't add a filter.
The first concrete decision falls right out: a Thompson-sampling bandit over a small set of human-approved subject lines, scored on booked calls, proposing-not-sending, reading only the corpus. You didn't need a new idea — you needed the course's questions, in order. That is the motion that keeps the system designable.
Part 5
Put it together
5.1 The one mental model for the whole course
+---------------------------------------------------------------+ | PART 4 . THE COMPANY paperst.ai: 3 nested loops, the moat | | +- PART 2 . LOOPS + PART 3 . THE ENGINE ---+ | | | direct AI to build predict -> decide -> | | | | maker/checker, gates learn (the ML) | | | +----------------------------------------------+ | | PART 1 . HOW AI WORKS it predicts plausible text, not truth | +---------------------------------------------------------------+ model unreliable -> build a harness that checks it -> it builds machines -> the machines feed one library -> the library is the moat
The single thread: the model is brilliant but unreliable, so you never trust it — you build a harness that checks it. That harness builds the machines. The machines feed one shared library of what-made-money. That library, owned by you, is the thing no competitor can copy. Everything else hangs off that thread.
5.2 The ideas worth tattooing on the inside of your eyelids
- A prompt makes you the engine; a loop makes the system the engine.
- It predicts plausible text, not true text — fluency and hallucination are the same mechanism.
- Harness beats model (~70/30). Invest in the system around the model before waiting for a smarter one.
- The verifier is the heart. No separate check = an agent agreeing with itself. Fix the artifact, never weaken the gate.
- Climb for leverage, drop for reliability. Prove it once, harden it, then automate it.
- The schema is the moat; the model is secondary.
- Offline metrics are not live improvement. Calibrate before you act. Twyman's Law: too-good is a bug.
- The agent proposes; a separate layer executes. Proposals, never pushes.
- Honesty is the moat's foundation. Flywheel not network-effect; estimate not quote; designed not built.
5.3 The cheat sheet — which tool for which job
| You want to… | Reach for |
|---|---|
| Get one good answer, once | a clear prompt (L0) |
| Make an AI build one verifiable thing | a single-agent loop with a real test (L2) |
| Make "done" actually mean something | maker/checker — separate the doer from the checker (L3) |
| Build a batch / multi-phase thing | a Workflow — planner → executors + per-item gates (L4) |
| Predict who's worth a human's time | lead scoring = triage → a ranked call list |
| Spend a scarce touch wisely | uplift — find persuadables, avoid sleeping dogs |
| Choose + learn under uncertainty | a bandit (most cases), not full RL |
| Trust a probability for money-maths | prove it ranks, calibrates, and validates forward |
| Keep an agent from leaking data | break the lethal trifecta by topology |
| Compound everything into a business | the three nested loops + the corpus/registry wall |
5.4 The honest limits
This course is reasoning and guidance, not gospel. The heuristic numbers (the ~50% accept-rate, the ~70/30 harness split, the volume gates) are rules of thumb, not laws. Model sizes and prices move — re-check at use-time. The legal and fairness material is doctrine, not legal advice. And most of the system above L2 is designed, not built. None of that weakens the spine; it's what keeps it honest. Build the loop like someone who intends to stay the engineer — not just the person who presses go.
Reference
Glossary — the words, grouped
The vocabulary in plain lines, grouped by where it shows up. Skim now; come back when a word trips you.
A · How AI works
- Model
- a trained maths function (a file of numbers) that maps an input to a predicted output.
- Parameters / weights
- the tunable internal numbers; "70B parameters" = 70 billion of them.
- Training vs Inference
- the expensive one-time tuning of the weights vs running the finished model (weights don't change). Every prompt is inference.
- Token
- the word-fragment a model reads/writes in; ~4 characters / ~0.75 word; what you're billed by.
- Next-token prediction
- the core LLM mechanism — predict the most likely next token, append, repeat.
- Hallucination
- confident, fluent output that's fabricated; the same mechanism as the fluency.
- Temperature
- the randomness dial; low = consistent, high = varied/creative.
- Context window
- the total tokens a model can "see" at once; finite, capped per model. The "dumb zone" is its least-attended middle.
- Prompt / context engineering
- crafting the wording vs designing what information fills the window.
- Harness
- the machinery around the model (verifiers, loops, file handoffs) that keeps output correct.
- RAG
- retrieval-augmented generation — look up real docs and paste them in so the model answers from sources, not memory.
- Grounding
- tying an answer to verifiable real sources; the antidote to hallucination.
- Agent / tool use / MCP
- an LLM that acts in a loop / a menu of real actions it can invoke / the standard plug for connecting tools to an AI.
B · Building with AI (loops)
- Loop
- a repeating cycle: the AI acts, gets feedback, picks the next action, until a stop condition.
- The ladder (L0–L5)
- the rungs from manual prompting (L0) to scheduled/self-improving (L5).
- Goal / Verifier / State / Stop
- the four parts every real loop needs; the verifier (the gate it can't argue past) is the heart.
- Workflow vs Agent
- control flow held by your code (predictable) vs by the model (flexible). The whole distinction.
- Maker/checker
- split the agent that does from the agent that checks; most of the quality.
- Adversarial verification
- N independent skeptics, each told to refute, each with a different lens; majority rules.
- Cost per accepted change
- the real measure of a loop's worth — not tokens spent or loops run.
- Ralph Wiggum loop
- the agent exits a half-finished job and the loop keeps spending in silence.
- LLM-as-judge / calibration gate
- an LLM scoring a rubric / proving it agrees with humans >75% before you trust it.
- Lethal trifecta
- private data + untrusted content + outbound ability in one context = dangerous.
- Prompt injection
- malicious instructions hidden in content the agent reads.
- Durable execution
- append-only history + memoization so a crash replays without re-paying for completed work.
C · The engine (machine learning)
- Supervised / unsupervised / reinforcement
- learning with answer-keys / without / by trial-and-reward.
- Loss / gradient descent / SGD
- how-wrong-a-prediction-is / stepping downhill to minimise it / using one random sample per step.
- Overfitting / bias-variance
- memorising the noise (great on training, bad on new data) / the too-simple vs too-flexible U-curve.
- Regularisation / ridge / lasso
- penalising complexity / shrink-all-keep-all / shrink-some-to-zero (auto feature-selection).
- Feature
- an input signal as a concept (an aggregate per entity), not a raw value.
- Target leakage / train-serve skew
- a feature that encodes the outcome / the same feature computed differently offline vs live. Both invisible offline.
- AUC / calibration
- a ranking score (good lead outranks bad) / "0.8 means ~80%".
- Twyman's Law
- "any figure that looks too good is probably wrong"; a near-perfect AUC is a leakage alarm.
- Propensity vs uplift (CATE)
- the level of outcome (likely to convert) vs the change the touch causes.
- Persuadables / sure things / sleeping dogs
- the response segments; only persuadables are worth the budget; sleeping dogs are suppressed by the touch.
- Bandit (multi-armed / contextual)
- choose-and-learn among options / when the best choice depends on features. Most decisions are a bandit, not full RL.
- Frozen scorer / Goodhart's law
- the unchangeable grading metric / a loop pointed at a proxy optimises the proxy and drifts.
- Proxy discrimination
- a neutral feature useful because it reproduces a protected class's disparate impact.
D · The Paper St system
- Thesis
- the single locked sentence every build is judged against.
- Closed loop / nested loops
- a self-improving system / loops inside loops (client · niche-co · HoldCo).
- Moat
- the durable advantage competitors can't copy — here, the cross-niche corpus of what made businesses money.
- Learning flywheel vs network effect
- an improving loop (true now) vs a self-reinforcing data advantage (not yet — say "flywheel").
- HoldCo / three-tier offer
- the parent that owns subsidiaries / bespoke · niche product · self-serve.
- Build Loop / Refinery
- the recipe that builds machines (loops/) / the system that learns from them (engine/).
- Corpus / registry / learning boundary
- the learnable library (safe) / the key↔client map (never learned) / the structural wall between them.
- Governor / Rule-of-Two / consent envelope
- the rule engine every action passes / the trifecta budget / a bounded permission slip with a kill switch.
- Build-to-HOLD
- do every safe, reversible step, then hold the irreversible one for a human; nothing ships itself.
- Papertrail doctrine
- every level keeps an append-only sourced ledger; every recommendation cites the row it came from.