PaperSt.AI

The Machine Room.

How AI actually works — and how we build with it. From "what is a token" to "how do I design a self-improving machine." Plain English first, then the proper word for it. Written for a smart non-engineer; no maths or code assumed.

~75–100 min read 6 parts + glossary No code assumed You'll be able to design the next piece

Part 00

Start here — how to use this

This is not a coding course. You will not write code by hand. It's a course about how the machine actually works — the AI you already use, the machine-learning that powers a prediction, the "loops" that let an AI do work while you sleep, and the blueprint of the system we're building. The goal: when you sit down to design the next piece, you reason from how it truly works, not from analogy and vibes.

The one reframe the whole course turns onA prompt makes you the engine. A loop makes the system the engine. Most people use AI by hand, one request at a time. The leverage comes from building a system that keeps the AI correct and pointed at the goal — so the work continues when you step away.

The colour key — read this once

The whole guide is colour-coded so you can scan it fast. Five box types, and you'll see them everywhere:

Why this matters

The reasoning behind a thing. The most important box — rules you don't understand get misapplied.

Do this

A good habit or the correct move.

Never do this

A hard line. Breaking it costs money, a client, or trust.

Watch out

A common trap or a place people get confused.

Try it

A small exercise. Doing beats reading.

You'll also see term chips for vocabulary (all collected in the Glossary), monospace for things you'd type, and the inverted highlight for the single sharpest idea on a page.

The map of the journey

PartWhat you'll understandThe one-liner
1 · How AI worksLLMs, tokens, training vs using, why it lies, contextthe ground floor
2 · Building with AIprompts → loops → agents, the L0→L5 ladderhow you direct it
3 · Machine learningpredict → decide → learn; the maths, made simplehow the prediction works
4 · The Paper St systemthe thesis, the three loops, the moat, the blueprintwhat we're building
5 · Put it togetherhow the pieces compose, and how to keep designingthe whole machine

Why this order

You can't design loops until you know why an AI needs a leash (Part 1's "it predicts plausible text, not true text"). You can't design the prediction engine until you can direct AI to build it (Part 2). And you can't design the company until you've seen both halves it's made of (Parts 2 and 3). Each part is the floor the next one stands on.

Part 1

How AI actually works

The ground floor. Five ideas, in order. Get these and the rest of the course has somewhere to stand.

1.1 The nesting dolls: AI ⊃ ML ⊃ Deep Learning ⊃ LLM

These five words get used like synonyms. They're actually nested boxes, like Russian dolls.

the words, nested
+- ARTIFICIAL INTELLIGENCE -- any machine doing something "smart" ---+
| +- MACHINE LEARNING -- learns patterns from examples --------+     |
| | +- DEEP LEARNING -- ML with many-layered networks -------+ |     |
| | | +- LLM -- trained on huge text, predicts language ---+ | |     |
| | | | Claude . ChatGPT live here                         | | |     |
| | | +----------------------------------------------------+ | |     |
| | +--------------------------------------------------------+ |     |
| +------------------------------------------------------------+     |
+--------------------------------------------------------------------+

   GENERATIVE AI = a sticker across these, for the ones that CREATE

Generative AI is not a smaller doll — it's a label that cuts across the stack, meaning any AI whose job is to produce new content (text, images, code). An LLM writing you a paragraph is generative; a model labelling an email "spam / not spam" is AI/ML but not generative.

Try it — say it in one breath

"AI contains machine learning, which contains deep learning, which contains LLMs; generative AI is a label for the ones that create." If you can say that, §1.1 is done.

1.2 What a "model" is — training vs inference

A model is just a big maths function that's been tuned so that, given an input, it produces a useful output. Not a database of answers, not a person — a giant pile of numbers (the weights, a.k.a. parameters) wired so that running an input through them yields a prediction. "A 70-billion-parameter model" means 70 billion of those numbers.

There are two completely different moments in a model's life, and beginners conflate them constantly:

PhaseWhat it isCost & when
Trainingreads enormous data, slowly nudges its numbers until predictions get goodmillions of $, weeks — once per version, before you touch it
Inferenceactually using the finished model: send a prompt, get an answercheap, fast, every time — the numbers do not change

Why this matters more than it looks

Because the weights are frozen during inference, an LLM has no memory of yesterday unless you re-feed it. That one fact explains "memory systems," "context," and half the design decisions later in this course. The model doesn't remember you; the system around it does.

The structure those weights live in is a neural network — layers of simple maths units, each multiplying its inputs by its weights and passing the result on. Stack many ("deep") and it represents complex patterns. Analogy: training is baking a recipe into a chef's muscle memory over years; inference is the chef cooking one dish to order. They don't re-learn cooking each time you order.

1.3 How an LLM works: next-token prediction

This is the single most clarifying fact in the course. Underneath all the polish, an LLM does one thing: it predicts the next chunk of text given all the text so far. Then it appends that chunk and predicts the next, one piece at a time, until it stops.

one step at a time
 "The capital of France is"          -->  [ model ]  -->  "Paris"
 "The capital of France is Paris"    -->  [ model ]  -->  "."

 predict the next token  -->  append it  -->  do it again, until done

The chunk is a token — a word-fragment. Rule of thumb: ~4 characters ≈ 1 token, ~100 tokens ≈ 75 words. Common words are one token; rarer ones split ("tokenization" → "token" + "ization").

Why it's fluent: trained on a colossal amount of human writing, its "what comes next" instinct mirrors how people write. It's an extraordinarily good autocomplete.

Why it HALLUCINATES — the most important calibration here

The model optimises for plausible-sounding next tokens, not for truth. It has no fact-checker and no real sense of "I don't know this." Ask for a citation it never saw and it produces one that looks perfect and is fabricated — because that string is statistically plausible. Fluency and hallucination come from the same mechanism. You can't keep one and delete the other by asking nicely.

Why the whole back half of this course exists

Because the model is fluent-but-unreliable, you never trust its confidence — you check its output with something outside it. Loops, verifiers, and grounding (Parts 2 and 3) are all answers to this one problem.

Temperature is the randomness dial. Low (near 0) = grab the single most-likely token → consistent, good for facts/code. High = sometimes pick less-likely tokens → varied, creative, more drift. Analogy: it's your phone's autocomplete, if autocomplete had read the entire internet — and a great-sounding suggestion can still be flat wrong.

1.4 The context window — its working memory

The model stores nothing about you between sessions. Everything it can "see" when it answers must fit in one bucket: the context window — the total text (in tokens) it can read at once: your instructions, the chat history, pasted documents, and the answer it's writing, all sharing that one bucket. Bucket full → something drops. That's why a long chat eventually "forgets" the start.

It's finite because cost grows steeply with length (roughly: doubling the context can quadruple the cost). As of mid-2026 the big Claude models (Opus 4.8, Sonnet 4.6, Fable 5) hold ~1,000,000 tokens (~750k words, ~10 novels); the small fast one (Haiku 4.5) holds 200,000. (These numbers move — re-check per model.)

Watch out — bigger is NOT automatically better

Research ("Lost in the Middle," 2023) showed models use the start and end of a long context well but sag in the middle — a U-shaped curve. Newer work ("context rot," 2025) found accuracy degrades as length grows even on trivial tasks. So cramming 800K tokens "just in case" can make the model worse: the number you needed gets lost in the middle and the noise distracts it. This region is the "dumb zone."

Why this becomes a design rule

Keep each job's context small and relevant, put load-bearing instructions at the top and bottom, and reset context with file handoffs instead of one sprawling mega-session. Many short fresh-context jobs beat one long smart one — the cheapest reliability upgrade there is. (This hands straight off to Part 2.)

1.5 The leverage shift: prompt → context → systems

Everyone's journey has three stages, and the leverage climbs at each:

  1. Prompting. Write the perfect instruction, hope for a great answer. Works for one-offs; it's where you start. But it hits a model that hallucinates, forgets when the window fills, and has no memory tomorrow.
  2. Context engineering. Stop obsessing over magic wording; manage what goes into the window — which docs, in what order, how much history. You're curating the model's working memory. The leverage moves from the sentence you write to the information environment you build.
  3. System design. Stop relying on the model being right; build a machine that keeps it correct — small jobs, clean context, an external verifier, file handoffs, loops that catch errors. Reliability comes from the harness, not the model's brilliance.
The whole game in one lineMost AI failures are harness design, not model quality. A decent model with a great harness beats a great model with a bad one. Prompting is the floor, not the leverage.

Analogy: prompting is giving a brilliant-but-forgetful intern clearer instructions. Context engineering is controlling which files are on their desk. System design is building the whole office around them — a checklist, a second reviewer, an inbox that feeds the right doc at the right moment — so the team is reliable even though the intern alone never is.

1.6 Honest calibration: good at vs bad at

Genuinely good at (lean in)Genuinely risky (build guardrails)
Fluent language: draft, rewrite, summarise, translate, toneFacts from memory — confidently wrong; needs grounding
Transformation over a source you give it (truth is in the input)Exact maths / counting — it predicts tokens, doesn't calculate
Code generation & explanation, with a human reviewingKnowing what it doesn't know — confidence ≠ correctness
Breadth — a strong first-draft brainstorm partnerSelf-correction with no external check — can get worse
Pattern tasks: classify, sort, extract — fast & cheap at scaleCurrent events past its cutoff; long cluttered context
The one-sentence calibrationTreat the model as a brilliant, fast, unreliable writer — magic when the truth is in front of it or a checker is behind it, dangerous when you trust its unaided memory or its confidence. That asymmetry is the entire case for the system-design work the rest of this course teaches.

Part 2

Building with AI — loops & agents

The orchestration skill — how you direct AI to do work instead of doing every step by hand. The most immediately useful part of the course.

2.1 A prompt vs a loop

Most people use AI the slow way: type a request, wait, judge it, fix it, ask again — all by hand. You are the engine; the AI is a tool in your hand, and a tool does nothing on its own.

A loop is the faster way: you define the goal once, and the system finds the work, does it, checks its own result against a test it cannot argue with, writes down what happened, and repeats until the goal is met or a hard limit stops it. The skill shifts from writing the perfect prompt (authorship) to designing the cycle that keeps the AI correct (orchestration).

The one line to keepA prompt hands the AI an instruction and waits for you. A loop hands the AI a job, a way to know when it's done, and a rule for when to give up. The unit of work is no longer the prompt — it's the cycle.

2.2 The ladder: L0 → L5

An honest progression from lowest leverage to highest. Each level is the right tool somewhere — climbing is not always correct.

LevelWhat it isReal outside check?Where it's right
L0you prompt by hand, every stepnone — you judgeone-offs, taste calls, exploring
L1the prompt grades itself vs written criteriaweak (self-scored)one doc held to a bar
L2a single agent loops until a real test passesyes — objectiveone verifiable target
L3a maker + a separate checker (the heart)yes — independent"done" must mean something
L4a planner fans work to executors + verifiersyes — a gate per itema batch; a multi-phase build
L5a heartbeat fires it; it finds its own work; improves itselfyes + audit + budget guardrecurring, machine-checkable work

Why it's a thinking tool, not a scoreboard

The rule of motion: climb a level for leverage, drop a level for reliability. The instant a loop gets flaky, step down until it's solid, then climb again. Most real value lives at L3 and L4. L5 is only for work that genuinely repeats and a machine can check.

The build order — do NOT skip ahead

Scheduling something you haven't made reliable by hand is how loops blow up while you sleep:

  1. Get one manual run reliable (L0/L1 — prove it end to end, by hand).
  2. Turn it into a skill — a saved, reusable instruction file.
  3. Wrap it in a loop: add the gate it can't argue past + a hard cap (L2/L3).
  4. THEN put it on a schedule (L5). Prove it once, harden it, then automate it.

2.3 What a loop actually IS — four parts

A loop is not "an agent that runs a few times." It's four specific things, and three are where people go wrong:

PartWhat it is
Goala checkable condition, not a vibe. "every test in /auth passes, lint clean," not "improve it"
Verifier (the gate)THE HEART. the check it can't talk past. Without it, a loop is an agent agreeing with itself
Statea small record (done / failed / next) so the next pass resumes instead of repeating the mistake
Stopsuccess, OR a hard limit ("after 8 tries, stop and report"). No exit = it runs till it drains the account
each pass runs this cycle
   DISCOVER  --  find what needs doing
      |
   PLAN      --  break the goal into checkable tasks
      |
   EXECUTE   --  the agent calls tools             ( matters least )
      |
   VERIFY    --  a SEPARATE gate, not self-judgment    ( the heart )
      |
   ITERATE   --  not done? carry state, loop back up to PLAN

Why "state" is secretly the multiplier

Microsoft's Magentic-One system degrades 31% if you remove its written ledgers. The orchestrator doesn't think better — it writes down what it knows, tried, and plans. Boring bookkeeping is the lever, not a smarter model. (At Paper St: the LEDGER, trackers, verdicts-to-disk, the lesson you write into the project memory.)

2.4 When NOT to loop — the gate & four failure modes

A loop pays off only when all of these hold — miss one and a single good prompt wins:

  • It repeats (roughly weekly+) — so the setup cost amortises.
  • A machine can auto-reject bad output — a test, build, linter, or a rubric a second model scores.
  • The agent can do it end to end — not hand half back to you each pass.
  • "Done" is objective, not a taste call.
  • (On our machine) RAM can absorb it — too many parallel agents OOM-kills the laptop.

The metric nobody tracks is cost per accepted change — not tokens spent or loops run. If a loop hands you ten results and you toss six, you're doing the review it was meant to save. And the four failure modes get worse as the loop gets smoother:

The four silent failures

  • Ralph Wiggum loop: the agent decides it's done too early, exits half-finished, and the loop keeps spending while producing nothing. Loops don't crash this way — they bill you in silence.
  • Grading your own homework: no separate check = an agent agreeing with itself.
  • Comprehension debt: the gap between what the repo contains and what you understand; it grows the faster a loop ships code you didn't read.
  • Cognitive surrender: accepting whatever comes back. Build the loop like someone who intends to stay the engineer.

2.5 Where the vocabulary comes from

The hype is new; the mechanism is old.

ReAct the seed
Reason + Act: think → act → observe → think again. That single cycle is what most people mean by "an agent."
Augmented LLM
the substrate everything's built on — a model extended with retrieval (writes its own queries), tools (picks & calls them), and memory.
Workflow vs Agent
a workflow runs model calls through control flow you wrote in code (predictable, cheaper). An agent lets the model direct its own process (flexible, less predictable). The whole distinction: who holds control flow — your code, or the model. Pick deliberately.
The five workflow patterns
prompt chaining · routing · parallelisation · orchestrator-workers · evaluator-optimizer (generate → a second model evaluates → loop — the direct ancestor of maker/checker).
"Ralph"
the plainest loop: a coding agent in a bare while loop, same prompt against a spec, fresh instance each time, filesystem as memory. Proof the leverage lives in the loop, not a clever prompt.

Why start simple

Anthropic's central discipline: begin with the simplest viable approach; add agentic complexity only when simpler solutions fall short. Every extra loop costs latency, tokens, and a new way to fail.

2.6 The Claude Code primitives — tool ↔ loop block

Loop blockThe toolWhat it does
Heartbeat (in-session)/loopre-runs a prompt on an interval, or self-paces
Stop-condition runner/goalruns until a verifiable condition is met; a separate fast model checks "done"
Heartbeat (out-of-session)cron / GitHub Actionsfires on a schedule or repo event, no session open
Act on the worldMCP connectorsread issues/CRM/DB, open PRs, post — not just suggest
Orchestration (L4)Workflow, subagentsdeterministic multi-agent scripts; delegated helpers
Make a gate blockinghooksa PreToolUse hook can refuse a tool call outright

The WSL override

The Workflow tool allows up to 16 concurrent agents; on this machine we cap it DOWN to ≤4 (default 3) — too many parallel agents crash the laptop. Chunk every batch to ≤4; never trust the tool's internal cap.

2.7 Maker / checker — the single most important pattern

The agent that wrote the work is a poor judge of it — not a model limitation, a structural one: the maker is too generous grading its own homework. So split the roles.

  +---------+   builds    +----------+   refutes   +-----------+
  |  MAKER  | ----------> | artifact | <---------- |  CHECKER  |
  | fast,   |             +----------+             | slow,     |
  | cheap   |   a DIFFERENT model, DIFFERENT       | strict    |
  +---------+   instructions, fresh frame          +-----------+

   the separation IS most of the quality.
   >> fix the artifact -- never weaken the gate.

Three levers make a checker independent: different instructions ("find what's wrong, fail on uncertainty"), a different/stronger model (substitute up), and less context / a fresh frame. For high stakes, use adversarial verification — N skeptics each told to refute, each with a different lens (correctness, security, does-it-reproduce); kill the finding unless a majority fail to refute it.

The line to remember"Done" is a claim, not a proof. The verifier raises the floor; it does not remove the human gate. Human review of shipped work stays in the loop no matter how good the verifier gets.

2.8 The named agent patterns — a menu

Read this and you can skip the rest of 2.8The meta-rules beat any single pattern: simplest-first · harness beats model (~70/30) · single-agent is the default (multi-agent costs ~15×) · the real multiplier is explicit state, not model size · tight generate→verify coupling.

The single-agent reasoning patterns, one line each — and the rule is match the pattern to whether a real verifier exists:

  • ReAct — reason→act→observe→repeat; the baseline (1× cost).
  • Reflexion — turn a pass/fail into a failure narrative, feed it to the next try. For retryable tasks with a verifier.
  • Self-Refine — generate → self-critique "top 3 problems" → regenerate. ~+20% on prose with no objective answer.
  • Self-Consistency — sample N answers, take the majority. For extraction with a unique right answer.
  • Plan-and-Execute — plan the whole flow upfront, then execute. ~−30% tokens; for well-defined pipelines.

The rule almost everyone converges on

Decouple inference from execution: the LLM proposes a structured action {tool, args, reason}; a separate validated layer checks it against an allowlist and runs it. Never pass credentials, DB handles, or write-permissions into the model's context. (This returns as a safety law in §2.11 and as the Governor in Part 4.)

2.9 Evals + LLM-as-judge — "evals are the moat"

"Done" must be measured, cheapest-and-hardest-to-fool first:

  1. Deterministic checks — grade the side effects / execution trace, not the narrative. "Grade the trace, not the vibe." "The site looks good" is not an eval; "eye-gate ran AND returned green before deploy" is.
  2. Reference-guided grading — compare to a known-good answer.
  3. LLM-as-judge — an LLM scoring a rubric. Only when 1 & 2 can't capture the signal (tone, helpfulness). Biased in known ways.
Judge biasWhat it doesFix
Positionprefers the answer in a given slotrun both orders, accept only consistent verdicts
Verbosityscores longer answers higherpenalise length; score conciseness separately
Self-preferencerates its own family higherjudge with a different model family

The calibration gate (a hard rule)

Before you trust any LLM judge: hand-grade ~30 examples, run the judge on the same 30, measure agreement, target >75%. Below that, fix the rubric or swap the model before deploying. A judge you haven't calibrated is not a gate.

2.10 Governed state — the agent proposes, a gate executes

The most transferable thing Palantir does: never let an agent reason over raw data and change it directly. Force every change through one governed action that validates, permissions, logs, and can stage before commit.

the mutation-gatekeeper — one endpoint, every write
   the agent PROPOSES a change -- then it must pass, in order:

     1  AUTHZ      permitted?   (deny by default)
     2  VALIDATE   does it fit the schema + the rules?
     3  CLASSIFY   auto-run  /  stage for a human  /  forbidden
     4  EXECUTE    run it now, OR stage for approval, OR reject
     5  LOG        who . what . why . when   (immutable)

   the prompt NEVER holds credentials or DB handles.

The bottleneck is retrieval quality, not window size

Under ~25% of naively-injected "memory" is relevant to a query, so retrieve, don't preload. And at our scale, do not build a knowledge graph or vector store — grep + file structure is the index. The fanciest tool is usually the wrong one.

2.11 Agent security — the lethal trifecta

The whole section in one lineThe fix is topology, not detection — "95% caught is a failing grade in security." Never let one agent hold all three of: private data + untrusted content + the ability to send outward.
   PRIVATE DATA            UNTRUSTED CONTENT
    (the CRM)              (an inbound email, a scraped page)
         \                       /
          \                     /
           v        DANGER      v
          +----------------------+
          |  all three in ONE    |  <-- a prompt injection in the
          |  context             |      untrusted content reads the
          +----------------------+      private data and ships it out
                     ^
                     |
            ABILITY TO SEND OUTWARD

You can't reliably detect every prompt injection, so you make the dangerous combination structurally impossible: split the capabilities across agents so no single context has the full trifecta. Around that, defence in depth — input rail (screen content), execution rail (gate every tool call), output rail (scan for secrets) — and a red-team gate: if any attack succeeds, the design fails; fix the topology, don't add a filter.

2.12 Reliability & governance — surviving production

The decision to make firstCan you draw the flowchart before the LLM runs? YES → build a workflow (fixed, cheap, auditable). NO → an agent, constrained inside a bounded role. A pure agent at 99%/step fails ~10% over 10 steps; a hybrid compounds gracefully.

Load-bearing principles: treat all LLM output as schema-validated data before execution; own your context window (a budget, not a backpack — the "dumb zone" is the least-attended middle, so retrieve, don't preload); small focused agents (<~25 steps); durable execution (append-only event history + memoization, so a crash replays without re-paying for completed LLM calls); wrap every external call with idempotency, backoff+jitter, and a circuit breaker.

Why human-in-the-loop must be architectural

For high-consequence actions, the agent is built incapable of executing — it can only recommend; a human holds the execution token. A capability boundary enforced in code, not a prompt reminder. (And approval is only real if it's not a rubber-stamp: the human must see the reasoning and impact, logged with attribution.)

The flagship proof it's worth itKarpathy's autoresearch — a ~630-line loop — ran ~50 ML experiments overnight on one GPU; on a longer run it found 20 genuine improvements, an ~11% speedup on already-optimised code. A real verifier turned overnight repetition into progress with the human asleep. The leverage lived in the loop, not a clever prompt.

Part 3

Machine learning, plain

Part 2 was how you build with AI. This is how the prediction underneath a machine works — the maths that turns data into a number you can bet on. The heaviest part; take it slowly.

Two different tools — don't conflate them

An LLM (Part 1) predicts text. A machine-learning model here predicts an outcome (will this lead buy?) from columns of data. A real machine often uses both.

3.1 The one sentence

   MEASURE  -->  PREDICT  -->  DECIDE  -->  ACT  -->  LEARN
   ...then it loops: each LEARN result sharpens the next MEASURE.
The idea people don't expectThe schema is the moat; the model is secondary. A plain model on data you own — every lead tied to whether it won/lost, why, over years — beats a clever model on data you don't. A competitor copies your algorithm in an afternoon; they cannot copy two years of outcome-linked history.

3.2 What ML is — and the three kinds

Machine learning is learning a rule from examples instead of being told the rule. You don't write "if the lead opened 3 emails and lives in zip X, call them." You show it thousands of past leads plus what happened, and it finds the pattern that best predicts the outcome — then applies it to a new lead it's never seen.

Under the hood, almost every model here is one optimisation problem: pick the settings (the parameters) that make predictions least wrong, with a small penalty for being too complicated. The three kinds:

KindWhat it doesExample here
Supervisedyou have the answers (labels)"here are leads, here's who bought" — most of prediction
Unsupervisedno labels; finds structure itselfgrouping customers; compressing many columns to a few
Reinforcementacts, sees a reward, improves by trialthe decide/act layer (§3.8)

3.3 The core maths, made simple

Six load-bearing ideas, each with an analogy.

Loss
the score for how wrong a prediction is; training = making it as small as possible. Analogy: loss is your golf score. Lower is better; every model just tries to shoot a lower round on the examples it's shown.
Gradient descent / SGD
with no neat formula, the machine feels its way downhill on the error surface — steepest-down, small step, repeat. SGD peeks at one random example per step (what makes huge-data training possible). Analogy: blindfolded on a hillside, feeling for the valley one foot at a time.
Cross-validation
split into ~5 chunks, train on 4 test on the 5th, rotate, average — a stable estimate of new-data performance without wasting data. Trap: anything that learns from the data must happen inside each fold, or the test leaks in.
Regularisation · ridge · lasso · λ
the overfitting dial — penalise complexity, push parameters toward zero. Ridge shrinks all; lasso pushes some to exactly zero (auto feature-selection). The strength is λ, tuned by CV. Analogy: lasso drops the weakest team members entirely; ridge trims everyone's expenses but keeps the team.

Overfitting + bias-variance — the silent killer

A model is graded on test error (fresh data it didn't train on), never training error. Overfitting is memorising the training examples including their noise — brilliant on training, falls apart on new data. You cannot see overfitting by looking at the training set. This is the mechanism behind "the lead score that looked amazing offline and collapsed live."

test error vs model flexibility
   TEST ERROR
     high  |\                                    /
           | \                                  /
           |  \__                            __/
           |     \__                      __/
           |        \___              ___/
           |           \____      ____/
     low   |               \______/    <-- sweet spot
           +----------------------------------->  model flexibility
            too simple                 too flexible
            (high BIAS / underfit)     (high VARIANCE / overfit)

Analogy: a student who memorises last year's exact exam aces the practice and fails the real test. Overfitting is memorising instead of understanding.

3.4 Features — where the work & the bugs live

A feature is not raw data — it's the concept of a signal, computed per entity. "Clicked at 3:02pm" is raw; "product pages viewed in the last 30 days" is a feature. ~60% of real ML work is here, and so are the two deadly bugs — both invisible to every offline metric:

The two silent killers

  • Target leakage — a feature that secretly encodes the answer: a value that only exists because the outcome already happened (a bank's "call duration" — you don't know it until after you've called). Near-perfect offline, collapses live. The test: "Would this value exist, unchanged, the instant a new lead arrives, before any human worked it?"
  • Train/serve skew — the same feature computed one way offline and a different way live (a rounding diff, a timezone). "The most expensive feature bug because it's silent." Fix: define each feature once (a feature store in principle — one shared function both train and serve call; you do not need the platform at this scale).

Why this is the asset, not the model

A model swap is cheap; a leakage-safe feature library built from a client's own years of events is what a competitor can't reproduce. The mechanism that structurally prevents leakage: point-in-time correctness — every feature frozen as of the decision timestamp.

3.5 The three tests a score must pass

A single number ("AUC 0.82") hides which property you have. A score must pass three genuinely different tests before you bet a dollar:

TestThe questionCatches
Rankdo good leads sort above bad? (lift / AUC)a great ranker whose probabilities are nonsense
Calibratedoes "0.8" actually mean 80%?a well-ranked score you can't do money-maths on
Validate forwardwill it survive going live? (no time-leak)a number inflated by leaking the future

Accuracy is the wrong metric under imbalance: if 5% of leads buy, "say no to everyone" is 95% accurate and useless — so rank with AUC, not accuracy. Calibration is the one that matters for money: among leads scored ~0.8, ~80% really buy — fix miscalibration by fitting a calibrator (Platt / isotonic) on held-out data. And random cross-validation leaks the future on time-ordered data, so validate forward in time.

Twyman's Law

"Any figure that looks too good is probably wrong." A near-perfect AUC is a leakage alarm to investigate, not a win to report. First reaction to a 0.99 should be suspicion, not celebration.

3.6 Lead scoring is triage

Lead scoring is risk stratification: split leads by probability of buying so you spend human effort on the high group. The deliverable isn't a yes/no — it's an ordering (a ranked call list) plus an action cutoff. The backbone: logistic regression first (readable, defensible, debuggable), trees only when a held-out test shows the fancier model actually wins — "not because it's fancier."

3.7 Uplift — don't waste a touch on a sure thing

Propensity answers "who is likely to convert on their own." Uplift answers a better question: "whose conversion does our touch actually cause?" These are different people. Analogy: propensity asks "was he going to buy anyway?"; uplift asks "did our touch change his mind?" — only the second tells you where the budget actually worked.

SegmentTouched →Not touched →Verdict
Persuadablesconvertsdoesn'tthe only profitable target
Sure Thingsconvertsconverts anywaywasted (falsely booked as a win)
Lost Causesdoesn'tdoesn'twasted (no effect)
Sleeping Dogsdoesn'twould havenegative — the touch suppresses a sale

The honest ceiling

Uplift needs a randomised (or quasi-experimental) split — a random control that got no touch. On plain observational data the model confuses effect with selection and returns confident, wrong scores. "A propensity model in a causal costume is worse than an honest propensity model."

3.8 Bandits vs full RL

To act under uncertainty and learn, the whole field is one trade-off: explore (try something worse, to learn) vs exploit (use the current best). Climb the ladder — don't start at the top:

RungUse whenAnswers
A/B testfew fixed options, you can wait"which one is best on average?"
Multi-armed banditstop wasting traffic on losers while learning"how do I earn while I learn?"
Contextual banditbest action depends on this lead's features"which action for this lead?"
Full RL (MDP)today's action changes the future state"which sequence maximises long-run reward?"
The honest line"Most marketing decisions are a bandit, not full RL — and many are not even a bandit yet, just an A/B test nobody ran." Reach for full RL only when you can name the state transition your action causes.

3.9 Self-iterating machines — a loop is only as good as its frozen scorer

A self-iterating machine improves itself: propose a change → run it → score it against a frozen metric → keep if it wins, discard if not → repeat.

The single most important lesson

The scorer is the heart, frozen and walled off from the proposer. A loop whose proposer can edit or see through its scorer is grading-your-own-homework on autopilot — it climbs the number while the real objective rots. The proposer is the easy, commoditised part; the honest scorer is the whole job.

Why our engine runs slow on purpose

Open-source research loops have a cheap, fast verifier (~100 experiments a night). Ours is the opposite — a booked sale under a randomised holdout is slow, noisy, confounded, and costly. So our cadence is set by how fast we can honestly measure: "a few honest experiments a quarter, not a hundred a night, and that is the correct, not the broken, speed." We are verifier-bound. Swap the slow honest scorer for a fast proxy (clicks instead of revenue) and you guarantee Goodhart drift — perfect optimisation of the wrong thing. The operating model is build-to-HOLD: the winner is held for a human read in the morning; nothing self-promotes to a client.

3.10 Fairness — dropping protected attributes does NOT make a model fair

"We don't collect race or sex, so we're fine" is exactly the move that fails — and fails worse the smarter the model. An AI denied the protected attribute is structurally driven to rebuild it from proxies (a zip code, the shows you watch). You often can't tell by reading the feature list. So ask: (1) can the model reconstruct the protected attribute from what's left? (2) does each feature earn its place for a reason other than the disparate impact it produces? — answerable only by measuring against the protected attribute.

The audit paradox — resolved by topology

To audit for proxy discrimination you need the protected attribute; minimisation says hold as little as possible. Resolve it: collect protected-class data only into a walled-off, access-controlled fairness-audit pipeline — the audit sees it; the scoring model never does. (Doctrine, not legal advice — counsel reviews before any live regulated model.)

3.11 The honest ceilings — memorise these

What separates a product from a demo1 · Offline metrics are not live improvement — only a randomised holdout proves the live policy helps. 2 · Volume gates are real — below ~100 conversions/segment you ship a ranking aid, not a trusted probability (rules of thumb, not law). 3 · Calibrate before you act. 4 · Dropping protected attributes doesn't make a model fair.

Part 4

The Paper St system

The payoff: the whole system, drawn so you can keep designing it. Parts 2 and 3 were the two halves — building with AI (loops) and the prediction engine (machine learning). Here's how they compose into a company.

4.1 The one-line thesis

You use Claude Code to build little "machines" that make a business money — a website, a lead system, an email flow. Every time you build one, you write down what worked into a private library. The more machines, the bigger and smarter that library. You group the machines by industry under a parent company, and eventually the whole thing becomes a product anyone can use.

The thesis in one lineBuild money-making machines with Claude Code · learn from every one into a proprietary database of systems · compound that database across a holding company of niche brands · into an AI product that eventually lets any business build its own machine.

The pattern has a precedent: the consulting-firm-to-product move — engineers run the playbook by hand for each client, then the firm turns the playbook into a product trained on everything those engagements taught it.

4.2 The three nested closed loops

A closed loop captures its own results and uses them to improve itself — like a thermostat reading the room. The system stacks three, at three sizes:

why it compounds
+- LOOP 3 . HOLDCO ----------------------------------------+
| pools which PATTERNS make money across ALL industries    |
| +- LOOP 2 . NICHE-CO ----------------------------------+ |
| | pools what worked across ALL clients in one industry | |
| | +- LOOP 1 . CLIENT ------------------------------+   | |
| | | one machine learns from one client's customers |   | |
| | +------------------------------------------------+   | |
| +------------------------------------------------------+ |
+----------------------------------------------------------+

        "More systems = more real-world data = better systems."

Analogy: three Russian dolls. The smallest (one client) sits inside the middle (the industry brand), inside the biggest (the HoldCo). Each outer doll learns from everything every inner doll discovered.

4.3 The moat = Loop 3

A moat is what keeps competitors out. The moat is not any one website — those get copied in a weekend. It's Loop 3: the library of what actually made real businesses money, across many industries, that no competitor can reproduce because no competitor ran those engagements.

Why "learning flywheel," not "network effect," for now

At a handful of clients, what exists is a learning flywheel (a loop that keeps improving), not yet a true data network effect — most "data moats" are weak scale-effects that plateau fast. The honest rule: call it a flywheel until three conditions hold — capture is automatic, the loop visibly closes per client, and opt-in reciprocity is in the contract. This honesty is good positioning — show it, don't hide it.

4.4 The holding company + the three-tier offer

A holding company is a parent that owns other companies but doesn't itself sell to customers. It owns the niche companies (one per industry), the proprietary systems database (the shared moat), and the self-serve product. The same core offer sells at three altitudes:

TierWhat it isTouch / price
Bespokewe build your machine for youhighest touch, highest price
Niche producta productised machine for one industrymid
Self-serveClaude + the database + governance; you build your ownlowest touch, lowest price

Analogy: same coffee three ways — a barista makes it (bespoke), a branded pod machine for your office (niche product), or a bag of beans + instructions (self-serve). Same beans (the corpus) under all three.

4.5 The deal, in plain language

For the moat to work, every engagement makes two things true at once:

The honest split

  • The client owns all their raw customer data, accounts, domains, leads, and the machine built in their name — from day one, never held hostage.
  • Paper St learns only from anonymised, aggregated patterns (what worked, with identifying info stripped) — never sells your data, never shares your customers.

The line: the client owns everything they touch; Paper St owns the meta — the pattern of what makes systems win.

4.6 What paperst.ai actually is

The honest framingpaperst.ai is a product built on Claude (via the API), wrapped in a governance layer that holds the model to set parameters, and powered by the proprietary database. The engine is Claude; the brand and the moat are the database and the governance, not the model — a competitor can rent the same model, but can't copy the corpus.

4.7 The two halves that build each machine

Two distinct systems do the work, mapping exactly onto Parts 2 and 3:

loops/ The Build Loop — how anything gets made (Part 2)

A fixed recipe every system goes through so quality is repeatable, not luck:

SCOPE -> revise -> PLAN -> revise -> BLUEPRINT -> stress-test -> BUILD -> deep audit

Each stage saves the assumptions it rested on, so if something breaks later you trace it to the exact thinking error and fix the recipe, not just the symptom. (Use the strongest model on the hardest judgment — "frontier-bound" work — and design routine "cadence-bound" work so a cheaper model can do it later.)

engine/ The Refinery — how machines teach each other (Part 3)

The system that makes Loop 3 automatic: every running machine sends its anonymised results up → the Refinery ingests, validates, de-identifies, sorts, and digests them → it proposes improvements back down.

The critical rule — "proposals, never pushes"

The Refinery never changes a live system on its own. It only proposes; a separate, gated, attributed step applies it. No exception, including trivial. This is build-to-HOLD, encoded into the architecture.

4.8 The end-to-end machine

one flywheel, spinning per client
   CORPUS  -->  BRIEF  -->  RENDER  -->  GATE  -->  PUBLISH  -->  MEASURE
   ...then it loops: every result becomes a tagged ledger row that feeds
   back into the CORPUS.    ( capture -> predict -> decide -> act -> learn )

A flywheel is a loop that, once spinning, takes less and less push and gets faster as it compounds. Predict here is the corpus + Claude's judgment (no billion-dollar model needed — the tagged ledger is the substitute); act is always the gated apply step, never automatic.

Why "copy the schema, skip the corpus"

The smart move isn't to out-spend the giants — it's to copy their tagging schema and shape (generate wide → human-curate → bandit-allocate) while skipping what needs billions in spend (in-platform ad testing — the big platforms already do it free and better, so own the landing side instead). Knowing what not to build is half the design.

4.9 The learning boundary — corpus vs registry

The single most important safety design in the whole system. The database is split in two, with a structural wall between:

+- database/corpus/ ----+              +- database/registry/ -------+
| systems + outcomes    |              | machine-key -> real client |
| the MOAT              |   <-- WALL   | the identity map           |
| the ONLY thing a      |   (nothing   | NEVER enters a learning    |
| learning job may read |   crosses)   | job (a lint check fails    |
+-----------------------+              | if a config includes it)   |
                                       +----------------------------+

Why a wall instead of a promise

Even with names stripped, data where the mapping back to a person is held nearby is still "personal data" under privacy law. So you put a structural wall between the learnable patterns and the identity map, instead of trusting a model to "just not look." Keys are opaque (mch_<hex>, no meaning), and distinctive facts are banded (audience "10K–100K", geography "US West") so a record can't be re-identified by its details. Analogy: a hospital research dataset — researchers see "Patient 4471, age band 40–50, condition X" and learn; the file mapping 4471 to a real name lives in a room they never get the key to.

4.10 The governance layer

The governance layer is the leash — the rules and gates that hold Claude inside set parameters so it can't do something illegal, off-brand, or destructive even while working on its own.

Model routing: not every job needs the biggest brain. The first rule is "the cheapest model is no model" — anything a plain script can check never burns AI tokens (Tier 0). Cheap models sort/tag; the strongest are reserved for deep architecture and audits. When money moves, a stronger model reviews — "higher blast radius = higher-tier review."

The Governor — the rule engine every action passes through. A few worth knowing:

  • Rule-of-Two — any job holding all three of {untrusted input, private data, outbound ability} is dangerous and must be redesigned or human-gated. (§2.11's lethal trifecta, as policy.)
  • Deterministic policy only — a guardrail that "blocks 95%" never counts; only a hard architectural cut does.
  • Proposals, never pushes — §4.7, encoded as governance.
  • Learn only from verified outcomes — "the AI reflected and learned" with no attached verified result is banned.
  • Kill switch outside the reasoning path — a stop button that doesn't depend on the AI agreeing to stop.

A consent envelope is the explicit leash: a signed, bounded permission slip that lets an agent act on its own within hard limits (spend caps, banned claim-classes, pre-approved templates, a kill switch). Analogy: a nuclear plant's control room — cheap sensors handle the routine, humans sign off on anything that moves real fuel, and there's a physical scram button wired outside the computer.

4.11 The whole stack on one page

LevelWhat it isStatus
L0 · Truthclaims & assumptions ledgers; every row human-verifiablenow
L1 · Build Loopthe standard recipe for making anythingnow
L2 · Corpus + papertrailthe moat: systems + outcomes recordsnow
L3 · Refinerythe brain that ingests, digests, and proposesnext (research-first)
L4 · Autonomygoverned agents inside consent envelopeslater (gated)
L5 · Productself-serve: the packaged corpus + Governor for outside tenantsfuture

Cross-cutting all six: the Governor (every arrow passes a gate), the Console (every job emits a heartbeat, so silent failures surface), the Fleet (all deployed machines). The papertrail doctrine — every level keeps an append-only ledger and every recommendation cites the row it came from — is the edge over vanilla Claude: the captured failures + traced reasoning are what a fresh model can't reproduce.

Honest build status

L0–L2 are live work; L3 is next; L4–L5 are later and gated. This is the designed plan, not a finished system — and honesty about what's built is itself part of the design.

4.12 How to keep designing it

When you sit down to design the next piece, route the decision through what this course gave you:

  • Part 2 problem or Part 3 problem? "Get the AI to build this reliably?" → loops. "Predict / decide well from data?" → engine.
  • Find the rung first. A one-off is L0/L1. Repeats and machine-checkable → L3/L4. Nothing goes to L5 until one manual run is reliable.
  • Name the verifier before you build. Can't say what "done" objectively is? You have an agent agreeing with itself, not a loop.
  • Decouple proposal from execution, always. The agent proposes; a separate validated layer executes and logs.
  • Keep the learning boundary sacred. Anything that learns reads the corpus, never the registry.
  • Be honest about what's built. Flywheel not network-effect; estimate not quote; designed not deployed.

A worked example — designing one new piece, out loud

Say you want a subject-line learner for the outreach machine — something that figures out which cold-email subject lines get replies. Run it through the questions instead of guessing:

  1. Part 2 or 3? Both. Deciding which line to send is a Part-3 prediction; the machinery that tries, measures, updates is a Part-2 loop.
  2. Which rung? It repeats, a machine can check it, options are a fixed menu — that's a bandit (§3.8), not full RL. Most marketing "AI" is a bandit nobody set up. Start there.
  3. The frozen verifier? The honest metric is reply / booked-call rate, not open rate — open rate is a proxy, and a self-modifying loop optimises exactly the number you give it (Goodhart). Freeze the scorer on the real outcome.
  4. Enough data? Below ~100 booked calls/segment it's a ranking aid, not a bet (§3.11).
  5. Proposal vs execution? It proposes a line; a gated step (or a consent envelope with a spend cap) sends — never the model directly.
  6. Cross the wall? Reads aggregated patterns from the corpus; never the registry. If a feature needs a lead's identity, redesign the topology — don't add a filter.

The first concrete decision falls right out: a Thompson-sampling bandit over a small set of human-approved subject lines, scored on booked calls, proposing-not-sending, reading only the corpus. You didn't need a new idea — you needed the course's questions, in order. That is the motion that keeps the system designable.

Part 5

Put it together

5.1 The one mental model for the whole course

+---------------------------------------------------------------+
| PART 4 . THE COMPANY    paperst.ai: 3 nested loops, the moat  |
| +- PART 2 . LOOPS   +   PART 3 . THE ENGINE ---+              |
| | direct AI to build      predict -> decide -> |              |
| | maker/checker, gates    learn  (the ML)      |              |
| +----------------------------------------------+              |
| PART 1 . HOW AI WORKS   it predicts plausible text, not truth |
+---------------------------------------------------------------+

   model unreliable -> build a harness that checks it -> it builds
   machines -> the machines feed one library -> the library is the moat

The single thread: the model is brilliant but unreliable, so you never trust it — you build a harness that checks it. That harness builds the machines. The machines feed one shared library of what-made-money. That library, owned by you, is the thing no competitor can copy. Everything else hangs off that thread.

5.2 The ideas worth tattooing on the inside of your eyelids

  • A prompt makes you the engine; a loop makes the system the engine.
  • It predicts plausible text, not true text — fluency and hallucination are the same mechanism.
  • Harness beats model (~70/30). Invest in the system around the model before waiting for a smarter one.
  • The verifier is the heart. No separate check = an agent agreeing with itself. Fix the artifact, never weaken the gate.
  • Climb for leverage, drop for reliability. Prove it once, harden it, then automate it.
  • The schema is the moat; the model is secondary.
  • Offline metrics are not live improvement. Calibrate before you act. Twyman's Law: too-good is a bug.
  • The agent proposes; a separate layer executes. Proposals, never pushes.
  • Honesty is the moat's foundation. Flywheel not network-effect; estimate not quote; designed not built.

5.3 The cheat sheet — which tool for which job

You want to…Reach for
Get one good answer, oncea clear prompt (L0)
Make an AI build one verifiable thinga single-agent loop with a real test (L2)
Make "done" actually mean somethingmaker/checker — separate the doer from the checker (L3)
Build a batch / multi-phase thinga Workflow — planner → executors + per-item gates (L4)
Predict who's worth a human's timelead scoring = triage → a ranked call list
Spend a scarce touch wiselyuplift — find persuadables, avoid sleeping dogs
Choose + learn under uncertaintya bandit (most cases), not full RL
Trust a probability for money-mathsprove it ranks, calibrates, and validates forward
Keep an agent from leaking databreak the lethal trifecta by topology
Compound everything into a businessthe three nested loops + the corpus/registry wall

5.4 The honest limits

This course is reasoning and guidance, not gospel. The heuristic numbers (the ~50% accept-rate, the ~70/30 harness split, the volume gates) are rules of thumb, not laws. Model sizes and prices move — re-check at use-time. The legal and fairness material is doctrine, not legal advice. And most of the system above L2 is designed, not built. None of that weakens the spine; it's what keeps it honest. Build the loop like someone who intends to stay the engineer — not just the person who presses go.

Reference

Glossary — the words, grouped

The vocabulary in plain lines, grouped by where it shows up. Skim now; come back when a word trips you.

A · How AI works

Model
a trained maths function (a file of numbers) that maps an input to a predicted output.
Parameters / weights
the tunable internal numbers; "70B parameters" = 70 billion of them.
Training vs Inference
the expensive one-time tuning of the weights vs running the finished model (weights don't change). Every prompt is inference.
Token
the word-fragment a model reads/writes in; ~4 characters / ~0.75 word; what you're billed by.
Next-token prediction
the core LLM mechanism — predict the most likely next token, append, repeat.
Hallucination
confident, fluent output that's fabricated; the same mechanism as the fluency.
Temperature
the randomness dial; low = consistent, high = varied/creative.
Context window
the total tokens a model can "see" at once; finite, capped per model. The "dumb zone" is its least-attended middle.
Prompt / context engineering
crafting the wording vs designing what information fills the window.
Harness
the machinery around the model (verifiers, loops, file handoffs) that keeps output correct.
RAG
retrieval-augmented generation — look up real docs and paste them in so the model answers from sources, not memory.
Grounding
tying an answer to verifiable real sources; the antidote to hallucination.
Agent / tool use / MCP
an LLM that acts in a loop / a menu of real actions it can invoke / the standard plug for connecting tools to an AI.

B · Building with AI (loops)

Loop
a repeating cycle: the AI acts, gets feedback, picks the next action, until a stop condition.
The ladder (L0–L5)
the rungs from manual prompting (L0) to scheduled/self-improving (L5).
Goal / Verifier / State / Stop
the four parts every real loop needs; the verifier (the gate it can't argue past) is the heart.
Workflow vs Agent
control flow held by your code (predictable) vs by the model (flexible). The whole distinction.
Maker/checker
split the agent that does from the agent that checks; most of the quality.
Adversarial verification
N independent skeptics, each told to refute, each with a different lens; majority rules.
Cost per accepted change
the real measure of a loop's worth — not tokens spent or loops run.
Ralph Wiggum loop
the agent exits a half-finished job and the loop keeps spending in silence.
LLM-as-judge / calibration gate
an LLM scoring a rubric / proving it agrees with humans >75% before you trust it.
Lethal trifecta
private data + untrusted content + outbound ability in one context = dangerous.
Prompt injection
malicious instructions hidden in content the agent reads.
Durable execution
append-only history + memoization so a crash replays without re-paying for completed work.

C · The engine (machine learning)

Supervised / unsupervised / reinforcement
learning with answer-keys / without / by trial-and-reward.
Loss / gradient descent / SGD
how-wrong-a-prediction-is / stepping downhill to minimise it / using one random sample per step.
Overfitting / bias-variance
memorising the noise (great on training, bad on new data) / the too-simple vs too-flexible U-curve.
Regularisation / ridge / lasso
penalising complexity / shrink-all-keep-all / shrink-some-to-zero (auto feature-selection).
Feature
an input signal as a concept (an aggregate per entity), not a raw value.
Target leakage / train-serve skew
a feature that encodes the outcome / the same feature computed differently offline vs live. Both invisible offline.
AUC / calibration
a ranking score (good lead outranks bad) / "0.8 means ~80%".
Twyman's Law
"any figure that looks too good is probably wrong"; a near-perfect AUC is a leakage alarm.
Propensity vs uplift (CATE)
the level of outcome (likely to convert) vs the change the touch causes.
Persuadables / sure things / sleeping dogs
the response segments; only persuadables are worth the budget; sleeping dogs are suppressed by the touch.
Bandit (multi-armed / contextual)
choose-and-learn among options / when the best choice depends on features. Most decisions are a bandit, not full RL.
Frozen scorer / Goodhart's law
the unchangeable grading metric / a loop pointed at a proxy optimises the proxy and drifts.
Proxy discrimination
a neutral feature useful because it reproduces a protected class's disparate impact.

D · The Paper St system

Thesis
the single locked sentence every build is judged against.
Closed loop / nested loops
a self-improving system / loops inside loops (client · niche-co · HoldCo).
Moat
the durable advantage competitors can't copy — here, the cross-niche corpus of what made businesses money.
Learning flywheel vs network effect
an improving loop (true now) vs a self-reinforcing data advantage (not yet — say "flywheel").
HoldCo / three-tier offer
the parent that owns subsidiaries / bespoke · niche product · self-serve.
Build Loop / Refinery
the recipe that builds machines (loops/) / the system that learns from them (engine/).
Corpus / registry / learning boundary
the learnable library (safe) / the key↔client map (never learned) / the structural wall between them.
Governor / Rule-of-Two / consent envelope
the rule engine every action passes / the trifecta budget / a bounded permission slip with a kill switch.
Build-to-HOLD
do every safe, reversible step, then hold the irreversible one for a human; nothing ships itself.
Papertrail doctrine
every level keeps an append-only sourced ledger; every recommendation cites the row it came from.