A field guide for business minds

How AI actually works.

Agents, skills, MCP, RAG, evals — every concept your engineering team mentions in standup, translated into plain English. No math. No hype. Sixteen short chapters, readable over two coffees.

Part OneThe Raw Material

The Model · Tokens · The Context Window · Prompts · Hallucination

Part TwoMaking It Useful

RAG — Retrieval · Tools · Agents · Skills · MCP · Multi-Agent Systems

Part ThreeShipping It

Prompt vs. RAG vs. Fine-Tune · Evals · Guardrails · The Cheat Sheet · Glossary

Part One

The Raw Material

01

The Model

A prediction engine that read the internet — not a database, not a mind.

A large language model (LLM) is a piece of software trained on a colossal amount of text — books, articles, code, the public web. Training taught it one skill, performed at superhuman scale: given everything so far, predict what comes next. That one skill, it turns out, is enough to draft strategy memos, write code, summarize earnings calls, and negotiate calendars.

Two properties follow from how it was built, and they explain most of the surprises you’ll hit in practice. First, the model is frozen in time: it knows nothing after its training ended — not yesterday’s launch, not this morning’s metrics, not your reorg. Second, it has no memory of you: every conversation starts from a blank slate unless you (or your tooling) hand it the relevant history.

Everything else in this guide — retrieval, tools, agents, MCP — exists to work around those two limitations. Once you see that, the whole stack stops looking like alphabet soup and starts looking like an org design problem.

02

Tokens

The unit of cost, speed, and memory. Everything is metered in tokens.

Models don’t read words; they read tokens — chunks of roughly three-quarters of a word. “Sponsorship” is two tokens. A page of text is about 500. Why should a business person care about a unit this nerdy? Because tokens are the meter that everything runs on.

API pricing is per token, in and out. Latency grows with tokens generated. And the model’s working memory — the next chapter — is a fixed token budget. When an engineer says a feature is “token-heavy,” they are telling you it is slow and expensive in the same breath. When finance asks why the AI bill doubled, the answer is denominated in tokens.

One sentence, as the model sees it

Sponsorship revenue for mid-tier creators grew 23% quarter over quarter.

You read words. The model reads tokens — click the button to see the difference.

Fig. 02 — Tokenization (illustrative)

03

The Context Window

The model's working memory — a desk, not a library.

The context window is everything the model can see at once: your instructions, the documents you attached, the conversation so far. Modern windows hold a few hundred thousand tokens — several novels’ worth. That sounds infinite. It is not, and treating it as infinite is the most common beginner mistake in AI product design.

Two things break when the window fills. The obvious one: old material falls out, and the model genuinely forgets it — this is why a long chat session “loses the plot.” The subtler one: models pay imperfect attention to enormous contexts, the way you skim page 200 of a briefing book. More context is not always better context. The craft of choosing what goes in — and what stays out — is called context engineering, and it is quietly becoming the core skill of building with AI.

What fills the window on a single request

  • Instructions (system prompt) 12%
  • Your documents & data 38%
  • Conversation so far 26%
  • Your latest question 6%
  • Room left to think & answer 18%

The window is finite. Stuff it with everything “just in case” and there is no room left to reason — and you pay for every token you stuffed. Curating what goes in is a real discipline; engineers call it context engineering.

Fig. 03 — The context window as a budget

04

Prompts

Instructions are the product spec. Most 'AI quality' problems are brief quality problems.

A prompt is just the input you give the model. The interesting one is the system prompt: standing instructions, written by your team, that the model reads before every single user message. Tone, rules, format, what to refuse, who it works for — the system prompt is the job description, and the model follows it with the literal-mindedness of a very fast new hire on day one.

This is why “prompt engineering” isn’t a parlor trick — it’s spec-writing. Vague brief in, vague work out, exactly like a vague creative brief to an agency. The difference: with AI you can revise the brief fifty times a day and measure the result each time. The teams that win treat prompts as managed, versioned product assets — reviewed like code, tested like features — not as magic words someone typed once and forgot.

Most “the AI is bad at this” complaints are really “we gave it a bad brief.”
05

Hallucination

Fluent, confident, and occasionally wrong — by design.

A model’s job is to produce the most plausible next token — not the most true one. Usually plausible and true coincide. When they don’t, the model produces a hallucination: a confident, fluent, well-formatted answer that is simply false. A citation that doesn’t exist. A revenue figure that was never reported. The danger isn’t that it happens — it’s that it happens in perfect corporate prose, with no stammer to tip you off.

The fix is not “a smarter model” (though newer models hallucinate less). The fix is grounding: forcing answers to come from real sources — your documents via retrieval, live data via tools — and asking for citations you can check. Well-grounded systems can also be instructed to say “I don’t know,” which is the single cheapest reliability upgrade available.

Part Two

Making It Useful

06

RAG — Retrieval

Give the model an open-book exam on your company's knowledge.

The model ships knowing nothing about your business. Retrieval-augmented generation — RAG, rhymes with “bag” — is the standard fix. When a question arrives, the system first searches your documents (policies, wikis, contracts, past decks) for the most relevant passages, then hands those passages to the model and says: answer from this.

The business consequences are bigger than the acronym suggests. Answers stay current the moment a document is updated — no retraining, no waiting on a model release. Answers come with citations, so a human can verify in seconds. And your data stays your data: it is fetched at question time, not baked into anyone’s model. When a vendor says “we train on your knowledge base,” the follow-up question is whether they mean RAG (fine, standard) or actual training (a much bigger conversation — see chapter 12).

One honest caveat: RAG quality lives and dies on the retrieval step. If the search surfaces the wrong passages, the model fluently summarizes the wrong thing. Garbage retrieved, garbage generated.

Retrieval-augmented generation, in three moves

1

Question arrives

“What’s our refund policy for annual plans?”

2

Retrieve

Search your docs, wiki, and policies for the relevant passages

3

Read & answer

The model answers from those passages — and cites them

The model never “memorizes” your documents. It does an open-book exam on every single question — so when the policy doc changes, the answers change with it. Nothing to retrain.

Fig. 06 — RAG: look it up, then answer

07

Tools

Giving the model hands: from writing about work to doing it.

Out of the box, a model can only emit text. Tools (the API feature is called function calling) change that: your engineers define a menu of actions — search the warehouse, query the CRM, send the email, file the ticket — and the model, mid-conversation, can choose one, fill in the parameters, and ask your software to run it. The result comes back, and the model continues with real data in hand.

This is the line between an AI that describes work and an AI that does work — and it’s also where risk enters the building. A model that can only talk can embarrass you; a model that can act can spend money, email customers, and delete things. Which is why the tool menu, and the permissions on each item, is a business decision, not just a technical one.

08

Agents

A model in a loop with tools and a goal. The word of the year, demystified.

Strip the hype and an agent is exactly three things: a model, a set of tools, and a goal — run in a loop. Think about the next step, take an action, look at what happened, repeat until the goal is met. A chatbot answers your question and stops. An agent keeps working: it notices the first search came up empty, reformulates, tries a different tool, checks its own output, and comes back when the job is done — minutes or hours later.

The loop is what changes the economics. You stop paying per answer and start delegating per outcome: “reconcile these two spreadsheets,” “triage today’s support queue,” “draft the quarterly review from these five sources.” The loop is also what changes the risk: an agent that can take twenty actions unsupervised can take twenty wrong actions unsupervised. Autonomy is a dial you set deliberately, not a feature you switch on.

The loop that makes it an agent

Think

Read the goal, decide the next move

Act

Use a tool — search, query, draft, book

Observe

Read the result. Done? If not…

…go again, until the goal is met

A chatbot runs this loop once and hands you text. An agent keeps looping — dozens or hundreds of turns — checking its own work against the goal each time around.

Fig. 08 — Think, act, observe, repeat

A chatbot answers questions. An agent finishes jobs.
09

Skills

Standard operating procedures that turn a smart generalist into a trained operator.

An agent is a brilliant generalist, but it doesn’t know your way of doing things — how this company writes a launch brief, which template legal insists on, what “done” means for a quarterly review. A skill is the fix: a packaged playbook — instructions, examples, reference files, sometimes scripts — that the agent loads when, and only when, the relevant task shows up.

The detail that makes skills elegant: the agent reads a one-line description of each skill and pulls in the full playbook only when needed — so you can maintain a large library without clogging the context window (chapter 3’s desk stays clean). Skills are files, so they’re versioned, reviewable, and shareable. When someone improves the “competitive teardown” skill, every agent that uses it gets better the same day. That is institutional knowledge with a deployment pipeline.

10

MCP

The USB-C of AI: one standard plug between assistants and your systems.

Tools are powerful, but until recently every team wired them up bespoke — every assistant needed a custom integration to every system. The Model Context Protocol (MCP) is the open standard that ends that. A system exposes one MCP server describing what it offers — “search these docs,” “query this database,” “create this ticket” — and any MCP-compatible assistant can connect. Build the connector once; every AI tool that speaks the protocol can use it.

Why a business person should care about a protocol: it converts integrations from custom builds into a check-box (“does it support MCP?”), it reduces vendor lock-in because connectors outlive any one assistant, and it’s where the ecosystem has converged — major AI vendors and a fast-growing catalog of servers for common business software. When your team says “there’s an MCP server for that,” they mean the integration is mostly already done.

One plug standard, many systems

AI assistantspeaks MCPCRMAnalyticsDocs & DriveCalendarTicketingData warehouse

Before MCP: every AI tool needed a custom, one-off integration with every system — an N×M mess of bespoke connectors. After: each system exposes one MCP server, and any MCP-speaking assistant can plug in. Like USB-C, but for AI and software.

Fig. 10 — The Model Context Protocol

11

Multi-Agent Systems

When one agent isn't enough: specialists, orchestrators, and the org chart pattern.

Big jobs strain a single agent for the same reason they strain a single employee: too many contexts, too many hats. A multi-agent system splits the work — an orchestrator agent breaks the goal into assignments and farms each one out to a subagent with a clean context window and a narrow job: one researches, one analyzes, one drafts, one critiques. The orchestrator reviews the pieces and assembles the result.

Two reasons this works, both familiar from management. Focus: a specialist with one job and only the relevant materials on its desk outperforms a generalist juggling everything (the context window strikes again). Parallelism: ten subagents can read ten reports simultaneously. And one familiar cost: coordination overhead is real — more agents means more token spend and more ways to fail. The honest heuristic your team should follow: use one agent until it demonstrably breaks, then split.

An org chart, but for agents

Orchestrator

breaks the goal into assignments, reviews the work

Researcher

pulls market data

Analyst

crunches the numbers

Writer

drafts the brief

Each specialist gets a clean context window and one job — the same reason you don’t staff one person on strategy, data, and copywriting simultaneously. The orchestrator owns the goal and the quality bar.

Fig. 11 — Multi-agent delegation

Part Three

Shipping It

12

Prompt vs. RAG vs. Fine-Tune

Three ways to make a general model yours — and the order to try them in.

Sooner or later every team asks: “should we train our own model?” Almost always, the answer is no — and the reason is that there are three ways to customize, with wildly different price tags. Prompting changes the instructions. RAG changes what the model can look up. Fine-tuning changes the model itself, by training it further on hundreds or thousands of your examples.

The mental model that prevents expensive mistakes: fine-tuning teaches behavior, not facts. It is the right tool when you need a very specific style or a narrow task done at huge volume — and the wrong tool for “make it know our products,” because baked-in knowledge goes stale the day your catalog changes. RAG keeps knowledge live. Prompting fixes most things for the price of an afternoon. Climb the ladder only when the cheaper rung demonstrably fails.

1.Prompting
Hours

Always start here. Tone, format, rules, examples — most customization is just a better brief.

2.RAG
Days–weeks

The model needs your knowledge: policies, docs, data that changes. Current, citable, no retraining.

3.Fine-tuning
Weeks + upkeep

A narrow, high-volume task where examples beat instructions. Teaches behavior, not facts. Last resort.

13

Evals

QA for software that never gives the same answer twice.

Traditional software is deterministic: same input, same output, write a test. AI is probabilistic: same input, slightly different output every time — and “better” is often a judgment call. Evals are how serious teams replace vibes with measurement: a curated set of test cases (real questions, graded answers) that every change — new prompt, new model, new retrieval setup — must pass before it ships.

This is the chapter to care about most, because evals are where AI products quietly succeed or fail. Without them, every model upgrade is a gamble and every “it feels worse lately” debate is unresolvable. With them, you have a scoreboard: ship the change if the numbers improve, roll back if they don’t. If your team can’t show you their evals, what they have is a demo, not a product. And the highest-leverage thing a business owner can contribute to an AI project is exactly here: defining what a good answer looks like, with real examples. That’s domain judgment, not engineering.

If there are no evals, it’s not a product — it’s a demo.
14

Guardrails

Autonomy is a dial, not a switch — and someone has to own where it's set.

Once an AI can act (chapter 7) and loop (chapter 8), the governing question stops being “how smart is it?” and becomes “what is it allowed to do?” Guardrails are the controls around the model: permissions on every tool, checks on inputs and outputs, spending and rate limits, and — most importantly — human-in-the-loop checkpoints, where the system pauses and asks a person before acting.

The practical framework is blast radius. Drafting an internal summary? Let it run. Replying to a customer, touching production data, moving money? A human approves, at least until the eval scores earn more rope. Mature teams widen autonomy the way managers widen a new hire’s: gradually, based on a track record, with an audit log of every action so trust can be verified rather than assumed. “The AI did it” is not an accountability structure — every agent should have an owner whose name you know.

15

The Cheat Sheet

The whole guide on one screen: what you want, what it's called, where to read.

If you want… → you need…

Ch.

Answers from our documents, kept current RAG

06

Our voice, format, and rules in every output A better system prompt

04

AI that takes actions in our systems Tools + guardrails

07 · 14

Multi-step jobs finished end-to-end An agent

08

Our playbooks followed every time Skills

09

Many integrations without custom builds MCP

10

One narrow task at massive volume, in a precise style Fine-tuning (maybe)

12

Confidence it actually works Evals

13

The standup phrasebook

It's hallucinating
Confidently making things up. Ask: are answers grounded in our data, with citations?
We blew through the context window
It ran out of working memory and forgot earlier material.
That feature is token-heavy
Slow and expensive, in the same breath.
We need better evals
We can't currently measure whether changes make it better or worse.
There's an MCP server for that
The integration mostly already exists. Good news.
Let's keep a human in the loop
The AI pauses for sign-off before acting. Ask: on which actions?
16

Glossary

Twenty-four terms that cover ninety percent of the meetings you'll sit in.

Agent
A model running in a loop with tools and a goal, working until the job is done.
Context window
Everything the model can see at once — its finite working memory, measured in tokens.
Context engineering
The craft of choosing what goes into the context window, and what stays out.
Embedding
A list of numbers representing a text's meaning, so 'refund policy' can match 'money-back terms' in search.
Eval
A curated test set that scores AI quality, so changes ship on evidence instead of vibes.
Fine-tuning
Further training a model on your examples to change its behavior. Teaches style, not facts.
Function calling
The API mechanism behind tools: the model picks an action and fills in the parameters; your software executes it.
Grounding
Forcing answers to come from real, checkable sources rather than the model's memory.
Hallucination
A confident, fluent answer that is false — a citation or number the model invented.
Human-in-the-loop
A checkpoint where the system pauses for a person's approval before acting.
Inference
Running the model to get an output (as opposed to training it). What you pay for, per token.
LLM
Large language model. Software trained on vast text to predict the next token — the engine under everything here.
MCP
Model Context Protocol — the open standard that lets any AI assistant plug into any system. USB-C for AI.
Multi-agent system
An orchestrator agent delegating to specialist subagents, each with a clean context and one job.
Multimodal
A model that handles more than text: images, audio, video in or out.
Orchestrator
The agent that breaks a goal into assignments, farms them out, and reviews the results.
Prompt
The input you give a model. Quality of brief in, quality of work out.
RAG
Retrieval-augmented generation: search your documents first, then answer from what was found, with citations.
Skill
A packaged playbook (instructions, examples, files) an agent loads on demand for a specific kind of task.
Subagent
A specialist agent spawned for one assignment, with its own fresh context window.
System prompt
Standing instructions the model reads before every user message. The job description.
Token
The chunk (~¾ of a word) models read, write, and bill by. The meter everything runs on.
Tool
An action the model is allowed to take — search, query, send, file — defined and permissioned by your team.
Training
Building a model's abilities from massive data upfront. Frozen at a cutoff date thereafter.