In class · frontal Part 2 of 2 · ~90 minutes

Building AI agents with PydanticAI

You've used AI to write code. Now we build AI that acts: an agent that can call tools, loop, and get real work done — not just answer once. We'll cover the what and the why, how to run it for free, and a small live example.

What is an agent?

A plain LLM call is a single question and a single answer. An agent wraps the model in a loop and gives it tools — functions it can decide to call. The model reasons, calls a tool, reads the result, and keeps going until the task is done.

Promptyour task
LLMdecide next step
Call a toolsearch, run code…
Read resultobserve
Final answerstructured

The loop (reason → act → observe) is what makes it an agent. Remove the loop and the tools, and you're back to a single chat message.

What is a tool?

A tool is just a normal Python function you register with the agent. The model sees its name, arguments, and docstring, and decides when to call it. You control what it can actually do.

tool example
from pydantic_ai import Agent

agent = Agent("google-gla:gemini-2.0-flash")

@agent.tool_plain
def word_count(text: str) -> int:
    """Return the number of words in the given text."""
    return len(text.split())

The docstring is not a comment — it's the tool's instruction manual for the model. Clear name + clear docstring + typed arguments = the model uses it correctly. Tools are how the model touches the real world: search the web, query a database, run code, hit an API, read a file.

Why an agent beats a single LLM request

🗨️ One LLM request

  • Answers only from training data (may be stale or wrong)
  • No access to your data, the web, or live systems
  • Can't verify its own work
  • Silently makes things up when it doesn't know
  • One shot — no recovery from a mistake
vs

🤖 An agent with tools

  • Fetches fresh, real information via tools
  • Grounds answers in your sources
  • Checks results and retries when a tool fails
  • Breaks a big task into steps and loops until done
  • Returns validated, structured output you can trust

The motivation, in one line

An LLM alone is a brilliant intern with no phone, no internet, and no notebook. Tools give it the phone and the notebook; the loop lets it actually finish the job.

What the benchmarks show

The industry measures agents on tasks that require doing, not just talking. A few worth naming in class:

BenchmarkWhat it measuresWhy it matters
SWE-benchResolving real GitHub issues in real reposTool-using agents massively outperform single prompts — they read files, run tests, and iterate
GAIAMulti-step reasoning with web + file toolsQuestions humans find easy but a lone LLM can't answer without tools
τ-benchTool use in realistic customer-service flowsTests whether the agent follows rules and calls the right tool at the right time

The consistent finding: giving a model tools + a loop lifts task-completion rates far beyond what better prompting alone can do. The gap is the whole reason agents exist.

Your stack: framework, model, harness

Three independent choices. The model and framework are easy to swap — the harness is where the real engineering lives, and it's what actually makes an agent good.

🧱

Framework

The library that runs the loop and wires up tools:

PydanticAILangGraphLangChainLlamaIndexOpenAI Agents SDK

We use PydanticAI: typed, small, validated outputs for free.

🧠

Model

The brain doing the reasoning — swappable in one line:

GeminiClaudeGPTLlamaQwen

You reach it via a gateway: direct API, OpenRouter (one key, many models), or local Ollama. That's plumbing — not the harness.

🛠️

Harness the point

Everything around the model that turns it into a dependable agent:

ToolsObservability / tracingTests & evalsRetries & guardrailsMemory

Same model + same framework, but a strong harness is the difference between a demo and something you'd ship.

Why the harness is the agent

Two teams can use the identical model and framework and get wildly different results. The gap is the harness — the scaffolding you build around the loop:

Piece of the harnessWhat it doesWhy it matters
ToolsThe actions the agent can take — search, run code, query a DB, call an APIDefines what the agent is even capable of doing
ObservabilityTracing every step: prompts, tool calls, tokens, latency, errors (e.g. Logfire)You can't fix what you can't see — this is how you debug and improve
Tests & evalsAutomatically checking the agent still behaves as inputs and prompts changeTurns “seemed fine” into measurable, repeatable quality
Retries & guardrailsRecovering from bad tool output, validating results, capping usageKeeps the agent safe, bounded, and reliable in the real world
Memory & contextWhat the agent remembers within and across runsLets it handle long, multi-step tasks instead of forgetting
🧩 The mindset

Picking a model is a one-line decision. Building the harness — good tools, tracing you trust, tests that catch regressions, sane guardrails — is the actual work, and it's what makes one agent great and another useless.

Agent architectures

Once you have one agent, the next question is how to combine them. There's a ladder of patterns from a single loop up to teams of specialists — you pick the simplest one that solves your problem.

First distinction: workflow vs. agent

🧭 Workflow

You wire the steps in code. The path is predefined — the LLM fills in each step, but you decide the flow.

Predictable, cheaper, easier to debug.

vs

🤖 Agent

The model decides what to do next and which tools to call, looping until done. The path emerges at runtime.

Flexible, handles the unknown — but slower and pricier.

Both are built from the same brick: the augmented LLM — a model with tools, memory, and retrieval (that's your harness). Everything below is just different ways of arranging those bricks.

The core patterns

1 · Single agent start here

One augmented LLM in a loop with its tools. Reasons, calls a tool, observes, repeats until the task is done.

Prompt
Agent⟳ loop + tools
Answer

Use when: the task fits one context and one skill set. 90% of the time, this is enough.

2 · Prompt chain sub-agents in a flow

Break the task into fixed, ordered steps. Each step's output feeds the next — a pipeline you define in code (a workflow).

Draftstep 1
Checkstep 2
Polishstep 3
Result

Use when: the steps are known and stable. Trades a little latency for a lot of accuracy. Add a gate between steps to catch failures early.

3 · Routing classify → handoff

A router classifies the input and hands it to the right specialist. Separates concerns so each handler stays simple (and you can use a cheaper model for easy cases).

Routerclassify
Billing
Technical
Sales
one input → the one right specialist

Use when: inputs fall into distinct categories that need different handling.

4 · Parallelization fan-out → fan-in

Run several LLM calls at once, then combine. Two flavors: sectioning (split a task into independent parts) and voting (run the same task several times and aggregate for confidence).

Split / dispatch
↓   in parallel   ↓
Worker A
Worker B
Worker C
Aggregate

Use when: subtasks are independent (speed) or you want multiple perspectives / a vote (reliability). This is exactly the workshop's Level 4.

5 · Orchestrator + specialists the workhorse

A central orchestrator breaks the task into subtasks at runtime, delegates each to a specialist worker, then synthesizes the results. Unlike a chain, the subtasks aren't known in advance — the orchestrator decides.

Orchestratorplan · delegate · synthesize
↕   delegates & collects   ↕
Researcher
Coder
Writer

Use when: the task is complex and the needed steps can't be predicted up front. This is the most common production pattern for multi-agent systems.

6 · Evaluator–optimizer generate ⇄ critique

One agent produces a result; a second agent critiques it against criteria; the first revises. Loop until it's good enough — like a writer and an editor.

Generatorproduce
Evaluatorcritique
Final
loop until the critic is satisfied

Use when: you have clear quality criteria and revision measurably helps (code, translations, structured writing).

Scaling up: multi-agent topologies

When one team of specialists isn't enough, these describe how many agents relate to each other. Rule of thumb: default to orchestrator/hierarchical; reach for peer patterns only when you truly need them.

TopologyShapeControlBest for
HierarchicalSupervisors of supervisors — a tree of managers → workersHighLarge, multi-domain tasks (20+ agents) that blow past one context window
SwarmAutonomous peers coordinating through shared state, no bossLowOpen-ended exploration; many branches searched in parallel
MeshA few peers (3–8) with direct, persistent connectionsMediumTight collaborative iteration on a shared artifact (e.g. review loops)
🧭 The one rule that matters: start simple

Every layer of agents adds latency, cost, and places to fail. Begin with a single agent. Move to a workflow (chain/route/parallel) only when the path is clear, and to an orchestrator only when it isn't. These patterns are composable Lego bricks — reach for the fewest that solve the problem.

Connecting tools at scale: MCP

Writing every tool by hand doesn't scale. The Model Context Protocol (MCP) is an open standard — think “USB-C for AI tools” — that lets your agent plug into ready-made servers that expose tools and data: GitHub, Slack, databases, the filesystem, a browser, and hundreds more.

🔌

Write once, reuse everywhere

An MCP server built for one app works with any MCP-aware agent — regardless of framework or model. No more re-implementing the same integration.

🧰

Instant capabilities

Point your agent at a server and it gains all of that server's tools at once — search, file access, API calls — without you coding each one.

agent_with_mcp.py — illustrative
from pydantic_ai import Agent
from pydantic_ai.mcp import MCPServerStdio

# Connect to an MCP server (here: a filesystem server started over stdio).
fs = MCPServerStdio("npx", args=["-y", "@modelcontextprotocol/server-filesystem", "."])

agent = Agent("google-gla:gemini-2.0-flash", toolsets=[fs])

async def main():
    async with agent:            # opens & closes the MCP connections
        result = await agent.run("List the Python files here and summarize each.")
        print(result.output)
🔗 Two sides of MCP

Your agent can be an MCP client (consuming other people's tools, above) and you can expose your own tools as an MCP server for others. The exact PydanticAI class names evolve quickly — check ai.pydantic.dev/mcp for your version.

Getting free tokens for testing

You don't need a credit card to learn this. Two free paths — and a way to cap your usage so an agent loop can't run away.

🟢 Google Gemini — free tier

Get a free key from Google AI Studio. The free tier gives you a generous number of requests per minute and per day on models like gemini-2.0-flash — perfect for testing agents.

  1. Sign in with a Google account
  2. Click Create API key
  3. Set it as an environment variable (never paste it into code you'll share):
terminal
export GEMINI_API_KEY="your-key-here"

🔵 OpenRouter — free models

One key at openrouter.ai/keys reaches many models. Models tagged :free cost nothing (with rate limits). Great for comparing brains without changing your code.

Point PydanticAI at OpenRouter's OpenAI-compatible endpoint https://openrouter.ai/api/v1 and pick a free model such as meta-llama/llama-3.3-70b-instruct:free.

🚦 Cap the loop: “a specific number of requests”

Because an agent loops, a bug can make it call the model over and over and burn your free quota. PydanticAI lets you set a hard ceiling with UsageLimits. When it's hit, it raises UsageLimitExceeded — which you catch, exactly like in the example below.

Simple live example

A tiny agent with one tool, a free model, a structured result, and a request cap. This is the full shape of everything we'll build in the workshop.

first_agent.py
from pydantic import BaseModel, Field
from pydantic_ai import Agent
from pydantic_ai.usage import UsageLimits
from pydantic_ai.exceptions import UsageLimitExceeded

# 1. Describe the structured answer we want back (validated automatically).
class Answer(BaseModel):
    reasoning: str = Field(description="Short explanation of how we got here")
    result: int = Field(description="The final numeric answer")

# 2. Build the agent. Model id is 'provider:model' — swap it in one line.
#    Reads GEMINI_API_KEY from the environment; nothing secret in the code.
agent = Agent(
    "google-gla:gemini-2.0-flash",
    output_type=Answer,
    system_prompt="You are a precise assistant. Use the tools; do not do math yourself.",
)

# 3. A tool: a normal typed function. The docstring tells the model what it does.
@agent.tool_plain
def add(a: int, b: int) -> int:
    """Add two integers and return the sum."""
    return a + b

# 4. Run it — with a hard ceiling of 5 model requests so the loop can't run away.
try:
    out = agent.run_sync(
        "What is 21 + 21? Use the add tool.",
        usage_limits=UsageLimits(request_limit=5),
    )
    print(out.output.result)      # -> 42
    print(out.output.reasoning)
except UsageLimitExceeded as e:
    print("Hit the request cap:", e)
🔍 Read the four moves

1. A BaseModel defines the output — you get typed, validated data, not a blob of text.  2. The model id is just a string; change providers without touching logic.  3. A tool is a plain function; the docstring is its API.  4. UsageLimits caps the loop and UsageLimitExceeded is your safety valve.

Same agent, but through OpenRouter (free model)
openrouter_agent.py
import os
from openai import AsyncOpenAI
from pydantic_ai import Agent
from pydantic_ai.models.openai import OpenAIModel
from pydantic_ai.providers.openai import OpenAIProvider

client = AsyncOpenAI(
    api_key=os.environ["OPENROUTER_API_KEY"],   # from env, not hard-coded
    base_url="https://openrouter.ai/api/v1",
)
model = OpenAIModel(
    "meta-llama/llama-3.3-70b-instruct:free",     # a free model
    provider=OpenAIProvider(openai_client=client),
)
agent = Agent(model, system_prompt="You are a helpful assistant.")
print(agent.run_sync("Say hello in one word.").output)

Notice: only the model wiring changed. Your tools, output types, and logic stay identical.

▶️ Run it live in class

Save as first_agent.py, pip install pydantic-ai, set GEMINI_API_KEY, then python first_agent.py. Watch it call the tool and return 42 — then break the tool on purpose and watch the request cap catch the runaway loop.

What goes wrong — and how the harness guards it

Agents fail in ways a single function never does. Knowing the failure modes is half of building a good harness.

Failure modeWhat it looks likeGuard
Runaway loopKeeps calling tools / the model forever, burning tokensUsageLimits(request_limit=…)
Bad tool callWrong arguments, or the tool errors on the model's inputValidate inputs; raise ModelRetry to let it try again
Hallucinated outputConfident answer that's simply wrong or made-upStructured output + validation; an evaluator step
Prompt injectionText fetched by a tool contains “ignore your instructions…”Treat tool results as untrusted data, not commands
Unsafe actionAbout to delete / send / spend something irreversibleHuman-in-the-loop approval for high-stakes tools
Silent regressionsWorked yesterday, worse today after a prompt tweakObservability (tracing) + evals that run on every change
retry a bad tool call
from pydantic_ai import ModelRetry

@agent.tool_plain
def get_user(user_id: int) -> dict:
    """Look up a user by id."""
    user = db.get(user_id)
    if user is None:
        # Don't crash — tell the model to fix its input and try again.
        raise ModelRetry(f"No user with id {user_id}. Ask for a valid id.")
    return user
🧠 The mindset

Assume the model will misbehave, and design so that when it does, the harness catches it cheaply. That's the difference between a fun demo and something you'd let touch real data.