In class · frontal Part 2 of 2 · ~90 minutes
Building AI agents with PydanticAI
You've used AI to write code. Now we build AI that acts: an agent that can call tools, loop, and get real work done — not just answer once. We'll cover the what and the why, how to run it for free, and a small live example.
What is an agent?
A plain LLM call is a single question and a single answer. An agent wraps the model in a loop and gives it tools — functions it can decide to call. The model reasons, calls a tool, reads the result, and keeps going until the task is done.
The loop (reason → act → observe) is what makes it an agent. Remove the loop and the tools, and you're back to a single chat message.
What is a tool?
A tool is just a normal Python function you register with the agent. The model sees its name, arguments, and docstring, and decides when to call it. You control what it can actually do.
from pydantic_ai import Agent
agent = Agent("google-gla:gemini-2.0-flash")
@agent.tool_plain
def word_count(text: str) -> int:
"""Return the number of words in the given text."""
return len(text.split())
The docstring is not a comment — it's the tool's instruction manual for the model. Clear name + clear docstring + typed arguments = the model uses it correctly. Tools are how the model touches the real world: search the web, query a database, run code, hit an API, read a file.
Why an agent beats a single LLM request
🗨️ One LLM request
- Answers only from training data (may be stale or wrong)
- No access to your data, the web, or live systems
- Can't verify its own work
- Silently makes things up when it doesn't know
- One shot — no recovery from a mistake
🤖 An agent with tools
- Fetches fresh, real information via tools
- Grounds answers in your sources
- Checks results and retries when a tool fails
- Breaks a big task into steps and loops until done
- Returns validated, structured output you can trust
The motivation, in one line
An LLM alone is a brilliant intern with no phone, no internet, and no notebook. Tools give it the phone and the notebook; the loop lets it actually finish the job.
What the benchmarks show
The industry measures agents on tasks that require doing, not just talking. A few worth naming in class:
| Benchmark | What it measures | Why it matters |
|---|---|---|
| SWE-bench | Resolving real GitHub issues in real repos | Tool-using agents massively outperform single prompts — they read files, run tests, and iterate |
| GAIA | Multi-step reasoning with web + file tools | Questions humans find easy but a lone LLM can't answer without tools |
| τ-bench | Tool use in realistic customer-service flows | Tests whether the agent follows rules and calls the right tool at the right time |
The consistent finding: giving a model tools + a loop lifts task-completion rates far beyond what better prompting alone can do. The gap is the whole reason agents exist.
Your stack: framework, model, harness
Three independent choices. The model and framework are easy to swap — the harness is where the real engineering lives, and it's what actually makes an agent good.
Framework
The library that runs the loop and wires up tools:
We use PydanticAI: typed, small, validated outputs for free.
Model
The brain doing the reasoning — swappable in one line:
You reach it via a gateway: direct API, OpenRouter (one key, many models), or local Ollama. That's plumbing — not the harness.
Harness the point
Everything around the model that turns it into a dependable agent:
Same model + same framework, but a strong harness is the difference between a demo and something you'd ship.
Why the harness is the agent
Two teams can use the identical model and framework and get wildly different results. The gap is the harness — the scaffolding you build around the loop:
| Piece of the harness | What it does | Why it matters |
|---|---|---|
| Tools | The actions the agent can take — search, run code, query a DB, call an API | Defines what the agent is even capable of doing |
| Observability | Tracing every step: prompts, tool calls, tokens, latency, errors (e.g. Logfire) | You can't fix what you can't see — this is how you debug and improve |
| Tests & evals | Automatically checking the agent still behaves as inputs and prompts change | Turns “seemed fine” into measurable, repeatable quality |
| Retries & guardrails | Recovering from bad tool output, validating results, capping usage | Keeps the agent safe, bounded, and reliable in the real world |
| Memory & context | What the agent remembers within and across runs | Lets it handle long, multi-step tasks instead of forgetting |
Picking a model is a one-line decision. Building the harness — good tools, tracing you trust, tests that catch regressions, sane guardrails — is the actual work, and it's what makes one agent great and another useless.
Agent architectures
Once you have one agent, the next question is how to combine them. There's a ladder of patterns from a single loop up to teams of specialists — you pick the simplest one that solves your problem.
First distinction: workflow vs. agent
🧭 Workflow
You wire the steps in code. The path is predefined — the LLM fills in each step, but you decide the flow.
Predictable, cheaper, easier to debug.
🤖 Agent
The model decides what to do next and which tools to call, looping until done. The path emerges at runtime.
Flexible, handles the unknown — but slower and pricier.
Both are built from the same brick: the augmented LLM — a model with tools, memory, and retrieval (that's your harness). Everything below is just different ways of arranging those bricks.
The core patterns
1 · Single agent start here
One augmented LLM in a loop with its tools. Reasons, calls a tool, observes, repeats until the task is done.
Use when: the task fits one context and one skill set. 90% of the time, this is enough.
2 · Prompt chain sub-agents in a flow
Break the task into fixed, ordered steps. Each step's output feeds the next — a pipeline you define in code (a workflow).
Use when: the steps are known and stable. Trades a little latency for a lot of accuracy. Add a gate between steps to catch failures early.
3 · Routing classify → handoff
A router classifies the input and hands it to the right specialist. Separates concerns so each handler stays simple (and you can use a cheaper model for easy cases).
Use when: inputs fall into distinct categories that need different handling.
4 · Parallelization fan-out → fan-in
Run several LLM calls at once, then combine. Two flavors: sectioning (split a task into independent parts) and voting (run the same task several times and aggregate for confidence).
Use when: subtasks are independent (speed) or you want multiple perspectives / a vote (reliability). This is exactly the workshop's Level 4.
5 · Orchestrator + specialists the workhorse
A central orchestrator breaks the task into subtasks at runtime, delegates each to a specialist worker, then synthesizes the results. Unlike a chain, the subtasks aren't known in advance — the orchestrator decides.
Use when: the task is complex and the needed steps can't be predicted up front. This is the most common production pattern for multi-agent systems.
6 · Evaluator–optimizer generate ⇄ critique
One agent produces a result; a second agent critiques it against criteria; the first revises. Loop until it's good enough — like a writer and an editor.
Use when: you have clear quality criteria and revision measurably helps (code, translations, structured writing).
Scaling up: multi-agent topologies
When one team of specialists isn't enough, these describe how many agents relate to each other. Rule of thumb: default to orchestrator/hierarchical; reach for peer patterns only when you truly need them.
| Topology | Shape | Control | Best for |
|---|---|---|---|
| Hierarchical | Supervisors of supervisors — a tree of managers → workers | High | Large, multi-domain tasks (20+ agents) that blow past one context window |
| Swarm | Autonomous peers coordinating through shared state, no boss | Low | Open-ended exploration; many branches searched in parallel |
| Mesh | A few peers (3–8) with direct, persistent connections | Medium | Tight collaborative iteration on a shared artifact (e.g. review loops) |
Every layer of agents adds latency, cost, and places to fail. Begin with a single agent. Move to a workflow (chain/route/parallel) only when the path is clear, and to an orchestrator only when it isn't. These patterns are composable Lego bricks — reach for the fewest that solve the problem.
Connecting tools at scale: MCP
Writing every tool by hand doesn't scale. The Model Context Protocol (MCP) is an open standard — think “USB-C for AI tools” — that lets your agent plug into ready-made servers that expose tools and data: GitHub, Slack, databases, the filesystem, a browser, and hundreds more.
Write once, reuse everywhere
An MCP server built for one app works with any MCP-aware agent — regardless of framework or model. No more re-implementing the same integration.
Instant capabilities
Point your agent at a server and it gains all of that server's tools at once — search, file access, API calls — without you coding each one.
from pydantic_ai import Agent
from pydantic_ai.mcp import MCPServerStdio
# Connect to an MCP server (here: a filesystem server started over stdio).
fs = MCPServerStdio("npx", args=["-y", "@modelcontextprotocol/server-filesystem", "."])
agent = Agent("google-gla:gemini-2.0-flash", toolsets=[fs])
async def main():
async with agent: # opens & closes the MCP connections
result = await agent.run("List the Python files here and summarize each.")
print(result.output)
Your agent can be an MCP client (consuming other people's tools, above) and you can expose your own tools as an MCP server for others. The exact PydanticAI class names evolve quickly — check ai.pydantic.dev/mcp for your version.
Getting free tokens for testing
You don't need a credit card to learn this. Two free paths — and a way to cap your usage so an agent loop can't run away.
🟢 Google Gemini — free tier
Get a free key from Google AI Studio. The free tier gives you a generous number of requests per minute and per day on models like gemini-2.0-flash — perfect for testing agents.
- Sign in with a Google account
- Click Create API key
- Set it as an environment variable (never paste it into code you'll share):
export GEMINI_API_KEY="your-key-here"
🔵 OpenRouter — free models
One key at openrouter.ai/keys reaches many models. Models tagged :free cost nothing (with rate limits). Great for comparing brains without changing your code.
Point PydanticAI at OpenRouter's OpenAI-compatible endpoint https://openrouter.ai/api/v1 and pick a free model such as meta-llama/llama-3.3-70b-instruct:free.
Because an agent loops, a bug can make it call the model over and over and burn your free quota. PydanticAI lets you set a hard ceiling with UsageLimits. When it's hit, it raises UsageLimitExceeded — which you catch, exactly like in the example below.
Simple live example
A tiny agent with one tool, a free model, a structured result, and a request cap. This is the full shape of everything we'll build in the workshop.
from pydantic import BaseModel, Field
from pydantic_ai import Agent
from pydantic_ai.usage import UsageLimits
from pydantic_ai.exceptions import UsageLimitExceeded
# 1. Describe the structured answer we want back (validated automatically).
class Answer(BaseModel):
reasoning: str = Field(description="Short explanation of how we got here")
result: int = Field(description="The final numeric answer")
# 2. Build the agent. Model id is 'provider:model' — swap it in one line.
# Reads GEMINI_API_KEY from the environment; nothing secret in the code.
agent = Agent(
"google-gla:gemini-2.0-flash",
output_type=Answer,
system_prompt="You are a precise assistant. Use the tools; do not do math yourself.",
)
# 3. A tool: a normal typed function. The docstring tells the model what it does.
@agent.tool_plain
def add(a: int, b: int) -> int:
"""Add two integers and return the sum."""
return a + b
# 4. Run it — with a hard ceiling of 5 model requests so the loop can't run away.
try:
out = agent.run_sync(
"What is 21 + 21? Use the add tool.",
usage_limits=UsageLimits(request_limit=5),
)
print(out.output.result) # -> 42
print(out.output.reasoning)
except UsageLimitExceeded as e:
print("Hit the request cap:", e)
1. A BaseModel defines the output — you get typed, validated data, not a blob of text. 2. The model id is just a string; change providers without touching logic. 3. A tool is a plain function; the docstring is its API. 4. UsageLimits caps the loop and UsageLimitExceeded is your safety valve.
Same agent, but through OpenRouter (free model)
import os
from openai import AsyncOpenAI
from pydantic_ai import Agent
from pydantic_ai.models.openai import OpenAIModel
from pydantic_ai.providers.openai import OpenAIProvider
client = AsyncOpenAI(
api_key=os.environ["OPENROUTER_API_KEY"], # from env, not hard-coded
base_url="https://openrouter.ai/api/v1",
)
model = OpenAIModel(
"meta-llama/llama-3.3-70b-instruct:free", # a free model
provider=OpenAIProvider(openai_client=client),
)
agent = Agent(model, system_prompt="You are a helpful assistant.")
print(agent.run_sync("Say hello in one word.").output)
Notice: only the model wiring changed. Your tools, output types, and logic stay identical.
Save as first_agent.py, pip install pydantic-ai, set GEMINI_API_KEY, then python first_agent.py. Watch it call the tool and return 42 — then break the tool on purpose and watch the request cap catch the runaway loop.
What goes wrong — and how the harness guards it
Agents fail in ways a single function never does. Knowing the failure modes is half of building a good harness.
| Failure mode | What it looks like | Guard |
|---|---|---|
| Runaway loop | Keeps calling tools / the model forever, burning tokens | UsageLimits(request_limit=…) |
| Bad tool call | Wrong arguments, or the tool errors on the model's input | Validate inputs; raise ModelRetry to let it try again |
| Hallucinated output | Confident answer that's simply wrong or made-up | Structured output + validation; an evaluator step |
| Prompt injection | Text fetched by a tool contains “ignore your instructions…” | Treat tool results as untrusted data, not commands |
| Unsafe action | About to delete / send / spend something irreversible | Human-in-the-loop approval for high-stakes tools |
| Silent regressions | Worked yesterday, worse today after a prompt tweak | Observability (tracing) + evals that run on every change |
from pydantic_ai import ModelRetry
@agent.tool_plain
def get_user(user_id: int) -> dict:
"""Look up a user by id."""
user = db.get(user_id)
if user is None:
# Don't crash — tell the model to fix its input and try again.
raise ModelRetry(f"No user with id {user_id}. Ask for a valid id.")
return user
Assume the model will misbehave, and design so that when it does, the harness catches it cheaply. That's the difference between a fun demo and something you'd let touch real data.