In class · frontal Part 2 of 2 · ~90 minutes

Building AI agents with PydanticAI

You've used AI to write code. Now we build AI that acts: an agent that can call tools, loop, and get real work done — not just answer once. We'll cover the what and the why, how to run it for free, and a small live example.

What is an agent?

A plain LLM call is a single question and a single answer. An agent wraps the model in a loop and gives it tools — functions it can decide to call. The model reasons, calls a tool, reads the result, and keeps going until the task is done.

Promptyour task

→

LLMdecide next step

→

Call a toolsearch, run code…

→

Read resultobserve

↺

Final answerstructured

The loop (reason → act → observe) is what makes it an agent. Remove the loop and the tools, and you're back to a single chat message.

What is a tool?

A tool is just a normal Python function you register with the agent. The model sees its name, arguments, and docstring, and decides when to call it. You control what it can actually do.

tool example

from pydantic_ai import Agent

agent = Agent("google-gla:gemini-2.0-flash")

@agent.tool_plain
def word_count(text: str) -> int:
    """Return the number of words in the given text."""
    return len(text.split())

The docstring is not a comment — it's the tool's instruction manual for the model. Clear name + clear docstring + typed arguments = the model uses it correctly. Tools are how the model touches the real world: search the web, query a database, run code, hit an API, read a file.

Why an agent beats a single LLM request

🗨️ One LLM request

Answers only from training data (may be stale or wrong)
No access to your data, the web, or live systems
Can't verify its own work
Silently makes things up when it doesn't know
One shot — no recovery from a mistake

🤖 An agent with tools

Fetches fresh, real information via tools
Grounds answers in your sources
Checks results and retries when a tool fails
Breaks a big task into steps and loops until done
Returns validated, structured output you can trust

The motivation, in one line

An LLM alone is a brilliant intern with no phone, no internet, and no notebook. Tools give it the phone and the notebook; the loop lets it actually finish the job.

What the benchmarks show

The industry measures agents on tasks that require doing, not just talking. A few worth naming in class:

Benchmark	What it measures	Why it matters
SWE-bench	Resolving real GitHub issues in real repos	Tool-using agents massively outperform single prompts — they read files, run tests, and iterate
GAIA	Multi-step reasoning with web + file tools	Questions humans find easy but a lone LLM can't answer without tools
τ-bench	Tool use in realistic customer-service flows	Tests whether the agent follows rules and calls the right tool at the right time

The consistent finding: giving a model tools + a loop lifts task-completion rates far beyond what better prompting alone can do. The gap is the whole reason agents exist.

Your stack: framework, model, harness

Three independent choices. The model and framework are easy to swap — the harness is where the real engineering lives, and it's what actually makes an agent good.

🧱

Framework

The library that runs the loop and wires up tools:

We use PydanticAI: typed, small, validated outputs for free.

🧠

Model

The brain doing the reasoning — swappable in one line:

You reach it via a gateway: direct API, OpenRouter (one key, many models), or local Ollama. That's plumbing — not the harness.

🛠️

Harness the point

Everything around the model that turns it into a dependable agent:

Same model + same framework, but a strong harness is the difference between a demo and something you'd ship.

Why the harness is the agent

Two teams can use the identical model and framework and get wildly different results. The gap is the harness — the scaffolding you build around the loop:

Piece of the harness	What it does	Why it matters
Tools	The actions the agent can take — search, run code, query a DB, call an API	Defines what the agent is even capable of doing
Observability	Tracing every step: prompts, tool calls, tokens, latency, errors (e.g. Logfire)	You can't fix what you can't see — this is how you debug and improve
Tests & evals	Automatically checking the agent still behaves as inputs and prompts change	Turns “seemed fine” into measurable, repeatable quality
Retries & guardrails	Recovering from bad tool output, validating results, capping usage	Keeps the agent safe, bounded, and reliable in the real world
Memory & context	What the agent remembers within and across runs	Lets it handle long, multi-step tasks instead of forgetting

🧩 The mindset

Picking a model is a one-line decision. Building the harness — good tools, tracing you trust, tests that catch regressions, sane guardrails — is the actual work, and it's what makes one agent great and another useless.

Agent architectures

Once you have one agent, the next question is how to combine them. There's a ladder of patterns from a single loop up to teams of specialists — you pick the simplest one that solves your problem.

First distinction: workflow vs. agent

🧭 Workflow

You wire the steps in code. The path is predefined — the LLM fills in each step, but you decide the flow.

Predictable, cheaper, easier to debug.

🤖 Agent

The model decides what to do next and which tools to call, looping until done. The path emerges at runtime.

Flexible, handles the unknown — but slower and pricier.

Both are built from the same brick: the augmented LLM — a model with tools, memory, and retrieval (that's your harness). Everything below is just different ways of arranging those bricks.

The core patterns

1 · Single agent start here

One augmented LLM in a loop with its tools. Reasons, calls a tool, observes, repeats until the task is done.

Prompt

→

Agent⟳ loop + tools

→

Answer

Use when: the task fits one context and one skill set. 90% of the time, this is enough.

2 · Prompt chain sub-agents in a flow

Break the task into fixed, ordered steps. Each step's output feeds the next — a pipeline you define in code (a workflow).

Draftstep 1

→

Checkstep 2

→

Polishstep 3

→

Result

Use when: the steps are known and stable. Trades a little latency for a lot of accuracy. Add a gate between steps to catch failures early.

3 · Routing classify → handoff

A router classifies the input and hands it to the right specialist. Separates concerns so each handler stays simple (and you can use a cheaper model for easy cases).

Routerclassify

↓

Billing

Technical

Sales

one input → the one right specialist

Use when: inputs fall into distinct categories that need different handling.

4 · Parallelization fan-out → fan-in

Run several LLM calls at once, then combine. Two flavors: sectioning (split a task into independent parts) and voting (run the same task several times and aggregate for confidence).

Split / dispatch

↓ in parallel ↓

Worker A

Worker B

Worker C

↓

Aggregate

Use when: subtasks are independent (speed) or you want multiple perspectives / a vote (reliability). This is exactly the workshop's Level 4.

5 · Orchestrator + specialists the workhorse

A central orchestrator breaks the task into subtasks at runtime, delegates each to a specialist worker, then synthesizes the results. Unlike a chain, the subtasks aren't known in advance — the orchestrator decides.

Orchestratorplan · delegate · synthesize

↕ delegates & collects ↕

Researcher

Coder

Writer

Use when: the task is complex and the needed steps can't be predicted up front. This is the most common production pattern for multi-agent systems.

6 · Evaluator–optimizer generate ⇄ critique

One agent produces a result; a second agent critiques it against criteria; the first revises. Loop until it's good enough — like a writer and an editor.

Generatorproduce

⇄

Evaluatorcritique

→

Final

loop until the critic is satisfied

Use when: you have clear quality criteria and revision measurably helps (code, translations, structured writing).

Scaling up: multi-agent topologies

When one team of specialists isn't enough, these describe how many agents relate to each other. Rule of thumb: default to orchestrator/hierarchical; reach for peer patterns only when you truly need them.

Topology	Shape	Control	Best for
Hierarchical	Supervisors of supervisors — a tree of managers → workers	High	Large, multi-domain tasks (20+ agents) that blow past one context window
Swarm	Autonomous peers coordinating through shared state, no boss	Low	Open-ended exploration; many branches searched in parallel
Mesh	A few peers (3–8) with direct, persistent connections	Medium	Tight collaborative iteration on a shared artifact (e.g. review loops)

🧭 The one rule that matters: start simple

Every layer of agents adds latency, cost, and places to fail. Begin with a single agent. Move to a workflow (chain/route/parallel) only when the path is clear, and to an orchestrator only when it isn't. These patterns are composable Lego bricks — reach for the fewest that solve the problem.

Connecting tools at scale: MCP

Writing every tool by hand doesn't scale. The Model Context Protocol (MCP) is an open standard — think “USB-C for AI tools” — that lets your agent plug into ready-made servers that expose tools and data: GitHub, Slack, databases, the filesystem, a browser, and hundreds more.

🔌

Write once, reuse everywhere

An MCP server built for one app works with any MCP-aware agent — regardless of framework or model. No more re-implementing the same integration.

🧰

Instant capabilities

Point your agent at a server and it gains all of that server's tools at once — search, file access, API calls — without you coding each one.

agent_with_mcp.py — illustrative

from pydantic_ai import Agent
from pydantic_ai.mcp import MCPServerStdio

# Connect to an MCP server (here: a filesystem server started over stdio).
fs = MCPServerStdio("npx", args=["-y", "@modelcontextprotocol/server-filesystem", "."])

agent = Agent("google-gla:gemini-2.0-flash", toolsets=[fs])

async def main():
    async with agent:            # opens & closes the MCP connections
        result = await agent.run("List the Python files here and summarize each.")
        print(result.output)

🔗 Two sides of MCP

Your agent can be an MCP client (consuming other people's tools, above) and you can expose your own tools as an MCP server for others. The exact PydanticAI class names evolve quickly — check ai.pydantic.dev/mcp for your version.

Getting free tokens for testing

You don't need a credit card to learn this. Two free paths — and a way to cap your usage so an agent loop can't run away.

🟢 Google Gemini — free tier

Get a free key from Google AI Studio. The free tier gives you a generous number of requests per minute and per day on models like gemini-2.0-flash — perfect for testing agents.

Sign in with a Google account
Click Create API key
Set it as an environment variable (never paste it into code you'll share):

terminal

export GEMINI_API_KEY="your-key-here"

🔵 OpenRouter — free models

One key at openrouter.ai/keys reaches many models. Models tagged :free cost nothing (with rate limits). Great for comparing brains without changing your code.

Point PydanticAI at OpenRouter's OpenAI-compatible endpoint https://openrouter.ai/api/v1 and pick a free model such as meta-llama/llama-3.3-70b-instruct:free.

🚦 Cap the loop: “a specific number of requests”

Because an agent loops, a bug can make it call the model over and over and burn your free quota. PydanticAI lets you set a hard ceiling with UsageLimits. When it's hit, it raises UsageLimitExceeded — which you catch, exactly like in the example below.

Simple live example

A tiny agent with one tool, a free model, a structured result, and a request cap. This is the full shape of everything we'll build in the workshop.

first_agent.py

from pydantic import BaseModel, Field
from pydantic_ai import Agent
from pydantic_ai.usage import UsageLimits
from pydantic_ai.exceptions import UsageLimitExceeded

# 1. Describe the structured answer we want back (validated automatically).
class Answer(BaseModel):
    reasoning: str = Field(description="Short explanation of how we got here")
    result: int = Field(description="The final numeric answer")

# 2. Build the agent. Model id is 'provider:model' — swap it in one line.
#    Reads GEMINI_API_KEY from the environment; nothing secret in the code.
agent = Agent(
    "google-gla:gemini-2.0-flash",
    output_type=Answer,
    system_prompt="You are a precise assistant. Use the tools; do not do math yourself.",
)

# 3. A tool: a normal typed function. The docstring tells the model what it does.
@agent.tool_plain
def add(a: int, b: int) -> int:
    """Add two integers and return the sum."""
    return a + b

# 4. Run it — with a hard ceiling of 5 model requests so the loop can't run away.
try:
    out = agent.run_sync(
        "What is 21 + 21? Use the add tool.",
        usage_limits=UsageLimits(request_limit=5),
    )
    print(out.output.result)      # -> 42
    print(out.output.reasoning)
except UsageLimitExceeded as e:
    print("Hit the request cap:", e)

🔍 Read the four moves

1. A BaseModel defines the output — you get typed, validated data, not a blob of text. 2. The model id is just a string; change providers without touching logic. 3. A tool is a plain function; the docstring is its API. 4. UsageLimits caps the loop and UsageLimitExceeded is your safety valve.

Same agent, but through OpenRouter (free model)

openrouter_agent.py

import os
from openai import AsyncOpenAI
from pydantic_ai import Agent
from pydantic_ai.models.openai import OpenAIModel
from pydantic_ai.providers.openai import OpenAIProvider

client = AsyncOpenAI(
    api_key=os.environ["OPENROUTER_API_KEY"],   # from env, not hard-coded
    base_url="https://openrouter.ai/api/v1",
)
model = OpenAIModel(
    "meta-llama/llama-3.3-70b-instruct:free",     # a free model
    provider=OpenAIProvider(openai_client=client),
)
agent = Agent(model, system_prompt="You are a helpful assistant.")
print(agent.run_sync("Say hello in one word.").output)

Notice: only the model wiring changed. Your tools, output types, and logic stay identical.

▶️ Run it live in class

Save as first_agent.py, pip install pydantic-ai, set GEMINI_API_KEY, then python first_agent.py. Watch it call the tool and return 42 — then break the tool on purpose and watch the request cap catch the runaway loop.

What goes wrong — and how the harness guards it

Agents fail in ways a single function never does. Knowing the failure modes is half of building a good harness.

Failure mode	What it looks like	Guard
Runaway loop	Keeps calling tools / the model forever, burning tokens	`UsageLimits(request_limit=…)`
Bad tool call	Wrong arguments, or the tool errors on the model's input	Validate inputs; raise `ModelRetry` to let it try again
Hallucinated output	Confident answer that's simply wrong or made-up	Structured output + validation; an evaluator step
Prompt injection	Text fetched by a tool contains “ignore your instructions…”	Treat tool results as untrusted data, not commands
Unsafe action	About to delete / send / spend something irreversible	Human-in-the-loop approval for high-stakes tools
Silent regressions	Worked yesterday, worse today after a prompt tweak	Observability (tracing) + evals that run on every change

retry a bad tool call

from pydantic_ai import ModelRetry

@agent.tool_plain
def get_user(user_id: int) -> dict:
    """Look up a user by id."""
    user = db.get(user_id)
    if user is None:
        # Don't crash — tell the model to fix its input and try again.
        raise ModelRetry(f"No user with id {user_id}. Ask for a valid id.")
    return user

🧠 The mindset

Assume the model will misbehave, and design so that when it does, the harness catches it cheaply. That's the difference between a fun demo and something you'd let touch real data.