Back to posts

Tutorial: Build an AI Agent from Scratch with Ollama and Python

Tutorial: Build an AI Agent from Scratch with Ollama and Python

This tutorial builds a small AI agent from scratch in plain Python. It runs against a local Ollama model, stores everything in SQLite, and uses no frameworks. We start with a single LLM call and layer on the four patterns that make it an agent: tools, memory, planning, and reflection.

By the end you will have:

  • A local personal-planner agent you talk to from the terminal. It can add, list, and complete tasks; it remembers facts about you across sessions; it plans before acting and reflects on what it did.
  • A working mental model of what an agent framework actually does for you, so you can decide when to reach for one and when not to.
  • Roughly 300 lines of Python, all stdlib plus the ollama client.

All the code lives at github.com/devdaviddr/personal-planner-agent if you want to clone-and-run before reading.

A taste of what the finished agent looks like in use — note that it remembers a preference you told it weeks earlier and uses it to answer a question that has nothing to do with the original turn:

you> i have wednesday afternoons free for meetings
bot> Noted.

# ...some time later, new session...

you> when should i schedule the dentist?

bot> Plan:
     - recall any free-time preferences
     - resolve "next Wednesday" via get_today
     - propose a date
[recall("when is the user free for appointments") → "Wednesday afternoons"]
[get_today() → 2026-05-24]
     You mentioned Wednesday afternoons are free. Next Wednesday is
     2026-05-27 — want me to add it?

What is an agent, really?

Strip the term down and an agent is four things in a loop:

  1. An LLM that picks the next action.
  2. A set of tools the LLM can call (functions, basically).
  3. Memory so it carries state across turns and sessions.
  4. A control loop that keeps calling the LLM until it's done.

Everything else — planning, reflection, multi-agent orchestration, retrieval — is a refinement of one of those four. The shape of the loop itself (think → act → observe → think again) is often called ReAct, and it's the load-bearing structure of every agent system, from one-file scripts to multi-agent orchestration platforms. The rest of this article introduces each refinement only after showing what visibly breaks without it.


What you will build

A single Python program. It opens a terminal REPL, persists everything to a local planner.db SQLite file, and talks to Ollama for both chat completions and embeddings.

flowchart LR user(["you<br/>(terminal REPL)"]) agent["agent loop<br/>(Python)"] ollama[("Ollama<br/>qwen3:8b<br/>nomic-embed-text")] db[("SQLite<br/>planner.db")] user <--> agent agent <-->|chat + embeddings| ollama agent <-->|tasks · messages · memories| db classDef external fill:#eef2f7,stroke:#6b7280,color:#15171a classDef internal fill:#dbeafe,stroke:#2563eb,color:#15171a class user,ollama external class agent,db internal

SQLite holds three tables:

  • tasks — the planner's domain data (title, due date, status).
  • messages — conversation history, one row per turn, keyed by session.
  • memories — long-term facts about you, stored with an embedding for retrieval.

That's the whole system. We will build it up one layer at a time.


Prerequisites

Requirement Notes
Python 3.11+ Type hints and match are used throughout.
Ollama Install from https://ollama.com. CPU works but is slow; a GPU with 8+ GB VRAM makes the experience interactive.
The ollama Python client pip install 'ollama>=0.4'. The only third-party dependency. Older clients expose embeddings() instead of embed() and will KeyError on the code below.
A terminal We will run a REPL. Any shell.

Pull the two models we will use:

ollama pull qwen3:8b
ollama pull nomic-embed-text
ollama pull qwen3:8b
ollama pull nomic-embed-text

qwen3:8b is a small, reliable tool-calling model. If you have less VRAM, llama3.2:3b works but trips on tool schemas more often. nomic-embed-text is a 768-dim embedding model used for long-term memory retrieval.


The 30-line naive agent

Start with the smallest possible thing that calls an LLM:

# v1_naive.py
import ollama

MODEL = "qwen3:8b"

def chat(prompt: str) -> str:
    res = ollama.chat(
        model=MODEL,
        messages=[{"role": "user", "content": prompt}],
    )
    return res["message"]["content"]

if __name__ == "__main__":
    while (user := input("you> ").strip()):
        print(f"bot> {chat(user)}\n")
# v1_naive.py
import ollama

MODEL = "qwen3:8b"

def chat(prompt: str) -> str:
    res = ollama.chat(
        model=MODEL,
        messages=[{"role": "user", "content": prompt}],
    )
    return res["message"]["content"]

if __name__ == "__main__":
    while (user := input("you> ").strip()):
        print(f"bot> {chat(user)}\n")

Run it:

you> what's the capital of France?
bot> Paris.

you> add a task to buy milk tomorrow
bot> Sure! I've added "buy milk" to your task list for tomorrow.

The second reply is a lie. There is no task list. There is no tomorrow — the model has no way to do anything beyond producing text. It also forgets the previous turn the moment the next one starts, because we send a fresh single-message history every time.

This is the baseline. Everything from here on is fixing a specific failure of this version.


Tools — letting the agent do things

A tool is a Python function the LLM can decide to call. The agent loop is responsible for advertising those functions to the model (as JSON Schema), watching for tool_calls in the response, executing them, and feeding the results back.

The loop never decides what to do — the model does. The loop is purely mechanical: it dispatches whatever the model asks for, feeds the result back, and asks again. The model stops by producing a turn with no tool calls. This is the single most important idea in this section; everything below is plumbing for it.

Define a small registry:

# tools.py
import json, sqlite3
from datetime import date

TOOLS: dict[str, dict] = {}

def tool(name: str, description: str, schema: dict):
    def decorator(fn):
        TOOLS[name] = {"description": description, "schema": schema, "fn": fn}
        return fn
    return decorator

def tool_specs() -> list[dict]:
    return [
        {
            "type": "function",
            "function": {
                "name": name,
                "description": t["description"],
                "parameters": t["schema"],
            },
        }
        for name, t in TOOLS.items()
    ]
# tools.py
import json, sqlite3
from datetime import date

TOOLS: dict[str, dict] = {}

def tool(name: str, description: str, schema: dict):
    def decorator(fn):
        TOOLS[name] = {"description": description, "schema": schema, "fn": fn}
        return fn
    return decorator

def tool_specs() -> list[dict]:
    return [
        {
            "type": "function",
            "function": {
                "name": name,
                "description": t["description"],
                "parameters": t["schema"],
            },
        }
        for name, t in TOOLS.items()
    ]

Three task tools and a get_today (LLMs don't know the date — making it a tool is the lesson):

db = sqlite3.connect("planner.db")
db.row_factory = sqlite3.Row
db.executescript("""
CREATE TABLE IF NOT EXISTS tasks (
  id INTEGER PRIMARY KEY AUTOINCREMENT,
  title TEXT NOT NULL,
  due_date TEXT,
  status TEXT NOT NULL DEFAULT 'open',
  created_at TEXT NOT NULL DEFAULT CURRENT_TIMESTAMP
);
""")

@tool("add_task", "Add a task to the planner.", {
    "type": "object",
    "properties": {
        "title": {"type": "string"},
        "due_date": {"type": "string", "description": "ISO 8601 date, e.g. 2026-05-28"},
    },
    "required": ["title"],
})
def add_task(title: str, due_date: str | None = None) -> dict:
    cur = db.execute(
        "INSERT INTO tasks (title, due_date) VALUES (?, ?)", (title, due_date),
    )
    db.commit()
    return {"id": cur.lastrowid, "title": title, "due_date": due_date}

@tool("list_tasks", "List open tasks.", {"type": "object", "properties": {}})
def list_tasks() -> list[dict]:
    rows = db.execute(
        "SELECT id, title, due_date FROM tasks WHERE status = 'open' "
        "ORDER BY due_date IS NULL, due_date",  # push null-due tasks to the bottom
    ).fetchall()
    return [dict(r) for r in rows]

@tool("complete_task", "Mark a task complete.", {
    "type": "object",
    "properties": {"id": {"type": "integer"}},
    "required": ["id"],
})
def complete_task(id: int) -> dict:
    db.execute("UPDATE tasks SET status = 'done' WHERE id = ?", (id,))
    db.commit()
    return {"id": id, "status": "done"}

@tool("get_today", "Get today's date in ISO 8601.", {"type": "object", "properties": {}})
def get_today() -> dict:
    return {"date": date.today().isoformat()}
db = sqlite3.connect("planner.db")
db.row_factory = sqlite3.Row
db.executescript("""
CREATE TABLE IF NOT EXISTS tasks (
  id INTEGER PRIMARY KEY AUTOINCREMENT,
  title TEXT NOT NULL,
  due_date TEXT,
  status TEXT NOT NULL DEFAULT 'open',
  created_at TEXT NOT NULL DEFAULT CURRENT_TIMESTAMP
);
""")

@tool("add_task", "Add a task to the planner.", {
    "type": "object",
    "properties": {
        "title": {"type": "string"},
        "due_date": {"type": "string", "description": "ISO 8601 date, e.g. 2026-05-28"},
    },
    "required": ["title"],
})
def add_task(title: str, due_date: str | None = None) -> dict:
    cur = db.execute(
        "INSERT INTO tasks (title, due_date) VALUES (?, ?)", (title, due_date),
    )
    db.commit()
    return {"id": cur.lastrowid, "title": title, "due_date": due_date}

@tool("list_tasks", "List open tasks.", {"type": "object", "properties": {}})
def list_tasks() -> list[dict]:
    rows = db.execute(
        "SELECT id, title, due_date FROM tasks WHERE status = 'open' "
        "ORDER BY due_date IS NULL, due_date",  # push null-due tasks to the bottom
    ).fetchall()
    return [dict(r) for r in rows]

@tool("complete_task", "Mark a task complete.", {
    "type": "object",
    "properties": {"id": {"type": "integer"}},
    "required": ["id"],
})
def complete_task(id: int) -> dict:
    db.execute("UPDATE tasks SET status = 'done' WHERE id = ?", (id,))
    db.commit()
    return {"id": id, "status": "done"}

@tool("get_today", "Get today's date in ISO 8601.", {"type": "object", "properties": {}})
def get_today() -> dict:
    return {"date": date.today().isoformat()}

Now the agent loop:

# agent.py
import json, ollama
from tools import TOOLS, tool_specs

MODEL = "qwen3:8b"
MAX_TURNS = 8

def to_dict(msg) -> dict:
    # Ollama returns Pydantic models. Convert to plain dicts so json.dumps
    # (and later, SQLite storage) work without a custom encoder.
    return msg.model_dump(exclude_none=True) if hasattr(msg, "model_dump") else msg

def run(user_input: str) -> str:
    messages = [{"role": "user", "content": user_input}]
    for _ in range(MAX_TURNS):
        res = ollama.chat(model=MODEL, messages=messages, tools=tool_specs())
        msg = to_dict(res["message"])
        messages.append(msg)

        calls = msg.get("tool_calls") or []
        if not calls:
            return msg.get("content", "")

        for call in calls:
            name = call["function"]["name"]
            args = call["function"]["arguments"] or {}
            try:
                result = TOOLS[name]["fn"](**args)
            except Exception as e:
                result = {"error": str(e)}
            messages.append({
                "role": "tool",
                "content": json.dumps(result),
                "tool_name": name,
            })
    return "I hit my tool-call limit."
# agent.py
import json, ollama
from tools import TOOLS, tool_specs

MODEL = "qwen3:8b"
MAX_TURNS = 8

def to_dict(msg) -> dict:
    # Ollama returns Pydantic models. Convert to plain dicts so json.dumps
    # (and later, SQLite storage) work without a custom encoder.
    return msg.model_dump(exclude_none=True) if hasattr(msg, "model_dump") else msg

def run(user_input: str) -> str:
    messages = [{"role": "user", "content": user_input}]
    for _ in range(MAX_TURNS):
        res = ollama.chat(model=MODEL, messages=messages, tools=tool_specs())
        msg = to_dict(res["message"])
        messages.append(msg)

        calls = msg.get("tool_calls") or []
        if not calls:
            return msg.get("content", "")

        for call in calls:
            name = call["function"]["name"]
            args = call["function"]["arguments"] or {}
            try:
                result = TOOLS[name]["fn"](**args)
            except Exception as e:
                result = {"error": str(e)}
            messages.append({
                "role": "tool",
                "content": json.dumps(result),
                "tool_name": name,
            })
    return "I hit my tool-call limit."

The shape of one turn:

flowchart TD start([user message]) chat["ollama.chat<br/>(model + tools advertised)"] decide{tool_calls<br/>in response?} finalize([return reply]) dispatch[execute tool fn] push[append result as<br/>role: tool message] start --> chat chat --> decide decide -->|no| finalize decide -->|yes| dispatch dispatch --> push push --> chat classDef step fill:#dbeafe,stroke:#2563eb,color:#15171a classDef terminal fill:#eef2f7,stroke:#6b7280,color:#15171a class chat,dispatch,push step class start,finalize terminal class decide step

In code, "the model decides" looks like the conditional on calls: if the model returns no tool_calls, the loop returns its content. Otherwise it dispatches every call the model asked for, in order, and feeds each result back as a role: "tool" message before going around again.

A turn now looks like:

you> add a task to buy milk by friday
bot> Done — added "buy milk" with due date 2026-05-29.

The lie is gone. Real row in tasks, real due_date.

What's still broken: start a new run of the program. Ask "what tasks do I have?" The model has to call list_tasks from a cold start every time because we discard messages at the end of run. Worse, even within a single REPL session, the next user message gets none of the previous turn's context. The agent has no memory.


Short-term memory — conversation state

Persist messages to SQLite, keyed by session id. Load them on each turn. Trim to a budget.

CREATE TABLE IF NOT EXISTS messages (
  id INTEGER PRIMARY KEY AUTOINCREMENT,
  session TEXT NOT NULL,
  role TEXT NOT NULL,
  content TEXT,
  tool_calls TEXT,
  tool_name TEXT,
  created_at TEXT NOT NULL DEFAULT CURRENT_TIMESTAMP
);
# memory.py
import json, sqlite3
from typing import Iterable

HISTORY_LIMIT = 40  # rough turn budget

def save_message(db: sqlite3.Connection, session: str, msg: dict) -> None:
    db.execute(
        "INSERT INTO messages (session, role, content, tool_calls, tool_name) "
        "VALUES (?, ?, ?, ?, ?)",
        (
            session,
            msg["role"],
            msg.get("content"),
            json.dumps(msg["tool_calls"]) if msg.get("tool_calls") else None,
            msg.get("tool_name"),
        ),
    )
    db.commit()

def load_history(db: sqlite3.Connection, session: str) -> list[dict]:
    rows = db.execute(
        "SELECT role, content, tool_calls, tool_name FROM messages "
        "WHERE session = ? ORDER BY id DESC LIMIT ?",
        (session, HISTORY_LIMIT),
    ).fetchall()
    msgs = []
    for r in reversed(rows):
        m = {"role": r["role"]}
        if r["content"] is not None:
            m["content"] = r["content"]
        if r["tool_calls"]:
            m["tool_calls"] = json.loads(r["tool_calls"])
        if r["tool_name"]:
            m["tool_name"] = r["tool_name"]
        msgs.append(m)
    return trim_to_user_boundary(msgs)

def trim_to_user_boundary(msgs: list[dict]) -> list[dict]:
    # Tool-calling APIs require an assistant message with tool_calls to be
    # immediately followed by role: tool messages for each call. After a
    # window slice we have to guard both ends:
    #   (1) start on a user message — drop any orphan tool/assistant prefix
    #   (2) drop trailing orphans: a role:tool with no live assistant before
    #       it, or an assistant whose tool_calls were never answered.
    start = next((i for i, m in enumerate(msgs) if m["role"] == "user"), None)
    if start is None:
        return []
    msgs = msgs[start:]
    while msgs and (
        msgs[-1].get("role") == "tool"
        or (msgs[-1].get("role") == "assistant" and msgs[-1].get("tool_calls"))
    ):
        msgs.pop()
    return msgs
# memory.py
import json, sqlite3
from typing import Iterable

HISTORY_LIMIT = 40  # rough turn budget

def save_message(db: sqlite3.Connection, session: str, msg: dict) -> None:
    db.execute(
        "INSERT INTO messages (session, role, content, tool_calls, tool_name) "
        "VALUES (?, ?, ?, ?, ?)",
        (
            session,
            msg["role"],
            msg.get("content"),
            json.dumps(msg["tool_calls"]) if msg.get("tool_calls") else None,
            msg.get("tool_name"),
        ),
    )
    db.commit()

def load_history(db: sqlite3.Connection, session: str) -> list[dict]:
    rows = db.execute(
        "SELECT role, content, tool_calls, tool_name FROM messages "
        "WHERE session = ? ORDER BY id DESC LIMIT ?",
        (session, HISTORY_LIMIT),
    ).fetchall()
    msgs = []
    for r in reversed(rows):
        m = {"role": r["role"]}
        if r["content"] is not None:
            m["content"] = r["content"]
        if r["tool_calls"]:
            m["tool_calls"] = json.loads(r["tool_calls"])
        if r["tool_name"]:
            m["tool_name"] = r["tool_name"]
        msgs.append(m)
    return trim_to_user_boundary(msgs)

def trim_to_user_boundary(msgs: list[dict]) -> list[dict]:
    # Tool-calling APIs require an assistant message with tool_calls to be
    # immediately followed by role: tool messages for each call. After a
    # window slice we have to guard both ends:
    #   (1) start on a user message — drop any orphan tool/assistant prefix
    #   (2) drop trailing orphans: a role:tool with no live assistant before
    #       it, or an assistant whose tool_calls were never answered.
    start = next((i for i, m in enumerate(msgs) if m["role"] == "user"), None)
    if start is None:
        return []
    msgs = msgs[start:]
    while msgs and (
        msgs[-1].get("role") == "tool"
        or (msgs[-1].get("role") == "assistant" and msgs[-1].get("tool_calls"))
    ):
        msgs.pop()
    return msgs

The trim function is the one piece of subtlety. Naïve sliding-window truncation will sometimes cut between an assistant message that contains tool_calls and the role: tool messages that satisfy those calls. The next API request will reject that history with a confusing error. The fix is to guard both ends: skip any leading orphan tool or assistant fragments until we land on a user message, and drop any trailing assistant whose tool_calls were never answered.

The loop now uses the DB as the single source of truth — save the user message first, then load history, then run the turn:

# agent.py
from tools import TOOLS, tool_specs, db
from memory import save_message, load_history

SYSTEM = {
    "role": "system",
    "content": (
        "You are a personal planner. The user manages tasks through you. "
        "Use get_today before reasoning about relative dates like 'tomorrow' or 'next Tuesday'. "
        "Keep replies short."
    ),
}

def run(session: str, user_input: str) -> str:
    save_message(db, session, {"role": "user", "content": user_input})
    messages = [SYSTEM, *load_history(db, session)]

    for _ in range(MAX_TURNS):
        res = ollama.chat(model=MODEL, messages=messages, tools=tool_specs())
        msg = to_dict(res["message"])
        messages.append(msg)
        save_message(db, session, msg)

        calls = msg.get("tool_calls") or []
        if not calls:
            return msg.get("content", "")

        for call in calls:
            name = call["function"]["name"]
            args = call["function"]["arguments"] or {}
            try:
                result = TOOLS[name]["fn"](**args)
            except Exception as e:
                result = {"error": str(e)}
            tool_msg = {
                "role": "tool", "content": json.dumps(result), "tool_name": name,
            }
            messages.append(tool_msg)
            save_message(db, session, tool_msg)

    return "I hit my tool-call limit."
# agent.py
from tools import TOOLS, tool_specs, db
from memory import save_message, load_history

SYSTEM = {
    "role": "system",
    "content": (
        "You are a personal planner. The user manages tasks through you. "
        "Use get_today before reasoning about relative dates like 'tomorrow' or 'next Tuesday'. "
        "Keep replies short."
    ),
}

def run(session: str, user_input: str) -> str:
    save_message(db, session, {"role": "user", "content": user_input})
    messages = [SYSTEM, *load_history(db, session)]

    for _ in range(MAX_TURNS):
        res = ollama.chat(model=MODEL, messages=messages, tools=tool_specs())
        msg = to_dict(res["message"])
        messages.append(msg)
        save_message(db, session, msg)

        calls = msg.get("tool_calls") or []
        if not calls:
            return msg.get("content", "")

        for call in calls:
            name = call["function"]["name"]
            args = call["function"]["arguments"] or {}
            try:
                result = TOOLS[name]["fn"](**args)
            except Exception as e:
                result = {"error": str(e)}
            tool_msg = {
                "role": "tool", "content": json.dumps(result), "tool_name": name,
            }
            messages.append(tool_msg)
            save_message(db, session, tool_msg)

    return "I hit my tool-call limit."

Restart the REPL. Ask "what was the last task I added?" The agent answers from history. It remembers within and across sessions, up to HISTORY_LIMIT turns.

What's still broken: the agent remembers the conversation but it has no notion of facts about you. Tell it "I have Wednesday afternoons free" today and ask "when should I schedule a meeting?" three weeks from now and it has no idea — that turn was trimmed out of the window long ago. Some things need to outlive the conversation buffer.


Long-term memory — facts that outlive the window

Short-term memory is indexed by recency; long-term memory is indexed by meaning. You need both — conversations end but facts shouldn't. The data structure for "indexed by meaning" is an embedding: a fixed-length list of floats produced by a dedicated embedding model (here, nomic-embed-text), trained so that two pieces of text with similar meaning land near each other in vector space. Search becomes a similarity comparison instead of a substring match.

Two new tools (remember and recall) give the agent control over what to save and when to retrieve.

CREATE TABLE IF NOT EXISTS memories (
  id INTEGER PRIMARY KEY AUTOINCREMENT,
  text TEXT NOT NULL,
  embedding BLOB NOT NULL,
  created_at TEXT NOT NULL DEFAULT CURRENT_TIMESTAMP
);

Embeddings via Ollama, packed into the BLOB column as raw little-endian floats. Cosine similarity is a one-liner in pure Python — slow at a million rows, fine for a personal planner.

# embeddings.py
import struct, ollama

EMBED_MODEL = "nomic-embed-text"

def embed(text: str) -> list[float]:
    res = ollama.embed(model=EMBED_MODEL, input=text)
    return res["embeddings"][0]

def pack(vec: list[float]) -> bytes:
    return struct.pack(f"<{len(vec)}f", *vec)  # explicit little-endian float32

def unpack(blob: bytes) -> list[float]:
    return list(struct.unpack(f"<{len(blob) // 4}f", blob))

def cosine(a: list[float], b: list[float]) -> float:
    dot = sum(x * y for x, y in zip(a, b))
    na = sum(x * x for x in a) ** 0.5
    nb = sum(y * y for y in b) ** 0.5
    return dot / (na * nb) if na and nb else 0.0
# embeddings.py
import struct, ollama

EMBED_MODEL = "nomic-embed-text"

def embed(text: str) -> list[float]:
    res = ollama.embed(model=EMBED_MODEL, input=text)
    return res["embeddings"][0]

def pack(vec: list[float]) -> bytes:
    return struct.pack(f"<{len(vec)}f", *vec)  # explicit little-endian float32

def unpack(blob: bytes) -> list[float]:
    return list(struct.unpack(f"<{len(blob) // 4}f", blob))

def cosine(a: list[float], b: list[float]) -> float:
    dot = sum(x * y for x, y in zip(a, b))
    na = sum(x * x for x in a) ** 0.5
    nb = sum(y * y for y in b) ** 0.5
    return dot / (na * nb) if na and nb else 0.0

The tools:

@tool("remember", "Store a durable fact about the user.", {
    "type": "object",
    "properties": {"fact": {"type": "string"}},
    "required": ["fact"],
})
def remember(fact: str) -> dict:
    vec = embed(fact)
    db.execute(
        "INSERT INTO memories (text, embedding) VALUES (?, ?)", (fact, pack(vec)),
    )
    db.commit()
    return {"ok": True, "fact": fact}

@tool("recall", "Search long-term memory by meaning.", {
    "type": "object",
    "properties": {
        "query": {"type": "string"},
        "k": {"type": "integer", "default": 3},
    },
    "required": ["query"],
})
def recall(query: str, k: int = 3) -> list[dict]:
    qv = embed(query)
    rows = db.execute("SELECT text, embedding FROM memories").fetchall()
    scored = sorted(
        ((cosine(qv, unpack(r["embedding"])), r["text"]) for r in rows),
        reverse=True,
    )
    return [{"text": t, "score": round(s, 3)} for s, t in scored[:k]]
@tool("remember", "Store a durable fact about the user.", {
    "type": "object",
    "properties": {"fact": {"type": "string"}},
    "required": ["fact"],
})
def remember(fact: str) -> dict:
    vec = embed(fact)
    db.execute(
        "INSERT INTO memories (text, embedding) VALUES (?, ?)", (fact, pack(vec)),
    )
    db.commit()
    return {"ok": True, "fact": fact}

@tool("recall", "Search long-term memory by meaning.", {
    "type": "object",
    "properties": {
        "query": {"type": "string"},
        "k": {"type": "integer", "default": 3},
    },
    "required": ["query"],
})
def recall(query: str, k: int = 3) -> list[dict]:
    qv = embed(query)
    rows = db.execute("SELECT text, embedding FROM memories").fetchall()
    scored = sorted(
        ((cosine(qv, unpack(r["embedding"])), r["text"]) for r in rows),
        reverse=True,
    )
    return [{"text": t, "score": round(s, 3)} for s, t in scored[:k]]

The data path:

flowchart LR subgraph short[Short-term memory] msgs[(messages<br/>table)] trim[trim on user<br/>boundary] msgs --> trim end subgraph long[Long-term memory] mem[(memories<br/>text + embedding)] emb[nomic-embed-text] cos[cosine similarity<br/>in Python] end turn([user turn]) --> trim trim --> ctx[message context<br/>sent to LLM] turn -->|"remember(fact)"| emb1[nomic-embed-text] emb1 --> mem turn -->|"recall(query)"| emb emb --> cos mem --> cos cos --> ctx classDef store fill:#fef3c7,stroke:#d97706,color:#15171a classDef step fill:#dbeafe,stroke:#2563eb,color:#15171a classDef terminal fill:#eef2f7,stroke:#6b7280,color:#15171a class msgs,mem store class trim,emb,emb1,cos,ctx step class turn terminal

Two design choices worth flagging:

  • The model controls both writes and reads. It calls remember when a fact seems worth keeping and recall when it suspects relevant context exists. The system prompt is updated to nudge it: "If the user states a durable preference or fact about themselves, call remember. If a question would benefit from past context, call recall." The alternative — extracting memories automatically in a background pass — is cleaner architecturally but takes more code; the tool-driven version is the from-scratch lesson.
  • Cosine similarity is computed in Python, not in SQL. SQLite has no native vector type, and that's the point — seeing the loop makes clear what a vector DB is actually doing for you. At a few thousand rows this is fine. Past ~100k rows you want a real vector index.

Now the planner can carry facts forward indefinitely:

you> i have wednesday afternoons free for meetings
bot> Noted.
[remember(fact="user has Wednesday afternoons free for meetings")]

# ...some weeks later, new session...

you> when should i schedule the dentist?
bot> You mentioned Wednesday afternoons are free — want me to add it for next Wednesday?
[recall(query="when is the user free for appointments")]

What's still broken: complex requests fall apart. Ask "prep for my doctor visit next Tuesday and pick up a gift before then" and the agent often does one thing, forgets the other, or gets the dates wrong. It is reacting one tool at a time without ever looking at the whole request first.


Planning — think before acting

Without planning, the model reacts to each tool result in isolation and loses the thread on multi-step requests — does one thing, forgets the other. Planning separates deciding what to do from doing it. Even when both happen inside one LLM call, forcing the model to commit to a structure before it starts dispatching changes its behavior.

The smallest planning intervention that helps: a system-prompt instruction that says "For requests that involve more than one step, first write a one-line plan, then execute it."

SYSTEM = {
    "role": "system",
    "content": (
        "You are a personal planner.\n"
        "- Use get_today before reasoning about relative dates.\n"
        "- For multi-step requests, write a short plan first (1–3 bullets), "
        "  then call the tools to execute it.\n"
        "- If the user states a durable preference or fact about themselves, call remember.\n"
        "- If a question would benefit from past context, call recall before answering.\n"
        "- After executing, summarize what you did in one line."
    ),
}
SYSTEM = {
    "role": "system",
    "content": (
        "You are a personal planner.\n"
        "- Use get_today before reasoning about relative dates.\n"
        "- For multi-step requests, write a short plan first (1–3 bullets), "
        "  then call the tools to execute it.\n"
        "- If the user states a durable preference or fact about themselves, call remember.\n"
        "- If a question would benefit from past context, call recall before answering.\n"
        "- After executing, summarize what you did in one line."
    ),
}

That's it. No new code, no separate "planner" LLM call. The plan lives in the assistant message that precedes the tool calls. It costs a few extra tokens and visibly improves multi-step accuracy because the model commits to a structure before it starts dispatching.

A turn now looks like:

you> prep for my doctor visit next tuesday and pick up a gift before then

bot> Plan:
     - get today's date to resolve "next Tuesday"
     - add "prep for doctor visit" due that date
     - add "pick up gift" due the day before
[get_today() → 2026-05-24]
[add_task(title="prep for doctor visit", due_date="2026-06-02")]
[add_task(title="pick up gift", due_date="2026-06-01")]

bot> Added both — doctor prep on Jun 2, gift on Jun 1.

For more complex domains you'd promote planning to a dedicated LLM call that produces structured JSON, then iterate over the steps. For a personal planner, the prompt-only version is enough — and resisting the urge to over-engineer is part of the lesson.

What's still broken: the agent sometimes silently fails — calls a tool, gets an error, ignores it, and tells you everything went fine. It needs to check its own work.


Reflection — checking its own work

Reflection adds a second pair of eyes — a fresh LLM call with no investment in the previous answer, which makes it willing to say "that's wrong." The acting model has an implicit bias toward declaring success because it just spent tokens on the attempt; a separate critic call doesn't. Concretely: a second ollama.chat invocation with a different system prompt and the prior transcript as input.

Mechanically: after the main loop finishes, a second LLM pass looks at the transcript and decides did we actually accomplish what the user asked? If not, the critique is fed back in as a new user message and the loop runs again, up to a small retry budget.

The previous run becomes _act — same body, new name — and a new run wraps it with the reflection loop. The trick: a failed attempt's assistant/tool turns are kept in memory but only the final successful attempt is persisted, so a critique-driven retry never leaks into durable history.

REFLECT_PROMPT = (
    "You are reviewing an agent transcript. Given the user's original request "
    "and the actions taken, answer in JSON: "
    '{"done": true|false, "critique": "..."}. '
    "Set done=true if the request was fully satisfied. "
    "Set done=false and provide a concrete critique if anything is missing or wrong."
)

MAX_REFLECTIONS = 2

def _act(messages: list[dict]) -> tuple[str, list[dict]]:
    """One pass of the tool-calling loop. We both *mutate* `messages` (so the
    caller and reflection see the full transcript) and *return* the slice
    of new turns this call added, so `run` can decide what to persist."""
    start = len(messages)
    for _ in range(MAX_TURNS):
        res = ollama.chat(model=MODEL, messages=messages, tools=tool_specs())
        msg = to_dict(res["message"])
        messages.append(msg)
        calls = msg.get("tool_calls") or []
        if not calls:
            return msg.get("content", ""), messages[start:]
        for call in calls:
            name = call["function"]["name"]
            args = call["function"]["arguments"] or {}
            if isinstance(args, str):  # some models return arguments as JSON text
                args = json.loads(args)
            try:
                result = TOOLS[name]["fn"](**args)
            except Exception as e:
                result = {"error": str(e)}
            messages.append({
                "role": "tool", "content": json.dumps(result), "tool_name": name,
            })
    return "I hit my tool-call limit.", messages[start:]

def _reflect(original: str, reply: str, messages: list[dict]) -> dict:
    transcript = "\n".join(
        f"{m['role']}: {m.get('content', '') or m.get('tool_calls', '')}"
        for m in messages[-8:]
    )
    res = ollama.chat(
        model=MODEL,
        messages=[
            {"role": "system", "content": REFLECT_PROMPT},
            {"role": "user", "content":
                f"Original request: {original}\n\nTranscript:\n{transcript}\n\nFinal reply: {reply}"},
        ],
        format="json",
    )
    content = to_dict(res["message"]).get("content") or "{}"
    try:
        return json.loads(content)
    except (json.JSONDecodeError, TypeError):
        return {"done": True, "critique": ""}  # fail open on malformed output

def run(session: str, user_input: str) -> str:
    save_message(db, session, {"role": "user", "content": user_input})
    original = user_input
    messages = [SYSTEM, *load_history(db, session)]
    reply, added = "", []
    succeeded = False

    for attempt in range(MAX_REFLECTIONS + 1):
        reply, added = _act(messages)
        try:
            verdict = _reflect(original, reply, messages)
        except Exception:
            verdict = {"done": True, "critique": ""}  # fail open on any error
        if verdict["done"]:
            succeeded = True
            break
        if attempt == MAX_REFLECTIONS:
            break
        # Critique is appended in-memory only — never persisted.
        messages.append({
            "role": "user",
            "content": f"Your previous attempt was incomplete: {verdict['critique']}",
        })

    # Persist only the winning attempt's assistant/tool turns. On exhaustion
    # without success, the original user message remains as a dangling row;
    # next turn's load + trim handles that gracefully.
    if succeeded:
        for m in added:
            save_message(db, session, m)
    return reply
REFLECT_PROMPT = (
    "You are reviewing an agent transcript. Given the user's original request "
    "and the actions taken, answer in JSON: "
    '{"done": true|false, "critique": "..."}. '
    "Set done=true if the request was fully satisfied. "
    "Set done=false and provide a concrete critique if anything is missing or wrong."
)

MAX_REFLECTIONS = 2

def _act(messages: list[dict]) -> tuple[str, list[dict]]:
    """One pass of the tool-calling loop. We both *mutate* `messages` (so the
    caller and reflection see the full transcript) and *return* the slice
    of new turns this call added, so `run` can decide what to persist."""
    start = len(messages)
    for _ in range(MAX_TURNS):
        res = ollama.chat(model=MODEL, messages=messages, tools=tool_specs())
        msg = to_dict(res["message"])
        messages.append(msg)
        calls = msg.get("tool_calls") or []
        if not calls:
            return msg.get("content", ""), messages[start:]
        for call in calls:
            name = call["function"]["name"]
            args = call["function"]["arguments"] or {}
            if isinstance(args, str):  # some models return arguments as JSON text
                args = json.loads(args)
            try:
                result = TOOLS[name]["fn"](**args)
            except Exception as e:
                result = {"error": str(e)}
            messages.append({
                "role": "tool", "content": json.dumps(result), "tool_name": name,
            })
    return "I hit my tool-call limit.", messages[start:]

def _reflect(original: str, reply: str, messages: list[dict]) -> dict:
    transcript = "\n".join(
        f"{m['role']}: {m.get('content', '') or m.get('tool_calls', '')}"
        for m in messages[-8:]
    )
    res = ollama.chat(
        model=MODEL,
        messages=[
            {"role": "system", "content": REFLECT_PROMPT},
            {"role": "user", "content":
                f"Original request: {original}\n\nTranscript:\n{transcript}\n\nFinal reply: {reply}"},
        ],
        format="json",
    )
    content = to_dict(res["message"]).get("content") or "{}"
    try:
        return json.loads(content)
    except (json.JSONDecodeError, TypeError):
        return {"done": True, "critique": ""}  # fail open on malformed output

def run(session: str, user_input: str) -> str:
    save_message(db, session, {"role": "user", "content": user_input})
    original = user_input
    messages = [SYSTEM, *load_history(db, session)]
    reply, added = "", []
    succeeded = False

    for attempt in range(MAX_REFLECTIONS + 1):
        reply, added = _act(messages)
        try:
            verdict = _reflect(original, reply, messages)
        except Exception:
            verdict = {"done": True, "critique": ""}  # fail open on any error
        if verdict["done"]:
            succeeded = True
            break
        if attempt == MAX_REFLECTIONS:
            break
        # Critique is appended in-memory only — never persisted.
        messages.append({
            "role": "user",
            "content": f"Your previous attempt was incomplete: {verdict['critique']}",
        })

    # Persist only the winning attempt's assistant/tool turns. On exhaustion
    # without success, the original user message remains as a dangling row;
    # next turn's load + trim handles that gracefully.
    if succeeded:
        for m in added:
            save_message(db, session, m)
    return reply

The full cycle:

flowchart LR user([user request]) plan["1. Plan<br/>(in-prompt)"] act["2. Act<br/>tool-calling loop"] reflect["3. Reflect<br/>(2nd LLM pass)"] done([reply]) user --> plan --> act --> reflect reflect -->|done=true| done reflect -->|done=false<br/>critique fed back| plan classDef step fill:#dbeafe,stroke:#2563eb,color:#15171a classDef terminal fill:#eef2f7,stroke:#6b7280,color:#15171a class plan,act,reflect step class user,done terminal

Two things worth knowing:

  • Reflection is expensive. It doubles your LLM calls on every turn. Gate it behind a complexity heuristic ("did the agent call more than one tool?") if you care about latency or token cost. For a local Ollama planner where calls are free and slow, it's fine to always run.
  • Fail open on bad reflection output. If the reflector returns malformed JSON, return done=true rather than retrying forever. Trusting the critic blindly is how you get infinite loops.

Putting it together

All the pieces, in one file:

agent.py           # the loop, plan, reflect, REPL entry
tools.py           # tool registry, task tools, get_today, remember, recall
memory.py          # save/load messages, trim
embeddings.py      # embed, pack/unpack, cosine
planner.db         # SQLite, created on first run

~300 lines of Python, no third-party dependencies except ollama. Clone the repo if you don't want to type the files out yourself:

git clone https://github.com/devdaviddr/personal-planner-agent.git
cd personal-planner-agent
git clone https://github.com/devdaviddr/personal-planner-agent.git
cd personal-planner-agent

The REPL entry point at the bottom of agent.py:

if __name__ == "__main__":
    import uuid
    session = uuid.uuid4().hex  # one session per process; swap for a stable id to span runs
    while (text := input("you> ").strip()):
        print(f"bot> {run(session, text)}\n")
if __name__ == "__main__":
    import uuid
    session = uuid.uuid4().hex  # one session per process; swap for a stable id to span runs
    while (text := input("you> ").strip()):
        print(f"bot> {run(session, text)}\n")

Then:

pip install 'ollama>=0.4'
python agent.py
pip install 'ollama>=0.4'
python agent.py

A real transcript from the finished agent:

you> i have wednesday afternoons free
bot> Noted.

you> prep for my doctor visit next tuesday and pick up a gift before then
bot> Plan:
     - check today's date to resolve "next Tuesday"
     - add a prep task for that day
     - add a gift task for the day before
     Added: "prep for doctor visit" due 2026-06-02, "pick up gift" due 2026-06-01.

you> when should i schedule the follow-up?
bot> You mentioned Wednesday afternoons are free — next Wednesday is 2026-05-27.
     Want me to add it?

you> yes
bot> Added: "doctor follow-up" due 2026-05-27.

you> what's on my list?
bot> 3 open tasks:
     1. pick up gift — 2026-06-01
     2. prep for doctor visit — 2026-06-02
     3. doctor follow-up — 2026-05-27

The remember/recall calls, the get_today resolution, the plan, and the silent reflection pass are all happening in the background. From the user's seat it just feels like the agent is thinking.


What frameworks add (and what they take)

Now that you've built one from scratch, here's what an agent framework actually gives you, so you can decide when it's worth reaching for one:

Feature you wrote What a framework adds
TOOLS dict + JSON-schema decorator Auto-generation from Python type hints, async dispatch, parallel tool execution.
messages table + manual trim Pluggable memory backends (Redis, Postgres+pgvector, managed services), automatic summarization, token-aware truncation.
recall() over SQLite A real vector DB (Chroma, LanceDB, Pinecone) with proper indexing for >100k vectors.
Plan-then-act in the system prompt Multi-step planners that emit structured DAGs, with per-step retries.
Single-shot reflection Critic agents, self-consistency voting, debate loops.
One agent, one loop Multi-agent orchestration, message passing, handoff protocols.

Rule of thumb: ~10k+ memories (you need a real vector index), multiple coordinated agents (you need orchestration), async or parallel tool execution, or multi-tenant SLAs. Any single one of these is a yellow flag worth thinking about; two or more and you should reach for a framework. Below that, the from-scratch version is faster to debug and ships sooner.

For a personal planner with a few thousand tasks and one user, you don't need any of it. For a customer-facing system with multiple specialized agents, ten million memories, and SLA-bound latency, you do. The point of writing the from-scratch version is that you now know exactly what you're trading away when you adopt a framework — and what you'd have to rebuild if you ever ripped one out.


Troubleshooting

Symptom Cause and fix
ollama._types.ResponseError: model not found You haven't pulled the model. Run ollama pull qwen3:8b (and nomic-embed-text).
Model never calls tools, just answers in prose Either the model is too small or the tool descriptions are too vague. Try qwen3:8b; if you must use a 3B model, write longer, more imperative descriptions ("Use this to...").
role: tool rejected by Ollama Your history was sliced mid-tool-call. Confirm trim_to_user_boundary is running. The same bug occurs if you forget to persist the assistant tool_calls message.
recall returns garbage matches Your nomic-embed-text pull is incomplete or you're packing/unpacking with mismatched dtypes. Re-run ollama pull nomic-embed-text and verify embedding length is 768 (len(embed("test"))).
Agent gets dates wrong The system prompt instruction to call get_today first is missing or the model is ignoring it. Make the instruction more emphatic, or compute today's date in Python and inject it into the system prompt every turn.
Reflection loops forever MAX_REFLECTIONS is too high or your reflector is overly strict. Cap at 2 and fail open on malformed JSON output.
Slow first reply Ollama loads the model into VRAM on the first request. Subsequent calls are fast. Pre-warm with curl http://localhost:11434/api/generate -d '{"model":"qwen3:8b","prompt":"hi"}' on startup if you care.

Result

A local, fully-private personal-planner agent in roughly 300 lines of Python. It runs on your laptop, persists everything to one SQLite file, and demonstrates each of the agent fundamentals — tools, short-term memory, long-term memory, planning, reflection — as a discrete, removable layer rather than as framework magic.

The same skeleton generalizes. Swap the task tools for GitHub Issues, Linear, your calendar, or any REST API and you have a domain-specific agent on the same footing. Swap the SQLite memory tables for Postgres and you have something multi-user. The reflection pass shown here is closest to self-refine — within-turn critique-and-retry. Persist the critiques across episodes and you have the start of a true Reflexion system (Shinn et al., 2023), where the agent learns from its own past mistakes over time.

The point isn't the planner. The point is that you've seen each fundamental in isolation and can now build, debug, or replace any of them without the framework that usually hides them.

Source

Full source for this tutorial: github.com/devdaviddr/personal-planner-agent.