[
  {
    "slug": "2026-05-24-building-an-ai-agent-from-scratch-ollama-python",
    "title": "Tutorial: Build an AI Agent from Scratch with Ollama and Python",
    "description": "A from-scratch tutorial that builds a local personal-planner agent in plain Python, backed by Ollama and SQLite. No frameworks. Each section breaks the previous version to motivate the next fundamental: tools, short-term memory, long-term memory, planning, and reflection.",
    "tags": [
      "ai",
      "ollama",
      "python",
      "agents",
      "tutorial",
      "from-scratch"
    ],
    "excerpt": "This tutorial builds a small AI agent from scratch in plain Python. It runs against a local Ollama model, stores everything in SQLite, and uses no frameworks. We start with a single LLM call and layer on the four patterns that make it an agent: tools",
    "content": "This tutorial builds a small AI agent from scratch in plain Python. It runs against a local Ollama model, stores everything in SQLite, and uses no frameworks. We start with a single LLM call and layer on the four patterns that make it an agent: tools, memory, planning, and reflection. By the end you will have: A local personal-planner agent you talk to from the terminal. It can add, list, and complete tasks; it remembers facts about you across sessions; it plans before acting and reflects on what it did. A working mental model of what an agent framework actually does for you, so you can decide when to reach for one and when not to. Roughly 300 lines of Python, all stdlib plus the ollama client. All the code lives at github.com/devdaviddr/personal-planner-agent if you want to clone-and-run before reading. A taste of what the finished agent looks like in use — note that it remembers a preference you told it weeks earlier and uses it to answer a question that has nothing to do with the original turn: you&gt; i have wednesday afternoons free for meetings bot&gt; Noted. # ...some time later, new session... you&gt; when should i schedule the dentist? bot&gt; Plan: - recall any free-time preferences - resolve &quot;next Wednesday&quot; via get_today - propose a date [recall(&quot;when is the user free for appointments&quot;) → &quot;Wednesday afternoons&quot;] [get_today() → 2026-05-24] You mentioned Wednesday afternoons are free. Next Wednesday is 2026-05-27 — want me to add it? What is an agent, really? Strip the term down and an agent is four things in a loop: An LLM that picks the next action. A set of tools the LLM can call (functions, basically). Memory so it carries state across turns and sessions. A control loop that keeps calling the LLM until it's done. Everything else — planning, reflection, multi-agent orchestration, retrieval — is a refinement of one of those four. The shape of the loop itself (think → act → observe → think again) is often called ReAct , and it's the load-bearing structure of every agent system, from one-file scripts to multi-agent orchestration platforms. The rest of this article introduces each refinement only after showing what visibly breaks without it. What you will build A single Python program. It opens a terminal REPL, persists everything to a local planner.db SQLite file, and talks to Ollama for both chat completions and embeddings. flowchart LR user([&quot;you&lt;br/&gt;(terminal REPL)&quot;]) agent[&quot;agent loop&lt;br/&gt;(Python)&quot;] ollama[(&quot;Ollama&lt;br/&gt;qwen3:8b&lt;br/&gt;nomic-embed-text&quot;)] db[(&quot;SQLite&lt;br/&gt;planner.db&quot;)] user &lt;--&gt; agent agent &lt;--&gt;|chat + embeddings| ollama agent &lt;--&gt;|tasks · messages · memories| db classDef external fill:#eef2f7,stroke:#6b7280,color:#15171a classDef internal fill:#dbeafe,stroke:#2563eb,color:#15171a class user,ollama external class agent,db internal SQLite holds three tables: tasks — the planner's domain data (title, due date, status). messages — conversation history, one row per turn, keyed by session. memories — long-term facts about you, stored with an embedding for retrieval. That's the whole system. We will build it up one layer at a time. Prerequisites Requirement Notes Python 3.11+ Type hints and match are used throughout. Ollama Install from https://ollama.com . CPU works but is slow; a GPU with 8+ GB VRAM makes the experience interactive. The ollama Python client pip install 'ollama&gt;=0.4' . The only third-party dependency. Older clients expose embeddings() instead of embed() and will KeyError on the code below. A terminal We will run a REPL. Any shell. Pull the two models we will use: ollama pull qwen3:8b ollama pull nomic-embed-text ollama pull qwen3:8b ollama pull nomic-embed-text qwen3:8b is a small, reliable tool-calling model. If you have less VRAM, llama3.2:3b works but trips on tool schemas more often. nomic-embed-text is a 768-dim embedding model used for long-term memory retrieval. The 30-line naive agent Start with the smallest possible thing that calls an LLM: # v1_naive.py import ollama MODEL = \"qwen3:8b\" def chat (prompt: str ) -> str : res = ollama.chat( model = MODEL , messages = [{ \"role\" : \"user\" , \"content\" : prompt}], ) return res[ \"message\" ][ \"content\" ] if __name__ == \"__main__\" : while (user := input ( \"you> \" ).strip()): print ( f \"bot> { chat(user) }\\n \" ) # v1_naive.py import ollama MODEL = \"qwen3:8b\" def chat (prompt: str ) -> str : res = ollama.chat( model = MODEL , messages = [{ \"role\" : \"user\" , \"content\" : prompt}], ) return res[ \"message\" ][ \"content\" ] if __name__ == \"__main__\" : while (user := input ( \"you> \" ).strip()): print ( f \"bot> { chat(user) }\\n \" ) Run it: you&gt; what's the capital of France? bot&gt; Paris. you&gt; add a task to buy milk tomorrow bot&gt; Sure! I've added &quot;buy milk&quot; to your task list for tomorrow. The second reply is a lie. There is no task list. There is no tomorrow — the model has no way to do anything beyond producing text. It also forgets the previous turn the moment the next one starts, because we send a fresh single-message history every time. This is the baseline. Everything from here on is fixing a specific failure of this version. Tools — letting the agent do things A tool is a Python function the LLM can decide to call. The agent loop is responsible for advertising those functions to the model (as JSON Schema), watching for tool_calls in the response, executing them, and feeding the results back. The loop never decides what to do — the model does. The loop is purely mechanical: it dispatches whatever the model asks for, feeds the result back, and asks again. The model stops by producing a turn with no tool calls. This is the single most important idea in this section; everything below is plumbing for it. Define a small registry: # tools.py import json, sqlite3 from datetime import date TOOLS : dict[ str , dict ] = {} def tool (name: str , description: str , schema: dict ): def decorator (fn): TOOLS [name] = { \"description\" : description, \"schema\" : schema, \"fn\" : fn} return fn return decorator def tool_specs () -> list[ dict ]: return [ { \"type\" : \"function\" , \"function\" : { \"name\" : name, \"description\" : t[ \"description\" ], \"parameters\" : t[ \"schema\" ], }, } for name, t in TOOLS .items() ] # tools.py import json, sqlite3 from datetime import date TOOLS : dict[ str , dict ] = {} def tool (name: str , description: str , schema: dict ): def decorator (fn): TOOLS [name] = { \"description\" : description, \"schema\" : schema, \"fn\" : fn} return fn return decorator def tool_specs () -> list[ dict ]: return [ { \"type\" : \"function\" , \"function\" : { \"name\" : name, \"description\" : t[ \"description\" ], \"parameters\" : t[ \"schema\" ], }, } for name, t in TOOLS .items() ] Three task tools and a get_today (LLMs don't know the date — making it a tool is the lesson): db = sqlite3.connect( \"planner.db\" ) db.row_factory = sqlite3.Row db.executescript( \"\"\" CREATE TABLE IF NOT EXISTS tasks ( id INTEGER PRIMARY KEY AUTOINCREMENT, title TEXT NOT NULL, due_date TEXT, status TEXT NOT NULL DEFAULT 'open', created_at TEXT NOT NULL DEFAULT CURRENT_TIMESTAMP ); \"\"\" ) @tool ( \"add_task\" , \"Add a task to the planner.\" , { \"type\" : \"object\" , \"properties\" : { \"title\" : { \"type\" : \"string\" }, \"due_date\" : { \"type\" : \"string\" , \"description\" : \"ISO 8601 date, e.g. 2026-05-28\" }, }, \"required\" : [ \"title\" ], }) def add_task (title: str , due_date: str | None = None ) -> dict : cur = db.execute( \"INSERT INTO tasks (title, due_date) VALUES (?, ?)\" , (title, due_date), ) db.commit() return { \"id\" : cur.lastrowid, \"title\" : title, \"due_date\" : due_date} @tool ( \"list_tasks\" , \"List open tasks.\" , { \"type\" : \"object\" , \"properties\" : {}}) def list_tasks () -> list[ dict ]: rows = db.execute( \"SELECT id, title, due_date FROM tasks WHERE status = 'open' \" \"ORDER BY due_date IS NULL, due_date\" , # push null-due tasks to the bottom ).fetchall() return [ dict (r) for r in rows] @tool ( \"complete_task\" , \"Mark a task complete.\" , { \"type\" : \"object\" , \"properties\" : { \"id\" : { \"type\" : \"integer\" }}, \"required\" : [ \"id\" ], }) def complete_task (id: int ) -> dict : db.execute( \"UPDATE tasks SET status = 'done' WHERE id = ?\" , ( id ,)) db.commit() return { \"id\" : id , \"status\" : \"done\" } @tool ( \"get_today\" , \"Get today's date in ISO 8601.\" , { \"type\" : \"object\" , \"properties\" : {}}) def get_today () -> dict : return { \"date\" : date.today().isoformat()} db = sqlite3.connect( \"planner.db\" ) db.row_factory = sqlite3.Row db.executescript( \"\"\" CREATE TABLE IF NOT EXISTS tasks ( id INTEGER PRIMARY KEY AUTOINCREMENT, title TEXT NOT NULL, due_date TEXT, status TEXT NOT NULL DEFAULT 'open', created_at TEXT NOT NULL DEFAULT CURRENT_TIMESTAMP ); \"\"\" ) @tool ( \"add_task\" , \"Add a task to the planner.\" , { \"type\" : \"object\" , \"properties\" : { \"title\" : { \"type\" : \"string\" }, \"due_date\" : { \"type\" : \"string\" , \"description\" : \"ISO 8601 date, e.g. 2026-05-28\" }, }, \"required\" : [ \"title\" ], }) def add_task (title: str , due_date: str | None = None ) -> dict : cur = db.execute( \"INSERT INTO tasks (title, due_date) VALUES (?, ?)\" , (title, due_date), ) db.commit() return { \"id\" : cur.lastrowid, \"title\" : title, \"due_date\" : due_date} @tool ( \"list_tasks\" , \"List open tasks.\" , { \"type\" : \"object\" , \"properties\" : {}}) def list_tasks () -> list[ dict ]: rows = db.execute( \"SELECT id, title, due_date FROM tasks WHERE status = 'open' \" \"ORDER BY due_date IS NULL, due_date\" , # push null-due tasks to the bottom ).fetchall() return [ dict (r) for r in rows] @tool ( \"complete_task\" , \"Mark a task complete.\" , { \"type\" : \"object\" , \"properties\" : { \"id\" : { \"type\" : \"integer\" }}, \"required\" : [ \"id\" ], }) def complete_task (id: int ) -> dict : db.execute( \"UPDATE tasks SET status = 'done' WHERE id = ?\" , ( id ,)) db.commit() return { \"id\" : id , \"status\" : \"done\" } @tool ( \"get_today\" , \"Get today's date in ISO 8601.\" , { \"type\" : \"object\" , \"properties\" : {}}) def get_today () -> dict : return { \"date\" : date.today().isoformat()} Now the agent loop: # agent.py import json, ollama from tools import TOOLS , tool_specs MODEL = \"qwen3:8b\" MAX_TURNS = 8 def to_dict (msg) -> dict : # Ollama returns Pydantic models. Convert to plain dicts so json.dumps # (and later, SQLite storage) work without a custom encoder. return msg.model_dump( exclude_none = True ) if hasattr (msg, \"model_dump\" ) else msg def run (user_input: str ) -> str : messages = [{ \"role\" : \"user\" , \"content\" : user_input}] for _ in range ( MAX_TURNS ): res = ollama.chat( model = MODEL , messages = messages, tools = tool_specs()) msg = to_dict(res[ \"message\" ]) messages.append(msg) calls = msg.get( \"tool_calls\" ) or [] if not calls: return msg.get( \"content\" , \"\" ) for call in calls: name = call[ \"function\" ][ \"name\" ] args = call[ \"function\" ][ \"arguments\" ] or {} try : result = TOOLS [name][ \"fn\" ]( ** args) except Exception as e: result = { \"error\" : str (e)} messages.append({ \"role\" : \"tool\" , \"content\" : json.dumps(result), \"tool_name\" : name, }) return \"I hit my tool-call limit.\" # agent.py import json, ollama from tools import TOOLS , tool_specs MODEL = \"qwen3:8b\" MAX_TURNS = 8 def to_dict (msg) -> dict : # Ollama returns Pydantic models. Convert to plain dicts so json.dumps # (and later, SQLite storage) work without a custom encoder. return msg.model_dump( exclude_none = True ) if hasattr (msg, \"model_dump\" ) else msg def run (user_input: str ) -> str : messages = [{ \"role\" : \"user\" , \"content\" : user_input}] for _ in range ( MAX_TURNS ): res = ollama.chat( model = MODEL , messages = messages, tools = tool_specs()) msg = to_dict(res[ \"message\" ]) messages.append(msg) calls = msg.get( \"tool_calls\" ) or [] if not calls: return msg.get( \"content\" , \"\" ) for call in calls: name = call[ \"function\" ][ \"name\" ] args = call[ \"function\" ][ \"arguments\" ] or {} try : result = TOOLS [name][ \"fn\" ]( ** args) except Exception as e: result = { \"error\" : str (e)} messages.append({ \"role\" : \"tool\" , \"content\" : json.dumps(result), \"tool_name\" : name, }) return \"I hit my tool-call limit.\" The shape of one turn: flowchart TD start([user message]) chat[&quot;ollama.chat&lt;br/&gt;(model + tools advertised)&quot;] decide{tool_calls&lt;br/&gt;in response?} finalize([return reply]) dispatch[execute tool fn] push[append result as&lt;br/&gt;role: tool message] start --&gt; chat chat --&gt; decide decide --&gt;|no| finalize decide --&gt;|yes| dispatch dispatch --&gt; push push --&gt; chat classDef step fill:#dbeafe,stroke:#2563eb,color:#15171a classDef terminal fill:#eef2f7,stroke:#6b7280,color:#15171a class chat,dispatch,push step class start,finalize terminal class decide step In code, &quot;the model decides&quot; looks like the conditional on calls : if the model returns no tool_calls , the loop returns its content . Otherwise it dispatches every call the model asked for, in order, and feeds each result back as a role: &quot;tool&quot; message before going around again. A turn now looks like: you&gt; add a task to buy milk by friday bot&gt; Done — added &quot;buy milk&quot; with due date 2026-05-29. The lie is gone. Real row in tasks , real due_date . What's still broken: start a new run of the program. Ask &quot;what tasks do I have?&quot; The model has to call list_tasks from a cold start every time because we discard messages at the end of run . Worse, even within a single REPL session, the next user message gets none of the previous turn's context. The agent has no memory. Short-term memory — conversation state Persist messages to SQLite, keyed by session id. Load them on each turn. Trim to a budget. CREATE TABLE IF NOT EXISTS messages ( id INTEGER PRIMARY KEY AUTOINCREMENT, session TEXT NOT NULL, role TEXT NOT NULL, content TEXT, tool_calls TEXT, tool_name TEXT, created_at TEXT NOT NULL DEFAULT CURRENT_TIMESTAMP ); # memory.py import json, sqlite3 from typing import Iterable HISTORY_LIMIT = 40 # rough turn budget def save_message (db: sqlite3.Connection, session: str , msg: dict ) -> None : db.execute( \"INSERT INTO messages (session, role, content, tool_calls, tool_name) \" \"VALUES (?, ?, ?, ?, ?)\" , ( session, msg[ \"role\" ], msg.get( \"content\" ), json.dumps(msg[ \"tool_calls\" ]) if msg.get( \"tool_calls\" ) else None , msg.get( \"tool_name\" ), ), ) db.commit() def load_history (db: sqlite3.Connection, session: str ) -> list[ dict ]: rows = db.execute( \"SELECT role, content, tool_calls, tool_name FROM messages \" \"WHERE session = ? ORDER BY id DESC LIMIT ?\" , (session, HISTORY_LIMIT ), ).fetchall() msgs = [] for r in reversed (rows): m = { \"role\" : r[ \"role\" ]} if r[ \"content\" ] is not None : m[ \"content\" ] = r[ \"content\" ] if r[ \"tool_calls\" ]: m[ \"tool_calls\" ] = json.loads(r[ \"tool_calls\" ]) if r[ \"tool_name\" ]: m[ \"tool_name\" ] = r[ \"tool_name\" ] msgs.append(m) return trim_to_user_boundary(msgs) def trim_to_user_boundary (msgs: list[ dict ]) -> list[ dict ]: # Tool-calling APIs require an assistant message with tool_calls to be # immediately followed by role: tool messages for each call. After a # window slice we have to guard both ends: # (1) start on a user message — drop any orphan tool/assistant prefix # (2) drop trailing orphans: a role:tool with no live assistant before # it, or an assistant whose tool_calls were never answered. start = next ((i for i, m in enumerate (msgs) if m[ \"role\" ] == \"user\" ), None ) if start is None : return [] msgs = msgs[start:] while msgs and ( msgs[ - 1 ].get( \"role\" ) == \"tool\" or (msgs[ - 1 ].get( \"role\" ) == \"assistant\" and msgs[ - 1 ].get( \"tool_calls\" )) ): msgs.pop() return msgs # memory.py import json, sqlite3 from typing import Iterable HISTORY_LIMIT = 40 # rough turn budget def save_message (db: sqlite3.Connection, session: str , msg: dict ) -> None : db.execute( \"INSERT INTO messages (session, role, content, tool_calls, tool_name) \" \"VALUES (?, ?, ?, ?, ?)\" , ( session, msg[ \"role\" ], msg.get( \"content\" ), json.dumps(msg[ \"tool_calls\" ]) if msg.get( \"tool_calls\" ) else None , msg.get( \"tool_name\" ), ), ) db.commit() def load_history (db: sqlite3.Connection, session: str ) -> list[ dict ]: rows = db.execute( \"SELECT role, content, tool_calls, tool_name FROM messages \" \"WHERE session = ? ORDER BY id DESC LIMIT ?\" , (session, HISTORY_LIMIT ), ).fetchall() msgs = [] for r in reversed (rows): m = { \"role\" : r[ \"role\" ]} if r[ \"content\" ] is not None : m[ \"content\" ] = r[ \"content\" ] if r[ \"tool_calls\" ]: m[ \"tool_calls\" ] = json.loads(r[ \"tool_calls\" ]) if r[ \"tool_name\" ]: m[ \"tool_name\" ] = r[ \"tool_name\" ] msgs.append(m) return trim_to_user_boundary(msgs) def trim_to_user_boundary (msgs: list[ dict ]) -> list[ dict ]: # Tool-calling APIs require an assistant message with tool_calls to be # immediately followed by role: tool messages for each call. After a # window slice we have to guard both ends: # (1) start on a user message — drop any orphan tool/assistant prefix # (2) drop trailing orphans: a role:tool with no live assistant before # it, or an assistant whose tool_calls were never answered. start = next ((i for i, m in enumerate (msgs) if m[ \"role\" ] == \"user\" ), None ) if start is None : return [] msgs = msgs[start:] while msgs and ( msgs[ - 1 ].get( \"role\" ) == \"tool\" or (msgs[ - 1 ].get( \"role\" ) == \"assistant\" and msgs[ - 1 ].get( \"tool_calls\" )) ): msgs.pop() return msgs The trim function is the one piece of subtlety. Naïve sliding-window truncation will sometimes cut between an assistant message that contains tool_calls and the role: tool messages that satisfy those calls. The next API request will reject that history with a confusing error. The fix is to guard both ends: skip any leading orphan tool or assistant fragments until we land on a user message, and drop any trailing assistant whose tool_calls were never answered. The loop now uses the DB as the single source of truth — save the user message first, then load history, then run the turn: # agent.py from tools import TOOLS , tool_specs, db from memory import save_message, load_history SYSTEM = { \"role\" : \"system\" , \"content\" : ( \"You are a personal planner. The user manages tasks through you. \" \"Use get_today before reasoning about relative dates like 'tomorrow' or 'next Tuesday'. \" \"Keep replies short.\" ), } def run (session: str , user_input: str ) -> str : save_message(db, session, { \"role\" : \"user\" , \"content\" : user_input}) messages = [ SYSTEM , * load_history(db, session)] for _ in range ( MAX_TURNS ): res = ollama.chat( model = MODEL , messages = messages, tools = tool_specs()) msg = to_dict(res[ \"message\" ]) messages.append(msg) save_message(db, session, msg) calls = msg.get( \"tool_calls\" ) or [] if not calls: return msg.get( \"content\" , \"\" ) for call in calls: name = call[ \"function\" ][ \"name\" ] args = call[ \"function\" ][ \"arguments\" ] or {} try : result = TOOLS [name][ \"fn\" ]( ** args) except Exception as e: result = { \"error\" : str (e)} tool_msg = { \"role\" : \"tool\" , \"content\" : json.dumps(result), \"tool_name\" : name, } messages.append(tool_msg) save_message(db, session, tool_msg) return \"I hit my tool-call limit.\" # agent.py from tools import TOOLS , tool_specs, db from memory import save_message, load_history SYSTEM = { \"role\" : \"system\" , \"content\" : ( \"You are a personal planner. The user manages tasks through you. \" \"Use get_today before reasoning about relative dates like 'tomorrow' or 'next Tuesday'. \" \"Keep replies short.\" ), } def run (session: str , user_input: str ) -> str : save_message(db, session, { \"role\" : \"user\" , \"content\" : user_input}) messages = [ SYSTEM , * load_history(db, session)] for _ in range ( MAX_TURNS ): res = ollama.chat( model = MODEL , messages = messages, tools = tool_specs()) msg = to_dict(res[ \"message\" ]) messages.append(msg) save_message(db, session, msg) calls = msg.get( \"tool_calls\" ) or [] if not calls: return msg.get( \"content\" , \"\" ) for call in calls: name = call[ \"function\" ][ \"name\" ] args = call[ \"function\" ][ \"arguments\" ] or {} try : result = TOOLS [name][ \"fn\" ]( ** args) except Exception as e: result = { \"error\" : str (e)} tool_msg = { \"role\" : \"tool\" , \"content\" : json.dumps(result), \"tool_name\" : name, } messages.append(tool_msg) save_message(db, session, tool_msg) return \"I hit my tool-call limit.\" Restart the REPL. Ask &quot;what was the last task I added?&quot; The agent answers from history. It remembers within and across sessions, up to HISTORY_LIMIT turns. What's still broken: the agent remembers the conversation but it has no notion of facts about you . Tell it &quot;I have Wednesday afternoons free&quot; today and ask &quot;when should I schedule a meeting?&quot; three weeks from now and it has no idea — that turn was trimmed out of the window long ago. Some things need to outlive the conversation buffer. Long-term memory — facts that outlive the window Short-term memory is indexed by recency ; long-term memory is indexed by meaning . You need both — conversations end but facts shouldn't. The data structure for &quot;indexed by meaning&quot; is an embedding : a fixed-length list of floats produced by a dedicated embedding model (here, nomic-embed-text ), trained so that two pieces of text with similar meaning land near each other in vector space. Search becomes a similarity comparison instead of a substring match. Two new tools ( remember and recall ) give the agent control over what to save and when to retrieve. CREATE TABLE IF NOT EXISTS memories ( id INTEGER PRIMARY KEY AUTOINCREMENT, text TEXT NOT NULL, embedding BLOB NOT NULL, created_at TEXT NOT NULL DEFAULT CURRENT_TIMESTAMP ); Embeddings via Ollama, packed into the BLOB column as raw little-endian floats. Cosine similarity is a one-liner in pure Python — slow at a million rows, fine for a personal planner. # embeddings.py import struct, ollama EMBED_MODEL = \"nomic-embed-text\" def embed (text: str ) -> list[ float ]: res = ollama.embed( model = EMBED_MODEL , input = text) return res[ \"embeddings\" ][ 0 ] def pack (vec: list[ float ]) -> bytes : return struct.pack( f \"&#x3C; {len (vec) } f\" , * vec) # explicit little-endian float32 def unpack (blob: bytes ) -> list[ float ]: return list (struct.unpack( f \"&#x3C; {len (blob) // 4} f\" , blob)) def cosine (a: list[ float ], b: list[ float ]) -> float : dot = sum (x * y for x, y in zip (a, b)) na = sum (x * x for x in a) ** 0.5 nb = sum (y * y for y in b) ** 0.5 return dot / (na * nb) if na and nb else 0.0 # embeddings.py import struct, ollama EMBED_MODEL = \"nomic-embed-text\" def embed (text: str ) -> list[ float ]: res = ollama.embed( model = EMBED_MODEL , input = text) return res[ \"embeddings\" ][ 0 ] def pack (vec: list[ float ]) -> bytes : return struct.pack( f \"&#x3C; {len (vec) } f\" , * vec) # explicit little-endian float32 def unpack (blob: bytes ) -> list[ float ]: return list (struct.unpack( f \"&#x3C; {len (blob) // 4} f\" , blob)) def cosine (a: list[ float ], b: list[ float ]) -> float : dot = sum (x * y for x, y in zip (a, b)) na = sum (x * x for x in a) ** 0.5 nb = sum (y * y for y in b) ** 0.5 return dot / (na * nb) if na and nb else 0.0 The tools: @tool ( \"remember\" , \"Store a durable fact about the user.\" , { \"type\" : \"object\" , \"properties\" : { \"fact\" : { \"type\" : \"string\" }}, \"required\" : [ \"fact\" ], }) def remember (fact: str ) -> dict : vec = embed(fact) db.execute( \"INSERT INTO memories (text, embedding) VALUES (?, ?)\" , (fact, pack(vec)), ) db.commit() return { \"ok\" : True , \"fact\" : fact} @tool ( \"recall\" , \"Search long-term memory by meaning.\" , { \"type\" : \"object\" , \"properties\" : { \"query\" : { \"type\" : \"string\" }, \"k\" : { \"type\" : \"integer\" , \"default\" : 3 }, }, \"required\" : [ \"query\" ], }) def recall (query: str , k: int = 3 ) -> list[ dict ]: qv = embed(query) rows = db.execute( \"SELECT text, embedding FROM memories\" ).fetchall() scored = sorted ( ((cosine(qv, unpack(r[ \"embedding\" ])), r[ \"text\" ]) for r in rows), reverse = True , ) return [{ \"text\" : t, \"score\" : round (s, 3 )} for s, t in scored[:k]] @tool ( \"remember\" , \"Store a durable fact about the user.\" , { \"type\" : \"object\" , \"properties\" : { \"fact\" : { \"type\" : \"string\" }}, \"required\" : [ \"fact\" ], }) def remember (fact: str ) -> dict : vec = embed(fact) db.execute( \"INSERT INTO memories (text, embedding) VALUES (?, ?)\" , (fact, pack(vec)), ) db.commit() return { \"ok\" : True , \"fact\" : fact} @tool ( \"recall\" , \"Search long-term memory by meaning.\" , { \"type\" : \"object\" , \"properties\" : { \"query\" : { \"type\" : \"string\" }, \"k\" : { \"type\" : \"integer\" , \"default\" : 3 }, }, \"required\" : [ \"query\" ], }) def recall (query: str , k: int = 3 ) -> list[ dict ]: qv = embed(query) rows = db.execute( \"SELECT text, embedding FROM memories\" ).fetchall() scored = sorted ( ((cosine(qv, unpack(r[ \"embedding\" ])), r[ \"text\" ]) for r in rows), reverse = True , ) return [{ \"text\" : t, \"score\" : round (s, 3 )} for s, t in scored[:k]] The data path: flowchart LR subgraph short[Short-term memory] msgs[(messages&lt;br/&gt;table)] trim[trim on user&lt;br/&gt;boundary] msgs --&gt; trim end subgraph long[Long-term memory] mem[(memories&lt;br/&gt;text + embedding)] emb[nomic-embed-text] cos[cosine similarity&lt;br/&gt;in Python] end turn([user turn]) --&gt; trim trim --&gt; ctx[message context&lt;br/&gt;sent to LLM] turn --&gt;|&quot;remember(fact)&quot;| emb1[nomic-embed-text] emb1 --&gt; mem turn --&gt;|&quot;recall(query)&quot;| emb emb --&gt; cos mem --&gt; cos cos --&gt; ctx classDef store fill:#fef3c7,stroke:#d97706,color:#15171a classDef step fill:#dbeafe,stroke:#2563eb,color:#15171a classDef terminal fill:#eef2f7,stroke:#6b7280,color:#15171a class msgs,mem store class trim,emb,emb1,cos,ctx step class turn terminal Two design choices worth flagging: The model controls both writes and reads. It calls remember when a fact seems worth keeping and recall when it suspects relevant context exists. The system prompt is updated to nudge it: &quot;If the user states a durable preference or fact about themselves, call remember. If a question would benefit from past context, call recall.&quot; The alternative — extracting memories automatically in a background pass — is cleaner architecturally but takes more code; the tool-driven version is the from-scratch lesson. Cosine similarity is computed in Python, not in SQL. SQLite has no native vector type, and that's the point — seeing the loop makes clear what a vector DB is actually doing for you. At a few thousand rows this is fine. Past ~100k rows you want a real vector index. Now the planner can carry facts forward indefinitely: you&gt; i have wednesday afternoons free for meetings bot&gt; Noted. [remember(fact=&quot;user has Wednesday afternoons free for meetings&quot;)] # ...some weeks later, new session... you&gt; when should i schedule the dentist? bot&gt; You mentioned Wednesday afternoons are free — want me to add it for next Wednesday? [recall(query=&quot;when is the user free for appointments&quot;)] What's still broken: complex requests fall apart. Ask &quot;prep for my doctor visit next Tuesday and pick up a gift before then&quot; and the agent often does one thing, forgets the other, or gets the dates wrong. It is reacting one tool at a time without ever looking at the whole request first. Planning — think before acting Without planning, the model reacts to each tool result in isolation and loses the thread on multi-step requests — does one thing, forgets the other. Planning separates deciding what to do from doing it . Even when both happen inside one LLM call, forcing the model to commit to a structure before it starts dispatching changes its behavior. The smallest planning intervention that helps: a system-prompt instruction that says &quot;For requests that involve more than one step, first write a one-line plan, then execute it.&quot; SYSTEM = { \"role\" : \"system\" , \"content\" : ( \"You are a personal planner. \\n \" \"- Use get_today before reasoning about relative dates. \\n \" \"- For multi-step requests, write a short plan first (1–3 bullets), \" \" then call the tools to execute it. \\n \" \"- If the user states a durable preference or fact about themselves, call remember. \\n \" \"- If a question would benefit from past context, call recall before answering. \\n \" \"- After executing, summarize what you did in one line.\" ), } SYSTEM = { \"role\" : \"system\" , \"content\" : ( \"You are a personal planner. \\n \" \"- Use get_today before reasoning about relative dates. \\n \" \"- For multi-step requests, write a short plan first (1–3 bullets), \" \" then call the tools to execute it. \\n \" \"- If the user states a durable preference or fact about themselves, call remember. \\n \" \"- If a question would benefit from past context, call recall before answering. \\n \" \"- After executing, summarize what you did in one line.\" ), } That's it. No new code, no separate &quot;planner&quot; LLM call. The plan lives in the assistant message that precedes the tool calls. It costs a few extra tokens and visibly improves multi-step accuracy because the model commits to a structure before it starts dispatching. A turn now looks like: you&gt; prep for my doctor visit next tuesday and pick up a gift before then bot&gt; Plan: - get today's date to resolve &quot;next Tuesday&quot; - add &quot;prep for doctor visit&quot; due that date - add &quot;pick up gift&quot; due the day before [get_today() → 2026-05-24] [add_task(title=&quot;prep for doctor visit&quot;, due_date=&quot;2026-06-02&quot;)] [add_task(title=&quot;pick up gift&quot;, due_date=&quot;2026-06-01&quot;)] bot&gt; Added both — doctor prep on Jun 2, gift on Jun 1. For more complex domains you'd promote planning to a dedicated LLM call that produces structured JSON, then iterate over the steps. For a personal planner, the prompt-only version is enough — and resisting the urge to over-engineer is part of the lesson. What's still broken: the agent sometimes silently fails — calls a tool, gets an error, ignores it, and tells you everything went fine. It needs to check its own work. Reflection — checking its own work Reflection adds a second pair of eyes — a fresh LLM call with no investment in the previous answer, which makes it willing to say &quot;that's wrong.&quot; The acting model has an implicit bias toward declaring success because it just spent tokens on the attempt; a separate critic call doesn't. Concretely: a second ollama.chat invocation with a different system prompt and the prior transcript as input. Mechanically: after the main loop finishes, a second LLM pass looks at the transcript and decides did we actually accomplish what the user asked? If not, the critique is fed back in as a new user message and the loop runs again, up to a small retry budget. The previous run becomes _act — same body, new name — and a new run wraps it with the reflection loop. The trick: a failed attempt's assistant/tool turns are kept in memory but only the final successful attempt is persisted , so a critique-driven retry never leaks into durable history. REFLECT_PROMPT = ( \"You are reviewing an agent transcript. Given the user's original request \" \"and the actions taken, answer in JSON: \" '{\"done\": true|false, \"critique\": \"...\"}. ' \"Set done=true if the request was fully satisfied. \" \"Set done=false and provide a concrete critique if anything is missing or wrong.\" ) MAX_REFLECTIONS = 2 def _act (messages: list[ dict ]) -> tuple[ str , list[ dict ]]: \"\"\"One pass of the tool-calling loop. We both *mutate* `messages` (so the caller and reflection see the full transcript) and *return* the slice of new turns this call added, so `run` can decide what to persist.\"\"\" start = len (messages) for _ in range ( MAX_TURNS ): res = ollama.chat( model = MODEL , messages = messages, tools = tool_specs()) msg = to_dict(res[ \"message\" ]) messages.append(msg) calls = msg.get( \"tool_calls\" ) or [] if not calls: return msg.get( \"content\" , \"\" ), messages[start:] for call in calls: name = call[ \"function\" ][ \"name\" ] args = call[ \"function\" ][ \"arguments\" ] or {} if isinstance (args, str ): # some models return arguments as JSON text args = json.loads(args) try : result = TOOLS [name][ \"fn\" ]( ** args) except Exception as e: result = { \"error\" : str (e)} messages.append({ \"role\" : \"tool\" , \"content\" : json.dumps(result), \"tool_name\" : name, }) return \"I hit my tool-call limit.\" , messages[start:] def _reflect (original: str , reply: str , messages: list[ dict ]) -> dict : transcript = \" \\n \" .join( f \" { m[ 'role' ] } : { m.get( 'content' , '' ) or m.get( 'tool_calls' , '' ) } \" for m in messages[ - 8 :] ) res = ollama.chat( model = MODEL , messages = [ { \"role\" : \"system\" , \"content\" : REFLECT_PROMPT }, { \"role\" : \"user\" , \"content\" : f \"Original request: { original }\\n\\n Transcript: \\n{ transcript }\\n\\n Final reply: { reply } \" }, ], format = \"json\" , ) content = to_dict(res[ \"message\" ]).get( \"content\" ) or \" {} \" try : return json.loads(content) except (json.JSONDecodeError, TypeError ): return { \"done\" : True , \"critique\" : \"\" } # fail open on malformed output def run (session: str , user_input: str ) -> str : save_message(db, session, { \"role\" : \"user\" , \"content\" : user_input}) original = user_input messages = [ SYSTEM , * load_history(db, session)] reply, added = \"\" , [] succeeded = False for attempt in range ( MAX_REFLECTIONS + 1 ): reply, added = _act(messages) try : verdict = _reflect(original, reply, messages) except Exception : verdict = { \"done\" : True , \"critique\" : \"\" } # fail open on any error if verdict[ \"done\" ]: succeeded = True break if attempt == MAX_REFLECTIONS : break # Critique is appended in-memory only — never persisted. messages.append({ \"role\" : \"user\" , \"content\" : f \"Your previous attempt was incomplete: { verdict[ 'critique' ] } \" , }) # Persist only the winning attempt's assistant/tool turns. On exhaustion # without success, the original user message remains as a dangling row; # next turn's load + trim handles that gracefully. if succeeded: for m in added: save_message(db, session, m) return reply REFLECT_PROMPT = ( \"You are reviewing an agent transcript. Given the user's original request \" \"and the actions taken, answer in JSON: \" '{\"done\": true|false, \"critique\": \"...\"}. ' \"Set done=true if the request was fully satisfied. \" \"Set done=false and provide a concrete critique if anything is missing or wrong.\" ) MAX_REFLECTIONS = 2 def _act (messages: list[ dict ]) -> tuple[ str , list[ dict ]]: \"\"\"One pass of the tool-calling loop. We both *mutate* `messages` (so the caller and reflection see the full transcript) and *return* the slice of new turns this call added, so `run` can decide what to persist.\"\"\" start = len (messages) for _ in range ( MAX_TURNS ): res = ollama.chat( model = MODEL , messages = messages, tools = tool_specs()) msg = to_dict(res[ \"message\" ]) messages.append(msg) calls = msg.get( \"tool_calls\" ) or [] if not calls: return msg.get( \"content\" , \"\" ), messages[start:] for call in calls: name = call[ \"function\" ][ \"name\" ] args = call[ \"function\" ][ \"arguments\" ] or {} if isinstance (args, str ): # some models return arguments as JSON text args = json.loads(args) try : result = TOOLS [name][ \"fn\" ]( ** args) except Exception as e: result = { \"error\" : str (e)} messages.append({ \"role\" : \"tool\" , \"content\" : json.dumps(result), \"tool_name\" : name, }) return \"I hit my tool-call limit.\" , messages[start:] def _reflect (original: str , reply: str , messages: list[ dict ]) -> dict : transcript = \" \\n \" .join( f \" { m[ 'role' ] } : { m.get( 'content' , '' ) or m.get( 'tool_calls' , '' ) } \" for m in messages[ - 8 :] ) res = ollama.chat( model = MODEL , messages = [ { \"role\" : \"system\" , \"content\" : REFLECT_PROMPT }, { \"role\" : \"user\" , \"content\" : f \"Original request: { original }\\n\\n Transcript: \\n{ transcript }\\n\\n Final reply: { reply } \" }, ], format = \"json\" , ) content = to_dict(res[ \"message\" ]).get( \"content\" ) or \" {} \" try : return json.loads(content) except (json.JSONDecodeError, TypeError ): return { \"done\" : True , \"critique\" : \"\" } # fail open on malformed output def run (session: str , user_input: str ) -> str : save_message(db, session, { \"role\" : \"user\" , \"content\" : user_input}) original = user_input messages = [ SYSTEM , * load_history(db, session)] reply, added = \"\" , [] succeeded = False for attempt in range ( MAX_REFLECTIONS + 1 ): reply, added = _act(messages) try : verdict = _reflect(original, reply, messages) except Exception : verdict = { \"done\" : True , \"critique\" : \"\" } # fail open on any error if verdict[ \"done\" ]: succeeded = True break if attempt == MAX_REFLECTIONS : break # Critique is appended in-memory only — never persisted. messages.append({ \"role\" : \"user\" , \"content\" : f \"Your previous attempt was incomplete: { verdict[ 'critique' ] } \" , }) # Persist only the winning attempt's assistant/tool turns. On exhaustion # without success, the original user message remains as a dangling row; # next turn's load + trim handles that gracefully. if succeeded: for m in added: save_message(db, session, m) return reply The full cycle: flowchart LR user([user request]) plan[&quot;1. Plan&lt;br/&gt;(in-prompt)&quot;] act[&quot;2. Act&lt;br/&gt;tool-calling loop&quot;] reflect[&quot;3. Reflect&lt;br/&gt;(2nd LLM pass)&quot;] done([reply]) user --&gt; plan --&gt; act --&gt; reflect reflect --&gt;|done=true| done reflect --&gt;|done=false&lt;br/&gt;critique fed back| plan classDef step fill:#dbeafe,stroke:#2563eb,color:#15171a classDef terminal fill:#eef2f7,stroke:#6b7280,color:#15171a class plan,act,reflect step class user,done terminal Two things worth knowing: Reflection is expensive. It doubles your LLM calls on every turn. Gate it behind a complexity heuristic (&quot;did the agent call more than one tool?&quot;) if you care about latency or token cost. For a local Ollama planner where calls are free and slow, it's fine to always run. Fail open on bad reflection output. If the reflector returns malformed JSON, return done=true rather than retrying forever. Trusting the critic blindly is how you get infinite loops. Putting it together All the pieces, in one file: agent.py # the loop, plan, reflect, REPL entry tools.py # tool registry, task tools, get_today, remember, recall memory.py # save/load messages, trim embeddings.py # embed, pack/unpack, cosine planner.db # SQLite, created on first run ~300 lines of Python, no third-party dependencies except ollama . Clone the repo if you don't want to type the files out yourself: git clone https://github.com/devdaviddr/personal-planner-agent.git cd personal-planner-agent git clone https://github.com/devdaviddr/personal-planner-agent.git cd personal-planner-agent The REPL entry point at the bottom of agent.py : if __name__ == \"__main__\" : import uuid session = uuid.uuid4().hex # one session per process; swap for a stable id to span runs while (text := input ( \"you> \" ).strip()): print ( f \"bot> { run(session, text) }\\n \" ) if __name__ == \"__main__\" : import uuid session = uuid.uuid4().hex # one session per process; swap for a stable id to span runs while (text := input ( \"you> \" ).strip()): print ( f \"bot> { run(session, text) }\\n \" ) Then: pip install 'ollama>=0.4' python agent.py pip install 'ollama>=0.4' python agent.py A real transcript from the finished agent: you&gt; i have wednesday afternoons free bot&gt; Noted. you&gt; prep for my doctor visit next tuesday and pick up a gift before then bot&gt; Plan: - check today's date to resolve &quot;next Tuesday&quot; - add a prep task for that day - add a gift task for the day before Added: &quot;prep for doctor visit&quot; due 2026-06-02, &quot;pick up gift&quot; due 2026-06-01. you&gt; when should i schedule the follow-up? bot&gt; You mentioned Wednesday afternoons are free — next Wednesday is 2026-05-27. Want me to add it? you&gt; yes bot&gt; Added: &quot;doctor follow-up&quot; due 2026-05-27. you&gt; what's on my list? bot&gt; 3 open tasks: 1. pick up gift — 2026-06-01 2. prep for doctor visit — 2026-06-02 3. doctor follow-up — 2026-05-27 The remember / recall calls, the get_today resolution, the plan, and the silent reflection pass are all happening in the background. From the user's seat it just feels like the agent is thinking. What frameworks add (and what they take) Now that you've built one from scratch, here's what an agent framework actually gives you, so you can decide when it's worth reaching for one: Feature you wrote What a framework adds TOOLS dict + JSON-schema decorator Auto-generation from Python type hints, async dispatch, parallel tool execution. messages table + manual trim Pluggable memory backends (Redis, Postgres+pgvector, managed services), automatic summarization, token-aware truncation. recall() over SQLite A real vector DB (Chroma, LanceDB, Pinecone) with proper indexing for &gt;100k vectors. Plan-then-act in the system prompt Multi-step planners that emit structured DAGs, with per-step retries. Single-shot reflection Critic agents, self-consistency voting, debate loops. One agent, one loop Multi-agent orchestration, message passing, handoff protocols. Rule of thumb: ~10k+ memories (you need a real vector index), multiple coordinated agents (you need orchestration), async or parallel tool execution, or multi-tenant SLAs. Any single one of these is a yellow flag worth thinking about; two or more and you should reach for a framework. Below that, the from-scratch version is faster to debug and ships sooner. For a personal planner with a few thousand tasks and one user, you don't need any of it. For a customer-facing system with multiple specialized agents, ten million memories, and SLA-bound latency, you do. The point of writing the from-scratch version is that you now know exactly what you're trading away when you adopt a framework — and what you'd have to rebuild if you ever ripped one out. Troubleshooting Symptom Cause and fix ollama._types.ResponseError: model not found You haven't pulled the model. Run ollama pull qwen3:8b (and nomic-embed-text ). Model never calls tools, just answers in prose Either the model is too small or the tool descriptions are too vague. Try qwen3:8b ; if you must use a 3B model, write longer, more imperative descriptions (&quot;Use this to...&quot;). role: tool rejected by Ollama Your history was sliced mid-tool-call. Confirm trim_to_user_boundary is running. The same bug occurs if you forget to persist the assistant tool_calls message. recall returns garbage matches Your nomic-embed-text pull is incomplete or you're packing/unpacking with mismatched dtypes. Re-run ollama pull nomic-embed-text and verify embedding length is 768 ( len(embed(&quot;test&quot;)) ). Agent gets dates wrong The system prompt instruction to call get_today first is missing or the model is ignoring it. Make the instruction more emphatic, or compute today's date in Python and inject it into the system prompt every turn. Reflection loops forever MAX_REFLECTIONS is too high or your reflector is overly strict. Cap at 2 and fail open on malformed JSON output. Slow first reply Ollama loads the model into VRAM on the first request. Subsequent calls are fast. Pre-warm with curl http://localhost:11434/api/generate -d '{&quot;model&quot;:&quot;qwen3:8b&quot;,&quot;prompt&quot;:&quot;hi&quot;}' on startup if you care. Result A local, fully-private personal-planner agent in roughly 300 lines of Python. It runs on your laptop, persists everything to one SQLite file, and demonstrates each of the agent fundamentals — tools, short-term memory, long-term memory, planning, reflection — as a discrete, removable layer rather than as framework magic. The same skeleton generalizes. Swap the task tools for GitHub Issues, Linear, your calendar, or any REST API and you have a domain-specific agent on the same footing. Swap the SQLite memory tables for Postgres and you have something multi-user. The reflection pass shown here is closest to self-refine — within-turn critique-and-retry. Persist the critiques across episodes and you have the start of a true Reflexion system (Shinn et al., 2023), where the agent learns from its own past mistakes over time. The point isn't the planner. The point is that you've seen each fundamental in isolation and can now build, debug, or replace any of them without the framework that usually hides them. Source Full source for this tutorial: github.com/devdaviddr/personal-planner-agent ."
  },
  {
    "slug": "2026-05-11-building-a-gp-doctor-scribe",
    "title": "Building a GP Doctor Scribe: The Problem Worth Solving",
    "description": "Build log for an AI-powered, fully-local clinical scribe that turns doctor-patient consultations into structured SOAP notes — using Whisper, Ollama, and a TypeScript/React stack with no cloud dependencies.",
    "tags": [
      "ai",
      "healthcare",
      "whisper",
      "ollama",
      "typescript",
      "build log"
    ],
    "excerpt": "I'm a software engineer, and I wanted to see if I could build something real with generative AI. Not a demo, not a toy project — something deployable. I chose to build a GP scribe: an AI-powered clinical documentation tool that listens to doctor-pati",
    "content": "I'm a software engineer, and I wanted to see if I could build something real with generative AI. Not a demo, not a toy project — something deployable. I chose to build a GP scribe: an AI-powered clinical documentation tool that listens to doctor-patient consultations and produces structured medical notes. The challenge is personal. I'm not trying to compete with existing vendors or disrupt healthcare. I want to prove to myself that I can take several years of software experience, combine it with modern AI tools, and ship something that works. It might be imperfect, the architecture might not be textbook, but if it's functional and deployable, I'll consider it a success. This is the build log for GP Doctor Scribe — a tool designed to run locally, work with any practice management system, and turn recorded consultations into properly formatted SOAP notes. Why This Problem? Clinical documentation seemed like the right scope for this challenge. It's a real problem — doctors spend 50-70% of their time on administrative tasks rather than patient care — but it's also technically achievable with current AI tools. The constraints are clear: privacy matters (no cloud APIs for patient data), the output needs structure (SOAP format, not raw transcription), and it needs to fit existing workflows (copy-paste into whatever system the practice uses). Here's what makes existing solutions unsatisfying, and what I wanted to build differently: Real-time documentation doesn't exist. Doctors either take notes during the consultation (interrupting the flow) or reconstruct from memory afterwards (losing accuracy). Neither is good. Existing tools are over-engineered. Enterprise medical transcription software costs thousands per seat, requires dedicated IT setup, and still produces output that needs heavy editing. Privacy is an afterthought. Most AI dictation tools send audio to third-party servers. Patient consultations are not something you send to a cloud API without serious governance overhead. The output doesn't match the workflow. A blob of transcribed text is not a SOAP note. Converting one to the other still requires a doctor's time. The Vision GP Doctor Scribe should feel like having a competent medical secretary in the room. It listens, understands clinical context, and produces structured documentation — ready to paste into whatever system the practice uses. What Doctors Actually Need in a Note A GP consultation note isn't a transcript. It's a structured clinical record with a specific purpose: to communicate what happened to any clinician who reads it later — including the doctor themselves at the next appointment. It needs to be accurate, concise, and formatted in a way that fits the existing record system. Most practices use some variant of the SOAP format, which divides a consultation into four sections: Subjective — What the patient reports. Their symptoms, how long they've had them, what makes them better or worse, relevant history they mention. This is the patient's story in clinical language. Objective — What the clinician observes or measures. Examination findings, vital signs, test results reviewed during the appointment. This is what the doctor sees and records, not what the patient says. Assessment — The clinical interpretation. A working diagnosis, a differential, or a summary of the clinical picture. This is the doctor's professional judgement. Plan — What happens next. Prescriptions issued, referrals made, safety-netting advice given, follow-up scheduled. Clear and actionable. A well-written SOAP note from a 10-minute consultation might be 150-250 words. It should be readable in 30 seconds and give any doctor a complete picture of the encounter. What GP Doctor Scribe needs to produce is exactly that — not a wall of transcribed speech, but a properly structured clinical record that the doctor can review, lightly edit if needed, and sign off. The challenge is that extracting a SOAP note from a natural conversation requires real clinical understanding. The patient says &quot;it's been sore for a few days and paracetamol isn't really touching it&quot; — that belongs in Subjective. The doctor says &quot;chest is clear bilaterally, no wheeze&quot; — that's Objective. A language model that doesn't understand the difference will produce something that looks like a note but reads badly to anyone who actually uses these systems. Getting the section boundaries right is most of the work. Core Features for Version 1 Ambient listening — The recording runs in the background from the moment the doctor starts the consultation. There's no scribing workflow to follow, no prompts to respond to mid-appointment. The doctor sees their patient normally; the tool handles the rest. Structured SOAP output — The raw transcript is processed by a local LLM with a carefully designed prompt that enforces section structure. The output isn't a summary or a transcript — it's a four-section clinical note ready to go into the record. Local processing — Audio never leaves the machine. Whisper runs on-device for transcription, Ollama serves the language model locally. There are no API keys, no data agreements with third parties, no governance questions about where patient data goes. It runs offline if it needs to. Editable drafts — The doctor always has final control. The generated note is fully editable before it goes anywhere. The tool is an assistant, not an autonomous scribe — the clinician reviews and approves every word that ends up in the record. Export-ready output — Copy-paste into EMIS, SystmOne, Vision, or a plain text field. No integrations required, no configuration per practice. It works with whatever system the practice already uses. Non-Goals for Version 1 These are explicitly out of scope to keep the first version shippable: Direct integration with specific EMR/PMS systems Real-time streaming transcription during the consultation Multi-language support Multi-patient session management Any form of cloud sync or account system Those come later, once the core flow is solid and tested with real users. Technical Architecture This is where it gets interesting. The privacy constraint — no patient audio to external APIs — shapes every technical decision. Transcription: Whisper Running Locally OpenAI's Whisper model, run locally via the Whisper CLI, handles transcription. The small or medium model runs comfortably on modern hardware and produces accuracy that rivals cloud services for clear speech. Consultation Audio → Whisper (local) → Raw Transcript No API keys. No network calls. No data leaving the machine. AI Structuring: Local LLM via Ollama The raw transcript then goes to a local language model — Mistral 7B via Ollama — with a carefully crafted prompt that instructs it to extract and structure the clinical content into SOAP format. Raw Transcript → Local LLM (Ollama) → Structured SOAP Note This is the part that requires the most prompt engineering. Medical context is specific. The model needs to distinguish symptoms from history, differentiate patient-reported from clinician-observed, and produce output in a format that feels familiar to a doctor. User Interface Design The app has three screens. That's it. Screen 1: Record The home screen. One job: start and stop the recording. The doctor taps record at the start of the consultation, taps stop at the end. A running timer gives reassurance that it's working. Nothing else competes for attention. Screen 2: Processing Shown automatically while transcription and note generation run. Communicates progress clearly so the doctor knows to wait rather than assume it's broken. Step labels (transcribing, structuring) help set expectations on timing. Screen 3: Review &amp; Export The generated SOAP note broken into its four sections, each independently editable. The doctor reviews, makes any corrections, and copies either individual sections or the full note into their existing system. A one-tap option to discard and re-record handles anything that went badly wrong. Three screens. One job per screen. The entire interaction from stop recording to copied note should take under a minute. Technical Implementation Details The architecture flows in one direction: audio in, structured note out. Every component runs locally. Microphone │ ▼ Web Audio API (browser capture) │ ▼ Node.js Backend (TypeScript + Express) │ ├──▶ Whisper CLI (child_process) ──▶ Raw Transcript │ │ │ ▼ └────────────────────────▶ Ollama REST API (Mistral 7B) │ ▼ Structured SOAP Note │ ▼ React Frontend (display + edit) The React frontend handles recording, state, and display. A lightweight Node.js backend written in TypeScript handles audio processing — receiving the file, shelling out to Whisper for transcription, and calling the Ollama API for note structuring. Nothing leaves the machine. Why TypeScript for the backend? Keeping the entire stack in TypeScript means shared types between frontend and backend, one language context to hold in your head, and no context switching. The backend doesn't need specialized libraries — it calls whisper as a CLI process and hits the Ollama REST API. TypeScript handles both cleanly. Why Electron? Eventually. For Part 1 we'll run it as a local dev server (React on Vite, Express on a separate port). Electron packaging comes once the core flow is solid — no point wrapping something that isn't working yet. Final Tech Stack Layer Technology Reason Transcription Whisper CLI (called from Node) Best accuracy, runs fully local LLM Mistral 7B via Ollama REST API Fast, medically capable, local Backend Node.js + Express + TypeScript Shared language with frontend Frontend React + TypeScript + Vite Familiar, fast dev loop Styling Tailwind CSS Rapid UI iteration Audio Web Audio API + MediaRecorder Browser-native, no extra deps Desktop (later) Electron Package the web app as native Known Risks Planning honestly means naming the risks upfront: LLM hallucination — The model may add clinical details not present in the transcript. The review step is not optional; the UI must make this clear. Whisper accuracy on medical terms — Drug names and anatomical terms are transcription minefields. We'll need a post-processing correction layer. Latency — Transcription + LLM on CPU takes 20-40 seconds for a 10-minute consultation. Acceptable for post-consultation use; we'll be upfront about this. Adoption friction — Doctors are busy and skeptical. If setup takes more than 10 minutes or the output needs heavy editing, it won't get used. What's Next The next step is implementation. We'll build the Node.js backend, wire up Whisper, get the Ollama integration running, and build out the three-screen React frontend. The goal is a working local app that goes from recorded audio to a SOAP note you can copy into your notes system. If you're a GP, practice manager, or healthcare developer who wants to shape what this looks like in practice — get in touch. The design decisions above are all still open to feedback, and the right input now saves a lot of rebuilding later."
  },
  {
    "slug": "2026-05-11-local-ai-trello-bot-mcp-ollama-telegram",
    "title": "Tutorial: Build a Local-AI Trello Bot with MCP, Ollama, and Telegram",
    "description": "A step-by-step tutorial for setting up and understanding a fully-local Trello bot. The stack: a Telegram chat surface, an Ollama-hosted LLM, and an MCP server exposing 67 Trello tools. Nothing leaves your network.",
    "tags": [
      "ai",
      "ollama",
      "mcp",
      "telegram",
      "typescript",
      "self-hosting",
      "tutorial"
    ],
    "excerpt": "This tutorial walks you through setting up a Telegram bot that lets you manage your Trello boards in plain English, backed by a local Ollama instance and a 67-tool MCP server. By the end, you will have: A Telegram bot you can DM with requests like &q",
    "content": "This tutorial walks you through setting up a Telegram bot that lets you manage your Trello boards in plain English, backed by a local Ollama instance and a 67-tool MCP server. By the end, you will have: A Telegram bot you can DM with requests like &quot;what's overdue?&quot; or &quot;add a card to Roadmap called 'investigate flaky CI'&quot; . An MCP server exposing 67 Trello tools, reusable from any MCP host (Claude Desktop, the MCP Inspector, etc.). A Docker Compose deployment that runs the whole thing in a single container. A working understanding of how the pieces fit together so you can extend it for other SaaS APIs. Part 1: What you will build The system has three moving parts: Telegram , the user-facing chat surface. A bot process , which receives messages, drives an agent loop against Ollama, and dispatches tool calls. An MCP server , a subprocess of the bot that exposes Trello operations as typed tools. A separate Ollama host on your LAN runs the LLM. Trello's REST API is the only off-network dependency. flowchart LR user([&quot;Telegram user&quot;]) tg[&quot;Telegram API&quot;] bot[&quot;bot process&quot;] ollama[(&quot;Ollama&lt;br/&gt;LAN GPU box&quot;)] mcp[&quot;MCP server&quot;] trello[(&quot;Trello REST API&quot;)] user --&gt; tg tg --&gt; bot bot &lt;--&gt;|tool-calling| ollama bot -. stdio .-&gt; mcp mcp --&gt; trello classDef external fill:#eef2f7,stroke:#6b7280,color:#15171a classDef internal fill:#dbeafe,stroke:#2563eb,color:#15171a class user,tg,trello external class bot,ollama,mcp internal What is MCP? Model Context Protocol is a small standard for connecting LLMs to tools. The shape: An MCP server exposes a set of tools. Each tool has a name, a description, a JSON schema for its arguments, and a handler that performs the work. An MCP client (an LLM host like Claude Desktop, or a bot you write yourself) connects to the server, asks for the tool catalog, and dispatches the tools the model decides to call. The transport is either stdio (parent-child process) or HTTP/SSE (for remote servers). The benefit: the same MCP server you build for your bot is reusable from Claude Desktop, the MCP Inspector , or any future MCP host. Write the integration once, use it from anywhere. Part 2: Prerequisites Before you start, make sure you have the following installed and accessible: Requirement Notes Docker + Docker Compose Tested on Docker Desktop (macOS) and Docker Engine (Linux). An Ollama instance Reachable from the container. Default model qwen3-coder:latest needs ~16 GB VRAM. A Trello account Free tier works. You will create an API key and a token. A Telegram account Free. You will create a bot and find your numeric user id. A code editor Any. You will edit one .env file. Hardware note: Ollama can run on CPU but is too slow for an interactive chat experience. A GPU with at least 16 GB VRAM is recommended for the default model. If you only have 8 GB, swap to a smaller tool-calling model such as llama3.1:8b . Part 3: Get your Trello credentials You need two strings from Trello: an API key and a token . 3.1 Create a Power-Up to get an API key Open https://trello.com/power-ups/admin in a browser (logged in to Trello). Click New to create a Power-Up. Fill in any name and workspace. You are not actually shipping a Power-Up; you only need the credentials it generates. After creation, click the Power-Up, then open the API key tab. Click Generate a new API key . Copy the value and save it as TRELLO_API_KEY . 3.2 Generate a token On the same API key tab, find the description text on the right that contains a blue Token link. Click it. Trello will ask you to authorize the Power-Up against your account. Click Allow . Trello returns a long string. Copy it and save it as TRELLO_API_TOKEN . Common mistake: the Secret on the API key tab is not the token. The token is what you get after clicking the blue Token link and authorizing. Using the Secret instead of the Token is the most common cause of 401 Unauthorized errors later. Part 4: Create your Telegram bot 4.1 Talk to BotFather In Telegram, search for @BotFather and open a chat. Send /newbot . Answer the prompts: a display name (anything) and a username ending in bot (must be globally unique). BotFather replies with a token that looks like 123456:ABC-DEF... . Save it as TELEGRAM_BOT_TOKEN . The exchange looks roughly like this: You /newbot BotFather Alright, a new bot. How are we going to call it? Please choose a name for your bot. You My Trello Bot BotFather Good. Now let's choose a username for your bot. It must end in `bot`. You my_trello_bot BotFather Done! Congratulations on your new bot. Use this token to access the HTTP API: 123456:ABC-DEF1234ghIkl-zyx57W2v1u123ew11 Keep your token secure... Keep this token private. Anyone with it can impersonate your bot. 4.2 Find your numeric Telegram user id The bot uses your numeric Telegram id (not your @handle ) for authorization. In Telegram, search for @userinfobot . Send any message. It replies with your numeric id (something like 987654321 ). Save it as TELEGRAM_ALLOWED_USER_IDS . If you want to allow multiple users, list ids comma-separated: 123,456,789 . Part 5: Set up Ollama Ollama runs the LLM. You can host it on the same machine as the bot, or on a separate GPU box on your LAN. 5.1 Install Ollama Follow the install instructions at https://ollama.com . On macOS: brew install ollama ollama serve brew install ollama ollama serve On Linux: curl -fsSL https://ollama.com/install.sh | sh curl -fsSL https://ollama.com/install.sh | sh 5.2 Pull a tool-calling model The default model used by this bot is qwen3-coder:latest . Pull it: ollama pull qwen3-coder:latest ollama pull qwen3-coder:latest You should see something like this once it finishes: pulling manifest pulling 0b8c4f5e7e9a... 100% ▕████████████████▏ 18 GB pulling 9f2c8a... 100% ▕████████████████▏ 12 KB pulling 7d6f1a... 100% ▕████████████████▏ 1.4 KB verifying sha256 digest writing manifest success Tested models that work: qwen3-coder:latest (~16 GB VRAM, recommended) qwen-pro:latest llama3.1:8b (works on smaller GPUs) Avoid: Gemma family models. Tool-calling reliability across a 67-tool surface is too low for an agent loop. 5.3 Confirm it is reachable If Ollama runs on the same machine as the bot, the default http://localhost:11434 works. If it runs on a different machine on your LAN, find its IP and confirm: curl http:// &#x3C; ollama-i p > :11434/api/tags curl http:// &#x3C; ollama-i p > :11434/api/tags You should see a JSON list of installed models. Save the URL as OLLAMA_HOST for later. Part 6: Clone, configure, and run You now have all four secrets and a working Ollama. Time to start the bot. 6.1 Clone the repository git clone https://github.com/devdaviddr/trello-mcp-service.git cd trello-mcp-service git clone https://github.com/devdaviddr/trello-mcp-service.git cd trello-mcp-service 6.2 Configure your environment Copy the example file and fill in your values: cp .env.example .env $EDITOR .env cp .env.example .env $EDITOR .env The minimum you must set: TRELLO_API_KEY = ... TRELLO_API_TOKEN = ... TELEGRAM_BOT_TOKEN = ... TELEGRAM_ALLOWED_USER_IDS = ... OLLAMA_HOST = http://host.docker.internal:11434 # if Ollama is on the host OLLAMA_MODEL = qwen3-coder:latest TRELLO_API_KEY = ... TRELLO_API_TOKEN = ... TELEGRAM_BOT_TOKEN = ... TELEGRAM_ALLOWED_USER_IDS = ... OLLAMA_HOST = http://host.docker.internal:11434 # if Ollama is on the host OLLAMA_MODEL = qwen3-coder:latest OLLAMA_HOST from inside Docker: Same machine, macOS/Windows: http://host.docker.internal:11434 Same machine, Linux: http://host.docker.internal:11434 (the included extra_hosts config makes this work) Different machine on LAN: http://&lt;lan-ip&gt;:11434 6.3 Start the container docker compose up --build docker compose up --build The first build takes a minute or two. Once running you should see logs like: trello-bot | [mcp-server] connecting trello client trello-bot | [mcp-server] registered 67 tools trello-bot | [bot] ollama host: http://host.docker.internal:11434 trello-bot | [bot] model: qwen3-coder:latest trello-bot | [bot] starting long-poll... 6.4 Test it Open Telegram, find your bot by the username you gave BotFather, and send /start . The bot will greet you back. Now try a real query: what boards do I have? The first reply will take 20–60 seconds while Ollama loads model weights into VRAM. Subsequent replies should land in 1–3 seconds. Built-in commands: /start : greeting /reset : clear this chat's conversation history /whoami : show your Telegram numeric id and whether you are authorized (use this if the bot replies &quot;Not authorized&quot;) Part 7: How it works under the hood Now that the bot is running, this section explains the implementation so you can extend or fork it. 7.1 Defining a tool Each Trello operation is registered as one MCP tool. The project uses zod for schemas. One definition gives compile-time types and runtime validation, and converts cleanly to JSON Schema for the LLM. def ( \"create_card\" , \"Create a new card in a list.\" , z. object ({ list_id: z. string (), name: z. string (), description: z. string (). optional (), due: z. string (). optional (). describe ( \"ISO 8601 due date\" ), }), async ( args , trello ) => trello.cards. create (args.list_id, args.name, args.description, args.due), ); def ( \"create_card\" , \"Create a new card in a list.\" , z. object ({ list_id: z. string (), name: z. string (), description: z. string (). optional (), due: z. string (). optional (). describe ( \"ISO 8601 due date\" ), }), async ( args , trello ) => trello.cards. create (args.list_id, args.name, args.description, args.due), ); The handler delegates to a thin Trello REST client. zod parses the LLM's arguments at runtime, so if the model hallucinates a field type or omits a required arg, the call is rejected with a readable error string. That error becomes the next role: &quot;tool&quot; message, and the model uses it to fix its mistake on the next turn. This pattern is repeated 67 times, one tool per Trello capability. flowchart LR zod[&quot;zod schema&lt;br/&gt;z.object({ list_id, name, ... })&quot;] ts[&quot;TypeScript types&lt;br/&gt;(compile time)&quot;] json[&quot;JSON Schema&lt;br/&gt;(advertised to the LLM)&quot;] parser[&quot;zod.parse(args)&lt;br/&gt;(catches LLM hallucinations)&quot;] handler[&quot;handler(args, trello)&quot;] trello[(&quot;Trello REST API&quot;)] err[&quot;error string&lt;br/&gt;→ role: tool message&lt;br/&gt;→ model retries next turn&quot;] zod --&gt; ts zod --&gt; json zod --&gt; parser parser --&gt;|ok| handler parser --&gt;|fail| err handler --&gt; trello classDef source fill:#dbeafe,stroke:#2563eb,color:#15171a classDef artifact fill:#eef2f7,stroke:#6b7280,color:#15171a classDef external fill:#fef3c7,stroke:#d97706,color:#15171a class zod source class ts,json,parser,handler,err artifact class trello external 7.2 Running the MCP server over stdio The MCP server is a small glue file: import { Server } from \"@modelcontextprotocol/sdk/server/index.js\" ; import { StdioServerTransport } from \"@modelcontextprotocol/sdk/server/stdio.js\" ; const server = new Server ( { name: \"trello-mcp\" , version: \"0.1.0\" }, { capabilities: { tools: {} } }, ); server. setRequestHandler (ListToolsRequestSchema, async () => ({ tools: toolSchemas })); server. setRequestHandler (CallToolRequestSchema, async ( req ) => { const tool = toolsByName. get (req.params.name); if ( ! tool) { return { isError: true , content: [{ type: \"text\" , text: `Unknown tool: ${ req . params . name }` }] }; } try { const result = await tool. handler (req.params.arguments ?? {}, trello); return { content: [{ type: \"text\" , text: JSON . stringify (result ?? { ok: true }) }] }; } catch (err) { const message = err instanceof Error ? err.message : String (err); return { isError: true , content: [{ type: \"text\" , text: message }] }; } }); await server. connect ( new StdioServerTransport ()); import { Server } from \"@modelcontextprotocol/sdk/server/index.js\" ; import { StdioServerTransport } from \"@modelcontextprotocol/sdk/server/stdio.js\" ; const server = new Server ( { name: \"trello-mcp\" , version: \"0.1.0\" }, { capabilities: { tools: {} } }, ); server. setRequestHandler (ListToolsRequestSchema, async () => ({ tools: toolSchemas })); server. setRequestHandler (CallToolRequestSchema, async ( req ) => { const tool = toolsByName. get (req.params.name); if ( ! tool) { return { isError: true , content: [{ type: \"text\" , text: `Unknown tool: ${ req . params . name }` }] }; } try { const result = await tool. handler (req.params.arguments ?? {}, trello); return { content: [{ type: \"text\" , text: JSON . stringify (result ?? { ok: true }) }] }; } catch (err) { const message = err instanceof Error ? err.message : String (err); return { isError: true , content: [{ type: \"text\" , text: message }] }; } }); await server. connect ( new StdioServerTransport ()); stdio means the server runs as a subprocess of whoever launches it. No port to expose, no auth layer to manage, zero network latency on each tool call. The same binary works standalone with Claude Desktop pointed at it, covered in Part 8. 7.3 The Trello REST client with retries Trello rate-limits at 100 requests per 10 seconds per token. A naïve fetch will fail on the first 429. The request layer in this project retries with jittered exponential backoff and honors Retry-After when Trello provides it. async request &#x3C; T >(method: string, path: string, params: QueryParams = {}): Promise &#x3C; T > { const url = `${ BASE }${ path }?${ this . auth ( params ) }` ; let lastBody = \"\" ; let lastStatus = 0 ; for ( let attempt = 1 ; attempt &#x3C; = MAX_ATTEMPTS ; attempt ++) { const res = await fetch (url, { method }); if (res.ok) { const text = await res. text (); return text ? ( JSON . parse (text) as T ) : ( undefined as T ); } lastStatus = res.status; lastBody = await res. text (); if ( ! RETRY_STATUSES . has (res.status) || attempt === MAX_ATTEMPTS ) break ; const retryAfter = Number (res.headers. get ( \"retry-after\" )); const backoff = Number. isFinite (retryAfter) &#x26;&#x26; retryAfter > 0 ? retryAfter * 1000 : Math. min ( 8000 , 500 * 2 ** (attempt - 1 )) + Math. random () * 250 ; await sleep (backoff); } throw new Error ( `Trello ${ method } ${ path } failed: ${ lastStatus } ${ lastBody }` ); } async request &#x3C; T >(method: string, path: string, params: QueryParams = {}): Promise &#x3C; T > { const url = `${ BASE }${ path }?${ this . auth ( params ) }` ; let lastBody = \"\" ; let lastStatus = 0 ; for ( let attempt = 1 ; attempt &#x3C; = MAX_ATTEMPTS ; attempt ++) { const res = await fetch (url, { method }); if (res.ok) { const text = await res. text (); return text ? ( JSON . parse (text) as T ) : ( undefined as T ); } lastStatus = res.status; lastBody = await res. text (); if ( ! RETRY_STATUSES . has (res.status) || attempt === MAX_ATTEMPTS ) break ; const retryAfter = Number (res.headers. get ( \"retry-after\" )); const backoff = Number. isFinite (retryAfter) &#x26;&#x26; retryAfter > 0 ? retryAfter * 1000 : Math. min ( 8000 , 500 * 2 ** (attempt - 1 )) + Math. random () * 250 ; await sleep (backoff); } throw new Error ( `Trello ${ method } ${ path } failed: ${ lastStatus } ${ lastBody }` ); } Settings: RETRY_STATUSES is {429, 502, 503, 504} . Up to 4 attempts. The final error includes the status and response body, so failures are debuggable from logs. This single function carries every Trello call in the codebase. 7.4 The agent loop The Ollama npm package speaks the tool-calling API directly, so the loop is short: for ( let turn = 0 ; turn &#x3C; MAX_TURNS ; turn ++ ) { const res = await ollama. chat ({ model, messages, tools, stream: false }); const msg = res.message; messages. push (msg); const calls = msg.tool_calls ?? []; if (calls. length === 0 ) return { reply: msg.content ?? \"\" }; for ( const call of calls) { let toolResult : string ; try { toolResult = await mcp. callTool (call.function.name, normalizeArgs (call.function.arguments)); } catch (err) { toolResult = `ERROR: ${ err instanceof Error ? err . message : String ( err ) }` ; } messages. push ({ role: \"tool\" , content: truncate (toolResult), tool_name: call.function.name }); } } for ( let turn = 0 ; turn &#x3C; MAX_TURNS ; turn ++ ) { const res = await ollama. chat ({ model, messages, tools, stream: false }); const msg = res.message; messages. push (msg); const calls = msg.tool_calls ?? []; if (calls. length === 0 ) return { reply: msg.content ?? \"\" }; for ( const call of calls) { let toolResult : string ; try { toolResult = await mcp. callTool (call.function.name, normalizeArgs (call.function.arguments)); } catch (err) { toolResult = `ERROR: ${ err instanceof Error ? err . message : String ( err ) }` ; } messages. push ({ role: \"tool\" , content: truncate (toolResult), tool_name: call.function.name }); } } What it does: If tool_calls is empty, the model has produced its final answer and the loop returns. Otherwise it dispatches each call to the MCP server and pushes the result back as a role: &quot;tool&quot; message. Errors are included; that is how the model recovers. MAX_TURNS defaults to 16 so a confused model cannot spin forever. Tool output is truncated to a 16 KB budget before entering history, so a large list_boards does not blow past the context window. 7.5 Telegram wiring The Telegram side, using grammy : const bot = new Bot (token); bot. on ( \"message:text\" , async ( ctx ) => { if ( ! isAuthorized (ctx.from?.id)) return ctx. reply ( \"Not authorized\" ); await chatQueue. run (ctx.chat.id, async () => { const history = historyStore. get (ctx.chat.id); const { reply , history : next } = await agent. chat (history, ctx.message.text); historyStore. set (ctx.chat.id, next); await ctx. reply (reply); }); }); await bot. start (); const bot = new Bot (token); bot. on ( \"message:text\" , async ( ctx ) => { if ( ! isAuthorized (ctx.from?.id)) return ctx. reply ( \"Not authorized\" ); await chatQueue. run (ctx.chat.id, async () => { const history = historyStore. get (ctx.chat.id); const { reply , history : next } = await agent. chat (history, ctx.message.text); historyStore. set (ctx.chat.id, next); await ctx. reply (reply); }); }); await bot. start (); Two non-obvious details, learned the hard way: chatQueue serializes messages per chat. If two messages arrive in the same chat before the first finishes, both handlers would read the same starting history, and the second one's set() would clobber the first. A small Promise-queue keyed by chat id prevents this. History trim must land on a user-message boundary. Tool-calling APIs require an assistant message with tool_calls to be immediately followed by role: &quot;tool&quot; messages for each call. A naïve slice(-40) can leave an orphan tool result, and the next API call rejects it. The project's trim walks the cut point forward until it lands on role: &quot;user&quot; . Part 8: Use the MCP server standalone The MCP server is independent of the bot. You can plug it into any MCP host. 8.1 With Claude Desktop Add this to claude_desktop_config.json : { \"mcpServers\" : { \"trello\" : { \"command\" : \"node\" , \"args\" : [ \"/absolute/path/to/trello-mcp-service/dist/mcp-server/index.js\" ], \"env\" : { \"TRELLO_API_KEY\" : \"...\" , \"TRELLO_API_TOKEN\" : \"...\" } } } } { \"mcpServers\" : { \"trello\" : { \"command\" : \"node\" , \"args\" : [ \"/absolute/path/to/trello-mcp-service/dist/mcp-server/index.js\" ], \"env\" : { \"TRELLO_API_KEY\" : \"...\" , \"TRELLO_API_TOKEN\" : \"...\" } } } } Restart Claude Desktop. All 67 Trello tools become available in any conversation. Ask &quot;create a card on Roadmap called Buy milk&quot; and Claude will discover create_card , fill the arguments, and return the result inline as a tool-use turn. The same goes for the read-side tools: &quot;what's on my board?&quot; produces a list_boards + list_cards_on_board chain without any extra prompting. 8.2 With the MCP Inspector npx @modelcontextprotocol/inspector node dist/mcp-server/index.js npx @modelcontextprotocol/inspector node dist/mcp-server/index.js The Inspector opens a browser UI where you can browse the tool catalog, read schemas, and call tools manually. It is the fastest way to verify tool behavior without involving an LLM, and the right place to debug a failing tool before you suspect the model. Part 9: Customizing and extending 9.1 Add a new Trello tool Open src/mcp-server/tools/ and pick the file matching the resource (e.g. cards.ts ). Add a new def(...) registration with a name, description, zod schema, and async handler. Rebuild the container: docker compose up --build . The new tool is picked up automatically. There is no separate registration step. 9.2 Swap to a different SaaS API The project is a clean reference for any REST-backed SaaS. To fork it: Replace src/mcp-server/trello/ with a client for your target API (Linear, GitHub Issues, Notion, etc.). Replace the tool registrations under src/mcp-server/tools/ with your new operations. Everything else stays the same: the agent loop, Telegram wiring, history management, and Docker setup. The whole codebase is roughly 1,900 lines of TypeScript across 35 files. 9.3 Tunable knobs All behavior is env-var driven. Useful ones: Var Default Purpose MAX_TURNS 16 Max chained tool calls per user message. TOOL_OUTPUT_CHAR_BUDGET 16000 Tool output truncation before entering history. OLLAMA_TIMEOUT_MS 120000 Per-call abort timeout for Ollama. OLLAMA_MODEL qwen3-coder:latest Any tool-calling-capable model. Part 10: Troubleshooting Symptom Cause and fix Bot starts but never replies, no errors in logs On Apple Silicon, node:20-alpine runs under Rosetta and Node's TLS hangs on api.telegram.org . The project uses node:20-slim to avoid this. If you forked back to alpine, switch back. Not authorized reply in Telegram TELEGRAM_ALLOWED_USER_IDS must contain your numeric id, not your @handle . Send /whoami to the bot to see what id Telegram reports for you. 401 Unauthorized from Trello The Secret on the Power-Up API key page is not the Token. Click the blue Token link, authorize, and use that string. I hit my tool-call limit A multi-step request exceeded MAX_TURNS=16 . Bump it via env or break the request into smaller asks. Frequent hits often mean the model is looping; try a stronger model. First reply takes 20–60 seconds Ollama cold-loads the model into VRAM on the first request. Subsequent calls are normal-speed. Pre-warm with a curl to /api/generate if you want the first user-facing reply to be fast. Bot can reach internet but not Ollama If Ollama runs on the host, set OLLAMA_HOST=http://host.docker.internal:11434 . The included docker-compose.yml has the extra_hosts mapping needed for Linux. Result End-to-end on a warm model: roughly 1.5 seconds per reply. Cold start: 20–60 seconds for the first turn while Ollama loads weights into VRAM. The MCP server exposes 67 tools, from create_card to list_cards_due_soon to set_card_cover . Because it speaks plain MCP, plugging it into Claude Desktop is a four-line config addition. Forking it for a different SaaS is roughly two evenings of work for a comparable surface. Source Full source, README, architecture diagram, and the complete 67-tool inventory: github.com/devdaviddr/trello-mcp-service ."
  },
  {
    "slug": "2026-05-10-multiple-apps-docker-networks-cloudflare-tunnels",
    "title": "Running Multiple Apps with Traefik, Docker, and Cloudflare Tunnels",
    "description": "How to host five apps on one Mac Mini without cramming them into a single compose file. One shared network, Traefik for routing and TLS, one Cloudflare tunnel.",
    "tags": [
      "self-hosting",
      "docker",
      "traefik",
      "cloudflare",
      "infrastructure"
    ],
    "excerpt": "After the Mac Mini setup from the last post, I got greedy. The M4 was idling at 15% CPU with 8 GB of RAM untouched, the Cloudflare tunnel was already wired up, and the marginal cost of another app was effectively zero. So I started adding more. Five ",
    "content": "After the Mac Mini setup from the last post, I got greedy. The M4 was idling at 15% CPU with 8 GB of RAM untouched, the Cloudflare tunnel was already wired up, and the marginal cost of another app was effectively zero. So I started adding more. Five apps later, the lesson is straightforward: you don't need to cram them into one giant compose file. Each app gets its own directory, its own compose file, its own database. They share one Docker network, one reverse proxy (Traefik), and one Cloudflare tunnel. Routing and TLS are declared with Docker labels on each app, so there's no central config file to keep in sync. That's the whole pattern. Don't do this The temptation, when you start, is to put everything in a single docker-compose.yml : services : app1-frontend : ... app1-backend : ... app1-db : ... app2-frontend : ... app2-backend : ... app2-db : ... services : app1-frontend : ... app1-backend : ... app1-db : ... app2-frontend : ... app2-backend : ... app2-db : ... It works until it doesn't. One service crashes and you restart the whole stack. You update one app and risk breaking the other three. Ports collide, dependencies tangle, the file balloons. The blast radius of any change is the entire system. The pattern Mac Mini M4 | v Cloudflare Tunnel | v Traefik (proxy + TLS) | ┌───────┬──────┼──────┬───────┐ v v v v v App1 App2 App3 App4 ... (own (own (own (own stack) stack) stack) stack) One shared web Docker network connects every app to Traefik. Traefik watches the Docker socket and discovers routes from labels on each container, so there's no central routing file. Cloudflare hands inbound traffic to Traefik, which decides which app gets it. Each app's database stays on its own internal network, unreachable from anywhere else. File layout: ~/apps/ ├── portfolio/ │ ├── docker-compose.yml │ ├── frontend/ │ ├── backend/ │ └── .env ├── api-project/ │ ├── docker-compose.yml │ ├── api/ │ └── .env └── shared/ └── traefik/ ├── docker-compose.yml └── .env Step 1: the shared network docker network create web docker network create web That's the entire setup step. The network persists across container restarts and reboots. Every app's compose file references it as external. Step 2: an app A typical app compose file. Note the labels on each service Traefik should expose; the database has none, so Traefik never sees it. services : frontend : build : ./frontend container_name : portfolio-frontend networks : [ web , portfolio-internal ] labels : - \"traefik.enable=true\" - \"traefik.http.routers.portfolio.rule=Host(`portfolio.yourdomain.com`)\" - \"traefik.http.routers.portfolio.entrypoints=web\" - \"traefik.http.services.portfolio.loadbalancer.server.port=80\" restart : unless-stopped backend : build : ./backend container_name : portfolio-backend env_file : .env networks : [ web , portfolio-internal ] labels : - \"traefik.enable=true\" - \"traefik.http.routers.portfolio-api.rule=Host(`api.portfolio.yourdomain.com`)\" - \"traefik.http.routers.portfolio-api.entrypoints=web\" - \"traefik.http.services.portfolio-api.loadbalancer.server.port=4000\" restart : unless-stopped db : image : postgres:16-alpine container_name : portfolio-db env_file : .env volumes : - portfolio-db:/var/lib/postgresql/data networks : [ portfolio-internal ] restart : unless-stopped networks : web : external : true portfolio-internal : volumes : portfolio-db : services : frontend : build : ./frontend container_name : portfolio-frontend networks : [ web , portfolio-internal ] labels : - \"traefik.enable=true\" - \"traefik.http.routers.portfolio.rule=Host(`portfolio.yourdomain.com`)\" - \"traefik.http.routers.portfolio.entrypoints=web\" - \"traefik.http.services.portfolio.loadbalancer.server.port=80\" restart : unless-stopped backend : build : ./backend container_name : portfolio-backend env_file : .env networks : [ web , portfolio-internal ] labels : - \"traefik.enable=true\" - \"traefik.http.routers.portfolio-api.rule=Host(`api.portfolio.yourdomain.com`)\" - \"traefik.http.routers.portfolio-api.entrypoints=web\" - \"traefik.http.services.portfolio-api.loadbalancer.server.port=4000\" restart : unless-stopped db : image : postgres:16-alpine container_name : portfolio-db env_file : .env volumes : - portfolio-db:/var/lib/postgresql/data networks : [ portfolio-internal ] restart : unless-stopped networks : web : external : true portfolio-internal : volumes : portfolio-db : The two-networks-per-service trick is what makes this clean. The frontend and backend join web so Traefik can reach them. The database stays on portfolio-internal only, so nothing outside the app can talk to it. Each app is an island with one bridge to the proxy. Bring it up: cd ~/apps/portfolio docker compose up -d cd ~/apps/portfolio docker compose up -d A second app is the same pattern. New directory, its own compose file, its own internal network name, its own router labels. Repeat as needed. Step 3: Traefik ~/apps/shared/traefik/docker-compose.yml : services : traefik : image : traefik:v3.1 container_name : traefik command : - \"--providers.docker=true\" - \"--providers.docker.exposedbydefault=false\" - \"--providers.docker.network=web\" - \"--entrypoints.web.address=:80\" ports : - \"80:80\" volumes : - /var/run/docker.sock:/var/run/docker.sock:ro networks : [ web ] restart : unless-stopped networks : web : external : true services : traefik : image : traefik:v3.1 container_name : traefik command : - \"--providers.docker=true\" - \"--providers.docker.exposedbydefault=false\" - \"--providers.docker.network=web\" - \"--entrypoints.web.address=:80\" ports : - \"80:80\" volumes : - /var/run/docker.sock:/var/run/docker.sock:ro networks : [ web ] restart : unless-stopped networks : web : external : true A few things to notice. exposedbydefault=false means Traefik ignores any container that doesn't explicitly set traefik.enable=true , so a stray container can't accidentally appear on the public internet. The Docker socket is mounted read-only because Traefik only needs to watch it. And there's no central routing config file at all; Traefik builds the router table from labels on running containers and updates it live as you start and stop services. Bring it up: cd ~/apps/shared/traefik docker compose up -d cd ~/apps/shared/traefik docker compose up -d At this point, the apps are reachable on localhost:80 with the right Host header. Time to give them real TLS. Step 4: real SSL via Cloudflare DNS Cloudflare's edge already terminates TLS for users with their own certificate, and the tunnel from edge to your Mac is encrypted. So strictly, Traefik doesn't need its own certs to be safe. There's still a good case for issuing real Let's Encrypt certs at the origin: it gives you defense in depth (the cloudflared-to-Traefik hop is also TLS), it lets you flip Cloudflare into &quot;Full (strict)&quot; SSL mode, and it future-proofs the setup if you ever expose a service outside the tunnel. The wrinkle when you're behind a tunnel is that the HTTP-01 challenge can't reach you. Cloudflare DNS points at the tunnel, not at your home IP, and there's no public port for Let's Encrypt to hit. The DNS-01 challenge works fine, though: it just writes a TXT record. Traefik supports DNS-01 against the Cloudflare API natively. First, create a scoped API token. In the Cloudflare dashboard: My Profile → API Tokens → Create Token → &quot;Edit zone DNS&quot; template . Scope it to the specific zone(s) you're using. Save the token in ~/apps/shared/traefik/.env : CF_DNS_API_TOKEN=&lt;the token&gt; Then update the Traefik compose file to add the HTTPS entrypoint, the ACME resolver, and a place to persist certs: services : traefik : image : traefik:v3.1 container_name : traefik command : - \"--providers.docker=true\" - \"--providers.docker.exposedbydefault=false\" - \"--providers.docker.network=web\" - \"--entrypoints.web.address=:80\" - \"--entrypoints.websecure.address=:443\" - \"--certificatesresolvers.cloudflare.acme.dnschallenge=true\" - \"--certificatesresolvers.cloudflare.acme.dnschallenge.provider=cloudflare\" - \"--certificatesresolvers.cloudflare.acme.email=you@yourdomain.com\" - \"--certificatesresolvers.cloudflare.acme.storage=/letsencrypt/acme.json\" env_file : .env ports : - \"80:80\" - \"443:443\" volumes : - /var/run/docker.sock:/var/run/docker.sock:ro - letsencrypt:/letsencrypt networks : [ web ] restart : unless-stopped networks : web : external : true volumes : letsencrypt : services : traefik : image : traefik:v3.1 container_name : traefik command : - \"--providers.docker=true\" - \"--providers.docker.exposedbydefault=false\" - \"--providers.docker.network=web\" - \"--entrypoints.web.address=:80\" - \"--entrypoints.websecure.address=:443\" - \"--certificatesresolvers.cloudflare.acme.dnschallenge=true\" - \"--certificatesresolvers.cloudflare.acme.dnschallenge.provider=cloudflare\" - \"--certificatesresolvers.cloudflare.acme.email=you@yourdomain.com\" - \"--certificatesresolvers.cloudflare.acme.storage=/letsencrypt/acme.json\" env_file : .env ports : - \"80:80\" - \"443:443\" volumes : - /var/run/docker.sock:/var/run/docker.sock:ro - letsencrypt:/letsencrypt networks : [ web ] restart : unless-stopped networks : web : external : true volumes : letsencrypt : Then update each exposed service in the app compose files to use the websecure entrypoint and the Cloudflare resolver: labels : - \"traefik.enable=true\" - \"traefik.http.routers.portfolio.rule=Host(`portfolio.yourdomain.com`)\" - \"traefik.http.routers.portfolio.entrypoints=websecure\" - \"traefik.http.routers.portfolio.tls.certresolver=cloudflare\" - \"traefik.http.services.portfolio.loadbalancer.server.port=80\" labels : - \"traefik.enable=true\" - \"traefik.http.routers.portfolio.rule=Host(`portfolio.yourdomain.com`)\" - \"traefik.http.routers.portfolio.entrypoints=websecure\" - \"traefik.http.routers.portfolio.tls.certresolver=cloudflare\" - \"traefik.http.services.portfolio.loadbalancer.server.port=80\" The first time Traefik sees a router with tls.certresolver=cloudflare , it asks Let's Encrypt for a cert, Let's Encrypt asks for a TXT record under _acme-challenge.portfolio.yourdomain.com , Traefik writes it via the Cloudflare API, Let's Encrypt verifies, and the cert lands in /letsencrypt/acme.json . Renewals happen on their own. You don't think about it again. Restart the proxy so it picks up the new args: cd ~/apps/shared/traefik docker compose up -d cd ~/apps/shared/traefik docker compose up -d Watch the logs the first time; cert issuance takes a few seconds and you'll see Traefik report success. Step 5: the tunnel Now point the tunnel at HTTPS on Traefik. ~/.cloudflared/config.yml : tunnel : &#x3C;your-tunnel-id> credentials-file : /Users/&#x3C;username>/.cloudflared/&#x3C;tunnel-id>.json ingress : - hostname : portfolio.yourdomain.com service : https://localhost:443 - hostname : api.portfolio.yourdomain.com service : https://localhost:443 - hostname : api.yourdomain.com service : https://localhost:443 - service : http_status:404 tunnel : &#x3C;your-tunnel-id> credentials-file : /Users/&#x3C;username>/.cloudflared/&#x3C;tunnel-id>.json ingress : - hostname : portfolio.yourdomain.com service : https://localhost:443 - hostname : api.portfolio.yourdomain.com service : https://localhost:443 - hostname : api.yourdomain.com service : https://localhost:443 - service : http_status:404 Every hostname goes to https://localhost:443 . That's Traefik. Traefik reads the Host header and forwards to the right container on the web network. Cloudflare doesn't need to know about your app topology; Traefik owns that. Add the DNS records and reload the tunnel: cloudflared tunnel route dns &#x3C; tunnel-i d > portfolio.yourdomain.com cloudflared tunnel route dns &#x3C; tunnel-i d > api.portfolio.yourdomain.com cloudflared tunnel route dns &#x3C; tunnel-i d > api.yourdomain.com sudo launchctl kickstart -k system/com.cloudflare.cloudflared cloudflared tunnel route dns &#x3C; tunnel-i d > portfolio.yourdomain.com cloudflared tunnel route dns &#x3C; tunnel-i d > api.portfolio.yourdomain.com cloudflared tunnel route dns &#x3C; tunnel-i d > api.yourdomain.com sudo launchctl kickstart -k system/com.cloudflare.cloudflared In the Cloudflare dashboard, you can now switch SSL/TLS mode for the zone to Full (strict) . Edge-to-origin traffic is verified end to end. How a request flows A request to portfolio.yourdomain.com lands at Cloudflare's edge, gets TLS terminated against Cloudflare's cert, travels down the encrypted outbound tunnel to https://localhost:443 on the Mac, hits Traefik (which presents the Let's Encrypt cert it issued for that hostname), and Traefik forwards the decrypted request to the portfolio-frontend container on the web network. The response retraces the path. From the app's point of view, it's just receiving a plain HTTP request from a sibling container. Day-to-day Each app is independent: # update portfolio cd ~/apps/portfolio &#x26;&#x26; docker compose pull &#x26;&#x26; docker compose up -d # tail one app's logs cd ~/apps/api-project &#x26;&#x26; docker compose logs -f # restart one service cd ~/apps/portfolio &#x26;&#x26; docker compose restart backend # see what Traefik thinks is routable docker logs traefik | grep -i router # update portfolio cd ~/apps/portfolio &#x26;&#x26; docker compose pull &#x26;&#x26; docker compose up -d # tail one app's logs cd ~/apps/api-project &#x26;&#x26; docker compose logs -f # restart one service cd ~/apps/portfolio &#x26;&#x26; docker compose restart backend # see what Traefik thinks is routable docker logs traefik | grep -i router Adding a new app: create the directory, write a compose file that joins web and declares its router labels, docker compose up -d . Traefik picks it up within a second or two and (if you set tls.certresolver=cloudflare ) issues a cert on the spot. Add the tunnel route. Five minutes start to finish. Resource usage What's currently running on the base Mac Mini M4: Portfolio site (React + Express + Postgres) Side project API (Node + Redis + Postgres) Personal dashboard (Vue + SQLite) Internal tool (Python FastAPI, no database) A staging environment that comes and goes Steady state: 15 to 20% CPU, around 8 of the 16 GB of RAM in use, 45 of the 256 GB SSD. Plenty of headroom. When one big compose file is still right Use a single compose file when the services are one logical app and ship together (a frontend pinned to a specific backend version, for example). Use multiple compose files when the apps are independent projects with their own update schedules. For self-hosting a portfolio of side projects, multiple compose files win every time. A few things worth knowing Router names must be unique across the whole proxy. Traefik discovers routers by label, not by container, so two services labeled traefik.http.routers.api... will fight. Prefix with the app name ( portfolio-api , dashboard-api , etc.). Lock down exposedbydefault=false . Without it, every container on the web network gets auto-published. With it, only services that explicitly set traefik.enable=true are exposed. Treat this as non-negotiable. One .env per app. No shared secrets across projects. If one leaks, the blast radius is exactly one app. The Cloudflare API token lives in the Traefik directory only. Watch disk usage. Multiple databases and a growing collection of images add up faster than you expect. docker system df once a week, docker system prune -a once a month. Stagger backups. Five Postgres dumps at midnight is a real IO spike. Spread them across the early-morning hours. Persist acme.json . It holds your certs and account key. The letsencrypt named volume in the compose file above does this; if you blow it away, Traefik re-issues everything from scratch and you can hit Let's Encrypt rate limits. Closing The point of this setup isn't density. It's that the second app costs nothing in operational complexity, and the third costs even less. Traefik handles routing and TLS without a config file you have to remember to update. You stop thinking about whether a project is &quot;worth&quot; hosting. You build it, drop it in ~/apps/ , label the router, and move on. The tunnel doesn't care, the proxy doesn't care, and the Mac Mini definitely doesn't care."
  },
  {
    "slug": "2026-05-10-self-hosting-mac-mini-cloudflare-tunnels",
    "title": "Self-Hosting a Full-Stack App on a Mac Mini M4 with Cloudflare Tunnels",
    "description": "How I moved a React + Express + Postgres app off an $11/mo VPS onto a $599 Mac Mini, with Cloudflare Tunnels handling the public-facing parts.",
    "tags": [
      "self-hosting",
      "infrastructure",
      "cloudflare",
      "docker"
    ],
    "excerpt": "I was paying $11 a month for a VPS to host what amounted to a small React frontend, an Express API, and a Postgres database. $132 a year isn't ruinous, but the value side was thin: shared CPU cores that throttle under load, 1 GB of RAM that fills up ",
    "content": "I was paying $11 a month for a VPS to host what amounted to a small React frontend, an Express API, and a Postgres database. $132 a year isn't ruinous, but the value side was thin: shared CPU cores that throttle under load, 1 GB of RAM that fills up fast, 25 GB of disk that fills up faster. A side project pulling 100 visitors a day shouldn't need any of that babysitting. Last month I bought a base Mac Mini M4 for $599. I moved the whole stack onto it and put it on the public internet through Cloudflare Tunnels. No port forwarding, no exposed home IP, no firewall rules to maintain. It runs me about $2 a month in electricity. Here's how it's wired up. Why the Mac Mini holds up The base Mac Mini M4 is a real server. 10-core CPU, 16 GB of unified memory, 256 GB SSD. It idles around 10 watts and barely touches 30 under load. Compare that to an $11 VPS, where your &quot;1 vCPU&quot; is a fraction of someone else's processor and gets throttled the moment a noisy neighbor needs it. Apple Silicon does the heavy lifting. The M4 runs Docker containers cool and quiet. With my Vite build watcher, the Express API, Postgres, and Redis all running, Activity Monitor barely registers a blip. No fan ramps. No thermal throttling. And you own it. No surprise tier increases, no deprecation emails, no terms-of-service rewrites pointed at your project. The pricing path VPS providers solve real problems. The pricing path is the issue. You start at $5, bump to $11 when your app needs more RAM, add $2 for backups, then $11 for a staging environment, then more disk. Every step costs more for resources that are still shared and still constrained. For a side project, the math rarely works in your favor. You don't need someone else's slice of a server. You need your app to run reliably and cheaply. A Mac Mini does that. The architecture Internet Users | v ┌────────────────┐ │ Cloudflare │ │ Edge Network │ │ (SSL, DDoS) │ └────────┬───────┘ | Encrypted Tunnel (outbound) | v ┌────────────────┐ │ Mac Mini M4 │ │ (Your Home) │ └────────┬───────┘ | ┌───────────┴───────────┐ | | ┌────v─────┐ ┌─────v────┐ │ nginx │ │ Express │ │ (React) │ │ API │ │ :3000 │ │ :4000 │ └──────────┘ └─────┬────┘ | ┌─────v─────┐ │PostgreSQL │ │ Database │ └───────────┘ All in Docker containers Cloudflare Tunnels is what makes this safe. The Mac Mini opens an outbound connection to Cloudflare's edge. All inbound traffic flows through Cloudflare, picks up SSL and DDoS protection on the way, and is forwarded down the tunnel to the box. The Mac never accepts an inbound connection. There's nothing to port-forward, nothing to expose, no firewall hole to leave open. Docker setup Install Docker via Homebrew: brew install --cask docker brew install --cask docker Project layout: ~/my-app/ ├── docker-compose.yml ├── frontend/ # Vite-built React app (static dist/) ├── backend/ # Express API └── nginx.conf # static-file config for the frontend container The compose file ties it together: services : frontend : image : nginx:alpine volumes : - ./frontend/dist:/usr/share/nginx/html - ./nginx.conf:/etc/nginx/nginx.conf:ro ports : - \"3000:80\" restart : unless-stopped backend : build : ./backend env_file : .env ports : - \"4000:4000\" depends_on : - db restart : unless-stopped db : image : postgres:16-alpine env_file : .env volumes : - postgres_data:/var/lib/postgresql/data restart : unless-stopped volumes : postgres_data : services : frontend : image : nginx:alpine volumes : - ./frontend/dist:/usr/share/nginx/html - ./nginx.conf:/etc/nginx/nginx.conf:ro ports : - \"3000:80\" restart : unless-stopped backend : build : ./backend env_file : .env ports : - \"4000:4000\" depends_on : - db restart : unless-stopped db : image : postgres:16-alpine env_file : .env volumes : - postgres_data:/var/lib/postgresql/data restart : unless-stopped volumes : postgres_data : A couple of notes on this. The old version: '3.8' field is gone; Compose v2 ignores it. Secrets live in a .env file, not in the compose file itself. Three containers, around 500 MB of memory between them. The 16 GB Mini handles this with plenty of headroom for whatever you want to run alongside it. A minimal Express API: const express = require ( 'express' ); const { Pool } = require ( 'pg' ); const app = express (); const pool = new Pool ({ connectionString: process.env. DATABASE_URL }); app. use (express. json ()); app. get ( '/api/health' , ( _req , res ) => res. json ({ status: 'ok' })); app. get ( '/api/data' , async ( _req , res ) => { const { rows } = await pool. query ( 'SELECT * FROM items' ); res. json (rows); }); app. listen ( 4000 , () => console. log ( 'API on :4000' )); const express = require ( 'express' ); const { Pool } = require ( 'pg' ); const app = express (); const pool = new Pool ({ connectionString: process.env. DATABASE_URL }); app. use (express. json ()); app. get ( '/api/health' , ( _req , res ) => res. json ({ status: 'ok' })); app. get ( '/api/data' , async ( _req , res ) => { const { rows } = await pool. query ( 'SELECT * FROM items' ); res. json (rows); }); app. listen ( 4000 , () => console. log ( 'API on :4000' )); The React frontend is a standard Vite build. Nothing exotic. Bring it up: docker compose up -d docker compose up -d Frontend on :3000 , API on :4000 . Local only, for now. Wiring up Cloudflare Tunnels cloudflared is a small daemon that runs on the Mac and holds an outbound connection to Cloudflare. No DNS gymnastics. No certificate renewal. No firewall rules. brew install cloudflare/cloudflare/cloudflared cloudflared tunnel login cloudflared tunnel create my-app brew install cloudflare/cloudflare/cloudflared cloudflared tunnel login cloudflared tunnel create my-app The login step opens a browser to pick a domain you've added to Cloudflare. No domain? Cloudflare will hand out a free *.trycloudflare.com subdomain for testing. Create ~/.cloudflared/config.yml : tunnel : &#x3C;your-tunnel-id> credentials-file : /Users/&#x3C;username>/.cloudflared/&#x3C;tunnel-id>.json ingress : - hostname : myapp.yourdomain.com service : http://localhost:3000 - hostname : api.myapp.yourdomain.com service : http://localhost:4000 - service : http_status:404 tunnel : &#x3C;your-tunnel-id> credentials-file : /Users/&#x3C;username>/.cloudflared/&#x3C;tunnel-id>.json ingress : - hostname : myapp.yourdomain.com service : http://localhost:3000 - hostname : api.myapp.yourdomain.com service : http://localhost:4000 - service : http_status:404 Route DNS through the tunnel: cloudflared tunnel route dns my-app myapp.yourdomain.com cloudflared tunnel route dns my-app api.myapp.yourdomain.com cloudflared tunnel route dns my-app myapp.yourdomain.com cloudflared tunnel route dns my-app api.myapp.yourdomain.com Run it: cloudflared tunnel run my-app cloudflared tunnel run my-app The app is now on the public internet behind Cloudflare's edge: HTTPS, DDoS protection, no exposed IP. To make it survive reboots: sudo cloudflared service install sudo launchctl start com.cloudflare.cloudflared sudo cloudflared service install sudo launchctl start com.cloudflare.cloudflared The cost picture Mac Mini M4 (one-time): $599. Electricity at typical residential rates: roughly $2 a month. A domain is optional ($12 a year if you want one; the trycloudflare.com subdomain is free). A comparable VPS, once you add backups and a staging tier, lands around $200 to $300 a year. The Mini pays itself off in roughly two years. After that you're paying for power. And what you get for the same money is 10 CPU cores and 16 GB of memory instead of fractions of a shared core. Cloudflare Tunnels itself is on the free tier. Edge SSL, DDoS protection, and a global CDN at zero marginal cost. What you're trading off The Mac has to stay powered and online. A power blip or ISP outage takes you down. For side-project traffic that's usually fine, and a small UPS handles the brownouts. If you need four nines, this isn't the play. Back up Postgres. A nightly cron that dumps to an external drive (and optionally uploads to cheap object storage) is enough for most setups. Watch the disk. 256 GB is plenty until Docker images quietly accumulate. docker system prune -a once a month keeps it honest. Patch the system. brew upgrade cloudflared , pull fresh base images, restart the stack. Should be a 10-minute job, not a quarterly project. When this isn't the right call If your traffic is high, globally distributed, or genuinely needs HA, stay with the cloud. The Mac Mini is for the long tail: side projects, small businesses, internal tools, personal apps that have no business paying $20 a month for managed infrastructure. That's a lot of projects. Probably most of yours. What you get back, beyond the savings, is visibility. The box is on your desk. The logs are in your terminal. There are three containers, and you can see all of them. The whole stack fits in your head, which means you actually know what's running. Closing Self-hosting in 2026 isn't the chore it was a decade ago. The hardware is small, quiet, and cheap. The tools, Docker and Cloudflare Tunnels, hide the parts that used to be painful. You don't need a data center to run a real app. You need a Mac Mini and a tunnel."
  },
  {
    "slug": "2026-01-19-building-a-scalable-pdf-ai-analysis-pipeline",
    "title": "Building a Scalable PDF AI Analysis Pipeline with Python Microservices, Docker, Groq, and RabbitMQ",
    "description": "PDF analysis pipeline built with Python microservices, Docker, RabbitMQ, and Groq AI for scalable document processing and analysis.",
    "tags": [
      "ai",
      "python",
      "microservices",
      "docker",
      "rabbitmq",
      "groq",
      "pdf processing"
    ],
    "excerpt": "For developers and engineering teams, PDF documents represent a massive repository of critical information — technical specifications, research papers, financial reports, legal contracts, and customer submissions. However, PDFs are essentially locked",
    "content": "For developers and engineering teams, PDF documents represent a massive repository of critical information — technical specifications, research papers, financial reports, legal contracts, and customer submissions. However, PDFs are essentially locked boxes of data. Unlike structured databases or searchable codebases, you cannot query, aggregate, or analyze hundreds of PDFs simultaneously without manual effort. We face a fundamental bottleneck: document quantity versus extraction capacity. As organizations accumulate thousands of PDFs, the gap between having information and actually leveraging it grows exponentially. The Problem: The Document Processing Bottleneck Traditional approaches to PDF processing create immediate friction. The challenges are architectural: Synchronous Blocking — Users upload a document and wait while a single-threaded process extracts text, calls an AI API, and returns results. One slow PDF blocks everything behind it. Resource Mismatch — Text extraction is CPU-intensive. AI inference is network-bound. Storage operations are I/O-heavy. Running these on a single server wastes resources during each stage. Poor User Experience — Without async processing, users stare at loading spinners for minutes, unsure if the system crashed or is still working. The Solution: An Event-Driven Microservices Pipeline We are going to build a production-grade pipeline that decouples each processing stage into independent, scalable services. By leveraging Docker, RabbitMQ, Groq's inference API, and Streamlit, we will create a system that handles concurrent PDF uploads, processes them asynchronously, and delivers results through a polished web interface. The core innovation here is RabbitMQ message queuing. Rather than chaining services together synchronously, each service publishes events that downstream services consume. This pattern enables horizontal scaling, fault tolerance, and independent deployment cycles. Core Architecture The pipeline orchestrates six specialized microservices through an event-driven workflow: Streamlit UI — Users upload PDFs, select analysis types, and view real-time progress without page refreshes. API Gateway (FastAPI) — Accepts HTTP uploads, generates job IDs, and returns immediately while processing happens asynchronously. PDF Ingestion — Validates files, extracts metadata, stores PDFs in MinIO object storage, and publishes to the pdf.uploaded queue. Text Extractor — Consumes upload events, extracts text with PyPDF2/pdfplumber, handles OCR fallbacks, and publishes to the text.ready queue. AI Analyzer (Groq) — Consumes text events, sends content to Groq's Llama 3.1 or Mixtral models for summarization/classification/Q&amp;A generation, and publishes to the analysis.done queue. Results Handler — Consumes analysis events, persists results to PostgreSQL, caches in Redis, and triggers webhooks for external integrations. Here is the complete architecture: ┌─────────────────────────────────────────────────────────────────────────────┐ │ PDF AI ANALYSIS PIPELINE ARCHITECTURE │ │ (Python Microservices + Docker + RabbitMQ + Groq) │ │ WITH STREAMLIT FRONTEND │ └─────────────────────────────────────────────────────────────────────────────┘ ┌──────────────────┐ │ STREAMLIT UI │ (Web Frontend) │ [Docker:8501] │ - File upload interface └────────┬─────────┘ - Real-time status dashboard │ - Results visualization │ HTTP POST/GET │ ▼ ┌─────────────────┐ │ API Gateway │ (FastAPI) │ [Docker:8000] │ - REST endpoints └────────┬────────┘ - Job management │ - SSE for real-time updates │ │ HTTP POST /analyze │ ▼ ┌────────────────────┐ │ PDF Ingestion │ (Python Service) │ Microservice │ - Validates PDF │ [Docker:8001] │ - Extracts metadata └─────────┬──────────┘ - Stores in MinIO │ │ Publish: pdf.uploaded │ ▼ ┌─────────────────────────────────────────────────────────────────────────┐ │ RABBITMQ MESSAGE BROKER │ │ [Docker:5672] │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ pdf.uploaded │ │ text.ready │ │analysis.done │ │ │ │ Queue │ │ Queue │ │ Queue │ │ │ └──────────────┘ └──────────────┘ └──────────────┘ │ └─────────────────────────────────────────────────────────────────────────┘ │ │ │ │ │ │ ┌──────▼──────┐ ┌───────▼────────┐ ┌─────▼──────┐ │ PDF Text │ │ AI Analysis │ │ Results │ │ Extractor │ │ Microservice │ │ Handler │ │ Service │ │ (Groq) │ │ Service │ │[Docker:8002]│ │ [Docker:8003] │ │[Docker:8004]│ └──────┬──────┘ └────────┬───────┘ └─────┬──────┘ │ │ │ │ - PyPDF2 │ - Groq API │ - Store results │ - pdfplumber │ - Llama 3 │ - PostgreSQL │ - OCR │ - Summarization │ - Redis cache │ │ - Classification │ - Webhooks │ │ │ │ Publish: │ Publish: │ │ text.ready │ analysis.done │ │ │ │ └────────────────────┴──────────────────┘ │ ▼ ┌──────────────────┐ │ Data Storage │ │ │ │ - PostgreSQL │ [Docker:5432] │ - MinIO/S3 │ [Docker:9000] │ - Redis Cache │ [Docker:6379] └──────────────────┘ Message Flow User uploads PDF via Streamlit UI → API Gateway Ingestion service validates → publishes to pdf.uploaded queue Text Extractor consumes → extracts text → publishes to text.ready queue AI Analyzer consumes → calls Groq API → publishes to analysis.done queue Results Handler consumes → stores results → notifies user Streamlit polls API Gateway → displays real-time progress → shows results Part 1: The Infrastructure We will start with the foundational layer: orchestrating services with Docker Compose and designing a database schema that supports job tracking, result storage, and caching. Folder Structure Treat this as a monorepo. Create the following directory tree: mkdir pdf-ai-pipeline cd pdf-ai-pipeline mkdir services database mkdir services/streamlit-ui services/api-gateway services/pdf-ingestion mkdir services/text-extractor services/ai-analyzer services/results-handler touch docker-compose.yml .env database/init.sql mkdir pdf-ai-pipeline cd pdf-ai-pipeline mkdir services database mkdir services/streamlit-ui services/api-gateway services/pdf-ingestion mkdir services/text-extractor services/ai-analyzer services/results-handler touch docker-compose.yml .env database/init.sql The Docker Compose File We need to orchestrate eight core services: Streamlit UI — the frontend users interact with. API Gateway (FastAPI) — the HTTP entry point for uploads and queries. PDF Ingestion, Text Extractor, AI Analyzer, Results Handler — the processing pipeline. RabbitMQ — message broker for event-driven communication. PostgreSQL — persistent storage for jobs and results. Redis — fast caching layer for frequently accessed results. MinIO — S3-compatible object storage for raw PDFs. Open docker-compose.yml and add this configuration: version : '3.8' services : # Frontend streamlit-ui : build : ./services/streamlit-ui container_name : pdf_ui ports : - \"8501:8501\" environment : - API_URL=http://api-gateway:8000 networks : - pdf-net depends_on : - api-gateway # API Gateway api-gateway : build : ./services/api-gateway container_name : pdf_api ports : - \"8000:8000\" environment : - RABBITMQ_URL=amqp://guest:guest@rabbitmq:5672 - REDIS_URL=redis://redis:6379 - POSTGRES_URL=postgresql://admin:secretpassword@postgres:5432/pdf_analysis - MINIO_URL=minio:9000 networks : - pdf-net depends_on : rabbitmq : condition : service_healthy postgres : condition : service_healthy redis : condition : service_started # PDF Ingestion (2 replicas for load balancing) pdf-ingestion : build : ./services/pdf-ingestion environment : - RABBITMQ_URL=amqp://guest:guest@rabbitmq:5672 - MINIO_URL=minio:9000 - MINIO_ACCESS_KEY=minioadmin - MINIO_SECRET_KEY=minioadmin networks : - pdf-net depends_on : rabbitmq : condition : service_healthy minio : condition : service_started deploy : replicas : 2 # Text Extractor (3 replicas - CPU intensive) text-extractor : build : ./services/text-extractor environment : - RABBITMQ_URL=amqp://guest:guest@rabbitmq:5672 networks : - pdf-net depends_on : rabbitmq : condition : service_healthy deploy : replicas : 3 # AI Analyzer (2 replicas) ai-analyzer : build : ./services/ai-analyzer environment : - RABBITMQ_URL=amqp://guest:guest@rabbitmq:5672 - GROQ_API_KEY=${GROQ_API_KEY} networks : - pdf-net depends_on : rabbitmq : condition : service_healthy deploy : replicas : 2 # Results Handler results-handler : build : ./services/results-handler environment : - RABBITMQ_URL=amqp://guest:guest@rabbitmq:5672 - POSTGRES_URL=postgresql://admin:secretpassword@postgres:5432/pdf_analysis - REDIS_URL=redis://redis:6379 networks : - pdf-net depends_on : rabbitmq : condition : service_healthy postgres : condition : service_healthy redis : condition : service_started # RabbitMQ with Management UI rabbitmq : image : rabbitmq:3.12-management container_name : pdf_queue ports : - \"5672:5672\" - \"15672:15672\" environment : - RABBITMQ_DEFAULT_USER=guest - RABBITMQ_DEFAULT_PASS=guest networks : - pdf-net healthcheck : test : [ \"CMD\" , \"rabbitmq-diagnostics\" , \"-q\" , \"ping\" ] interval : 10s timeout : 5s retries : 5 # PostgreSQL for results storage postgres : image : postgres:15 container_name : pdf_db environment : - POSTGRES_DB=pdf_analysis - POSTGRES_USER=admin - POSTGRES_PASSWORD=secretpassword ports : - \"5432:5432\" volumes : - postgres_data:/var/lib/postgresql/data - ./database/init.sql:/docker-entrypoint-initdb.d/init.sql networks : - pdf-net healthcheck : test : [ \"CMD-SHELL\" , \"pg_isready -U admin -d pdf_analysis\" ] interval : 10s timeout : 5s retries : 5 # Redis for caching redis : image : redis:7-alpine container_name : pdf_cache ports : - \"6379:6379\" networks : - pdf-net # MinIO (S3-compatible storage) minio : image : minio/minio container_name : pdf_storage ports : - \"9000:9000\" - \"9001:9001\" environment : - MINIO_ROOT_USER=minioadmin - MINIO_ROOT_PASSWORD=minioadmin command : server /data --console-address \":9001\" volumes : - minio_data:/data networks : - pdf-net volumes : postgres_data : minio_data : networks : pdf-net : driver : bridge version : '3.8' services : # Frontend streamlit-ui : build : ./services/streamlit-ui container_name : pdf_ui ports : - \"8501:8501\" environment : - API_URL=http://api-gateway:8000 networks : - pdf-net depends_on : - api-gateway # API Gateway api-gateway : build : ./services/api-gateway container_name : pdf_api ports : - \"8000:8000\" environment : - RABBITMQ_URL=amqp://guest:guest@rabbitmq:5672 - REDIS_URL=redis://redis:6379 - POSTGRES_URL=postgresql://admin:secretpassword@postgres:5432/pdf_analysis - MINIO_URL=minio:9000 networks : - pdf-net depends_on : rabbitmq : condition : service_healthy postgres : condition : service_healthy redis : condition : service_started # PDF Ingestion (2 replicas for load balancing) pdf-ingestion : build : ./services/pdf-ingestion environment : - RABBITMQ_URL=amqp://guest:guest@rabbitmq:5672 - MINIO_URL=minio:9000 - MINIO_ACCESS_KEY=minioadmin - MINIO_SECRET_KEY=minioadmin networks : - pdf-net depends_on : rabbitmq : condition : service_healthy minio : condition : service_started deploy : replicas : 2 # Text Extractor (3 replicas - CPU intensive) text-extractor : build : ./services/text-extractor environment : - RABBITMQ_URL=amqp://guest:guest@rabbitmq:5672 networks : - pdf-net depends_on : rabbitmq : condition : service_healthy deploy : replicas : 3 # AI Analyzer (2 replicas) ai-analyzer : build : ./services/ai-analyzer environment : - RABBITMQ_URL=amqp://guest:guest@rabbitmq:5672 - GROQ_API_KEY=${GROQ_API_KEY} networks : - pdf-net depends_on : rabbitmq : condition : service_healthy deploy : replicas : 2 # Results Handler results-handler : build : ./services/results-handler environment : - RABBITMQ_URL=amqp://guest:guest@rabbitmq:5672 - POSTGRES_URL=postgresql://admin:secretpassword@postgres:5432/pdf_analysis - REDIS_URL=redis://redis:6379 networks : - pdf-net depends_on : rabbitmq : condition : service_healthy postgres : condition : service_healthy redis : condition : service_started # RabbitMQ with Management UI rabbitmq : image : rabbitmq:3.12-management container_name : pdf_queue ports : - \"5672:5672\" - \"15672:15672\" environment : - RABBITMQ_DEFAULT_USER=guest - RABBITMQ_DEFAULT_PASS=guest networks : - pdf-net healthcheck : test : [ \"CMD\" , \"rabbitmq-diagnostics\" , \"-q\" , \"ping\" ] interval : 10s timeout : 5s retries : 5 # PostgreSQL for results storage postgres : image : postgres:15 container_name : pdf_db environment : - POSTGRES_DB=pdf_analysis - POSTGRES_USER=admin - POSTGRES_PASSWORD=secretpassword ports : - \"5432:5432\" volumes : - postgres_data:/var/lib/postgresql/data - ./database/init.sql:/docker-entrypoint-initdb.d/init.sql networks : - pdf-net healthcheck : test : [ \"CMD-SHELL\" , \"pg_isready -U admin -d pdf_analysis\" ] interval : 10s timeout : 5s retries : 5 # Redis for caching redis : image : redis:7-alpine container_name : pdf_cache ports : - \"6379:6379\" networks : - pdf-net # MinIO (S3-compatible storage) minio : image : minio/minio container_name : pdf_storage ports : - \"9000:9000\" - \"9001:9001\" environment : - MINIO_ROOT_USER=minioadmin - MINIO_ROOT_PASSWORD=minioadmin command : server /data --console-address \":9001\" volumes : - minio_data:/data networks : - pdf-net volumes : postgres_data : minio_data : networks : pdf-net : driver : bridge Environment Variables Create a .env file to manage secrets: # Groq API Key (get one free at console.groq.com) GROQ_API_KEY = your_groq_api_key_here # Groq API Key (get one free at console.groq.com) GROQ_API_KEY = your_groq_api_key_here Designing the Schema We need to track job lifecycle, store analysis results, and cache frequently accessed data. Open database/init.sql : -- Jobs Table: Track processing lifecycle CREATE TABLE IF NOT EXISTS jobs ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), filename TEXT NOT NULL, status TEXT DEFAULT 'pending', -- pending, extracting, analyzing, completed, failed analysis_type TEXT, -- summary, classification, qa_generation, full model TEXT, -- llama-3.1-70b, mixtral-8x7b created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ); -- Results Table: Store analysis output CREATE TABLE IF NOT EXISTS results ( id SERIAL PRIMARY KEY, job_id UUID REFERENCES jobs(id) ON DELETE CASCADE, result_data JSONB NOT NULL, -- Flexible storage for any AI output confidence_score FLOAT, created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ); -- Create indexes for fast queries CREATE INDEX idx_jobs_status ON jobs(status); CREATE INDEX idx_jobs_created_at ON jobs(created_at DESC); CREATE INDEX idx_results_job_id ON results(job_id); -- Enable full-text search on results CREATE INDEX idx_results_data_gin ON results USING gin(result_data jsonb_path_ops); Booting Up Launch the infrastructure: # Start all services docker-compose up -d # Verify services are healthy docker-compose ps # View RabbitMQ Management UI at http://localhost:15672 (guest/guest) # View MinIO Console at http://localhost:9001 (minioadmin/minioadmin) # Start all services docker-compose up -d # Verify services are healthy docker-compose ps # View RabbitMQ Management UI at http://localhost:15672 (guest/guest) # View MinIO Console at http://localhost:9001 (minioadmin/minioadmin) Part 2: The Backend Services With infrastructure running, we will now build the four processing microservices that power the pipeline. Service 1: API Gateway (FastAPI) This service accepts HTTP uploads and immediately returns a job ID, enabling asynchronous processing. Create services/api-gateway/requirements.txt : fastapi uvicorn pika psycopg2-binary redis python-multipart python-dotenv The code ( services/api-gateway/main.py ): from fastapi import FastAPI, UploadFile, File, HTTPException from fastapi.responses import StreamingResponse import pika import psycopg2 import redis import json import os from uuid import uuid4 app = FastAPI( title = \"PDF Analysis API\" ) # Connect to Infrastructure RABBITMQ_URL = os.environ.get( \"RABBITMQ_URL\" ) POSTGRES_URL = os.environ.get( \"POSTGRES_URL\" ) REDIS_URL = os.environ.get( \"REDIS_URL\" ) redis_client = redis.from_url( REDIS_URL ) def get_db (): return psycopg2.connect( POSTGRES_URL ) def publish_event (queue_name, message): connection = pika.BlockingConnection(pika.URLParameters( RABBITMQ_URL )) channel = connection.channel() channel.queue_declare( queue = queue_name, durable = True ) channel.basic_publish( exchange = '' , routing_key = queue_name, body = json.dumps(message), properties = pika.BasicProperties( delivery_mode = 2 ) ) connection.close() @app.post ( \"/analyze\" ) async def analyze_pdf ( file: UploadFile = File( ... ), analysis_type: str = \"summary\" , model: str = \"llama-3.1-70b\" ): job_id = str (uuid4()) # Store job in database conn = get_db() cur = conn.cursor() cur.execute( \"INSERT INTO jobs (id, filename, status, analysis_type, model) VALUES ( %s , %s , 'pending', %s , %s )\" , (job_id, file .filename, analysis_type, model) ) conn.commit() conn.close() # Save file temporarily and publish to queue file_path = f \"/tmp/ { job_id } .pdf\" with open (file_path, \"wb\" ) as f: f.write( await file .read()) publish_event( \"pdf.uploaded\" , { \"job_id\" : job_id, \"file_path\" : file_path, \"filename\" : file .filename, \"analysis_type\" : analysis_type, \"model\" : model }) return { \"job_id\" : job_id, \"status\" : \"processing\" } @app.get ( \"/status/ {job_id} \" ) def get_status (job_id: str ): # Check cache first cached = redis_client.get( f \"job: { job_id } \" ) if cached: return json.loads(cached) # Query database conn = get_db() cur = conn.cursor() cur.execute( \"SELECT status, filename FROM jobs WHERE id = %s \" , (job_id,)) result = cur.fetchone() conn.close() if not result: raise HTTPException( status_code = 404 , detail = \"Job not found\" ) status_data = { \"job_id\" : job_id, \"status\" : result[ 0 ], \"filename\" : result[ 1 ]} redis_client.setex( f \"job: { job_id } \" , 60 , json.dumps(status_data)) return status_data @app.get ( \"/results/ {job_id} \" ) def get_results (job_id: str ): conn = get_db() cur = conn.cursor() cur.execute( \"SELECT result_data FROM results WHERE job_id = %s \" , (job_id,) ) result = cur.fetchone() conn.close() if not result: raise HTTPException( status_code = 404 , detail = \"Results not found\" ) return result[ 0 ] @app.get ( \"/metrics\" ) def get_metrics (): conn = get_db() cur = conn.cursor() cur.execute( \"SELECT status, COUNT(*) FROM jobs GROUP BY status\" ) metrics = dict (cur.fetchall()) conn.close() return metrics if __name__ == \"__main__\" : import uvicorn uvicorn.run(app, host = \"0.0.0.0\" , port = 8000 ) from fastapi import FastAPI, UploadFile, File, HTTPException from fastapi.responses import StreamingResponse import pika import psycopg2 import redis import json import os from uuid import uuid4 app = FastAPI( title = \"PDF Analysis API\" ) # Connect to Infrastructure RABBITMQ_URL = os.environ.get( \"RABBITMQ_URL\" ) POSTGRES_URL = os.environ.get( \"POSTGRES_URL\" ) REDIS_URL = os.environ.get( \"REDIS_URL\" ) redis_client = redis.from_url( REDIS_URL ) def get_db (): return psycopg2.connect( POSTGRES_URL ) def publish_event (queue_name, message): connection = pika.BlockingConnection(pika.URLParameters( RABBITMQ_URL )) channel = connection.channel() channel.queue_declare( queue = queue_name, durable = True ) channel.basic_publish( exchange = '' , routing_key = queue_name, body = json.dumps(message), properties = pika.BasicProperties( delivery_mode = 2 ) ) connection.close() @app.post ( \"/analyze\" ) async def analyze_pdf ( file: UploadFile = File( ... ), analysis_type: str = \"summary\" , model: str = \"llama-3.1-70b\" ): job_id = str (uuid4()) # Store job in database conn = get_db() cur = conn.cursor() cur.execute( \"INSERT INTO jobs (id, filename, status, analysis_type, model) VALUES ( %s , %s , 'pending', %s , %s )\" , (job_id, file .filename, analysis_type, model) ) conn.commit() conn.close() # Save file temporarily and publish to queue file_path = f \"/tmp/ { job_id } .pdf\" with open (file_path, \"wb\" ) as f: f.write( await file .read()) publish_event( \"pdf.uploaded\" , { \"job_id\" : job_id, \"file_path\" : file_path, \"filename\" : file .filename, \"analysis_type\" : analysis_type, \"model\" : model }) return { \"job_id\" : job_id, \"status\" : \"processing\" } @app.get ( \"/status/ {job_id} \" ) def get_status (job_id: str ): # Check cache first cached = redis_client.get( f \"job: { job_id } \" ) if cached: return json.loads(cached) # Query database conn = get_db() cur = conn.cursor() cur.execute( \"SELECT status, filename FROM jobs WHERE id = %s \" , (job_id,)) result = cur.fetchone() conn.close() if not result: raise HTTPException( status_code = 404 , detail = \"Job not found\" ) status_data = { \"job_id\" : job_id, \"status\" : result[ 0 ], \"filename\" : result[ 1 ]} redis_client.setex( f \"job: { job_id } \" , 60 , json.dumps(status_data)) return status_data @app.get ( \"/results/ {job_id} \" ) def get_results (job_id: str ): conn = get_db() cur = conn.cursor() cur.execute( \"SELECT result_data FROM results WHERE job_id = %s \" , (job_id,) ) result = cur.fetchone() conn.close() if not result: raise HTTPException( status_code = 404 , detail = \"Results not found\" ) return result[ 0 ] @app.get ( \"/metrics\" ) def get_metrics (): conn = get_db() cur = conn.cursor() cur.execute( \"SELECT status, COUNT(*) FROM jobs GROUP BY status\" ) metrics = dict (cur.fetchall()) conn.close() return metrics if __name__ == \"__main__\" : import uvicorn uvicorn.run(app, host = \"0.0.0.0\" , port = 8000 ) The Dockerfile ( services/api-gateway/Dockerfile ): FROM python:3.9-slim WORKDIR /app COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt COPY . . CMD [&quot;python&quot;, &quot;main.py&quot;] Service 2: PDF Ingestion This service validates PDFs, extracts metadata, and stores files in MinIO. Create services/pdf-ingestion/requirements.txt : pika PyPDF2 minio python-dotenv The code ( services/pdf-ingestion/worker.py ): import pika import json import os from minio import Minio from PyPDF2 import PdfReader RABBITMQ_URL = os.environ.get( \"RABBITMQ_URL\" ) MINIO_URL = os.environ.get( \"MINIO_URL\" ) MINIO_ACCESS = os.environ.get( \"MINIO_ACCESS_KEY\" ) MINIO_SECRET = os.environ.get( \"MINIO_SECRET_KEY\" ) minio_client = Minio( MINIO_URL , access_key = MINIO_ACCESS , secret_key = MINIO_SECRET , secure = False ) # Ensure bucket exists if not minio_client.bucket_exists( \"pdfs\" ): minio_client.make_bucket( \"pdfs\" ) def process_upload (ch, method, properties, body): data = json.loads(body) job_id = data[ 'job_id' ] file_path = data[ 'file_path' ] try : # Validate PDF reader = PdfReader(file_path) page_count = len (reader.pages) # Store in MinIO minio_client.fput_object( \"pdfs\" , f \" { job_id } .pdf\" , file_path) # Publish to next queue connection = pika.BlockingConnection(pika.URLParameters( RABBITMQ_URL )) channel = connection.channel() channel.queue_declare( queue = \"text.extraction\" , durable = True ) channel.basic_publish( exchange = '' , routing_key = \"text.extraction\" , body = json.dumps({ ** data, \"page_count\" : page_count, \"minio_path\" : f \"pdfs/ { job_id } .pdf\" }), properties = pika.BasicProperties( delivery_mode = 2 ) ) connection.close() os.remove(file_path) ch.basic_ack( delivery_tag = method.delivery_tag) print ( f \"Processed: { data[ 'filename' ] } \" ) except Exception as e: print ( f \"Error: { e } \" ) ch.basic_nack( delivery_tag = method.delivery_tag, requeue = False ) # Start Consumer connection = pika.BlockingConnection(pika.URLParameters( RABBITMQ_URL )) channel = connection.channel() channel.queue_declare( queue = \"pdf.uploaded\" , durable = True ) channel.basic_qos( prefetch_count = 1 ) channel.basic_consume( queue = \"pdf.uploaded\" , on_message_callback = process_upload) print ( \"PDF Ingestion Service Started...\" ) channel.start_consuming() import pika import json import os from minio import Minio from PyPDF2 import PdfReader RABBITMQ_URL = os.environ.get( \"RABBITMQ_URL\" ) MINIO_URL = os.environ.get( \"MINIO_URL\" ) MINIO_ACCESS = os.environ.get( \"MINIO_ACCESS_KEY\" ) MINIO_SECRET = os.environ.get( \"MINIO_SECRET_KEY\" ) minio_client = Minio( MINIO_URL , access_key = MINIO_ACCESS , secret_key = MINIO_SECRET , secure = False ) # Ensure bucket exists if not minio_client.bucket_exists( \"pdfs\" ): minio_client.make_bucket( \"pdfs\" ) def process_upload (ch, method, properties, body): data = json.loads(body) job_id = data[ 'job_id' ] file_path = data[ 'file_path' ] try : # Validate PDF reader = PdfReader(file_path) page_count = len (reader.pages) # Store in MinIO minio_client.fput_object( \"pdfs\" , f \" { job_id } .pdf\" , file_path) # Publish to next queue connection = pika.BlockingConnection(pika.URLParameters( RABBITMQ_URL )) channel = connection.channel() channel.queue_declare( queue = \"text.extraction\" , durable = True ) channel.basic_publish( exchange = '' , routing_key = \"text.extraction\" , body = json.dumps({ ** data, \"page_count\" : page_count, \"minio_path\" : f \"pdfs/ { job_id } .pdf\" }), properties = pika.BasicProperties( delivery_mode = 2 ) ) connection.close() os.remove(file_path) ch.basic_ack( delivery_tag = method.delivery_tag) print ( f \"Processed: { data[ 'filename' ] } \" ) except Exception as e: print ( f \"Error: { e } \" ) ch.basic_nack( delivery_tag = method.delivery_tag, requeue = False ) # Start Consumer connection = pika.BlockingConnection(pika.URLParameters( RABBITMQ_URL )) channel = connection.channel() channel.queue_declare( queue = \"pdf.uploaded\" , durable = True ) channel.basic_qos( prefetch_count = 1 ) channel.basic_consume( queue = \"pdf.uploaded\" , on_message_callback = process_upload) print ( \"PDF Ingestion Service Started...\" ) channel.start_consuming() The Dockerfile ( services/pdf-ingestion/Dockerfile ): FROM python:3.9-slim WORKDIR /app COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt COPY . . CMD [&quot;python&quot;, &quot;worker.py&quot;] Service 3: Text Extractor This service extracts text from PDFs using PyPDF2 or pdfplumber, with OCR fallback. Create services/text-extractor/requirements.txt : pika PyPDF2 pdfplumber pytesseract pdf2image python-dotenv The code ( services/text-extractor/worker.py ): import pika import json import os from PyPDF2 import PdfReader import pdfplumber RABBITMQ_URL = os.environ.get( \"RABBITMQ_URL\" ) def extract_text (file_path): text = \"\" # Try PyPDF2 first try : reader = PdfReader(file_path) for page in reader.pages: text += page.extract_text() except : pass # Fallback to pdfplumber if PyPDF2 fails if len (text.strip()) &#x3C; 100 : with pdfplumber.open(file_path) as pdf: for page in pdf.pages: text += page.extract_text() or \"\" return text.strip() def process_extraction (ch, method, properties, body): data = json.loads(body) job_id = data[ 'job_id' ] try : # Download from MinIO (simplified - assume local for demo) file_path = f \"/tmp/ { job_id } .pdf\" text = extract_text(file_path) # Publish to analysis queue connection = pika.BlockingConnection(pika.URLParameters( RABBITMQ_URL )) channel = connection.channel() channel.queue_declare( queue = \"text.ready\" , durable = True ) channel.basic_publish( exchange = '' , routing_key = \"text.ready\" , body = json.dumps({ ** data, \"extracted_text\" : text }), properties = pika.BasicProperties( delivery_mode = 2 ) ) connection.close() ch.basic_ack( delivery_tag = method.delivery_tag) print ( f \"Extracted text from: { data[ 'filename' ] } \" ) except Exception as e: print ( f \"Error: { e } \" ) ch.basic_nack( delivery_tag = method.delivery_tag, requeue = False ) # Start Consumer connection = pika.BlockingConnection(pika.URLParameters( RABBITMQ_URL )) channel = connection.channel() channel.queue_declare( queue = \"text.extraction\" , durable = True ) channel.basic_qos( prefetch_count = 1 ) channel.basic_consume( queue = \"text.extraction\" , on_message_callback = process_extraction) print ( \"Text Extractor Service Started...\" ) channel.start_consuming() import pika import json import os from PyPDF2 import PdfReader import pdfplumber RABBITMQ_URL = os.environ.get( \"RABBITMQ_URL\" ) def extract_text (file_path): text = \"\" # Try PyPDF2 first try : reader = PdfReader(file_path) for page in reader.pages: text += page.extract_text() except : pass # Fallback to pdfplumber if PyPDF2 fails if len (text.strip()) &#x3C; 100 : with pdfplumber.open(file_path) as pdf: for page in pdf.pages: text += page.extract_text() or \"\" return text.strip() def process_extraction (ch, method, properties, body): data = json.loads(body) job_id = data[ 'job_id' ] try : # Download from MinIO (simplified - assume local for demo) file_path = f \"/tmp/ { job_id } .pdf\" text = extract_text(file_path) # Publish to analysis queue connection = pika.BlockingConnection(pika.URLParameters( RABBITMQ_URL )) channel = connection.channel() channel.queue_declare( queue = \"text.ready\" , durable = True ) channel.basic_publish( exchange = '' , routing_key = \"text.ready\" , body = json.dumps({ ** data, \"extracted_text\" : text }), properties = pika.BasicProperties( delivery_mode = 2 ) ) connection.close() ch.basic_ack( delivery_tag = method.delivery_tag) print ( f \"Extracted text from: { data[ 'filename' ] } \" ) except Exception as e: print ( f \"Error: { e } \" ) ch.basic_nack( delivery_tag = method.delivery_tag, requeue = False ) # Start Consumer connection = pika.BlockingConnection(pika.URLParameters( RABBITMQ_URL )) channel = connection.channel() channel.queue_declare( queue = \"text.extraction\" , durable = True ) channel.basic_qos( prefetch_count = 1 ) channel.basic_consume( queue = \"text.extraction\" , on_message_callback = process_extraction) print ( \"Text Extractor Service Started...\" ) channel.start_consuming() The Dockerfile ( services/text-extractor/Dockerfile ): FROM python:3.9-slim RUN apt-get update &amp;&amp; apt-get install -y tesseract-ocr &amp;&amp; rm -rf /var/lib/apt/lists/* WORKDIR /app COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt COPY . . CMD [&quot;python&quot;, &quot;worker.py&quot;] Service 4: AI Analyzer (Groq) This is where Groq delivers high-speed inference for document analysis. Create services/ai-analyzer/requirements.txt : pika groq python-dotenv The code ( services/ai-analyzer/worker.py ): import pika import json import os from groq import Groq RABBITMQ_URL = os.environ.get( \"RABBITMQ_URL\" ) GROQ_API_KEY = os.environ.get( \"GROQ_API_KEY\" ) client = Groq( api_key = GROQ_API_KEY ) def analyze_text (text, analysis_type, model): prompts = { \"summary\" : \"Provide a concise summary of this document in 3-5 bullet points.\" , \"classification\" : \"Classify this document by type and main topics.\" , \"qa_generation\" : \"Generate 5 question-answer pairs from this document.\" , \"full\" : \"Provide a comprehensive analysis including summary, key entities, and main themes.\" } prompt = prompts.get(analysis_type, prompts[ \"summary\" ]) response = client.chat.completions.create( model = model, messages = [ { \"role\" : \"system\" , \"content\" : \"You are a helpful document analysis assistant.\" }, { \"role\" : \"user\" , \"content\" : f \" { prompt }\\n\\n Document: \\n{ text[: 8000 ] } \" } ], temperature = 0.3 ) return response.choices[ 0 ].message.content def process_analysis (ch, method, properties, body): data = json.loads(body) job_id = data[ 'job_id' ] try : result = analyze_text( data[ 'extracted_text' ], data[ 'analysis_type' ], data[ 'model' ] ) # Publish results connection = pika.BlockingConnection(pika.URLParameters( RABBITMQ_URL )) channel = connection.channel() channel.queue_declare( queue = \"analysis.done\" , durable = True ) channel.basic_publish( exchange = '' , routing_key = \"analysis.done\" , body = json.dumps({ \"job_id\" : job_id, \"result\" : result }), properties = pika.BasicProperties( delivery_mode = 2 ) ) connection.close() ch.basic_ack( delivery_tag = method.delivery_tag) print ( f \"Analyzed: { data[ 'filename' ] } \" ) except Exception as e: print ( f \"Error: { e } \" ) ch.basic_nack( delivery_tag = method.delivery_tag, requeue = False ) # Start Consumer connection = pika.BlockingConnection(pika.URLParameters( RABBITMQ_URL )) channel = connection.channel() channel.queue_declare( queue = \"text.ready\" , durable = True ) channel.basic_qos( prefetch_count = 1 ) channel.basic_consume( queue = \"text.ready\" , on_message_callback = process_analysis) print ( \"AI Analyzer Service Started...\" ) channel.start_consuming() import pika import json import os from groq import Groq RABBITMQ_URL = os.environ.get( \"RABBITMQ_URL\" ) GROQ_API_KEY = os.environ.get( \"GROQ_API_KEY\" ) client = Groq( api_key = GROQ_API_KEY ) def analyze_text (text, analysis_type, model): prompts = { \"summary\" : \"Provide a concise summary of this document in 3-5 bullet points.\" , \"classification\" : \"Classify this document by type and main topics.\" , \"qa_generation\" : \"Generate 5 question-answer pairs from this document.\" , \"full\" : \"Provide a comprehensive analysis including summary, key entities, and main themes.\" } prompt = prompts.get(analysis_type, prompts[ \"summary\" ]) response = client.chat.completions.create( model = model, messages = [ { \"role\" : \"system\" , \"content\" : \"You are a helpful document analysis assistant.\" }, { \"role\" : \"user\" , \"content\" : f \" { prompt }\\n\\n Document: \\n{ text[: 8000 ] } \" } ], temperature = 0.3 ) return response.choices[ 0 ].message.content def process_analysis (ch, method, properties, body): data = json.loads(body) job_id = data[ 'job_id' ] try : result = analyze_text( data[ 'extracted_text' ], data[ 'analysis_type' ], data[ 'model' ] ) # Publish results connection = pika.BlockingConnection(pika.URLParameters( RABBITMQ_URL )) channel = connection.channel() channel.queue_declare( queue = \"analysis.done\" , durable = True ) channel.basic_publish( exchange = '' , routing_key = \"analysis.done\" , body = json.dumps({ \"job_id\" : job_id, \"result\" : result }), properties = pika.BasicProperties( delivery_mode = 2 ) ) connection.close() ch.basic_ack( delivery_tag = method.delivery_tag) print ( f \"Analyzed: { data[ 'filename' ] } \" ) except Exception as e: print ( f \"Error: { e } \" ) ch.basic_nack( delivery_tag = method.delivery_tag, requeue = False ) # Start Consumer connection = pika.BlockingConnection(pika.URLParameters( RABBITMQ_URL )) channel = connection.channel() channel.queue_declare( queue = \"text.ready\" , durable = True ) channel.basic_qos( prefetch_count = 1 ) channel.basic_consume( queue = \"text.ready\" , on_message_callback = process_analysis) print ( \"AI Analyzer Service Started...\" ) channel.start_consuming() The Dockerfile ( services/ai-analyzer/Dockerfile ): FROM python:3.9-slim WORKDIR /app COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt COPY . . CMD [&quot;python&quot;, &quot;worker.py&quot;] Service 5: Results Handler The final service persists results to PostgreSQL and caches in Redis. Create services/results-handler/requirements.txt : pika psycopg2-binary redis python-dotenv The code ( services/results-handler/worker.py ): import pika import json import os import psycopg2 import redis RABBITMQ_URL = os.environ.get( \"RABBITMQ_URL\" ) POSTGRES_URL = os.environ.get( \"POSTGRES_URL\" ) REDIS_URL = os.environ.get( \"REDIS_URL\" ) redis_client = redis.from_url( REDIS_URL ) def get_db (): return psycopg2.connect( POSTGRES_URL ) def store_results (ch, method, properties, body): data = json.loads(body) job_id = data[ 'job_id' ] try : conn = get_db() cur = conn.cursor() # Store result cur.execute( \"INSERT INTO results (job_id, result_data) VALUES ( %s , %s )\" , (job_id, json.dumps({ \"analysis\" : data[ 'result' ]})) ) # Update job status cur.execute( \"UPDATE jobs SET status = 'completed', updated_at = CURRENT_TIMESTAMP WHERE id = %s \" , (job_id,) ) conn.commit() conn.close() # Invalidate cache redis_client.delete( f \"job: { job_id } \" ) ch.basic_ack( delivery_tag = method.delivery_tag) print ( f \"Stored results for: { job_id } \" ) except Exception as e: print ( f \"Error: { e } \" ) ch.basic_nack( delivery_tag = method.delivery_tag, requeue = False ) # Start Consumer connection = pika.BlockingConnection(pika.URLParameters( RABBITMQ_URL )) channel = connection.channel() channel.queue_declare( queue = \"analysis.done\" , durable = True ) channel.basic_qos( prefetch_count = 1 ) channel.basic_consume( queue = \"analysis.done\" , on_message_callback = store_results) print ( \"Results Handler Service Started...\" ) channel.start_consuming() import pika import json import os import psycopg2 import redis RABBITMQ_URL = os.environ.get( \"RABBITMQ_URL\" ) POSTGRES_URL = os.environ.get( \"POSTGRES_URL\" ) REDIS_URL = os.environ.get( \"REDIS_URL\" ) redis_client = redis.from_url( REDIS_URL ) def get_db (): return psycopg2.connect( POSTGRES_URL ) def store_results (ch, method, properties, body): data = json.loads(body) job_id = data[ 'job_id' ] try : conn = get_db() cur = conn.cursor() # Store result cur.execute( \"INSERT INTO results (job_id, result_data) VALUES ( %s , %s )\" , (job_id, json.dumps({ \"analysis\" : data[ 'result' ]})) ) # Update job status cur.execute( \"UPDATE jobs SET status = 'completed', updated_at = CURRENT_TIMESTAMP WHERE id = %s \" , (job_id,) ) conn.commit() conn.close() # Invalidate cache redis_client.delete( f \"job: { job_id } \" ) ch.basic_ack( delivery_tag = method.delivery_tag) print ( f \"Stored results for: { job_id } \" ) except Exception as e: print ( f \"Error: { e } \" ) ch.basic_nack( delivery_tag = method.delivery_tag, requeue = False ) # Start Consumer connection = pika.BlockingConnection(pika.URLParameters( RABBITMQ_URL )) channel = connection.channel() channel.queue_declare( queue = \"analysis.done\" , durable = True ) channel.basic_qos( prefetch_count = 1 ) channel.basic_consume( queue = \"analysis.done\" , on_message_callback = store_results) print ( \"Results Handler Service Started...\" ) channel.start_consuming() The Dockerfile ( services/results-handler/Dockerfile ): FROM python:3.9-slim WORKDIR /app COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt COPY . . CMD [&quot;python&quot;, &quot;worker.py&quot;] Part 3: The Streamlit Frontend The UI provides an intuitive interface for uploads, real-time monitoring, and result visualization. Create services/streamlit-ui/requirements.txt : streamlit requests python-dotenv The code ( services/streamlit-ui/app.py ): import streamlit as st import requests import time import json API_URL = \"http://api-gateway:8000\" st.set_page_config( page_title = \"PDF AI Analysis\" , layout = \"wide\" ) st.title( \"PDF AI Analysis Pipeline\" ) tab1, tab2, tab3 = st.tabs([ \"Upload &#x26; Analyze\" , \"Dashboard\" , \"History\" ]) # TAB 1: Upload with tab1: uploaded_file = st.file_uploader( \"Upload PDF\" , type = [ \"pdf\" ]) col1, col2 = st.columns( 2 ) with col1: analysis_type = st.selectbox( \"Analysis Type\" , [ \"summary\" , \"classification\" , \"qa_generation\" , \"full\" ] ) with col2: model = st.selectbox( \"Model\" , [ \"llama-3.1-70b\" , \"mixtral-8x7b\" ]) if st.button( \"Start Analysis\" ) and uploaded_file: with st.spinner( \"Uploading...\" ): files = { \"file\" : uploaded_file} data = { \"analysis_type\" : analysis_type, \"model\" : model} response = requests.post( f \" {API_URL} /analyze\" , files = files, data = data) job_id = response.json()[ \"job_id\" ] st.success( f \"Job started: { job_id } \" ) # Poll for status progress_bar = st.progress( 0 ) status_text = st.empty() while True : status_response = requests.get( f \" {API_URL} /status/ { job_id } \" ) status = status_response.json()[ \"status\" ] status_text.text( f \"Status: { status } \" ) if status == \"completed\" : progress_bar.progress( 100 ) results = requests.get( f \" {API_URL} /results/ { job_id } \" ).json() st.json(results) break elif status == \"failed\" : st.error( \"Analysis failed\" ) break progress_bar.progress( 50 ) time.sleep( 2 ) # TAB 2: Dashboard with tab2: metrics = requests.get( f \" {API_URL} /metrics\" ).json() col1, col2, col3 = st.columns( 3 ) col1.metric( \"Total Jobs\" , sum (metrics.values())) col2.metric( \"Completed\" , metrics.get( \"completed\" , 0 )) col3.metric( \"Failed\" , metrics.get( \"failed\" , 0 )) # TAB 3: History with tab3: st.write( \"Coming soon: Job history and search\" ) import streamlit as st import requests import time import json API_URL = \"http://api-gateway:8000\" st.set_page_config( page_title = \"PDF AI Analysis\" , layout = \"wide\" ) st.title( \"PDF AI Analysis Pipeline\" ) tab1, tab2, tab3 = st.tabs([ \"Upload &#x26; Analyze\" , \"Dashboard\" , \"History\" ]) # TAB 1: Upload with tab1: uploaded_file = st.file_uploader( \"Upload PDF\" , type = [ \"pdf\" ]) col1, col2 = st.columns( 2 ) with col1: analysis_type = st.selectbox( \"Analysis Type\" , [ \"summary\" , \"classification\" , \"qa_generation\" , \"full\" ] ) with col2: model = st.selectbox( \"Model\" , [ \"llama-3.1-70b\" , \"mixtral-8x7b\" ]) if st.button( \"Start Analysis\" ) and uploaded_file: with st.spinner( \"Uploading...\" ): files = { \"file\" : uploaded_file} data = { \"analysis_type\" : analysis_type, \"model\" : model} response = requests.post( f \" {API_URL} /analyze\" , files = files, data = data) job_id = response.json()[ \"job_id\" ] st.success( f \"Job started: { job_id } \" ) # Poll for status progress_bar = st.progress( 0 ) status_text = st.empty() while True : status_response = requests.get( f \" {API_URL} /status/ { job_id } \" ) status = status_response.json()[ \"status\" ] status_text.text( f \"Status: { status } \" ) if status == \"completed\" : progress_bar.progress( 100 ) results = requests.get( f \" {API_URL} /results/ { job_id } \" ).json() st.json(results) break elif status == \"failed\" : st.error( \"Analysis failed\" ) break progress_bar.progress( 50 ) time.sleep( 2 ) # TAB 2: Dashboard with tab2: metrics = requests.get( f \" {API_URL} /metrics\" ).json() col1, col2, col3 = st.columns( 3 ) col1.metric( \"Total Jobs\" , sum (metrics.values())) col2.metric( \"Completed\" , metrics.get( \"completed\" , 0 )) col3.metric( \"Failed\" , metrics.get( \"failed\" , 0 )) # TAB 3: History with tab3: st.write( \"Coming soon: Job history and search\" ) The Dockerfile ( services/streamlit-ui/Dockerfile ): FROM python:3.9-slim WORKDIR /app COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt COPY . . EXPOSE 8501 CMD [&quot;streamlit&quot;, &quot;run&quot;, &quot;app.py&quot;, &quot;--server.port=8501&quot;, &quot;--server.address=0.0.0.0&quot;] Final Integration: Launch Day Running the Stack Build and run: docker-compose up -d --build docker-compose up -d --build Access the UI: open http://localhost:8501 Upload a PDF: select analysis type and model, then watch real-time processing Monitor queues: visit RabbitMQ Management at http://localhost:15672 You have just built a production-grade PDF analysis pipeline. The system scales horizontally, handles failures gracefully through RabbitMQ acknowledgments, and leverages Groq's inference speed for real-time document processing. Scalability — add replicas to any service independently based on bottlenecks. Cost efficiency — Groq's API is 10-100x faster than alternatives, reducing processing time and costs. User experience — Streamlit provides immediate feedback while processing happens asynchronously in the background. This architecture demonstrates the power of event-driven microservices and local-first AI integration. Each service owns a single responsibility, communicates through well-defined message contracts, and can be developed and deployed independently by different teams. Happy coding!"
  },
  {
    "slug": "2026-01-17-building-an-ai-driven-youtube-index",
    "title": "Building a Private, AI-Driven YouTube Knowledge Base",
    "description": "Turn your YouTube subscriptions into a searchable, private RAG engine — autonomous ingestion with yt-dlp, transcription with faster-whisper, embeddings via Ollama and pgvector, and a Streamlit/LangChain chat UI.",
    "tags": [
      "ai",
      "youtube",
      "streamlit",
      "langchain",
      "rag",
      "ollama"
    ],
    "excerpt": "For IT professionals and developers, YouTube has evolved from an entertainment platform into a primary source of continuous education. We rely on it for everything from architectural patterns and cloud infrastructure tutorials to debugging sessions a",
    "content": "For IT professionals and developers, YouTube has evolved from an entertainment platform into a primary source of continuous education. We rely on it for everything from architectural patterns and cloud infrastructure tutorials to debugging sessions and conference talks. However, video is inherently opaque data. Unlike documentation or code repositories, you cannot &quot;Ctrl+F&quot; your way through thousands of hours of video history to find that one specific explanation of a concept you watched six months ago. We face a significant gap between content consumption and knowledge retention. The Problem: The Unsearchable Archive As we accumulate subscriptions, we build a massive library of potential knowledge that remains largely inaccessible. The challenges are structural: The &quot;Black Box&quot; of Video — Valuable technical insights are often buried deep within long-form content, invisible to standard metadata searches. Fragmentation — Knowledge is siloed across hundreds of channels with no unified way to cross-reference topics (e.g., comparing how three different channels handle Kubernetes networking). Ephemeral Recall — We watch a solution once, but without a text-based index, retrieving that solution during a future incident is nearly impossible. The Solution: A Private RAG Engine In this guide, we are going to build a solution to shift from passive consumption to active conversation. We will build a Retrieval Augmented Generation (RAG) system that treats your YouTube subscriptions as a private dataset. By leveraging LangChain and Ollama locally, we can create a system that lets you chat with your video history. You can ask, &quot;How does NetworkChuck explain VLANs?&quot; and the system will not only find the video but synthesize an answer based on the transcript. Core Architecture To turn this concept into reality, we will adopt a microservices approach using Docker. At a high level, the pipeline involves five stages: Ingestion — A service autonomously monitors your subscriptions for new content. Transcription — Using Whisper, the system converts unstructured audio into timestamped text. Indexing — The system chunks transcripts and processes them through an embedding model ( nomic-embed-text ), storing the vectors in PostgreSQL. Retrieval — Your questions are converted into vectors to find relevant transcript segments. Synthesis — Llama 3 reads the retrieved context and generates a precise answer, citing specific video timestamps. Here is the architecture we will build: INTERNET (YouTube) ^ ^ | | (1) User Visits UI | | (3) Download Audio (yt-dlp) (Browser) | | | | | v | | +-----------------------------------------------------------------------+ | HOST MACHINE (Port 8501) | | | +-------------------------------------+-----+---------------------------+ | | | | | DOCKER NETWORK (yt-net) | | | | | | | | +-------------------+ +------+-----+------+ | | | Streamlit UI | | Ingestion Service | | | | [LangChain Client]|&lt;------| (The Watcher) | | | | | | | | | | Ports: 8501:8501 | | [yt-dlp/RSS] | | | +--------+-----+----+ +---------+---------+ | | | | | | | | | (2) Search | (4) Push Job | | | | Vector | (AMQP) | | | v v | | | +-----------------------------------+ | | (5) Gen | | RabbitMQ | | | Query | | (Message Broker) | | | Embed | | | | | (HTTP) | | Ports: 15672:15672 (Mgmt UI) | | | | +-----------------+-----------------+ | | | | | | | | (6) Pull Job | | | | (AMQP) | | | v | | | +---------+---------+ | | | | Processing Worker | | | | | (The Brain) | | | | | | | | | | [faster-whisper] | | | | | [ffmpeg] | | | +---------&gt;| [yt-dlp] | | | ^ +----+---------+----+ | | | | | | | (9) Chat | (7) Gen | | (8) Store | | With | Embed | | Data | | Data | (HTTP) | | (SQL) | | | v v | | +--------+----------+ +---------+---------+ | | | Ollama | | PostgreSQL | | | | (AI Model) | | (Data Layer) | | | | | | | | | | [nomic-embed-text]| | [pgvector] | | | | [llama3] | | [Videos/Subs] | | | +-------------------+ +-------------------+ | | | +-----------------------------------------------------------------------+ Part 1: The Infrastructure We will start by defining the &quot;plumbing&quot; of our system using Docker Compose and designing our PostgreSQL database schema to handle vector embeddings. Folder Structure Treat this project as a monorepo. Open your terminal and create the following structure: mkdir yt-rag-engine cd yt-rag-engine mkdir database touch docker-compose.yml .env database/init.sql mkdir yt-rag-engine cd yt-rag-engine mkdir database touch docker-compose.yml .env database/init.sql The Docker Compose File We need to orchestrate three core services: PostgreSQL (with pgvector) — to store our data and embeddings. RabbitMQ — to manage our background processing queues. Ollama — to run our local LLMs (Llama 3 and Nomic Embed). Open docker-compose.yml and add the following configuration: version : '3.8' services : # 1. The Database (Postgres + pgvector) postgres : image : pgvector/pgvector:pg16 container_name : yt_db environment : POSTGRES_USER : ${DB_USER} POSTGRES_PASSWORD : ${DB_PASS} POSTGRES_DB : ${DB_NAME} ports : - \"5432:5432\" volumes : - postgres_data:/var/lib/postgresql/data - ./database/init.sql:/docker-entrypoint-initdb.d/init.sql networks : - yt-net healthcheck : test : [ \"CMD-SHELL\" , \"pg_isready -U ${DB_USER} -d ${DB_NAME}\" ] interval : 10s timeout : 5s retries : 5 # 2. The Message Broker (RabbitMQ) rabbitmq : image : rabbitmq:3-management container_name : yt_queue ports : - \"5672:5672\" # AMQP protocol - \"15672:15672\" # Management UI environment : RABBITMQ_DEFAULT_USER : ${RABBIT_USER} RABBITMQ_DEFAULT_PASS : ${RABBIT_PASS} networks : - yt-net healthcheck : test : [ \"CMD\" , \"rabbitmq-diagnostics\" , \"-q\" , \"ping\" ] interval : 10s timeout : 5s retries : 5 # 3. The AI Server (Ollama) ollama : image : ollama/ollama:latest container_name : yt_ai ports : - \"11434:11434\" volumes : - ollama_models:/root/.ollama networks : - yt-net # Uncomment below to enable GPU support (Nvidia) # deploy: # resources: # reservations: # devices: # - driver: nvidia # count: 1 # capabilities: [gpu] volumes : postgres_data : ollama_models : networks : yt-net : driver : bridge version : '3.8' services : # 1. The Database (Postgres + pgvector) postgres : image : pgvector/pgvector:pg16 container_name : yt_db environment : POSTGRES_USER : ${DB_USER} POSTGRES_PASSWORD : ${DB_PASS} POSTGRES_DB : ${DB_NAME} ports : - \"5432:5432\" volumes : - postgres_data:/var/lib/postgresql/data - ./database/init.sql:/docker-entrypoint-initdb.d/init.sql networks : - yt-net healthcheck : test : [ \"CMD-SHELL\" , \"pg_isready -U ${DB_USER} -d ${DB_NAME}\" ] interval : 10s timeout : 5s retries : 5 # 2. The Message Broker (RabbitMQ) rabbitmq : image : rabbitmq:3-management container_name : yt_queue ports : - \"5672:5672\" # AMQP protocol - \"15672:15672\" # Management UI environment : RABBITMQ_DEFAULT_USER : ${RABBIT_USER} RABBITMQ_DEFAULT_PASS : ${RABBIT_PASS} networks : - yt-net healthcheck : test : [ \"CMD\" , \"rabbitmq-diagnostics\" , \"-q\" , \"ping\" ] interval : 10s timeout : 5s retries : 5 # 3. The AI Server (Ollama) ollama : image : ollama/ollama:latest container_name : yt_ai ports : - \"11434:11434\" volumes : - ollama_models:/root/.ollama networks : - yt-net # Uncomment below to enable GPU support (Nvidia) # deploy: # resources: # reservations: # devices: # - driver: nvidia # count: 1 # capabilities: [gpu] volumes : postgres_data : ollama_models : networks : yt-net : driver : bridge Environment Variables Create a .env file to keep secrets safe: # Database Credentials DB_USER = admin DB_PASS = secretpassword DB_NAME = yt_knowledge_base # RabbitMQ Credentials RABBIT_USER = guest RABBIT_PASS = guest # Database Credentials DB_USER = admin DB_PASS = secretpassword DB_NAME = yt_knowledge_base # RabbitMQ Credentials RABBIT_USER = guest RABBIT_PASS = guest Designing the Schema (pgvector) We need to tell PostgreSQL how to structure our data. The most critical part is enabling the vector extension and defining the embedding column. We are using nomic-embed-text via Ollama, which outputs vectors with 768 dimensions. Open database/init.sql and add this SQL script: -- 1. Enable the pgvector extension CREATE EXTENSION IF NOT EXISTS vector; -- 2. Channels Table: Who are we watching? CREATE TABLE IF NOT EXISTS channels ( id TEXT PRIMARY KEY, -- YouTube Channel ID (e.g., UC123...) name TEXT NOT NULL, url TEXT NOT NULL, last_checked_at TIMESTAMP DEFAULT '1970-01-01' ); -- 3. Videos Table: Metadata for individual videos CREATE TABLE IF NOT EXISTS videos ( id TEXT PRIMARY KEY, -- YouTube Video ID (e.g., dQw4w9WgXcQ) channel_id TEXT REFERENCES channels(id), title TEXT NOT NULL, url TEXT NOT NULL, published_at TIMESTAMP, created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, status TEXT DEFAULT 'pending' -- pending, processing, completed, error ); -- 4. Transcripts Table: The searchable content CREATE TABLE IF NOT EXISTS transcript_chunks ( id SERIAL PRIMARY KEY, video_id TEXT REFERENCES videos(id) ON DELETE CASCADE, -- The actual text content (for RAG context) chunk_text TEXT NOT NULL, -- Timestamps for deep-linking start_time DOUBLE PRECISION, end_time DOUBLE PRECISION, -- The AI &quot;Brain&quot; Part -- 768 dimensions matches nomic-embed-text embedding vector(768) ); -- 5. Create a search index for speed (HNSW algorithm) CREATE INDEX ON transcript_chunks USING hnsw (embedding vector_cosine_ops); Booting Up and Priming Models Before writing code, let's bring up the infrastructure and download the AI models. Start Docker: docker-compose up -d docker-compose up -d Pull models. Ollama starts empty. Execute these commands to pull the models into the persistent volume: # Pull the Chat Model (for RAG synthesis) docker exec -it yt_ai ollama pull llama3 # Pull the Embedding Model (for Vectorizing) docker exec -it yt_ai ollama pull nomic-embed-text # Pull the Chat Model (for RAG synthesis) docker exec -it yt_ai ollama pull llama3 # Pull the Embedding Model (for Vectorizing) docker exec -it yt_ai ollama pull nomic-embed-text Part 2: The Backend Engine With the infrastructure running, we will now build the two Python services that power the system: the Ingestion Service (Discovery) and the Processing Worker (Analysis). Service 1: The Ingestion Service This service checks RSS feeds and creates &quot;Job Tickets&quot; in RabbitMQ. Create a folder ingestion_service with a requirements.txt : pika psycopg2-binary feedparser python-dotenv The code ( ingestion_service/main.py ): import time import feedparser import pika import json import psycopg2 import os from datetime import datetime # Connect to Infrastructure DB_HOST = \"postgres\" RABBIT_HOST = \"rabbitmq\" QUEUE_NAME = \"transcription_queue\" def get_db_connection (): return psycopg2.connect( host = DB_HOST , database = os.environ.get( \"DB_NAME\" ), user = os.environ.get( \"DB_USER\" ), password = os.environ.get( \"DB_PASS\" ) ) def publish_to_queue (video_data): connection = pika.BlockingConnection(pika.ConnectionParameters( host = RABBIT_HOST )) channel = connection.channel() channel.queue_declare( queue = QUEUE_NAME , durable = True ) channel.basic_publish( exchange = '' , routing_key = QUEUE_NAME , body = json.dumps(video_data), properties = pika.BasicProperties( delivery_mode = 2 ) # Make message persistent ) connection.close() def check_feeds (): conn = get_db_connection() cur = conn.cursor() # 1. Get all monitored channels cur.execute( \"SELECT id, url FROM channels\" ) channels = cur.fetchall() for channel_id, channel_url in channels: # YouTube RSS URL format rss_url = f \"https://www.youtube.com/feeds/videos.xml?channel_id= { channel_id } \" feed = feedparser.parse(rss_url) for entry in feed.entries: video_id = entry.yt_videoid # 2. Check if we already have this video cur.execute( \"SELECT 1 FROM videos WHERE id = %s \" , (video_id,)) if cur.fetchone() is None : print ( f \"Found new video: { entry.title } \" ) # 3. Add to DB as 'pending' cur.execute( \"INSERT INTO videos (id, channel_id, title, url, published_at, status) VALUES ( %s , %s , %s , %s , %s , 'pending')\" , (video_id, channel_id, entry.title, entry.link, datetime.now()) ) conn.commit() # 4. Push to RabbitMQ publish_to_queue({ \"video_id\" : video_id, \"url\" : entry.link, \"title\" : entry.title }) conn.close() if __name__ == \"__main__\" : print ( \"Ingestion Service Started...\" ) while True : try : check_feeds() except Exception as e: print ( f \"Error: { e } \" ) time.sleep( 3600 ) # Sleep for 1 hour import time import feedparser import pika import json import psycopg2 import os from datetime import datetime # Connect to Infrastructure DB_HOST = \"postgres\" RABBIT_HOST = \"rabbitmq\" QUEUE_NAME = \"transcription_queue\" def get_db_connection (): return psycopg2.connect( host = DB_HOST , database = os.environ.get( \"DB_NAME\" ), user = os.environ.get( \"DB_USER\" ), password = os.environ.get( \"DB_PASS\" ) ) def publish_to_queue (video_data): connection = pika.BlockingConnection(pika.ConnectionParameters( host = RABBIT_HOST )) channel = connection.channel() channel.queue_declare( queue = QUEUE_NAME , durable = True ) channel.basic_publish( exchange = '' , routing_key = QUEUE_NAME , body = json.dumps(video_data), properties = pika.BasicProperties( delivery_mode = 2 ) # Make message persistent ) connection.close() def check_feeds (): conn = get_db_connection() cur = conn.cursor() # 1. Get all monitored channels cur.execute( \"SELECT id, url FROM channels\" ) channels = cur.fetchall() for channel_id, channel_url in channels: # YouTube RSS URL format rss_url = f \"https://www.youtube.com/feeds/videos.xml?channel_id= { channel_id } \" feed = feedparser.parse(rss_url) for entry in feed.entries: video_id = entry.yt_videoid # 2. Check if we already have this video cur.execute( \"SELECT 1 FROM videos WHERE id = %s \" , (video_id,)) if cur.fetchone() is None : print ( f \"Found new video: { entry.title } \" ) # 3. Add to DB as 'pending' cur.execute( \"INSERT INTO videos (id, channel_id, title, url, published_at, status) VALUES ( %s , %s , %s , %s , %s , 'pending')\" , (video_id, channel_id, entry.title, entry.link, datetime.now()) ) conn.commit() # 4. Push to RabbitMQ publish_to_queue({ \"video_id\" : video_id, \"url\" : entry.link, \"title\" : entry.title }) conn.close() if __name__ == \"__main__\" : print ( \"Ingestion Service Started...\" ) while True : try : check_feeds() except Exception as e: print ( f \"Error: { e } \" ) time.sleep( 3600 ) # Sleep for 1 hour The Dockerfile ( ingestion_service/Dockerfile ): FROM python:3.9-slim WORKDIR /app COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt COPY . . CMD [&quot;python&quot;, &quot;main.py&quot;] Service 2: The Processing Worker This worker converts audio, transcribes it, and embeds it. Create a folder processing_worker with a requirements.txt : pika psycopg2-binary yt-dlp faster-whisper requests python-dotenv The code ( processing_worker/worker.py ): import pika import json import os import psycopg2 import requests import yt_dlp from faster_whisper import WhisperModel # Config OLLAMA_API = \"http://ollama:11434/api/embeddings\" MODEL_NAME = \"nomic-embed-text\" TEMP_DIR = \"/app/temp\" # Initialize Whisper (runs on CPU by default, or GPU if passed to Docker) model = WhisperModel( \"tiny\" , device = \"cpu\" , compute_type = \"int8\" ) def download_audio (video_url, video_id): \"\"\"Downloads audio using yt-dlp to a temp file\"\"\" output_path = f \" {TEMP_DIR} / { video_id } \" ydl_opts = { 'format' : 'bestaudio/best' , 'outtmpl' : output_path, 'postprocessors' : [{ 'key' : 'FFmpegExtractAudio' , 'preferredcodec' : 'mp3' }], 'quiet' : True } with yt_dlp.YoutubeDL(ydl_opts) as ydl: ydl.download([video_url]) return f \" { output_path } .mp3\" def get_embedding (text): \"\"\"Calls Ollama to get vector embedding\"\"\" response = requests.post( OLLAMA_API , json = { \"model\" : MODEL_NAME , \"prompt\" : text }) return response.json()[ 'embedding' ] def process_video (ch, method, properties, body): data = json.loads(body) video_id = data[ 'video_id' ] print ( f \"Processing: { data[ 'title' ] } \" ) try : # 1. Download Audio audio_path = download_audio(data[ 'url' ], video_id) # 2. Transcribe segments, _ = model.transcribe(audio_path) conn = psycopg2.connect( host = \"postgres\" , database = os.environ.get( \"DB_NAME\" ), user = os.environ.get( \"DB_USER\" ), password = os.environ.get( \"DB_PASS\" ) ) cur = conn.cursor() # 3. Chunk &#x26; Embed chunk_buffer = \"\" start_time = 0.0 for segment in segments: chunk_buffer += segment.text + \" \" # Create a chunk roughly every 500 characters if len (chunk_buffer) > 500 : vector = get_embedding(chunk_buffer) cur.execute( \"\"\"INSERT INTO transcript_chunks (video_id, chunk_text, start_time, end_time, embedding) VALUES ( %s , %s , %s , %s , %s )\"\"\" , (video_id, chunk_buffer, start_time, segment.end, vector) ) chunk_buffer = \"\" start_time = segment.end # 4. Mark Complete cur.execute( \"UPDATE videos SET status = 'completed' WHERE id = %s \" , (video_id,)) conn.commit() conn.close() os.remove(audio_path) print ( f \"Done: { data[ 'title' ] } \" ) ch.basic_ack( delivery_tag = method.delivery_tag) except Exception as e: print ( f \"Error processing { video_id } : { e } \" ) ch.basic_nack( delivery_tag = method.delivery_tag, requeue = False ) # Start Consumer connection = pika.BlockingConnection(pika.ConnectionParameters( \"rabbitmq\" )) channel = connection.channel() channel.queue_declare( queue = \"transcription_queue\" , durable = True ) channel.basic_qos( prefetch_count = 1 ) channel.basic_consume( queue = \"transcription_queue\" , on_message_callback = process_video) print ( \"Processing Worker Started...\" ) channel.start_consuming() import pika import json import os import psycopg2 import requests import yt_dlp from faster_whisper import WhisperModel # Config OLLAMA_API = \"http://ollama:11434/api/embeddings\" MODEL_NAME = \"nomic-embed-text\" TEMP_DIR = \"/app/temp\" # Initialize Whisper (runs on CPU by default, or GPU if passed to Docker) model = WhisperModel( \"tiny\" , device = \"cpu\" , compute_type = \"int8\" ) def download_audio (video_url, video_id): \"\"\"Downloads audio using yt-dlp to a temp file\"\"\" output_path = f \" {TEMP_DIR} / { video_id } \" ydl_opts = { 'format' : 'bestaudio/best' , 'outtmpl' : output_path, 'postprocessors' : [{ 'key' : 'FFmpegExtractAudio' , 'preferredcodec' : 'mp3' }], 'quiet' : True } with yt_dlp.YoutubeDL(ydl_opts) as ydl: ydl.download([video_url]) return f \" { output_path } .mp3\" def get_embedding (text): \"\"\"Calls Ollama to get vector embedding\"\"\" response = requests.post( OLLAMA_API , json = { \"model\" : MODEL_NAME , \"prompt\" : text }) return response.json()[ 'embedding' ] def process_video (ch, method, properties, body): data = json.loads(body) video_id = data[ 'video_id' ] print ( f \"Processing: { data[ 'title' ] } \" ) try : # 1. Download Audio audio_path = download_audio(data[ 'url' ], video_id) # 2. Transcribe segments, _ = model.transcribe(audio_path) conn = psycopg2.connect( host = \"postgres\" , database = os.environ.get( \"DB_NAME\" ), user = os.environ.get( \"DB_USER\" ), password = os.environ.get( \"DB_PASS\" ) ) cur = conn.cursor() # 3. Chunk &#x26; Embed chunk_buffer = \"\" start_time = 0.0 for segment in segments: chunk_buffer += segment.text + \" \" # Create a chunk roughly every 500 characters if len (chunk_buffer) > 500 : vector = get_embedding(chunk_buffer) cur.execute( \"\"\"INSERT INTO transcript_chunks (video_id, chunk_text, start_time, end_time, embedding) VALUES ( %s , %s , %s , %s , %s )\"\"\" , (video_id, chunk_buffer, start_time, segment.end, vector) ) chunk_buffer = \"\" start_time = segment.end # 4. Mark Complete cur.execute( \"UPDATE videos SET status = 'completed' WHERE id = %s \" , (video_id,)) conn.commit() conn.close() os.remove(audio_path) print ( f \"Done: { data[ 'title' ] } \" ) ch.basic_ack( delivery_tag = method.delivery_tag) except Exception as e: print ( f \"Error processing { video_id } : { e } \" ) ch.basic_nack( delivery_tag = method.delivery_tag, requeue = False ) # Start Consumer connection = pika.BlockingConnection(pika.ConnectionParameters( \"rabbitmq\" )) channel = connection.channel() channel.queue_declare( queue = \"transcription_queue\" , durable = True ) channel.basic_qos( prefetch_count = 1 ) channel.basic_consume( queue = \"transcription_queue\" , on_message_callback = process_video) print ( \"Processing Worker Started...\" ) channel.start_consuming() The Dockerfile ( processing_worker/Dockerfile ). Crucially, we install ffmpeg here for audio extraction: FROM python:3.9-slim # Install ffmpeg RUN apt-get update &amp;&amp; apt-get install -y ffmpeg &amp;&amp; rm -rf /var/lib/apt/lists/* WORKDIR /app COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt # Create temp directory RUN mkdir -p /app/temp COPY . . CMD [&quot;python&quot;, &quot;worker.py&quot;] Part 3: The Control Center (Streamlit) Finally, we need a UI to manage subscriptions and — most importantly — chat with the data. Create a folder streamlit_app with a requirements.txt : streamlit langchain-community langchain-core langchain-ollama psycopg2-binary yt-dlp python-dotenv The code ( streamlit_app/app.py ): import streamlit as st import psycopg2 import os import yt_dlp from langchain_ollama import ChatOllama from langchain_core.prompts import ChatPromptTemplate from langchain_core.output_parsers import StrOutputParser # Config DB_HOST = \"postgres\" DB_NAME = os.environ.get( \"DB_NAME\" ) DB_USER = os.environ.get( \"DB_USER\" ) DB_PASS = os.environ.get( \"DB_PASS\" ) OLLAMA_URL = \"http://ollama:11434\" st.set_page_config( page_title = \"YouTube Knowledge Base\" , layout = \"wide\" ) st.title( \"AI YouTube Knowledge Base\" ) # --- DATABASE FUNCTIONS --- def get_db_connection (): return psycopg2.connect( host = DB_HOST , database = DB_NAME , user = DB_USER , password = DB_PASS ) def add_channel (url): ydl_opts = { 'quiet' : True , 'extract_flat' : True , 'playlist_end' : 0 } with yt_dlp.YoutubeDL(ydl_opts) as ydl: try : info = ydl.extract_info(url, download = False ) channel_id = info.get( 'channel_id' ) name = info.get( 'uploader' ) or info.get( 'channel' ) conn = get_db_connection() cur = conn.cursor() cur.execute( \"INSERT INTO channels (id, name, url) VALUES ( %s , %s , %s ) ON CONFLICT (id) DO NOTHING\" , (channel_id, name, url) ) conn.commit() conn.close() return f \"Success: Added { name } \" except Exception as e: return f \"Error: {str (e) } \" def get_context (query_text): \"\"\"Semantic Search: Vector -> SQL Cosine Similarity\"\"\" from langchain_ollama import OllamaEmbeddings embeddings = OllamaEmbeddings( base_url = OLLAMA_URL , model = \"nomic-embed-text\" ) query_vector = embeddings.embed_query(query_text) conn = get_db_connection() cur = conn.cursor() cur.execute( \"\"\" SELECT t.chunk_text, v.title, v.url, t.start_time FROM transcript_chunks t JOIN videos v ON t.video_id = v.id ORDER BY t.embedding &#x3C;=> %s ::vector LIMIT 5 \"\"\" , ( str (query_vector),) ) results = cur.fetchall() conn.close() return results # --- UI LAYOUT --- tab1, tab2 = st.tabs([ \"Chat with Knowledge\" , \"Manage Subscriptions\" ]) # TAB 1: RAG CHAT with tab1: user_query = st.text_input( \"Ask a question about your videos:\" ) if st.button( \"Ask AI\" ) and user_query: with st.spinner( \"Thinking...\" ): results = get_context(user_query) if not results: st.warning( \"No relevant info found in database.\" ) else : context_text = \"\" for i, (text, title, url, start) in enumerate (results): context_text += f \" \\n [Source { i + 1} ]: { text } (From ' { title } ') \\n \" # RAG Synthesis llm = ChatOllama( base_url = OLLAMA_URL , model = \"llama3\" ) prompt = ChatPromptTemplate.from_template( \"\"\" You are a helpful AI assistant. Answer the user's question based ONLY on the following context. If the answer is not in the context, say \"I don't know\". Context: {context} Question: {question} \"\"\" ) chain = prompt | llm | StrOutputParser() response = chain.invoke({ \"context\" : context_text, \"question\" : user_query}) st.markdown( \"### AI Answer\" ) st.write(response) st.markdown( \"---\" ) st.subheader( \"Reference Clips\" ) for text, title, url, start in results: video_link = f \" { url } &#x26;t= {int (start) } s\" st.markdown( f \"**[ { title } ]( { video_link } )**\" ) st.caption( f \"... { text } ...\" ) # TAB 2: MANAGE with tab2: st.header( \"Add New Channel\" ) new_url = st.text_input( \"Paste Channel URL\" ) if st.button( \"Subscribe\" ): with st.spinner( \"Resolving Channel...\" ): msg = add_channel(new_url) st.write(msg) st.header( \"Active Subscriptions\" ) conn = get_db_connection() df = conn.cursor() df.execute( \"SELECT name, url, last_checked_at FROM channels\" ) rows = df.fetchall() for row in rows: st.write( f \"**[ { row[ 0 ] } ]( { row[ 1 ] } )** - Last Checked: { row[ 2 ] } \" ) conn.close() import streamlit as st import psycopg2 import os import yt_dlp from langchain_ollama import ChatOllama from langchain_core.prompts import ChatPromptTemplate from langchain_core.output_parsers import StrOutputParser # Config DB_HOST = \"postgres\" DB_NAME = os.environ.get( \"DB_NAME\" ) DB_USER = os.environ.get( \"DB_USER\" ) DB_PASS = os.environ.get( \"DB_PASS\" ) OLLAMA_URL = \"http://ollama:11434\" st.set_page_config( page_title = \"YouTube Knowledge Base\" , layout = \"wide\" ) st.title( \"AI YouTube Knowledge Base\" ) # --- DATABASE FUNCTIONS --- def get_db_connection (): return psycopg2.connect( host = DB_HOST , database = DB_NAME , user = DB_USER , password = DB_PASS ) def add_channel (url): ydl_opts = { 'quiet' : True , 'extract_flat' : True , 'playlist_end' : 0 } with yt_dlp.YoutubeDL(ydl_opts) as ydl: try : info = ydl.extract_info(url, download = False ) channel_id = info.get( 'channel_id' ) name = info.get( 'uploader' ) or info.get( 'channel' ) conn = get_db_connection() cur = conn.cursor() cur.execute( \"INSERT INTO channels (id, name, url) VALUES ( %s , %s , %s ) ON CONFLICT (id) DO NOTHING\" , (channel_id, name, url) ) conn.commit() conn.close() return f \"Success: Added { name } \" except Exception as e: return f \"Error: {str (e) } \" def get_context (query_text): \"\"\"Semantic Search: Vector -> SQL Cosine Similarity\"\"\" from langchain_ollama import OllamaEmbeddings embeddings = OllamaEmbeddings( base_url = OLLAMA_URL , model = \"nomic-embed-text\" ) query_vector = embeddings.embed_query(query_text) conn = get_db_connection() cur = conn.cursor() cur.execute( \"\"\" SELECT t.chunk_text, v.title, v.url, t.start_time FROM transcript_chunks t JOIN videos v ON t.video_id = v.id ORDER BY t.embedding &#x3C;=> %s ::vector LIMIT 5 \"\"\" , ( str (query_vector),) ) results = cur.fetchall() conn.close() return results # --- UI LAYOUT --- tab1, tab2 = st.tabs([ \"Chat with Knowledge\" , \"Manage Subscriptions\" ]) # TAB 1: RAG CHAT with tab1: user_query = st.text_input( \"Ask a question about your videos:\" ) if st.button( \"Ask AI\" ) and user_query: with st.spinner( \"Thinking...\" ): results = get_context(user_query) if not results: st.warning( \"No relevant info found in database.\" ) else : context_text = \"\" for i, (text, title, url, start) in enumerate (results): context_text += f \" \\n [Source { i + 1} ]: { text } (From ' { title } ') \\n \" # RAG Synthesis llm = ChatOllama( base_url = OLLAMA_URL , model = \"llama3\" ) prompt = ChatPromptTemplate.from_template( \"\"\" You are a helpful AI assistant. Answer the user's question based ONLY on the following context. If the answer is not in the context, say \"I don't know\". Context: {context} Question: {question} \"\"\" ) chain = prompt | llm | StrOutputParser() response = chain.invoke({ \"context\" : context_text, \"question\" : user_query}) st.markdown( \"### AI Answer\" ) st.write(response) st.markdown( \"---\" ) st.subheader( \"Reference Clips\" ) for text, title, url, start in results: video_link = f \" { url } &#x26;t= {int (start) } s\" st.markdown( f \"**[ { title } ]( { video_link } )**\" ) st.caption( f \"... { text } ...\" ) # TAB 2: MANAGE with tab2: st.header( \"Add New Channel\" ) new_url = st.text_input( \"Paste Channel URL\" ) if st.button( \"Subscribe\" ): with st.spinner( \"Resolving Channel...\" ): msg = add_channel(new_url) st.write(msg) st.header( \"Active Subscriptions\" ) conn = get_db_connection() df = conn.cursor() df.execute( \"SELECT name, url, last_checked_at FROM channels\" ) rows = df.fetchall() for row in rows: st.write( f \"**[ { row[ 0 ] } ]( { row[ 1 ] } )** - Last Checked: { row[ 2 ] } \" ) conn.close() The Dockerfile ( streamlit_app/Dockerfile ): FROM python:3.9-slim WORKDIR /app COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt COPY . . EXPOSE 8501 CMD [&quot;streamlit&quot;, &quot;run&quot;, &quot;app.py&quot;, &quot;--server.port=8501&quot;, &quot;--server.address=0.0.0.0&quot;] Final Integration: Launch Day We need to update our docker-compose.yml to include our new Python services. Add the following to the services: block: ingestion : build : ./ingestion_service container_name : yt_ingestion environment : - DB_HOST=postgres - DB_NAME=${DB_NAME} - DB_USER=${DB_USER} - DB_PASS=${DB_PASS} networks : - yt-net depends_on : postgres : condition : service_healthy rabbitmq : condition : service_healthy worker : build : ./processing_worker container_name : yt_worker environment : - DB_HOST=postgres - DB_NAME=${DB_NAME} - DB_USER=${DB_USER} - DB_PASS=${DB_PASS} networks : - yt-net depends_on : postgres : condition : service_healthy rabbitmq : condition : service_healthy ollama : condition : service_started streamlit : build : ./streamlit_app container_name : yt_ui ports : - \"8501:8501\" environment : - DB_HOST=postgres - DB_NAME=${DB_NAME} - DB_USER=${DB_USER} - DB_PASS=${DB_PASS} networks : - yt-net depends_on : postgres : condition : service_healthy ollama : condition : service_started ingestion : build : ./ingestion_service container_name : yt_ingestion environment : - DB_HOST=postgres - DB_NAME=${DB_NAME} - DB_USER=${DB_USER} - DB_PASS=${DB_PASS} networks : - yt-net depends_on : postgres : condition : service_healthy rabbitmq : condition : service_healthy worker : build : ./processing_worker container_name : yt_worker environment : - DB_HOST=postgres - DB_NAME=${DB_NAME} - DB_USER=${DB_USER} - DB_PASS=${DB_PASS} networks : - yt-net depends_on : postgres : condition : service_healthy rabbitmq : condition : service_healthy ollama : condition : service_started streamlit : build : ./streamlit_app container_name : yt_ui ports : - \"8501:8501\" environment : - DB_HOST=postgres - DB_NAME=${DB_NAME} - DB_USER=${DB_USER} - DB_PASS=${DB_PASS} networks : - yt-net depends_on : postgres : condition : service_healthy ollama : condition : service_started Running the Stack Build and run: docker-compose up -d --build docker-compose up -d --build Access the app: open your browser to http://localhost:8501 . Add a channel: go to &quot;Manage Subscriptions&quot; and add a URL like https://www.youtube.com/@Fireship . Watch it work: the ingestion service will queue the latest videos, and the worker will begin transcribing them (view logs with docker logs -f yt_worker ). Chat: once processing is complete, go to the &quot;Chat&quot; tab and ask: &quot;What is the latest JavaScript framework mentioned?&quot; You have just built a completely private, AI-powered knowledge engine. Privacy — no data leaves your machine. Your viewing habits remain yours. Cost — $0. No OpenAI API keys. No SaaS subscriptions. Just local compute. Utility — you have turned a passive stream of entertainment into an active database of answers. This project is a perfect example of the power of Agentic AI and Local LLMs. You didn't just write a script; you built a system that sees, listens, and thinks. The database is yours — build what you need. Happy coding!"
  }
]