Capstone: Build a Chat-With-Your-Notes AI App
Capstone project: build a 'chat with your notes' AI app in Python with RAG, retrieval, and citations end to end, then where to go next.

Eleven lessons of parts. Time to build the machine. By the end of this you'll have one runnable Python script, a command-line assistant that reads a folder of your own notes and answers questions about them, with the source file printed next to every answer. Drop your meeting notes, study cards, or a project README into a folder, ask "what did we decide about pricing?", and get back an answer grounded in your text instead of the model's best guess.
This is the whole series snapping together. We load and chunk text like a file project, embed it with the embedding model, rank by cosine the way the embeddings lesson showed, stuff the winners into a grounded prompt the way RAG showed, and wrap it in a loop. No new concepts. Just assembly.
What we're building
Five moving parts, in order. The first four run once at startup; the last one runs on every question.
Everything lives in memory in a single script. No database, no web framework, no vector store. That's deliberate. You can read the whole thing top to bottom and see exactly where the magic is (there isn't any, it's cosine similarity and a careful prompt). When you outgrow in-memory, the last section points you at the tools that scale it.
The setup
Same provider-agnostic client from your first API call, plus the embedding model from the embeddings lesson. Three env vars in a git-ignored .env, nothing hard-coded.
import os, math, glob
from dotenv import load_dotenv
from openai import OpenAI
load_dotenv()
client = OpenAI(
base_url=os.environ["LLM_BASE_URL"],
api_key=os.environ["LLM_API_KEY"],
)
MODEL = os.environ.get("LLM_MODEL", "gpt-4o-mini")
EMBED_MODEL = os.environ.get("LLM_EMBED_MODEL", "text-embedding-3-small")
NOTES_DIR = "notes" # put your .md / .txt files in here
TOP_K = 4 # how many chunks to feed the modelIf pip install openai python-dotenv and the .env file are unfamiliar, the first API call lesson walks through every line. Make a folder called notes next to the script and toss in a few .md or .txt files. That's your knowledge base.
Load and chunk the notes
Two jobs: read every text file in the folder, and cut each one into short passages. We chunk because a whole document is a blunt instrument. If one paragraph answers the question, you want to retrieve that paragraph, not the entire 4,000-word file. Smaller chunks mean sharper retrieval and a tighter, cheaper prompt.
The chunker is pure Python with no API call, so you can run it right here and watch a wall of text turn into clean passages:
That's roughly the chunker we'll use, just with a smaller size so the demo splits visibly. Now the file loader, which pairs each chunk with the filename it came from. That filename is what lets us cite sources later:
def chunk_text(text, size=80):
chunks, current = [], []
for para in text.split("\n\n"):
for word in para.split():
current.append(word)
if len(current) >= size:
chunks.append(" ".join(current))
current = []
if current:
chunks.append(" ".join(current))
current = []
return [c for c in chunks if c.strip()]
def load_notes(folder):
passages = [] # list of {"text": ..., "source": ...}
paths = glob.glob(os.path.join(folder, "*.md")) + \
glob.glob(os.path.join(folder, "*.txt"))
for path in paths:
with open(path, encoding="utf-8") as f:
for chunk in chunk_text(f.read()):
passages.append({"text": chunk, "source": os.path.basename(path)})
return passagesEach passage is a tiny dict, the text plus where it lives. If dicts feel shaky, the dictionaries lesson is the refresher. Reading files with with open(...) is straight out of file handling.
Embed every chunk, once
This is the step people get wrong by re-running it on every question. Embed the whole corpus a single time at startup, keep the vectors in memory, and reuse them. Re-embedding on each query is slow and wastes money for zero benefit. We also need cosine similarity (the exact function from the embeddings lesson) and a tiny embed helper.
def embed(texts):
resp = client.embeddings.create(model=EMBED_MODEL, input=texts)
return [d.embedding for d in resp.data]
def cosine(a, b):
dot = sum(x * y for x, y in zip(a, b))
mag_a = math.sqrt(sum(x * x for x in a))
mag_b = math.sqrt(sum(x * x for x in b))
return dot / (mag_a * mag_b)
print(f"Loading notes from {NOTES_DIR}/ ...")
passages = load_notes(NOTES_DIR)
if not passages:
raise SystemExit(f"No .md/.txt files found in {NOTES_DIR}/")
# Embed all chunks in one batched call, then stash the vectors alongside them.
vectors = embed([p["text"] for p in passages])
for p, v in zip(passages, vectors):
p["vector"] = v
print(f"Indexed {len(passages)} chunks. Ask me anything (Ctrl-C to quit).")Note the single batched embed([...]) call, one request for all chunks, not one per chunk. The API hands back one vector per input in the same order, so a plain zip lines each vector up with its passage. The whole index now sits in passages, each item carrying its text, its source filename, and its vector.
Embed queries and chunks with the same model
Vectors from different embedding models live in incompatible coordinate spaces. Mix them and the cosine scores are noise. Index and query with the exact same EMBED_MODEL, and if you ever switch models, re-embed everything.
Retrieve: rank by cosine, take the top-k
The ask loop's first job is retrieval. Embed the question, score it against every chunk, return the best few. We also keep the score, because a low best-score is our signal that the notes simply don't cover the question. That's how we'll answer "I don't know" instead of making something up.
def retrieve(question, top_k=TOP_K):
q = embed([question])[0]
scored = []
for p in passages:
scored.append((cosine(q, p["vector"]), p))
scored.sort(key=lambda pair: pair[0], reverse=True)
return scored[:top_k]A linear scan over every chunk is perfectly fine for a few thousand passages. Past that you'd reach for a vector database, but the ranking logic wouldn't change. It's the same cosine sort, just done faster.
Build a grounded prompt and answer
Here's the heart of RAG. We don't ask the model to answer from memory. We hand it the retrieved chunks and tell it to answer only from them, and to say it doesn't know when they don't cover the question. The system message sets the rules. The user message carries the question plus the context.
SYSTEM = (
"You answer questions using ONLY the provided notes. "
"If the notes don't contain the answer, say you don't know — "
"do not use outside knowledge or guess. Be concise."
)
def answer(question):
hits = retrieve(question)
best_score = hits[0][0] if hits else 0.0
# Nothing relevant retrieved? Don't even call the model.
if best_score < 0.30:
return "I don't know — I couldn't find anything about that in your notes.", []
context = "\n\n".join(
f"[{p['source']}]\n{p['text']}" for _, p in hits
)
messages = [
{"role": "system", "content": SYSTEM},
{"role": "user",
"content": f"Notes:\n{context}\n\nQuestion: {question}"},
]
resp = client.chat.completions.create(model=MODEL, messages=messages)
reply = resp.choices[0].message.content
sources = sorted({p["source"] for _, p in hits})
return reply, sourcesTwo safety nets are doing real work here. The best_score < 0.30 check is a cheap circuit breaker: if even the closest chunk is a poor match, we skip the API call entirely and admit we don't know. (Tune that threshold to your data. Start around 0.3 and watch what gets let through.) And the system prompt tells the model to ground its answer in the notes and decline otherwise, so a relevant-ish chunk that still doesn't contain the answer gets an honest "I don't know" rather than a confident fabrication. Belt and suspenders, because hallucination is the failure mode RAG exists to fight.
Each chunk in the context is tagged with [source.md], and we collect the unique filenames to print. That's the citation. The reader can go check the actual file.
Quick check
Why does the app embed all the note chunks once at startup instead of inside the question loop?
The ask loop
Last piece: a REPL that reads a question, answers it, and prints the sources. This is the part you actually talk to.
while True:
try:
question = input("\n> ").strip()
except (EOFError, KeyboardInterrupt):
print("\nBye.")
break
if not question:
continue
reply, sources = answer(question)
print(f"\n{reply}")
if sources:
print(f"\nSources: {', '.join(sources)}")That try/except around input() catches Ctrl-C and Ctrl-D so quitting is clean instead of a traceback, the kind of touch covered in error handling. Run it:
python notes_chat.pyLoading notes from notes/ ...
Indexed 37 chunks. Ask me anything (Ctrl-C to quit).
> what did we decide about the beta launch?
We agreed to ship the beta on Friday, with Maya owning the landing page.
Sources: standup-2026-07.md
> what's the capital of France?
I don't know — I couldn't find anything about that in your notes.That's the whole app (load, chunk, embed once, retrieve, ground, answer, cite) in one file you can read in a sitting. It answers from your notes and admits when it can't. Paste the snippets together top to bottom and it runs.
Ship a v2: add streaming or a search tool
Two upgrades worth an afternoon. Streaming: pass stream=True to chat.completions.create and print tokens as they arrive, so long answers feel instant instead of frozen. A search tool: wrap retrieve as a tool the model can call, so instead of always retrieving up front, the model decides when it needs to look something up, and can search again with a better query if the first results miss. That's the bridge from RAG to a notes agent.
Where to go next
You built this with the standard library and one SDK, which is the right way to learn it. You saw every step. To run it for real, here's the honest map.
Frameworks that do this at scale. LangChain and LlamaIndex give you loaders for PDFs, HTML, and Notion, smarter chunkers, retrievers, and chains, so you wire components instead of writing the plumbing. The catch: they hide the steps you just learned by hand. Knowing what's underneath means you'll debug them instead of cargo-culting them.
Run the model locally. Ollama serves open models on your own machine behind the same OpenAI-compatible API. Point LLM_BASE_URL at http://localhost:11434/v1, change LLM_MODEL, and this exact script runs offline. Private notes never leave your laptop. Same code, different .env.
Vector databases. When the linear scan gets slow (tens of thousands of chunks and up), move the vectors into something like Chroma, Qdrant, or pgvector. They do approximate nearest-neighbour search (same cosine ranking, far faster) and they persist, so you don't re-embed on every restart.
Evaluation and testing. "It felt right in two questions" isn't testing. Build a small set of question/expected-answer pairs and check the app against them whenever you change a prompt or a chunk size. LLM apps drift quietly. A tiny eval set catches regressions a vibe never will.
Fine-tuning vs RAG, in one line: RAG gives the model knowledge it didn't have (your notes), while fine-tuning changes its behaviour and style. Reach for RAG when the answer lives in documents, which is most of the time.
The real next step isn't a framework, though. It's shipping. Point this at notes you actually have — your course material, your work wiki, that folder of half-finished ideas — and use it this week. A small thing that runs beats a perfect thing that doesn't.
The whole journey
Look at the distance. You started with what an LLM actually is, a next-token predictor, not a mind. You made your first API call, then learned to shape the conversation with prompts and the message format and prompt engineering. You took control of the output (temperature, tokens, and streaming) and forced it into structured JSON you can trust. You let the model run your code with tool calling, turned text into meaning with embeddings, grounded answers in your own data with RAG, put it on a loop to build an agent, and learned to keep it cheap and safe with tokens, cost, and safety. This capstone tied them into one app.
None of it was magic. It was a model, a message list, some vectors, and a careful prompt, all in plain Python. The same loop you ran across the Python for Beginners series: try something, watch it break, read the error, fix it, run it again. You just pointed it at LLMs.
So point it somewhere real. You can build with this now. Go build something, and ship the small ugly first version this week. That's how every app you admire started.

Written by
Rhythm Bhiwani
Engineer and relentless builder, happiest reverse-engineering hard problems until they click.
Enjoyed this?
Tap the heart to leave some love.
Be the first to react
Comments
Join the conversation.
Loading comments…


