Week 3: Core RAG Architecture - 1. The RAG Pipeline

A RAG pipeline retrieves sources, assembles context, generates an answer, and cites evidence.

Before You Read

Trace a RAG request from user question to cited answer.
Separate ingestion-time work from query-time work.
Explain why grounding quality depends on retrieval, prompt assembly, and generation together.

Working Model

RAG turns a closed-book model into an open-book assistant. The model still writes the answer, but the application controls which pages are opened, how they are ordered, and what evidence must be cited.

In Week 1, we learned that Large Language Models (LLMs) suffer from knowledge cutoffs, hallucinations, and a lack of access to private data. In Week 2, we learned how to represent text mathematically using Embeddings and store them in Vector Databases.

Now, we combine these concepts to build Retrieval-Augmented Generation (RAG).

Introduced in 2020 by Lewis et al., RAG is an architecture that bridges the gap between a static LLM's internal knowledge and your dynamic, private data. Instead of relying on the model to memorize facts during training, RAG allows the model to look up facts at runtime.

The Two Phases of RAG

A production RAG system is divided into two distinct phases:

1. The Indexing Phase (Data Preparation)

This happens "offline" (before the user ever asks a question).

Load: Ingest documents from your knowledge base (PDFs, Notion, Confluence, databases).
Parse & Clean: Extract the raw text and remove noise (HTML tags, headers, footers).
Chunk: Split the long documents into smaller, manageable pieces (chunks).
Embed: Pass each chunk through an embedding model to get a dense vector.
Store: Save the vectors and the original text chunks into a Vector Database.

2. The Retrieval & Generation Phase (Runtime)

This happens "online" (when the user asks a question).

Query: The user submits a question (e.g., "What is our Q3 marketing budget?").
Embed Query: The user's question is passed through the exact same embedding model used in the Indexing phase.
Retrieve: The Vector Database performs a similarity search (using Cosine Similarity) to find the top- $K$ chunks that are most semantically related to the query.
Generate: The retrieved chunks are injected into the LLM's prompt as "context", and the LLM is instructed to answer the user's question using only that context.

Visualizing the Pipeline

[ Document ] -> [ Parse ] -> [ Chunk ] -> [ Embed ] -> [ Vector DB ]
                                                             ^
                                                             | (Retrieve)
[ User Query ] -> [ Embed Query ] ---------------------------+
                                                             | (Top-K Chunks)
[ LLM ] <----------------------------------------------------+
   |
   v
[ Answer ]

Over the next few lessons, we will break down each of these steps in detail.