Week 8: Production & Evaluation - 1. Evaluating RAG (RAGAS)

Production RAG uses evaluation, safety, tracing, cost, and latency checks before release.

Before You Read

Evaluate retrieval and generation separately before judging the whole RAG system.
Build test sets with questions, expected evidence, answers, and failure cases.
Use automated metrics and human review together.

Working Model

Evaluation is the feedback loop that turns RAG from a demo into an engineered system. A good scorecard tells you whether retrieval found the right evidence, generation used it faithfully, and users received a useful answer.

Building a prototype RAG application takes a weekend. Getting it ready for production takes months.

The hardest part of moving to production is answering a simple question: "Is my app actually getting better?" When you change your chunk size from 500 to 1000, or switch from Cosine Similarity to Hybrid Search, how do you know if the answers improved? You cannot manually read 1,000 test queries every time you change a line of code.

You need automated evaluation.

The RAGAS Framework

RAGAS (Retrieval Augmented Generation Assessment) is a popular open-source framework for evaluating RAG pipelines. Instead of relying only on human graders, RAGAS can use a judge model to grade your RAG pipeline's outputs and expose repeatable retrieval and generation metrics.

RAGAS breaks evaluation down into two main categories: Retrieval Metrics (did the database find the right documents?) and Generation Metrics (did the LLM write a good answer?).

1. Context Precision (Retrieval)

What it measures: Did the Vector Database put the most relevant chunks at the top of the list?
Why it matters: If the answer is in chunk #10, the LLM might suffer from the "Lost in the Middle" phenomenon and ignore it. The best context must be ranked #1.

2. Context Recall (Retrieval)

What it measures: Did the Vector Database retrieve all the information needed to answer the question?
Why it matters: If a user asks for the pros and cons of a policy, but the database only retrieves the pros, the LLM cannot fully answer the question.

3. Faithfulness (Generation)

What it measures: Is the generated answer derived entirely from the retrieved context? Or did the LLM hallucinate?
Why it matters: In enterprise RAG, an LLM should never use its internal training data. If the answer contains facts not found in the retrieved chunks, the Faithfulness score drops to 0.

4. Answer Relevance (Generation)

What it measures: Does the generated answer actually address the user's question?
Why it matters: An LLM might write a perfectly faithful summary of the retrieved documents, but if those documents don't answer the user's specific question, the response is useless.

Example: Evaluating with RAGAS (Python)

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
from datasets import Dataset

# Your test dataset (what your RAG system output)
data = {
    "question": ["What is the capital of France?"],
    "answer": ["Paris is the capital."],
    "contexts": [["France is a country in Europe. Its capital is Paris."]],
    "ground_truth": ["Paris"]
}
dataset = Dataset.from_dict(data)

# Run the evaluation using a Judge LLM
result = evaluate(
    dataset=dataset,
    metrics=[context_precision, faithfulness, answer_relevancy]
)

print(result)
# Output: {'context_precision': 1.0, 'faithfulness': 1.0, 'answer_relevancy': 0.98}

By running a test suite of 100 queries through RAGAS every time you update your code, you get a mathematical dashboard of your pipeline's health, allowing you to iterate with confidence.