Chapter 10: RAG Evaluation & Observability (RAGAs)
In traditional software engineering, you write unit tests. If a function is supposed to return True but returns False, the test fails, and you do not merge your code.
In the world of Generative AI, traditional unit tests do not work. If the ground truth answer is "The sky is blue", and your RAG pipeline outputs "The atmosphere appears azure", a standard string-matching unit test will fail, even though the AI's answer is factually correct.
To deploy RAG into a production environment, you need a robust, automated evaluation framework. You cannot rely on "eyeballing" the answers. In this final chapter, we explore LLM-as-a-Judge and the RAGAs framework.
10.1 The RAGAs Framework
RAGAs (Retrieval Augmented Generation Assessment) is the industry standard for evaluating RAG pipelines. Instead of using Python assertions, RAGAs uses an LLM (usually GPT-4) as a judge to score your pipeline's outputs on a scale from 0.0 to 1.0.
RAGAs divides evaluation into four distinct metrics, allowing you to isolate exactly where your pipeline is failing:
Generation Metrics (The LLM's Performance)
- Faithfulness: Measures hallucinations. Does the answer contain facts that are not present in the retrieved context? If yes, the score drops.
- Answer Relevance: Does the answer actually address the user's question, or did it go on an unrelated tangent?
Retrieval Metrics (The Vector Database's Performance)
- Context Precision: Did the database retrieve the absolute best, most relevant chunks, and did it rank them at the very top (positions 1 and 2)?
- Context Recall: Did the retrieved chunks contain all the necessary information required to answer the question completely?
10.2 Implementing Automated Evaluation
To run an evaluation, you need a golden dataset: a list of predefined questions and their expected ground_truth answers.
You run those questions through your RAG pipeline to generate the contexts and the answers. Then, you pass all four variables to the RAGAs library.
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
from datasets import Dataset
# 1. Your Golden Dataset
questions = ["What is our refund policy?"]
ground_truths = [["Customers get a full refund within 30 days of purchase."]]
# 2. Run your pipeline to get the actual results
answers = []
contexts = []
for q in questions:
result = rag_chain_with_sources.invoke(q) # From Chapter 6!
answers.append(result["answer"])
contexts.append([doc.page_content for doc in result["context"]])
# 3. Format the data for RAGAs
data = {
"question": questions,
"answer": answers,
"contexts": contexts,
"ground_truth": ground_truths
}
dataset = Dataset.from_dict(data)
# 4. Run the LLM-as-a-Judge Evaluation!
metrics = [faithfulness, answer_relevancy, context_precision, context_recall]
score = evaluate(dataset, metrics=metrics)
print(score)
# Output: {'faithfulness': 0.95, 'answer_relevancy': 0.88, 'context_precision': 0.92, 'context_recall': 1.0}
By adding this script to your CI/CD pipeline (e.g., GitHub Actions), you can automatically block a pull request if a change to the chunk size or embedding model causes the faithfulness score to drop below 0.90!
10.3 Observability with LangSmith
Metrics tell you that a problem occurred, but they don't tell you why.
If your RAG chain starts failing, debugging it in the terminal is a nightmare. Was the LLM prompt bad? Did the Multi-Query retriever hallucinate a bad search term? Did the Vector Database return empty results?
LangSmith is the ultimate observability platform for LangChain. With just two environment variables, it traces every single step of your LCEL pipeline and visualizes it in a web dashboard.
# In your terminal or .env file
export LANGCHAIN_TRACING_V2="true"
export LANGCHAIN_API_KEY="ls__your_api_key_here"
Once enabled, you don't need to change a single line of your Python code. Every invoke() call is logged. You can log into LangSmith and see a visual waterfall diagram of your execution:
- How long the retrieval took (in milliseconds).
- Exactly what the LLM prompt looked like after the variables were injected.
- The raw token usage and cost for that specific interaction.
Conclusion
Building a RAG pipeline is a journey from simple text scripts to complex, distributed cloud architectures.
You have mastered Document Loading, Semantic Chunking, Dense/Sparse Embeddings, Vector Indexing, Pure LCEL pipelines, Re-ranking, and Automated Evaluation. You now possess the skills to build, deploy, and maintain Enterprise-Grade Generative AI systems.
Congratulations, and happy building!