Chapter 9: Conversational RAG & State Management — Complete RAG In Python

A single-turn RAG pipeline (like the one we built in Chapter 6) assumes the user asks perfectly self-contained questions.

But users expect to chat with your AI. If a user asks: User: "What is the parental leave policy?" AI: "The policy grants 12 weeks of paid leave." User: "Does that apply to contractors?"

If you pass the string "Does that apply to contractors?" directly into your vector database, the search will fail. The database has no idea what "that" refers to. It will retrieve documents about contractors, but completely miss the documents about parental leave.

To build Conversational RAG, we must intercept the user's follow-up question and rewrite it using the chat history before we search the database.

9.1 Contextualizing the Query

The modern LangChain solution is a two-step LCEL pipeline:

The History-Aware Retriever: An LLM reads the chat history and the user's latest follow-up question. It rewrites the follow-up question into a standalone, searchable query.
The Document Chain: The standard RAG generation step using the newly retrieved documents.

Step 1: The History-Aware Retriever

from langchain.chains import create_history_aware_retriever
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder

# This prompt asks the LLM to rewrite the question
contextualize_q_system_prompt = """Given a chat history and the latest user question \
which might reference context in the chat history, formulate a standalone question \
which can be understood without the chat history. Do NOT answer the question, \
just reformulate it if needed and otherwise return it as is."""

contextualize_q_prompt = ChatPromptTemplate.from_messages([
    ("system", contextualize_q_system_prompt),
    MessagesPlaceholder("chat_history"), # This injects the message history list
    ("human", "{input}"),
])

# This creates a special retriever that runs the prompt BEFORE querying the DB
history_aware_retriever = create_history_aware_retriever(
    llm, vector_db.as_retriever(), contextualize_q_prompt
)

If the user asks "Does that apply to contractors?", the LLM silently rewrites it to "Does the 12-week parental leave policy apply to contractors?" and searches the database using the rewritten string!

Step 2: The Generation Chain

from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains import create_retrieval_chain

qa_system_prompt = """You are an assistant for question-answering tasks. \
Use the following pieces of retrieved context to answer the question. \
If you don't know the answer, just say that you don't know.

{context}"""

qa_prompt = ChatPromptTemplate.from_messages([
    ("system", qa_system_prompt),
    MessagesPlaceholder("chat_history"),
    ("human", "{input}"),
])

# Assemble the final pipeline
question_answer_chain = create_stuff_documents_chain(llm, qa_prompt)
rag_chain = create_retrieval_chain(history_aware_retriever, question_answer_chain)

9.2 Persistent State Management

If you deploy your app to a serverless environment (like AWS Lambda or Vercel), the server spins down after every request. You cannot store chat_history in a Python list in memory—it will be deleted instantly.

You must persist memory to an external database. LangChain handles this seamlessly with RunnableWithMessageHistory.

from langchain_core.runnables.history import RunnableWithMessageHistory
from langchain_community.chat_message_histories import SQLChatMessageHistory

# Define a function that returns a database connection for a specific session
def get_session_history(session_id):
    return SQLChatMessageHistory(
        session_id=session_id, 
        connection_string="sqlite:///chat_history.db" # Can be Postgres, Redis, etc.
    )

# Wrap the RAG chain in the History Manager
conversational_rag_chain = RunnableWithMessageHistory(
    rag_chain,
    get_session_history,
    input_messages_key="input",
    history_messages_key="chat_history",
    output_messages_key="answer",
)

# Execution: You must pass the session_id so it knows which chat to load!
result = conversational_rag_chain.invoke(
    {"input": "Does that apply to contractors?"},
    config={"configurable": {"session_id": "user_123_chat_456"}}
)

print(result["answer"])

With RunnableWithMessageHistory, you never have to manually append messages to a list or write SQL INSERT statements. LangChain automatically pulls the history from the database, feeds it to the pipeline, and writes the new AI response back to the database in the background.

9.3 What's Next?

Our pipeline is functionally complete. It ingests data, chunks it semantically, embeds it, stores it in a secure vector DB, routes queries intelligently, re-ranks the results, handles follow-up questions, and saves state to a SQL database.

But how do you know if it's actually good? How do you prove to your boss that updating the embedding model didn't break the system?

In our final chapter, Chapter 10, we will explore RAG Evaluation & Observability (RAGAs), replacing manual testing with automated, metric-driven pipelines.