Chapter 7: Conversational Memory: State Management with RunnableWithMessageHistory
By default, Large Language Models are completely stateless. Every time you invoke a model API, it evaluates the request as if it has never spoken to you before. It has no memory of the question you asked five seconds ago.
To create a conversational chat experience, you must manually feed the past conversation history (both human messages and AI responses) back into the LLM as part of every new prompt.
In older versions of LangChain, memory was handled by black-box helpers like ConversationBufferMemory attached to LLMChain. In modern LangChain, we manage conversation logs using database adapters and wrap our chains in the RunnableWithMessageHistory wrapper.
In this chapter, we will learn how state management operates, explore SQL database-backed message storage, and implement an advanced token-trimming strategy to keep our conversations within the model's context limits.
7.1 The MessagesPlaceholder & Core Mechanics
To support conversation history, your prompt template must have a placeholder slot where the list of past messages can be injected dynamically. We use MessagesPlaceholder (from langchain_core.prompts) to represent this dynamic array.
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
prompt = ChatPromptTemplate.from_messages([
("system", "You are a friendly customer helper."),
# The variable 'chat_history' will hold the array of message objects
MessagesPlaceholder(variable_name="chat_history"),
("human", "{input_text}")
])
When you wrap an LCEL chain with RunnableWithMessageHistory, it performs three steps automatically during an .invoke() call:
- Reads Session ID: Looks at your configuration parameters (e.g.,
session_id = "user_abc_456"). - Loads History: Queries the storage backend to fetch all past messages for that session ID and injects them into the
MessagesPlaceholder. - Saves Interaction: Executes the prompt-model-parser chain, captures the human input and AI output, and writes them back to the database.
7.2 Storage Engines: In-Memory vs. Database Storage
LangChain supports a variety of storage engines.
InMemoryChatMessageHistory: Saves messages to a Python dictionary in RAM. Fast and simple for unit testing, but all history is deleted when the Python script exits.SQLChatMessageHistory: Saves messages to a SQL database (such as SQLite, PostgreSQL, or MySQL) using SQLAlchemy. This ensures history persists permanently across restarts.
Let's look at how to initialize a persistent SQLite database connection for history:
from langchain_community.chat_message_histories import SQLChatMessageHistory
def get_session_history(session_id: str):
"""
Creates or loads a SQLite table storing chat messages for the given session_id.
"""
return SQLChatMessageHistory(
session_id=session_id,
connection="sqlite:///chat_history.db"
)
7.3 Advanced Topic: Dynamic Message Trimming
In production, user conversations can stretch into dozens of messages. Sending the entire history with every request has severe drawbacks:
- Costs: LLM billing is based on input token counts. Sending massive histories raises your costs.
- Latency: Larger contexts require more time for the model to parse.
- Context Exhaustion: Eventually, the history will exceed the maximum context length of the model (e.g., 8K or 128K tokens), causing the API to crash.
To solve this, we write a preprocessing step in our chain that trims older messages, retaining only the most recent tokens. Modern LangChain provides a helper function trim_messages in langchain_core.messages to handle this logic cleanly.
Here is how we configure message trimming:
from langchain_core.messages import trim_messages
from langchain_ollama import ChatOllama
model = ChatOllama(model="llama3.2")
# Define a trimmer that retains up to 1000 tokens of history
trimmer = trim_messages(
max_tokens=1000,
strategy="last", # Retain the most recent messages
token_counter=model, # Use the model's tokenizer to count tokens
include_system=True, # Keep the system message at all costs
allow_partial=False, # Do not split a single message in half
start_on="human" # Start history sequence on a human message
)
We can pipe this trimmer directly into our prompt formatting flow!
7.4 Hands-on Example: Conversational Engine with Memory and Trimming
Let's build a complete, runnable chat application that saves its messages to a local SQLite database and uses token trimming.
import asyncio
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables.history import RunnableWithMessageHistory
from langchain_core.messages import trim_messages
from langchain_community.chat_message_histories import SQLChatMessageHistory
from langchain_ollama import ChatOllama
# 1. Initialize local Chat Model
model = ChatOllama(model="llama3.2", temperature=0.5)
# 2. Formulate Prompt Template
prompt = ChatPromptTemplate.from_messages([
("system", "You are an assistant. Answer questions concisely. Remember details discussed in the history."),
MessagesPlaceholder(variable_name="history"),
("human", "{question}")
])
# 3. Create the Message Trimmer
trimmer = trim_messages(
max_tokens=500, # Set small for demo purposes
strategy="last",
token_counter=model,
include_system=True,
allow_partial=False,
start_on="human"
)
# 4. Construct the Base Chain with Trimming
# We inject the trimmer before the prompt formatting.
# The trimmer takes the chat history lists, trims them, and passes them to the prompt.
base_chain = (
{
"history": lambda x: trimmer.invoke(x["history"]),
"question": lambda x: x["question"]
}
| prompt
| model
| StrOutputParser()
)
# 5. Define the Database Loading Connection
def get_session_history(session_id: str):
return SQLChatMessageHistory(
session_id=session_id,
connection="sqlite:///chat_history.db"
)
# 6. Wrap base_chain in RunnableWithMessageHistory
# This dynamically maps 'question' to the input and 'history' to the chat history variable.
chain_with_history = RunnableWithMessageHistory(
base_chain,
get_session_history,
input_messages_key="question",
history_messages_key="history"
)
# 7. Asynchronous Chat loop
async def main():
session_id = "user_krishna_session_1"
config = {"configurable": {"session_id": session_id}}
print("========================================")
print(f"Chat Started (Session Database ID: '{session_id}')")
print("Type 'exit' to quit.")
print("========================================")
while True:
try:
user_input = input("\nYou: ")
if user_input.strip().lower() in ["exit", "quit"]:
print("Goodbye!")
break
if not user_input.strip():
continue
print("AI is thinking...")
# invoke handles retrieving history, running trimmer, running model, and saving history
response = await chain_with_history.ainvoke(
{"question": user_input},
config=config
)
print(f"AI: {response}")
except KeyboardInterrupt:
print("\nGoodbye!")
break
except Exception as e:
print(f"Error: {e}")
if __name__ == "__main__":
asyncio.run(main())
Why this structure is clean:
- Separation of Concerns: The DB logic is completely isolated from prompt formatting.
- Robust Persistence: You can stop the script, restart it, use the same
session_id, and the program will automatically reload the chat logs fromchat_history.db. - Token Safety: The message
trimmerruns right before the prompt is formatted, ensuring that the model never receives more tokens than specified, preserving memory limits and optimizing performance.
7.5 Summary
You now understand:
- How LLM APIs are stateless and require historical message logs to simulate memory.
- How to map conversational lists using
MessagesPlaceholder. - How
RunnableWithMessageHistorymanages DB reading and writing automatically. - How to implement
trim_messagesto manage tokens and avoid context limit crashes.
In the next chapter, we will build a complete RAG (Retrieval-Augmented Generation) system from scratch using LCEL, learning how to ingest files, embed them, store them in a vector DB, and retrieve them during model execution.