Chapter 8: Re-ranking and Context Compression — Complete RAG In Python

Advanced retrieval systems (using Multi-Query or Hybrid Search from Chapter 7) are excellent at maximizing recall—they successfully find the needle in the haystack. However, to achieve this, they often return 20, 30, or 50 document chunks.

If you paste 50 chunks of text into an LLM's prompt, two terrible things happen:

Massive Latency and Cost: You are paying for tens of thousands of input tokens.
The "Lost in the Middle" Phenomenon: Stanford researchers proved that LLMs suffer from a U-shaped attention curve. They pay attention to the first few paragraphs and the last few paragraphs of a prompt. If the answer to the user's question is buried in document #14 (right in the middle of the massive prompt), the LLM will completely ignore it and hallucinate.

We must aggressively filter and compress the retrieved documents before sending them to the LLM.

8.1 Bi-Encoders vs. Cross-Encoders

To understand Re-ranking, you must understand how embedding models work.

Standard vector search uses Bi-Encoders (like OpenAI Embeddings). They map the user query to a vector, map the document to a vector independently, and calculate the distance. They are fast enough to search 10 million documents in milliseconds, but they are relatively "dumb". They miss deep linguistic nuances.

Re-ranking uses Cross-Encoders. A Cross-Encoder takes the user's query and the document simultaneously and processes them together through the neural network. This allows the model to see exactly how the words in the query interact with the words in the document.

Drawback: Cross-Encoders are incredibly slow. You cannot run a Cross-Encoder over 10 million documents.
The Solution: We use the fast Bi-Encoder to retrieve the top 30 candidates. Then, we use the slow, highly accurate Cross-Encoder to Re-rank those 30 candidates, pushing the absolute best matches to the very top. We then discard the bottom 25 and only send the top 5 to the LLM.

8.2 Implementing a Re-ranker (Cohere)

The industry standard for cloud-based re-ranking is Cohere. LangChain provides a built-in wrapper called ContextualCompressionRetriever.

from langchain.retrievers import ContextualCompressionRetriever
from langchain_cohere import CohereRerank

# 1. Initialize the Base Retriever (Fast Bi-Encoder)
# We fetch 20 documents because we know we will filter them down later.
base_retriever = vector_db.as_retriever(search_kwargs={"k": 20})

# 2. Initialize the Re-ranker (Slow, Accurate Cross-Encoder)
# We tell it to keep only the absolute best 3 documents
compressor = CohereRerank(model="rerank-english-v3.0", top_n=3)

# 3. Combine them into a Compression Retriever
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, 
    base_retriever=base_retriever
)

# When invoked, it searches 20 docs, cross-encodes them against the query, 
# sorts them, and returns the top 3.
compressed_docs = compression_retriever.invoke("What is the refund policy?")

Note: If you are building a fully offline, local pipeline, you can use the langchain-huggingface package to run open-source cross-encoders like BAAI/bge-reranker-large locally.

8.3 Document Contextual Compression

Re-ranking sorts whole chunks. But what if a chunk is 1,000 characters long, and only a single sentence inside that chunk is relevant to the user's query?

We can use an LLM-based Extractor to read the retrieved chunks and literally delete the irrelevant sentences before passing the final text to the main generation prompt.

from langchain.retrievers.document_compressors import LLMChainExtractor
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(temperature=0)

# The extractor uses an LLM to read each document and extract only the relevant parts
extractor = LLMChainExtractor.from_llm(llm)

compression_retriever = ContextualCompressionRetriever(
    base_compressor=extractor, 
    base_retriever=base_retriever
)

# The returned documents will be physically shorter! The irrelevant paragraphs are gone.
tight_docs = compression_retriever.invoke("What is the refund policy?")

Warning: While Contextual Compression yields the highest possible precision, it requires executing an LLM call for every single retrieved document. If you retrieve 10 documents, you are making 10 separate LLM calls just to compress them, drastically increasing latency. It should only be used in high-stakes environments where accuracy is prioritized over speed.

8.4 What's Next?

Our retrieval pipeline is now incredibly robust. We expand the query (Chapter 7), search the database, and use Cross-Encoders to rigorously re-rank the results.

But RAG is rarely a single question-and-answer interaction. Users want to chat. They want to ask follow-up questions like "Does that policy apply to me?", which relies on conversational memory.

In Chapter 9, we will explore Conversational RAG & State Management, wiring our advanced pipeline into an asynchronous memory backend.