Chapter 7: Advanced Retrieval Algorithms — Complete RAG In Python

Naive RAG assumes the user's initial query is perfectly formatted for a vector search. In reality, human users are terrible at writing search queries. They ask vague questions, use strange acronyms, or combine three questions into one sentence.

If the initial vector lookup fails to find the right documents, the LLM has zero chance of generating the correct answer.

To build an advanced RAG system, we must implement Query Transformation and Intelligent Routing. This chapter covers the three most powerful algorithms to drastically improve retrieval recall.

7.1 Multi-Query Retrieval

If a user asks, "How does the auth system handle expired tokens?", a single dense vector search might miss documents that use the terminology "session timeout authentication".

The Multi-Query Retriever uses an LLM to automatically generate 3-5 different variations of the user's query before searching the database. It then executes a vector search for every variation, pools all the retrieved documents together, removes duplicates, and passes the massive contextual block to the final LLM.

from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(temperature=0)

# Wrap your standard retriever in the MultiQueryRetriever
retriever_from_llm = MultiQueryRetriever.from_llm(
    retriever=vector_db.as_retriever(), 
    llm=llm
)

# Under the hood, this will query the LLM to generate variations like:
# 1. "What is the token expiration mechanism in the authentication system?"
# 2. "Authentication session timeout handling procedures"
# Then it fetches documents for all variations!
unique_docs = retriever_from_llm.invoke("How does the auth system handle expired tokens?")

Note: This significantly increases recall (finding the needle), but it costs slightly more latency because you are calling an LLM before querying the database.

7.2 HyDE: Hypothetical Document Embeddings

Semantic search matches the meaning of a query to the meaning of a document. But a short question ("What is X?") is fundamentally different in structure than a long informational paragraph ("X is a framework that...").

HyDE (Hypothetical Document Embeddings) is a brilliant, counter-intuitive algorithm.

It takes the user's short query.
It asks an LLM to hallucinate a detailed answer to the query without looking at any real data.
It takes that hallucinated, fake document and runs that through the embedding model to search the database.

Because the hallucinated document looks and sounds exactly like a real informative document, the vector database easily finds the actual, factual documents in the system.

from langchain.chains import HypotheticalDocumentEmbedder, LLMChain
from langchain_core.prompts import PromptTemplate

# 1. Prompt to hallucinate a document
prompt = PromptTemplate(
    input_variables=["question"],
    template="Please write a scientific paper answering the following question: {question}"
)
llm_chain = LLMChain(llm=llm, prompt=prompt)

# 2. Create the HyDE embedder
hyde_embeddings = HypotheticalDocumentEmbedder(
    llm_chain=llm_chain,
    base_embeddings=embeddings # e.g. OpenAIEmbeddings
)

# 3. Use these special embeddings to search the database!
vector_db = Chroma(embedding_function=hyde_embeddings)
results = vector_db.similarity_search("What are the side effects of aspirin?")

7.3 The Self-Querying Retriever

In Chapter 5, we learned that we must use metadata filtering for structured attributes (like dates, tenant IDs, or authors). Dense vectors cannot handle mathematical operators like year > 2022.

But what if the user types the metadata naturally in their question? "Show me the engineering reports written by Sarah in 2023 about database migrations."

A naive vector search will look for the semantic meaning of the word "2023", which is terrible.

A Self-Querying Retriever solves this. It uses an LLM to parse the natural language query and separate it into two parts:

The Semantic Query: "database migrations"
The Metadata Filter: {"author": "Sarah", "year": {"$eq": 2023}}

It then passes both parts into the vector database simultaneously.

from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo

# You must explicitly teach the LLM what metadata exists in your database!
metadata_field_info = [
    AttributeInfo(
        name="author",
        description="The name of the author of the report",
        type="string",
    ),
    AttributeInfo(
        name="year",
        description="The year the report was published",
        type="integer",
    ),
]

document_content_description = "Engineering and technical reports"

retriever = SelfQueryRetriever.from_llm(
    llm,
    vector_db,
    document_content_description,
    metadata_field_info,
    verbose=True
)

# The LLM automatically extracts the filters and executes a highly precise search!
docs = retriever.invoke("Show me engineering reports written by Sarah in 2023 about database migrations.")

7.4 What's Next?

By using Multi-Query, HyDE, and Self-Querying, we have drastically improved our recall. We are now fetching a massive amount of highly relevant documents.

But we have introduced a new problem: we are pulling too many documents. If we stuff 20 documents into the final LLM prompt, it will get confused (Lost in the Middle) and run out of context window.

In Chapter 8, we will explore Re-ranking and Context Compression. We will learn how to take our large pool of retrieved documents, score them, and aggressively compress them down to only the most critical sentences before generating the final answer.