Chapter 3: Advanced Text Splitters & Semantic Chunking
In Chapter 2, we successfully loaded a 500-page PDF into our system. However, an LLM's context window is finite, and embedding models have strict token limits (often 512 or 8192 tokens max). Furthermore, injecting a massive, multi-topic document into a prompt dilutes the LLM's attention, causing it to hallucinate or miss critical facts.
We must chunk the document into smaller, semantically cohesive pieces.
Chunking is arguably the most critical and under-appreciated step in the RAG pipeline. Bad chunking breaks sentences in half, separates pronouns from their nouns, and destroys the context required for vector similarity search to work. In this chapter, we will move beyond basic splits and master structural and semantic chunking.
3.1 The Baseline: Recursive Character Splitting
The most common mistake beginners make is using a naive CharacterTextSplitter. If you split exactly every 1,000 characters, you will inevitably slice a word or a sentence directly in half, completely destroying the semantic meaning of that chunk.
The industry baseline is the RecursiveCharacterTextSplitter. It attempts to split text hierarchically using a list of separators (by default: ["\n\n", "\n", " ", ""]).
It tries to split by paragraphs first (\n\n). If a paragraph is still too large, it falls back to splitting by sentences (\n), then by words ( ), ensuring that chunks remain as readable as possible.
from langchain_text_splitters import RecursiveCharacterTextSplitter
# Rule of thumb: chunk sizes between 500-1000 characters perform well for dense text.
# Overlap is MANDATORY. It ensures that a concept split across two chunks
# maintains some connecting context in both chunks.
splitter = RecursiveCharacterTextSplitter(
chunk_size=800,
chunk_overlap=150,
length_function=len,
is_separator_regex=False,
)
chunks = splitter.split_documents(massive_document_list)
3.2 Structural Chunking: Preserving Document Hierarchy
If you are parsing Markdown files (like this book!) or HTML, relying on character counts is dangerous. You want chunks that respect the document's logical structure (e.g., keeping everything under "Header 1" grouped together).
The MarkdownHeaderTextSplitter extracts headers and injects them into the metadata of the resulting chunks. This allows the vector database to know exactly which section a chunk originated from.
from langchain_text_splitters import MarkdownHeaderTextSplitter
markdown_document = """
# RAG Guidelines
## Chunking
Chunking is very important. You must overlap chunks.
## Databases
Always use metadata filtering when possible.
"""
# Define which headers we care about and what to name the metadata keys
headers_to_split_on = [
("#", "Header 1"),
("##", "Header 2"),
]
markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
# This will result in chunks that inherently 'know' they belong to 'Chunking' or 'Databases'
structural_chunks = markdown_splitter.split_text(markdown_document)
print(structural_chunks[0].metadata)
# Output: {'Header 1': 'RAG Guidelines', 'Header 2': 'Chunking'}
By retaining the structural hierarchy in metadata, you enable powerful downstream filters (e.g., "Only search chunks where Header 1 == 'RAG Guidelines'").
3.3 The Frontier: Semantic Chunking
Structural and recursive splitting rely on hardcoded punctuation (\n or #). But what if the text is a transcript of an hour-long podcast with no paragraphs or headers?
Semantic Chunking solves this. Instead of counting characters, it uses a fast embedding model to calculate the semantic meaning of every single sentence. It groups consecutive sentences together. When it detects a sudden shift in meaning (a high cosine distance between adjacent sentences), it establishes a "break point" and starts a new chunk.
This ensures that chunks represent complete, cohesive thoughts, regardless of length.
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings
# Note: Semantic chunking requires an embedding model to compute sentence similarity.
embedder = OpenAIEmbeddings()
# 'percentile' splitting sets a threshold: if the difference between two sentences
# is greater than the 95th percentile of all differences in the document, it splits.
semantic_splitter = SemanticChunker(
embedder, breakpoint_threshold_type="percentile"
)
semantic_chunks = semantic_splitter.create_documents([long_unformatted_text])
Note: Semantic chunking is computationally expensive compared to recursive splitting because it requires running an embedding model over the entire text during the ingestion phase.
3.4 Small-to-Big Retrieval (Parent Document Pattern)
There is a fundamental tension in RAG:
- For Retrieval (Vector Search): Smaller chunks are better. They represent a single, highly specific concept, making similarity search extremely accurate.
- For Generation (The LLM): Larger chunks are better. The LLM needs broad, surrounding context to synthesize a comprehensive answer.
The advanced solution is the Parent Document Retriever pattern.
- We split our document into massive "Parent Chunks" (e.g., 2000 characters).
- We split those Parent Chunks into highly specific "Child Chunks" (e.g., 200 characters).
- We embed and store only the Child Chunks in the vector database.
- We store the Parent Chunks in a standard Key-Value document store (like Redis or a local dictionary).
- The Magic: When the user searches, the vector database finds the highly relevant Child Chunk. But instead of sending the child to the LLM, it looks up the child's
parent_id, fetches the massive Parent Chunk, and sends that to the LLM.
This gives you the precision of small-chunk retrieval with the massive context of large-chunk generation.
3.5 What's Next?
We have successfully sliced our raw data into semantically meaningful, structured chunks.
The next step is to translate these human-readable text chunks into numbers that a machine can search. In Chapter 4, we dive into the mathematics and practical application of Embeddings and Dimensionality.