Chapter 2: Getting the Data (Document Loaders)

Think of a Document Loader as the extraction engine of your RAG pipeline.

Before an AI can answer questions about your data, you must extract that data from its native habitat. Data is rarely clean. It lives in nested folders of PDFs, large CSV exports, password-protected web pages, and enterprise APIs.

A Document Loader is responsible for connecting to these sources, extracting the raw information, and packaging it into a standardized format. In this chapter, we will explore advanced ingestion methods, from handling memory-efficient lazy loading to crawling entire websites.

2.1 The Core Abstraction: The `Document` Object

In the LangChain ecosystem, every loader outputs a standardized object called a Document. Regardless of whether the source was a SQL table row or a massive PDF, it becomes a Document.

A Document object consists of two critical properties:

page_content (String): The actual raw text that the LLM will eventually read.
metadata (Dictionary): Key-value pairs describing the text. This includes the source file path, page numbers, authors, or timestamps.

[!IMPORTANT]
Metadata is just as important as the text. If a user asks "What were our Q3 earnings in 2023?", the vector database relies on metadata filters to exclude documents from 2022. Without rich metadata, your search will suffer from catastrophic noise.

2.2 Memory Management: `.load()` vs `.lazy_load()`

When dealing with massive datasets (e.g., thousands of PDFs or gigabytes of JSON), calling loader.load() will crash your server because it attempts to load every single document into RAM simultaneously.

Advanced developers use .lazy_load(). This returns a Python generator, yielding one document at a time. This allows you to process, embed, and store documents in batches without running out of memory.

from langchain_community.document_loaders import TextLoader

loader = TextLoader("./massive_server_logs.txt")

# ADVANCED: Lazy Loading
# This does not load the file into memory yet!
document_generator = loader.lazy_load()

for doc in document_generator:
    print(f"Processing chunk from: {doc.metadata['source']}")
    # You can now send this individual 'doc' to your text splitter or database
    # and then discard it from memory.

2.3 Advanced Local Loaders: Structured and Unstructured Data

Parsing Structured Data (CSVs)

Loading a CSV is not as simple as reading it as text. If you read a 10,000-row CSV as a single string, the LLM will lose the row-by-row context.

The CSVLoader treats every single row as a separate Document.

from langchain_community.document_loaders.csv_loader import CSVLoader

# You can define which column contains the primary text, 
# and which columns should be treated as metadata!
loader = CSVLoader(
    file_path='./customer_feedback.csv',
    source_column="Ticket_ID", # Uses the Ticket ID as the 'source' in metadata
    csv_args={
        'delimiter': ',',
        'fieldnames': ['Ticket_ID', 'Customer_Name', 'Feedback', 'Date']
    }
)

docs = loader.load()
# Each row is now its own Document object, perfectly isolated for search.

Parsing PDFs: PyPDF vs. PyMuPDF

LangChain offers multiple PDF loaders because PDF parsing is notoriously difficult.

PyPDFLoader: Good for basic text, extracts page-by-page.
PyMuPDFLoader: Significantly faster and better at preserving paragraph structures and extracting text from complex layouts.

from langchain_community.document_loaders import PyMuPDFLoader

loader = PyMuPDFLoader("./financial_report.pdf")
docs = loader.load()

# PyMuPDF automatically injects rich metadata:
# {'source': './financial_report.pdf', 'page': 4, 'total_pages': 102, 'author': 'Finance Team'}

2.4 Web Scraping and Recursive Crawling

Often, your data lives on the internet or an internal company wiki.

Single Page Extraction

The WebBaseLoader grabs a URL and strips away the HTML boilerplate (headers, footers, navigation bars) leaving only the core readable text.

from langchain_community.document_loaders import WebBaseLoader

loader = WebBaseLoader("https://lilianweng.github.io/posts/2023-06-23-agent/")
docs = loader.load()

Recursive Web Crawling

If you want to ingest an entire documentation website (like the LangChain docs), you cannot manually pass hundreds of URLs. You use the RecursiveUrlLoader.

It starts at a root URL, searches for links, and recursively downloads child pages up to a maximum depth.

from langchain_community.document_loaders.recursive_url_loader import RecursiveUrlLoader
from bs4 import BeautifulSoup

# A custom function to clean the HTML of every crawled page
def bs4_extractor(html: str) -> str:
    soup = BeautifulSoup(html, "lxml")
    return soup.get_text(separator="\n", strip=True)

loader = RecursiveUrlLoader(
    url="https://docs.python.org/3/",
    max_depth=2,          # Go two links deep
    extractor=bs4_extractor
)

# This will yield hundreds of pages automatically
docs = loader.load()

2.5 Bulk Ingestion: The `DirectoryLoader`

In a production environment, users will upload hundreds of different file types into an S3 bucket or a local directory. You need a way to load all of them concurrently, applying the correct loader (PDF loader for PDFs, CSV loader for CSVs) automatically.

The DirectoryLoader handles this routing and supports multithreading to dramatically speed up ingestion.

from langchain_community.document_loaders import DirectoryLoader, PyMuPDFLoader, TextLoader

# We map file extensions to their specific loader classes
loaders = {
    ".pdf": PyMuPDFLoader,
    ".txt": TextLoader,
}

loader = DirectoryLoader(
    path='./company_knowledge_base',
    glob="**/*.*",               # Recursively search all subdirectories
    loader_kwargs={"autodetect_encoding": True},
    use_multithreading=True,     # Speed up IO bound loading!
    show_progress=True,          # Displays a nice progress bar in the terminal
    # Use our custom mapping, or let it fallback to unstructured
    # loader_cls=... 
)

docs = loader.load()

2.6 Best Practices for Document Ingestion

Garbage In, Garbage Out: If your PDF has unreadable tables or messy OCR (scanned text), the AI will hallucinate. Always inspect your page_content before passing it to the database.
Sanitize Content: Remove excessive newlines, bizarre unicode characters, and boilerplate headers. Clean text yields significantly better embeddings.
Inject Custom Metadata: If you are building a multi-tenant SaaS application, always inject a user_id or org_id into the metadata of every document immediately after loading. This ensures users cannot search each other's private documents.

2.7 What's Next?

We now have the ability to ingest millions of documents from databases, websites, and file systems.

However, we cannot send a 500-page PDF directly into a vector database or an LLM. We must chop these massive documents into small, semantically meaningful pieces. In Chapter 3, we will master Text Splitters and the science of chunking.

2.1 The Core Abstraction: The Document Object

2.2 Memory Management: .load() vs .lazy_load()