Chapter 23: Search and Filtering Architecture

Why This Exists

If a user cannot find a product, they cannot buy it. In a catalog of 100,000 items, relying on users to click through category menus is a death sentence for conversion rates. Modern users have been trained by Google to expect a single search bar that instantly understands their intent, forgives their typos, and allows them to slice the data using faceted filters. Search architecture exists to parse human language and intent into instant product discovery.

Real World Problem

A company builds their search using standard SQL: SELECT * FROM products WHERE name LIKE '%IPad%'. A customer types "Apple iPad". The query returns 0 results because the exact string isn't an identical match. Another customer types "iPads" (plural). 0 results. Another types "sneakers". The database has them listed as "running shoes". 0 results. The customers assume the store is empty and go to a competitor. The real-world problem is that relational databases do not understand the nuance, synonyms, and variations of human language.

Everyday Analogy

Imagine walking into a massive library.

SQL Database: You ask the librarian for "The History of Rome." The librarian walks down every single aisle, looking at the cover of every single book, one by one, until they find it. It takes hours.
Search Engine: You ask the librarian. They immediately walk to a small filing cabinet (The Index), pull the card for "Rome," and say, "Aisle 4, Shelf 2." It takes 5 seconds.

Beginner Explanation

A search engine does a lot of work before you even type anything. When you add a product called "Men's Running Shoes" to the store, the search engine chops it up into tags: [men, run, shoe]. When a user types "Running Shoe", the engine looks at its tags, matches them instantly, and returns the product. It also provides "Filters" (like a sidebar) that let the user click "Size 10" and "Color Blue" to narrow down the results instantly.

Intermediate Explanation

The core of search architecture is the Inverted Index. A standard database maps a Document ID to its contents:

ID 1 -> "Red Apple"
ID 2 -> "Green Apple"

An Inverted Index maps the Contents to the Document ID:

"Red" -> [ID 1]
"Green" -> [ID 2]
"Apple" -> [ID 1, ID 2]

When a user searches for "Apple", the system doesn't scan the database. It looks up "Apple" in the index, instantly sees [ID 1, ID 2], and returns them. This is why search engines are exponentially faster than SQL LIKE queries.

Advanced Explanation

Search requires a complex text analysis pipeline:

Tokenization: Breaking "Sony 55-inch TV" into [Sony, 55, inch, TV].
Lowercasing: Converting Sony to sony.
Stop Words: Removing useless words (e.g., "the", "and").
Stemming: Reducing words to their root (e.g., "running" becomes "run", "shoes" becomes "shoe").
Synonyms: Linking words. If the query is "couch", secretly expand the query to search for [couch, sofa, loveseat].
Fuzzy Matching (Levenshtein Distance): If a user types "snackers", the system calculates that it is only a 1-character typo away from "sneakers" and returns sneaker results.

Real World Example

Faceted Navigation (Etsy/Amazon): Search isn't just about the text bar; it's about the sidebar. When you search "Shirts," the sidebar shows:

Size: Small (15), Medium (42), Large (10)
Color: Red (5), Blue (62)

The search engine calculates these counts (Aggregations) dynamically based on the current search results in milliseconds. Attempting to do this with SQL GROUP BY across millions of rows would crash the server.

Architecture Design

Search is a Read-Model separated from the primary Write-Database (CQRS pattern).

graph TD
    UI[Frontend] -->|1. Search "Shoes"| SearchAPI[Search API]
    SearchAPI -->|2. Query DSL| ES[(Search Engine - Algolia/ES)]
    ES -- 3. Instant Results + Facets --> UI
    
    Admin[Merchant] -->|Updates Product| SQL[(Primary SQL DB)]
    SQL -- Async Event --> SyncWorker[Index Sync Worker]
    SyncWorker -->|Updates Index| ES

Database Design

Search engines use flat, denormalized JSON documents, not relational tables.

Elasticsearch Document Structure:

{
  "id": "prod_123",
  "name": "Nike Air Max",
  "brand": "Nike",
  "price": 120.00,
  "categories": ["Shoes", "Mens", "Running"], // Denormalized hierarchy
  "attributes": [
    { "key": "color", "value": "Red" },
    { "key": "size", "value": "10" }
  ]
}

API Design

Search Request: GET /api/search?q=nike+shoes&filters[color]=Red&sort=price_desc

Search Response:

{
  "results": [ { "id": "prod_123", "name": "Nike Air Max" } ],
  "facets": {
    "color": [ {"value": "Red", "count": 1}, {"value": "Blue", "count": 5} ]
  },
  "total_hits": 6
}

Production Considerations

Eventual Consistency: When a merchant updates a price in SQL, it takes a few milliseconds (or seconds) for the Index Sync Worker to update the Search Engine. The UI must handle situations where the search results show $100, but clicking the product page shows the real-time price of $110.
Typo Tolerance vs. Exact Match: If a user searches for a specific model number (e.g., RTX-3080), applying fuzzy typo-tolerance might ruin the results by showing RTX-3090 items. The engine must be configured to prioritize exact matches on specific fields (like SKU) while allowing fuzzy matching on descriptive text.

Security Considerations

Denial of Service (DoS): Heavy search queries (like wildcard searches * or complex Regex) consume massive CPU on the search cluster. The Search API must heavily sanitize and restrict the complexity of queries passed from the frontend to the backend engine to prevent users from intentionally crashing the search cluster.

Common Mistakes

Using SQL for Main Search: Trying to survive production using LIKE '%term%' or even PostgreSQL's built-in tsvector. They work for small catalogs but fail to provide the dynamic faceted counts required for modern sidebars.
Indexing HTML: Pushing the raw description field containing HTML tags (<p>Great shirt!</p>) into the search index. The user searches for the word "header" and finds the shirt because the HTML contained <h1>. Strip HTML before indexing.

Tradeoffs and Alternatives

Self-Hosted (Elasticsearch) vs. SaaS (Algolia / Typesense Cloud):
- Elasticsearch is free to use but incredibly difficult to tune, monitor, and scale (it consumes massive amounts of RAM).
- Algolia provides an unbelievable developer experience, instant typo-tolerance, and zero ops, but is notoriously expensive for high-volume catalogs.

Interview Questions

Explain the difference between how a SQL database and a Search Engine store data (B-Tree vs Inverted Index).
What is Tokenization and Stemming in the context of full-text search?
How do you ensure the Search Engine stays synchronized with the primary SQL catalog database?

Hands-On Exercise

Go to Amazon.com.
Search for a completely misspelled word (e.g., "bluthooth hedphones").
Notice how it instantly autocorrects and shows you the correct items.
Look at the sidebar. Notice the counts next to the brands (e.g., Sony (412)).
Understand that all of this was computed in less than 50 milliseconds using an Inverted Index and Aggregations.

Key Takeaways

SQL is for storing truth; Search Engines are for discovering truth.
Search relies on Inverted Indexes, which map Words to Documents.
A strong search pipeline requires Tokenization, Stemming, Synonyms, and Fuzzy Matching.
Faceted navigation (the filter sidebar) is powered by Search Engine Aggregations, not SQL GROUP BY queries.