Chapter 4: Product Catalog Design

Why This Exists

The product catalog is the heart of any e-commerce system. If customers cannot find products, filter them, or understand what they are buying, no transactions occur. Designing a catalog seems simple until you realize that a t-shirt has a size and color, a laptop has RAM and a processor, and a refrigerator has dimensions and energy rating. The catalog architecture exists to store, search, and retrieve highly unstructured, varied product data at lightning speed.

Real World Problem

A junior developer is tasked with building a database for an electronics store. They create a products table with columns: id, name, price, ram, cpu, screen_size. A month later, the business decides to sell shoes. The developer adds shoe_size and color columns. Then they sell books, adding author and isbn. Soon, the table has 500 columns, most of which are NULL for any given product (a shoe doesn't have RAM). The database becomes a bloated, unqueryable nightmare.

Everyday Analogy

Imagine a physical library. If you organize the library by creating a custom shelf for every possible combination of book traits (e.g., "Red cover, written in 1990, sci-fi"), you'll never find anything. Instead, you need a flexible catalog system. You store the physical books efficiently in the back, but you build a highly optimized, flexible index card system in the front that lets people search by author, genre, or publication year without moving the physical books.

Beginner Explanation

At the most basic level, a product needs a few standard things: a Name, a Description, a Price, and an Image. Every single product has these.

But products also have unique attributes. To solve the problem of varied attributes, modern systems use flexible data formats like JSON. Instead of making a separate database column for every possible feature (like RAM or shoe size), we create one column called attributes and stuff all the specific details in there. This allows a shoe and a laptop to live happily in the same database table.

Intermediate Explanation

While storing flexible data in a JSON column (in PostgreSQL or MongoDB) solves the storage problem, it creates a read problem. Customers want to filter products: "Show me all laptops under $1000 with 16GB RAM."

Querying deep inside JSON structures across millions of rows in a relational database is slow. To fix this, e-commerce architectures split the catalog into two systems:

The Source of Truth (Write Database): A relational database (SQL) that stores the core product data reliably.
The Search Engine (Read Database): A system like Elasticsearch or Typesense optimized entirely for fast filtering, full-text search, and faceted navigation.

Advanced Explanation

At production scale, Catalog systems utilize CQRS (Command Query Responsibility Segregation).

When a merchant updates a product in the admin panel, the system executes a Command. It writes the data to the highly normalized SQL database (the Source of Truth).

This write triggers an Event (via a message queue like Kafka). A separate service listens to this event, transforms the complex SQL relational data into a massive, flat JSON document, and pushes it into Elasticsearch (the Query side).

When a user searches the website, the API only talks to Elasticsearch. The frontend never touches the SQL database. This ensures that massive traffic spikes on the frontend do not affect the backend administration systems.

Real World Example

Consider Wayfair, the furniture e-commerce giant. Furniture has thousands of attributes (material, dimensions, weight capacity, style). Wayfair relies heavily on a decoupled catalog architecture. Their administrative tools write to a source-of-truth database, which then asynchronously updates vast Elasticsearch clusters. When you use their sidebar to filter by "Mid-Century Modern," "Wood," and "Under $500," you are querying the search cluster, returning results in milliseconds.

Architecture Design

Here is how a CQRS Catalog Architecture operates:

graph TD
    Admin[Merchant / PIM System] -->|Writes| WriteAPI[Catalog Write API]
    WriteAPI --> DB_SQL[(Relational DB - Source of Truth)]
    
    DB_SQL -- Change Data Capture / Events --> MQ[Message Queue - Kafka]
    
    MQ --> SyncWorker[Catalog Sync Worker]
    SyncWorker -->|Indexes| ES[(Elasticsearch / Read Store)]
    
    Customer[Shopper] -->|Searches & Filters| ReadAPI[Catalog Read API]
    ReadAPI --> ES

Database Design

1. The Anti-Pattern (Wide Table): Products(id, name, price, ram, color, shoe_size, author, ...) -> Do not do this.

2. The EAV Pattern (Entity-Attribute-Value): Historically used by Magento. Extremely flexible, but notoriously slow to query.

Products: (id, name, price)
Attributes: (id, name) e.g., 'Color', 'RAM'
Product_Values: (product_id, attribute_id, value) e.g., (1, 'Color', 'Red')

3. The Modern Approach (JSONB in SQL / NoSQL):

CREATE TABLE products (
    id UUID PRIMARY KEY,
    name VARCHAR(255),
    base_price DECIMAL(10,2),
    attributes JSONB -- Stores {"ram": "16gb", "color": "silver"}
);

Note: This table is then synchronized to Elasticsearch for fast read operations.

API Design

Write API (Admin): PUT /api/admin/products/123 Updates the core data and triggers the sync event.

Read API (Frontend): GET /api/catalog/search?q=laptop&filters[ram]=16gb&sort=price_asc This API translates the query string directly into an Elasticsearch query DSL.

Production Considerations

Eventual Consistency: Because the SQL database syncs to the Search Engine asynchronously, there is a delay (often milliseconds, but sometimes longer). A merchant might update a price to $90, but the search page still shows $100 for a few seconds. The UI must account for this eventual consistency.
Cache Invalidation: Product details pages (PDPs) are heavily cached in CDNs or Redis. When a product changes, you must have a mechanism to purge the cache for that specific product ID.

Security Considerations

XSS (Cross-Site Scripting): Merchants or vendors might input malicious JavaScript into the description field (e.g., <script>stealCookie()</script>). The Catalog API must strictly sanitize all HTML inputs before saving them to the database.
Denial of Service via Search: Complex, nested search queries can consume massive CPU on the search engine. APIs must restrict the depth and complexity of search filters allowed from the frontend.

Common Mistakes

Reading from the Write Database: Connecting the consumer-facing website directly to the primary SQL catalog database. A simple bot scraping your site will take down your entire infrastructure.
Over-normalizing: Splitting product data into 20 different SQL tables (categories, tags, dimensions, variants) and trying to run a 20-table JOIN every time a user views a product page.

Tradeoffs and Alternatives

EAV vs. JSON: EAV allows strict typing (you can enforce that 'Price' is an integer at the DB level), but requires massive SQL JOINs. JSON columns are schemaless and fast, but rely on the application layer to validate data types.
PIM vs. Custom Build: Large enterprises use dedicated Product Information Management (PIM) software (like Akeneo or Salsify) as their Source of Truth, rather than building the catalog admin panel from scratch.

Interview Questions

Design the database schema for a catalog that sells clothing, electronics, and digital downloads. How do you handle the different attributes?
A customer complains that they updated a product's price, but the search results still show the old price. What architectural pattern is causing this, and how do you explain it?
Why would you choose Elasticsearch over PostgreSQL for rendering a product listing page with faceted filters?

Hands-On Exercise

Choose a complex product (like a high-end digital camera).
Write a JSON document that represents all the attributes, specifications, and pricing for this camera.
Think about which of those JSON fields need to be "Searchable" (e.g., Megapixels) versus "Display Only" (e.g., the exact dimensions of the box).

Key Takeaways

Catalogs suffer from the "Sparse Data Problem"—products have varied attributes.
Modern architectures solve this using JSON document storage combined with CQRS.
The system that writes the catalog data (SQL) should almost never be the same system that reads the catalog data (Elasticsearch).
Designing for eventual consistency is mandatory at scale.