Chapter 24: Elasticsearch and Search Engines

Why This Exists

Chapter 23 explained why we need search engines and the concepts behind them (Inverted Indexes, Tokenization). This chapter dives into how the industry-standard tool—Elasticsearch—actually works under the hood. Understanding Elasticsearch exists so architects can build distributed, highly available search clusters capable of aggregating millions of product attributes in milliseconds without crashing the servers.

Real World Problem

A company’s catalog grows to 5 million products. They install Elasticsearch on a single server. During a Black Friday sale, 10,000 users hit the search bar simultaneously. The Elasticsearch server runs out of RAM (Java Heap Space), crashes, and the entire website’s navigation disappears. The real-world problem is that search is computationally expensive; a single machine cannot hold or process the entire internet's worth of text. It must be distributed.

Everyday Analogy

Imagine a massive encyclopedia set containing 100 books.

If one person tries to find all mentions of the word "Architecture," it will take them days.
Elasticsearch (The Cluster): You hire 10 people (Nodes). You give each person 10 books (Shards). You stand at the front (The Coordinator) and yell, "Find the word Architecture!" All 10 people search their 10 books simultaneously and hand you their results. You merge the lists and hand it to the user. This is distributed search.

Beginner Explanation

Elasticsearch is a specialized database. Instead of storing data in tables and rows, it stores data as JSON Documents. You send it a massive JSON file representing a product. Elasticsearch rips all the words out of that JSON file, organizes them alphabetically in an index, and stores it in memory. When you ask it a question, it doesn't search the documents; it searches the index, making it blazing fast.

Intermediate Explanation

Elasticsearch operates as a Cluster made of multiple Nodes (servers). Data is stored inside an Index (think of this as a database table). To distribute the data, an Index is broken into Shards.

If an Index has 3 Shards, and you have 3 Nodes, Elasticsearch puts one shard on each node.
When a search query comes in, it runs in parallel across all 3 nodes simultaneously.

To prevent data loss if a node catches fire, Elasticsearch uses Replicas. A Replica is an exact copy of a Shard, stored on a different node. Replicas also help handle high read traffic, as search queries can be routed to either the Primary shard or the Replica shard.

Advanced Explanation

The true power of Elasticsearch for e-commerce is Aggregations (facets). When a query searches for "Shirts," Elasticsearch doesn't just return the documents. It runs analytical functions on the result set in real-time. It uses specialized data structures called Doc Values (column-oriented data structures built at index time) to rapidly count attributes.

"aggs": {
  "colors": { "terms": { "field": "color.keyword" } },
  "sizes": { "terms": { "field": "size.keyword" } }
}

The cluster distributes this aggregation request to all shards. Each shard calculates the counts for its specific data segment. The coordinating node merges these counts and returns the final sidebar filters: Red (40), Blue (20).

Real World Example

GitHub Code Search: When you search for code on GitHub, you are using Elasticsearch. GitHub stores billions of lines of code. They heavily shard their Elasticsearch clusters. When you type a query, it is parallelized across thousands of shards, returning the exact file and line number where your search term exists in milliseconds.

Architecture Design

Here is the internal architecture of an Elasticsearch Query:

graph TD
    Client[Web API] -->|Search 'Laptop'| CoordNode[Coordinating Node]
    
    CoordNode -->|Scatter| Node1[Data Node 1: Shard 1]
    CoordNode -->|Scatter| Node2[Data Node 2: Shard 2]
    CoordNode -->|Scatter| Node3[Data Node 3: Shard 3]
    
    Node1 -- Local Hits & Facets --> CoordNode
    Node2 -- Local Hits & Facets --> CoordNode
    Node3 -- Local Hits & Facets --> CoordNode
    
    Note over CoordNode: 'Gather Phase': Merges & Sorts Results
    CoordNode -- Final 10 Results --> Client

Database Design (Mapping)

In SQL, you define schemas. In Elasticsearch, you define Mappings. Elasticsearch treats text differently depending on the mapping:

{
  "mappings": {
    "properties": {
      "title": { 
        "type": "text", // Analyzed: chopped into words for full-text search
        "analyzer": "english" 
      },
      "brand": { 
        "type": "keyword" // Not analyzed: used for exact filtering & aggregations
      },
      "price": { 
        "type": "double" 
      }
    }
  }
}

API Design

Elasticsearch exposes a REST API using a massive JSON payload called the Query DSL.

Search Request:

GET /products/_search
{
  "query": {
    "bool": {
      "must": [ { "match": { "title": "laptop" } } ],
      "filter": [ { "term": { "brand": "Apple" } } ]
    }
  },
  "aggs": {
    "average_price": { "avg": { "field": "price" } }
  }
}

Production Considerations

The JVM Heap Limit: Elasticsearch runs on Java. You must assign memory to the JVM Heap. The golden rule of Elasticsearch is never assign more than 32GB of RAM to the Heap, regardless of how big the server is. Going above 32GB disables "Compressed OOPs," causing a massive performance drop. Leave the remaining server RAM for the OS filesystem cache (which Elasticsearch relies on heavily).
Split-Brain Problem: If a cluster has 2 Master nodes and a network partition occurs, both nodes might think the other is dead and declare themselves the sole Master, corrupting the data. Always use an odd number of Master-eligible nodes (e.g., 3) to ensure a quorum (majority vote).

Security Considerations

Exposed Clusters: Historically, Elasticsearch installed with no authentication by default on Port 9200. Thousands of companies accidentally left this port open to the public internet. Automated bots scanned the internet, found the clusters, deleted all the data, and left an index named READ_ME_TO_RECOVER_DATA demanding Bitcoin. Always enable X-Pack Security (Basic Auth / TLS) and restrict access via Virtual Private Clouds (VPCs).

Common Mistakes

Deep Pagination: Allowing a user to request page=10000 (from=100000, size=10). To do this, Elasticsearch must fetch 100,010 records from every single shard, merge them in memory, and throw away 100,000 of them. It will crash the cluster. Restrict UI pagination to a maximum of 1,000 results. Use the search_after API for deep scrolling.
Mapping Explosions: Using dynamic mapping where every new unique key in a JSON document creates a new field in the index. If a hacker sends a payload with 10,000 random JSON keys, the cluster state will explode and crash. Strictly define mappings.

Tradeoffs and Alternatives

OpenSearch vs Elasticsearch: After Elastic changed their open-source license, AWS forked the project into "OpenSearch." They are structurally similar, but divergence in features (like vector search) is growing.
Solr vs Elasticsearch: Both are built on the underlying Apache Lucene library. Elasticsearch won the market share due to its superior REST API and developer experience.

Interview Questions

Explain the "Scatter-Gather" mechanism in distributed search engines.
What is the difference between a text field and a keyword field in an Elasticsearch mapping?
Why should you never allocate more than 32GB of RAM to the Elasticsearch JVM Heap?

Hands-On Exercise

Download and run Elasticsearch locally via Docker.
Use cURL or Postman to PUT /my_index and index a simple JSON document.
Write a GET /my_index/_search query to retrieve it.

Key Takeaways

Elasticsearch is a distributed JSON document store built on top of Apache Lucene.
Indexes are split into Shards (for parallel processing) and Replicas (for high availability).
Full-text search uses text fields; Facets/Filtering use keyword fields.
Deep pagination and unchecked dynamic mappings are the easiest ways to crash a cluster.