Chapter 147 min read

Chapter 14: Webhooks and Event Processing

Why This Exists

Modern e-commerce architectures are distributed systems. Your servers must communicate with Stripe, FedEx, Shopify, and Mailchimp. When Stripe processes a payment, or FedEx scans a package, your system needs to know immediately. You cannot keep an HTTP connection open for 3 days waiting for a package to ship. Webhooks and event processing exist to enable asynchronous, real-time communication between decoupled systems over the internet.

Real World Problem

A merchant uses a slow bank transfer payment method (like ACH or SEPA) which takes 3 days to clear. The customer completes checkout on Monday. On Thursday, the bank approves the transfer. How does your backend know the order is now paid so it can tell the warehouse to ship it? If you write a script that asks the bank API every 5 minutes, "Is it paid yet?", you waste massive amounts of server resources (Polling). The real-world problem is receiving data efficiently when the timing of the event is entirely unpredictable.

Everyday Analogy

  • Polling: You are waiting for a package. You open your front door every 5 minutes to check if the delivery truck is there. It is exhausting and highly inefficient.
  • Webhooks: You install a doorbell. You sit on your couch and watch TV. When the delivery driver arrives, they ring the doorbell, and you go to the door. Webhooks are the API equivalent of a doorbell.

Beginner Explanation

A Webhook is just a URL (like https://api.yourstore.com/webhooks/stripe) that you create on your server. You tell Stripe, "Hey, whenever a payment succeeds, send a message to this URL." When the payment clears, Stripe's servers make a POST request to your URL with a JSON message containing the details. Your server receives the message and updates the database.

Intermediate Explanation

Webhooks flip the traditional API model. Instead of the Client calling the Server, the Server (Stripe) calls the Client (Your App).

Building a webhook receiver is easy, but making it reliable is hard. Because the sender (Stripe) expects an immediate response, your webhook endpoint must return a 200 OK status code within 2-3 seconds. If your code takes 5 seconds to generate an invoice PDF and update the database, Stripe will assume your server is broken, disconnect, and retry the webhook an hour later. This leads to duplicate processing.

Advanced Explanation

At enterprise scale, webhook endpoints must be Idempotent and purely act as a Router to a Message Queue.

When a webhook hits your server:

  1. Cryptographically verify the signature.
  2. Check if this specific event ID has been processed before (Idempotency check).
  3. Immediately push the JSON payload to an asynchronous Message Queue (like AWS SQS, RabbitMQ, or Kafka).
  4. Return 200 OK.

A separate, background worker pulls the event from the Queue and does the heavy lifting (updating the DB, sending emails). If the worker crashes, the event remains safely in the Queue to be retried, and the external provider (Stripe) is happy because they got their instant 200 OK.

Real World Example

Shopify App Ecosystem: If you build an app for Shopify (e.g., a Loyalty Points app), you rely entirely on Webhooks. You subscribe to the orders/create topic. Every time any of the 1 million+ Shopify merchants makes a sale, Shopify fires a webhook to your servers. During Black Friday, your servers might receive 10,000 webhooks per second. If you process these synchronously, your database will lock up and crash. You must push them to a Queue and process them at a speed your database can handle.

Architecture Design

Here is the standard robust architecture for processing external webhooks:

graph TD
    Ext[External Service - e.g., Stripe] -->|POST Payload| API[Webhook API Endpoint]
    
    API -->|1. Validate Signature| Val{Signature Valid?}
    
    Val -- Yes --> Queue[(Message Queue - SQS)]
    Val -- No --> Drop[Drop Request / 401]
    
    Queue -- 2. Return 200 OK Fast --> Ext
    
    Worker[Background Worker] -->|3. Consume Message| Queue
    Worker -->|4. Check Idempotency| DB[(Database)]
    Worker -->|5. Execute Business Logic| DB

Database Design

To ensure idempotency and track processing errors, you need a Webhook Logs (or Inbox) table.

CREATE TABLE webhook_events (
    event_id VARCHAR(100) PRIMARY KEY, -- e.g., 'evt_345' from Stripe
    provider VARCHAR(50), -- 'Stripe', 'FedEx'
    topic VARCHAR(100), -- 'payment_intent.succeeded'
    payload JSONB,
    status VARCHAR(50), -- 'PENDING', 'PROCESSED', 'FAILED'
    received_at TIMESTAMP,
    processed_at TIMESTAMP
);

Before processing, the worker checks if event_id already exists. If yes, it skips it.

API Design

The Webhook Receiver: POST /api/webhooks/stripe Headers: Stripe-Signature: t=1612345,v1=abcd... Payload:

{
  "id": "evt_999",
  "type": "payment_intent.succeeded",
  "data": {
    "object": {
      "id": "pi_123",
      "amount": 10000
    }
  }
}

Production Considerations

  • Retries and Dead Letter Queues (DLQ): If a webhook payload has bad data (e.g., an order ID that doesn't exist in your DB), your background worker will fail. The Queue will retry it. If it fails 5 times, it should be moved to a DLQ—a separate queue for broken messages that engineers can manually inspect, ensuring the main queue isn't blocked by poisonous messages.
  • Event Ordering: Webhooks are not guaranteed to arrive in order. You might receive subscription_canceled before you receive subscription_created. Your application logic must handle out-of-order events gracefully.

Security Considerations

  • Replay Attacks & Spoofing: Anyone can send a POST request to your public webhook URL. Hackers will send fake payment.succeeded payloads to steal items. You MUST validate the cryptographic HMAC signature in the headers using a secret key shared only between you and the provider.
  • Time Attacks: Check the timestamp in the signature. If the webhook is 5 minutes older than the current server time, reject it to prevent replay attacks.

Common Mistakes

  • Synchronous Processing: Doing API calls, database writes, and email sending inside the HTTP handler.
  • Ignoring Retries: Assuming an event will only ever be sent once. If the network hiccups, Stripe will send it again. If your code isn't idempotent, you will credit the user's account twice.
  • Missing Signatures: Accepting webhooks without verifying the cryptographic signature.

Tradeoffs and Alternatives

  • Webhooks vs. Polling: Webhooks are vastly more efficient but require your server to be publicly accessible on the internet and highly available. If your server is deep behind a corporate firewall, you might be forced to use Polling (fetching data on a cron job).

Interview Questions

  1. Explain the architectural difference between Polling and Webhooks.
  2. Why is it important to use a Message Queue when receiving webhooks from a high-traffic service like Shopify or Stripe?
  3. What is Idempotency, and how would you design a database table to ensure a webhook event is only processed once?

Hands-On Exercise

  1. Look up the documentation for "Stripe Webhook Signatures."
  2. Write down the conceptual steps for validating an HMAC SHA256 signature (What data do you combine? What secret key do you use? What do you compare it against?).
  3. Create a free account on webhook.site. It gives you a unique URL. Paste that URL into a service you use (like GitHub or a mock Stripe account) and watch the raw HTTP requests come in.

Key Takeaways

  • Webhooks are the nervous system of distributed e-commerce architecture.
  • Webhook endpoints must be fast. Validate, push to a Queue, and return 200 OK immediately.
  • Never trust a webhook without validating its cryptographic signature.
  • Webhook processing logic must be idempotent to handle retries safely.

Further Reading

  • Stripe Documentation: Best Practices for Webhooks
  • Enterprise Integration Patterns: Message Channels and Queues