Node.js Interview Prep
Production and Scaling

Process Management

Clustering, PM2, and Graceful Shutdown

LinkedIn Hook

"Your Node.js server is using 12% CPU on a 16-core machine -- and you wonder why it's slow."

Node.js runs on a single thread by default. That means no matter how powerful your server is, your app is using exactly one CPU core. The other fifteen are sitting idle, watching your event loop sweat.

Worse: when you kill the process to deploy a new version, every in-flight request dies mid-response. Users see 502s. Database transactions get orphaned. Logs fill with ECONNRESET.

Production-grade Node.js requires three things most tutorials skip: clustering (one worker per core), a process manager like PM2 (auto-restart, zero-downtime reload), and graceful shutdown (drain in-flight work before exit).

Get these wrong and your "scalable" Node app falls over the first time traffic spikes. Get them right and a single 16-core box can handle what most teams reach for Kubernetes to solve.

In Lesson 10.1, I break down the cluster module, PM2 ecosystem files, SIGTERM handling, and when clustering beats horizontal scaling -- the way senior interviewers want to hear it.

Read the full lesson -> [link]

#NodeJS #Backend #DevOps #PM2 #Scaling #InterviewPrep


Process Management thumbnail


What You'll Learn

  • Why a single Node process leaves most of your server idle
  • How the built-in cluster module forks workers and shares a port
  • How the OS distributes incoming connections (round-robin on Linux)
  • What PM2 adds on top of cluster -- auto-restart, logs, ecosystem files, zero-downtime reload
  • How to handle SIGTERM properly: stop accepting, finish in-flight, close DB pools
  • When clustering is enough -- and when you should reach for Kubernetes instead
  • How load balancers use health checks to avoid routing to dying processes

The Coffee Shop Analogy — One Cashier vs A Full Counter

Imagine a coffee shop with one cashier. No matter how many baristas you hire in the back, every customer must funnel through that single cashier first. If she pauses to count change, the entire line freezes. That cashier is your Node.js event loop.

Now picture the same shop with four cashiers, each at their own register. A manager out front waves customers to whichever register is free. Throughput quadruples without changing anything about the baristas. That's clustering.

But there's a catch: if one cashier suddenly faints, the manager has to notice instantly and swap in a fresh employee from the break room -- without dropping the customer she was already serving. That swap-without-dropping behavior is what PM2 and graceful shutdown give you together.

And finally: if a single shop with four cashiers still can't keep up on Black Friday, you don't add a fifth register -- you open a second store across town and put a sign out front telling people which one is closer. That's horizontal scaling with Kubernetes.

+----------------------------------------------------------------+
|              SINGLE PROCESS (The Default Problem)              |
+----------------------------------------------------------------+
|                                                                |
|   16-core server                                               |
|                                                                |
|   [CPU0] <- node server.js  (100% busy)                        |
|   [CPU1]    idle                                               |
|   [CPU2]    idle                                               |
|   [CPU3]    idle                                               |
|   ...                                                          |
|   [CPU15]   idle                                               |
|                                                                |
|   Result: 1/16 = 6.25% theoretical max utilization             |
|                                                                |
+----------------------------------------------------------------+

+----------------------------------------------------------------+
|              CLUSTERED PROCESS (The Solution)                  |
+----------------------------------------------------------------+
|                                                                |
|                  +-------------------+                         |
|                  |   MASTER PROCESS  |                         |
|                  |  (no HTTP traffic)|                         |
|                  +---------+---------+                         |
|                            |                                   |
|        +-------+-----------+-----------+-------+               |
|        |       |           |           |       |               |
|     [CPU0]  [CPU1]      [CPU2]      [CPU3]  [CPU4...]          |
|     WORKER  WORKER      WORKER      WORKER  WORKER             |
|       ^       ^           ^           ^       ^                |
|       |       |           |           |       |                |
|       +-------+---OS round-robin------+-------+                |
|                            |                                   |
|                       :3000 (one shared port)                  |
|                                                                |
+----------------------------------------------------------------+

The cluster Module — Node's Built-In Multiprocess Primitive

Node ships with a cluster module that lets a single master process spawn N child processes (workers), each running the same script. All workers share the same listening port, and the operating system distributes incoming TCP connections between them.

On Linux, the default policy since Node 12 is OS-level round-robin (SO_REUSEPORT-style behavior). On Windows and macOS, the master accepts connections itself and hands them to workers via IPC. Either way, your application code doesn't change -- the workers just call app.listen(3000) and the kernel sorts it out.

Forking One Worker Per CPU Core

// server.js
// Built-in modules: no dependencies required
const cluster = require('node:cluster');
const os = require('node:os');
const http = require('node:http');
const process = require('node:process');

// Number of logical CPU cores available to this process
const numCPUs = os.availableParallelism(); // Node 18.14+ recommended over os.cpus()

if (cluster.isPrimary) {
  // ----- MASTER PROCESS -----
  console.log(`Primary ${process.pid} starting ${numCPUs} workers`);

  // Fork one worker per CPU core
  for (let i = 0; i < numCPUs; i++) {
    cluster.fork();
  }

  // If a worker dies unexpectedly, replace it immediately
  cluster.on('exit', (worker, code, signal) => {
    console.log(`Worker ${worker.process.pid} died (${signal || code}). Restarting...`);
    cluster.fork();
  });
} else {
  // ----- WORKER PROCESS -----
  // Every worker creates its own HTTP server listening on the SAME port.
  // The OS (or master, on Windows) load-balances incoming connections.
  http
    .createServer((req, res) => {
      res.writeHead(200, { 'Content-Type': 'text/plain' });
      res.end(`Handled by worker ${process.pid}\n`);
    })
    .listen(3000);

  console.log(`Worker ${process.pid} listening on :3000`);
}

Run it and curl it a few times:

$ node server.js
Primary 12001 starting 8 workers
Worker 12002 listening on :3000
Worker 12003 listening on :3000
...

$ for i in 1 2 3 4; do curl -s localhost:3000; done
Handled by worker 12002
Handled by worker 12005
Handled by worker 12003
Handled by worker 12004

Each request lands on a different PID. You're now using all your cores.

Important: workers do not share memory. In-process caches, rate-limiter counters, and WebSocket session maps must move to Redis (or another shared store) the moment you cluster. This is the single biggest gotcha when migrating from single-process to clustered Node.


PM2 — The Production Process Manager

The cluster module gives you the mechanism, but it doesn't give you the operations: log aggregation, restart-on-crash limits, startup-on-boot, zero-downtime reloads, memory thresholds, or a CLI to inspect what's running. That's PM2.

PM2 wraps your app, runs it in cluster mode automatically (no need to write the master/worker code yourself), and exposes commands like pm2 list, pm2 logs, pm2 reload, and pm2 monit.

ecosystem.config.js — One File To Describe Production

// ecosystem.config.js
// PM2 reads this file when you run `pm2 start ecosystem.config.js --env production`
module.exports = {
  apps: [
    {
      name: 'api',                         // Logical app name in `pm2 list`
      script: './dist/server.js',          // Entry point (compiled output for TS)
      instances: 'max',                    // 'max' = one worker per CPU core
      exec_mode: 'cluster',                // Use Node's cluster module under the hood
      watch: false,                        // Never enable in production (reload storm)
      max_memory_restart: '500M',          // Restart any worker that exceeds 500 MB RSS
      kill_timeout: 5000,                  // Give SIGTERM 5s before sending SIGKILL
      wait_ready: true,                    // Wait for process.send('ready') before marking up
      listen_timeout: 10000,               // Max time to wait for 'ready' signal
      max_restarts: 10,                    // Stop restarting after 10 crashes in a row
      min_uptime: '30s',                   // A worker must live 30s to count as "stable"
      env: {
        NODE_ENV: 'development',
        PORT: 3000,
      },
      env_production: {
        NODE_ENV: 'production',
        PORT: 3000,
        LOG_LEVEL: 'info',
      },
    },
  ],
};

Common PM2 commands:

pm2 start ecosystem.config.js --env production   # Start in cluster mode
pm2 list                                         # See all processes + memory + CPU
pm2 logs api                                     # Tail aggregated worker logs
pm2 reload api                                   # Zero-downtime rolling restart
pm2 restart api                                  # Hard restart (drops in-flight requests)
pm2 stop api                                     # Stop but keep in process list
pm2 delete api                                   # Remove from PM2 entirely
pm2 startup                                      # Generate systemd unit for boot
pm2 save                                         # Persist current process list

The critical distinction: reload is rolling and zero-downtime. restart is not. Reload kills workers one at a time, waits for the replacement to send 'ready', then moves to the next. Restart kills them all at once.


Graceful Shutdown — The Most Skipped Production Skill

When PM2 (or Kubernetes, or systemd) wants to stop your process, it sends SIGTERM first and waits a few seconds before escalating to SIGKILL. Those few seconds are your only chance to:

  1. Stop accepting new connections (server.close())
  2. Finish requests already in flight
  3. Drain the database connection pool (pool.end())
  4. Flush logs and metrics
  5. Exit cleanly with code 0

If you skip this, every deploy drops a handful of requests, leaks DB connections, and corrupts long-running transactions. Users see intermittent 502s that nobody can reproduce locally.

A Production-Quality SIGTERM Handler

// graceful.js
// Wire up a real graceful shutdown around an Express app + Postgres pool
const express = require('express');
const { Pool } = require('pg');

const app = express();
const pool = new Pool({ connectionString: process.env.DATABASE_URL });

app.get('/users/:id', async (req, res) => {
  const { rows } = await pool.query('SELECT * FROM users WHERE id = $1', [req.params.id]);
  res.json(rows[0]);
});

const server = app.listen(3000, () => {
  console.log(`Worker ${process.pid} ready on :3000`);
  // Tell PM2 we are ready to accept traffic (requires wait_ready: true)
  if (process.send) process.send('ready');
});

// Track whether shutdown is already in progress to avoid double-handling
let shuttingDown = false;

async function shutdown(signal) {
  if (shuttingDown) return;
  shuttingDown = true;
  console.log(`[${process.pid}] ${signal} received -- starting graceful shutdown`);

  // 1. Stop accepting NEW connections. Existing ones keep running.
  server.close((err) => {
    if (err) {
      console.error('Error closing HTTP server:', err);
      process.exit(1);
    }
    console.log(`[${process.pid}] HTTP server closed`);
  });

  // 2. Hard cap: if shutdown takes longer than 10s, force exit.
  //    PM2 will SIGKILL us at kill_timeout anyway -- we want to beat it.
  const forceExit = setTimeout(() => {
    console.error(`[${process.pid}] Graceful shutdown timed out -- forcing exit`);
    process.exit(1);
  }, 10_000);
  forceExit.unref(); // Don't keep the event loop alive just for this timer

  try {
    // 3. Drain the Postgres pool (waits for in-flight queries)
    await pool.end();
    console.log(`[${process.pid}] Database pool drained`);

    // 4. Clean exit
    clearTimeout(forceExit);
    process.exit(0);
  } catch (err) {
    console.error('Shutdown error:', err);
    process.exit(1);
  }
}

// Both signals matter: SIGTERM from PM2/k8s, SIGINT from Ctrl+C in dev
process.on('SIGTERM', () => shutdown('SIGTERM'));
process.on('SIGINT', () => shutdown('SIGINT'));

// Crash on truly unexpected errors -- do NOT try to keep running
process.on('uncaughtException', (err) => {
  console.error('uncaughtException:', err);
  shutdown('uncaughtException');
});

Why this order matters: if you close the DB pool before server.close() finishes, in-flight requests will fail with Cannot use a pool after calling end on the pool. Always close the inbound layer first, the outbound layer last.


Zero-Downtime Reload — How pm2 reload Actually Works

A rolling reload is what makes deploys invisible to users. Here's the sequence PM2 (and the cluster module) execute when you run pm2 reload api:

+----------------------------------------------------------------+
|                 ZERO-DOWNTIME RELOAD SEQUENCE                  |
+----------------------------------------------------------------+
|                                                                |
|  Initial state: 4 workers (W1, W2, W3, W4) all serving        |
|                                                                |
|  Step 1: Fork NEW worker W1' with the new code                 |
|  Step 2: Wait for W1' to send process.send('ready')            |
|  Step 3: Send SIGTERM to old W1                                |
|  Step 4: Old W1 stops accepting, drains, exits cleanly         |
|  Step 5: Repeat for W2, W3, W4 -- ONE AT A TIME                |
|                                                                |
|  At every moment >=3 of 4 workers are accepting traffic.       |
|  Total user-visible downtime: 0 ms.                            |
|                                                                |
+----------------------------------------------------------------+

The two requirements that make this work:

  1. wait_ready: true in ecosystem.config.js -- so PM2 knows when the new worker is genuinely ready, not just spawned.
  2. A working SIGTERM handler in your app -- so the old worker actually drains instead of getting killed mid-request.

If either is missing, "zero-downtime" reload becomes "small-downtime" reload, and you'll see it in your error rate graph during every deploy.

// minimal-ready-signal.js
// The `ready` signal that PM2's wait_ready is waiting for
const server = app.listen(process.env.PORT, () => {
  // Run any startup checks BEFORE signaling ready:
  //   - Verify DB connectivity
  //   - Warm up caches
  //   - Load feature flags
  // Only THEN tell PM2 we're safe to receive traffic.
  if (process.send) {
    process.send('ready');
  }
});

Health Checks — Telling The Load Balancer You're Alive

Whether you're behind PM2's built-in proxy, an Nginx reverse proxy, or a Kubernetes service, the load balancer needs a way to ask "are you healthy?" before sending traffic. The convention is two endpoints:

// health.js
// Two endpoints with very different meanings -- do not collapse them into one
const express = require('express');
const app = express();

let isShuttingDown = false;
let isReady = false;

// LIVENESS: "Is this process alive at all?"
// Should return 200 unless the process is fundamentally broken.
// k8s restarts the pod if this fails repeatedly.
app.get('/healthz', (req, res) => {
  res.status(200).json({ status: 'alive', pid: process.pid });
});

// READINESS: "Should I send you traffic right now?"
// Returns 503 during startup AND during shutdown.
// k8s removes the pod from the service endpoints when this fails.
app.get('/readyz', async (req, res) => {
  if (isShuttingDown) {
    return res.status(503).json({ status: 'draining' });
  }
  if (!isReady) {
    return res.status(503).json({ status: 'starting' });
  }
  try {
    // Optional: ping downstream dependencies
    await pool.query('SELECT 1');
    res.status(200).json({ status: 'ready' });
  } catch {
    res.status(503).json({ status: 'db-unreachable' });
  }
});

// Mark ready only after warm-up completes
async function startup() {
  await warmCaches();
  await pool.query('SELECT 1');
  isReady = true;
}

// Flip the readiness flag the instant SIGTERM arrives,
// so the LB stops sending new requests immediately.
process.on('SIGTERM', () => {
  isShuttingDown = true;
  // ... then the rest of the graceful shutdown logic
});

The trick most people miss: flip isShuttingDown = true before you start closing things. That way the load balancer's next health check (often within 1-2 seconds) removes you from rotation, and no new requests arrive while you drain. Without this, you race the load balancer for the duration of your kill_timeout.


Clustering vs Horizontal Scaling — When To Use Which

Clustering scales you vertically within a box. Horizontal scaling (Kubernetes, ECS, multiple VMs behind a load balancer) scales you across boxes. They're complementary, not alternatives.

+----------------------------------------------------------------+
|                CLUSTERING (single box)                         |
+----------------------------------------------------------------+
|                                                                |
|   +------------------ One Server (16 cores) -----------------+ |
|   |                                                          | |
|   |   [PM2 master]                                           | |
|   |       |                                                  | |
|   |   W1 W2 W3 W4 W5 W6 W7 W8 W9 W10 W11 W12 W13 W14 W15 W16 | |
|   |                                                          | |
|   +----------------------------------------------------------+ |
|                                                                |
|   Pros: cheap, simple, no network hops, shared local disk     |
|   Cons: single point of failure, capped at one machine        |
|                                                                |
+----------------------------------------------------------------+

+----------------------------------------------------------------+
|              HORIZONTAL SCALING (multiple boxes / k8s)         |
+----------------------------------------------------------------+
|                                                                |
|                        [Load Balancer]                         |
|                              |                                 |
|         +---------+----------+----------+---------+            |
|         |         |          |          |         |            |
|     [Pod 1]   [Pod 2]    [Pod 3]    [Pod 4]   [Pod 5]          |
|     1 worker  1 worker   1 worker   1 worker  1 worker         |
|                                                                |
|   Pros: HA, multi-AZ, autoscaling, rolling deploys baked in   |
|   Cons: network latency, complexity, k8s operational cost     |
|                                                                |
+----------------------------------------------------------------+

Rule of thumb:

SituationUse clusteringUse horizontal scaling
Single VM, < 100 RPS, hobby projectYesNo
One beefy box, no HA requirementYesNo
Need zero-downtime deploysYes (PM2)Yes (k8s)
Need multi-AZ failoverNoYes
Traffic exceeds one machine's capacityNoYes
Inside a k8s podNoYes

The last row is the one interviewers love. Inside Kubernetes you typically run one Node worker per pod, NOT cluster mode. Kubernetes is already handling the "many processes, load-balanced, restart on crash" job. Running cluster mode inside a pod hides per-worker metrics from k8s, doubles memory usage, and prevents the scheduler from packing pods efficiently. Let one tool do one job.


Common Mistakes

  • Sharing in-memory state between workers. Rate limit counters, session caches, and WebSocket maps stop working the moment you fork. Move them to Redis.
  • Skipping wait_ready. Without it, PM2 marks the new worker "up" the instant it spawns, before it has connected to the DB or warmed caches. Reloads briefly route traffic to a not-actually-ready process.
  • Closing the DB pool before server.close() finishes. In-flight requests crash with Cannot use a pool after calling end.
  • No kill_timeout budget. Default is 1.6s -- not enough for a real drain. Bump to 5-10s and make sure your in-app shutdown timeout is shorter than PM2's, so you exit cleanly first.
  • Running cluster mode inside Kubernetes. Doubles memory, hides metrics, fights the scheduler. One worker per pod.
  • Using pm2 restart in deploy scripts. It drops in-flight requests. Always pm2 reload.
  • Conflating /healthz and /readyz. Liveness is "am I alive?", readiness is "should I get traffic?" -- they have very different failure semantics.
  • Forgetting that cluster does not share file descriptors automatically on Windows. Windows uses master-dispatch, not OS round-robin. Performance characteristics differ; test on your target OS.
  • Trusting uncaughtException to keep the process running. It can't. Once it fires, your app is in an undefined state. Log, drain, exit. Let PM2 fork a fresh worker.
  • Setting instances: 'max' on a 64-core box without checking memory. Each Node worker easily costs 100-300 MB. 64 workers x 250 MB = 16 GB just for Node.

Interview Questions

1. Why does Node.js need clustering at all? Doesn't the event loop already handle concurrency?

The event loop handles I/O concurrency on a single thread, but it cannot use more than one CPU core for JavaScript execution. Any CPU-bound work (JSON parsing, template rendering, crypto, compression) blocks the loop and starves every other request. Clustering forks N processes so the OS can schedule them across N cores. It's the only way a pure-Node app uses a multi-core machine for compute.

2. How does the cluster module distribute incoming connections across workers?

On Linux (since Node 12), the default is OS-level round-robin: every worker calls listen() on the same port and the kernel hands accepted connections to whichever worker is ready. On Windows and macOS, the default is master-dispatch: only the master holds the listening socket and forwards file descriptors to workers via IPC. You can switch policies via cluster.schedulingPolicy, but the Linux default is almost always the right choice.

3. What's the difference between pm2 restart and pm2 reload?

restart kills every worker and starts new ones -- in-flight requests die and there's a brief window with zero workers serving. reload is a rolling restart: PM2 forks a new worker with the new code, waits for it to signal ready, sends SIGTERM to one old worker, waits for it to drain, then moves to the next. With wait_ready: true and a real SIGTERM handler, reload is genuinely zero-downtime. restart is not.

4. Walk me through what should happen when your Node service receives SIGTERM in production.

Five steps in order: (1) flip a isShuttingDown flag so /readyz starts returning 503 and the load balancer drains you out of rotation; (2) call server.close() to stop accepting new connections while letting in-flight ones finish; (3) wait for outstanding requests to complete, with a hard timeout (typically 5-10s) shorter than your orchestrator's kill_timeout; (4) drain external resources -- close DB pools, flush log buffers, close message queue connections; (5) process.exit(0). Never close the DB before HTTP, or in-flight queries crash.

5. Should I run PM2 cluster mode inside a Kubernetes pod?

No. Kubernetes already provides exactly what cluster mode provides -- multiple processes, load-balanced traffic, automatic restart on crash -- but at the pod level, where it can also do rolling deploys, multi-node scheduling, and autoscaling. Running cluster mode inside a pod doubles memory, hides per-worker CPU/memory metrics from the kubelet, and prevents the scheduler from bin-packing efficiently. The standard pattern is one Node worker per pod and let k8s scale the number of pods. Cluster mode + PM2 makes sense on bare VMs, single-box deploys, and traditional PaaS hosts -- not inside k8s.


Cheat Sheet

+----------------------------------------------------------------+
|              NODE.JS PROCESS MANAGEMENT CHEAT SHEET            |
+----------------------------------------------------------------+
|                                                                |
|  CLUSTER MODULE                                                |
|    require('node:cluster')                                     |
|    cluster.isPrimary -> master branch                          |
|    cluster.fork()    -> spawn worker                           |
|    cluster.on('exit', ...) -> respawn dead workers             |
|    os.availableParallelism() -> worker count                   |
|                                                                |
|  PM2 ESSENTIALS                                                |
|    pm2 start ecosystem.config.js --env production              |
|    pm2 reload api      <- ZERO-DOWNTIME, use in deploys        |
|    pm2 restart api     <- HARD restart, drops requests         |
|    pm2 logs api        <- aggregated worker logs               |
|    pm2 monit           <- live CPU/mem dashboard               |
|    pm2 startup && pm2 save  <- survive reboots                 |
|                                                                |
|  ECOSYSTEM.CONFIG.JS MUST-HAVES                                |
|    instances: 'max'                                            |
|    exec_mode: 'cluster'                                        |
|    wait_ready: true                                            |
|    listen_timeout: 10000                                       |
|    kill_timeout: 5000                                          |
|    max_memory_restart: '500M'                                  |
|    max_restarts: 10                                            |
|    min_uptime: '30s'                                           |
|                                                                |
|  GRACEFUL SHUTDOWN ORDER                                       |
|    1. isShuttingDown = true  (readyz -> 503)                   |
|    2. server.close()         (stop new connections)            |
|    3. wait for in-flight requests                              |
|    4. await pool.end()       (drain DB)                        |
|    5. process.exit(0)                                          |
|    +  setTimeout(forceExit, 10_000).unref() as safety net      |
|                                                                |
|  HEALTH ENDPOINTS                                              |
|    GET /healthz -> liveness  (am I alive?)                     |
|    GET /readyz  -> readiness (should I get traffic?)           |
|                                                                |
|  WHEN TO CLUSTER vs SCALE HORIZONTALLY                         |
|    Single box, no HA need        -> cluster (PM2)              |
|    Multi-AZ, autoscale, k8s      -> 1 worker per pod           |
|    NEVER cluster inside a pod                                  |
|                                                                |
|  TOP GOTCHAS                                                   |
|    - Workers do not share memory -> use Redis                  |
|    - kill_timeout default (1.6s) is too short                  |
|    - Close HTTP BEFORE closing DB, never the reverse           |
|    - reload != restart                                         |
|    - uncaughtException = drain and die, never keep running     |
|                                                                |
+----------------------------------------------------------------+


This is Lesson 10.1 of the Node.js Interview Prep Course -- 10 chapters, 42 lessons.

On this page