Logging and Monitoring
Pino, Correlation IDs, Health Checks, and APM
LinkedIn Hook
"Your Node.js app is on fire in production. You SSH into the box, run
pm2 logs, and stare at 40,000 lines ofconsole.log('here')with no timestamps, no request IDs, and no way to tell which log line belongs to which user."Logging is the single most undervalued skill in backend engineering. Teams ship apps that log everything as plain strings, lose the connection between a user's request and the seventeen downstream calls it triggered, and discover their
/healthendpoint returns 200 even when the database has been down for an hour.The fix is not exotic. Use a structured logger like Pino that emits JSON. Attach a correlation ID to every request via AsyncLocalStorage so you can trace a single transaction across services. Split your health endpoints into liveness (
/healthz) and readiness (/readyz) so Kubernetes knows when to restart you versus when to stop sending traffic. Plug in an APM tool (Datadog, New Relic, OpenTelemetry) so you can actually answer "why is p99 latency 4 seconds?" without grepping log files.In Lesson 10.2, I break down production logging and monitoring in Node.js: Pino vs Winston, log levels, correlation IDs, liveness/readiness probes, event loop lag, and APM auto-instrumentation.
Read the full lesson -> [link]
#NodeJS #Observability #Logging #Monitoring #DevOps #InterviewPrep
What You'll Learn
- Why structured JSON logs beat plain strings, and how Pino achieves them with near-zero overhead
- The Pino vs Winston tradeoff: speed and simplicity versus configurability and transports
- The five standard log levels (fatal, error, warn, info, debug) and when to use each
- Correlation IDs propagated through
AsyncLocalStorageso every log line carries the request context - The difference between a
/healthzliveness probe and a/readyzreadiness probe in Kubernetes - Process metrics that matter: event loop lag, RSS memory, heap used, and CPU percent
- APM tools (Datadog, New Relic, OpenTelemetry) and how auto-instrumentation works without touching your code
The Air Traffic Control Analogy — Why Observability Is Not Optional
Imagine an air traffic controller staring at a radar screen with no labels. Every dot is a plane, but none of them have a flight number, an altitude, a heading, or a destination. Two planes are about to collide and the controller has no idea which two, where they are going, or who to radio. That is what running a Node.js app with console.log('done') looks like — you have noise, but no signal.
Now picture a real control tower. Every plane has a transponder broadcasting a unique ID, every screen shows altitude and heading, every runway has a sensor reporting status, and the moment something is wrong, an alarm fires with the exact flight number and the exact problem. Controllers can rewind the radar tape, replay any flight, and see which controller handled the handoff. That is what a properly instrumented Node.js service looks like — structured logs, correlation IDs, health endpoints, and metrics all tied together.
The job of logging and monitoring is not to "have logs." It is to answer three questions in under a minute: Is the service healthy? Which request failed? What was happening when it failed? Every decision you make — the logger you pick, the format you emit, the IDs you propagate, the endpoints you expose — should be in service of those three questions.
+---------------------------------------------------------------+
| UNSTRUCTURED LOGS (The Problem) |
+---------------------------------------------------------------+
| |
| console.log("user logged in"); |
| console.log("processing order"); |
| console.log("error: " + err.message); |
| |
| Output: |
| user logged in |
| processing order |
| error: connection refused |
| |
| Problems: |
| - No timestamps, no levels, no request ID |
| - Cannot filter, search, or aggregate |
| - Cannot tell which "processing order" belongs to whom |
| - Stack traces are lost or mangled |
| |
+---------------------------------------------------------------+
+---------------------------------------------------------------+
| STRUCTURED LOGS (The Fix) |
+---------------------------------------------------------------+
| |
| log.info({ userId: 42, reqId: "abc" }, "user logged in"); |
| |
| Output (one JSON object per line): |
| {"level":30,"time":1712345678901,"reqId":"abc", |
| "userId":42,"msg":"user logged in"} |
| |
| Benefits: |
| - Machine-parseable -> Elasticsearch, Loki, Datadog |
| - Filter by reqId to see one user's full journey |
| - Aggregate by level, userId, route, status |
| - Stack traces serialized as structured fields |
| |
+---------------------------------------------------------------+
Napkin AI Visual Prompt: "Dark gradient (#0a1a0a -> #0d2e16). Split comparison: LEFT side labeled 'console.log' shows messy unstructured strings in red and gray. RIGHT side labeled 'Pino JSON' shows the same events as clean JSON objects with green (#68a063) field names and amber (#ffb020) values. An amber arrow at the bottom points from both into a 'log aggregator' icon (Loki/ELK/Datadog) in white monospace."
Pino vs Winston — Picking a Structured Logger
Two libraries dominate Node.js logging: Pino and Winston. Both produce structured JSON. The difference is philosophy.
Pino is engineered for speed. It serializes log objects with a hand-rolled JSON stringifier, writes directly to stdout as a Node stream, and offloads pretty-printing and transport to a separate worker thread. The result is roughly 5x faster than Winston in synchronous benchmarks, and it puts almost no pressure on the event loop. The API is intentionally minimal: child loggers, log levels, serializers, and that is mostly it.
Winston is engineered for flexibility. It supports many transports out of the box (file, HTTP, syslog, MongoDB, custom), has a richer formatting pipeline, and lets you mix and match how logs are routed at every level. The cost is overhead: more allocations per log call, more code paths, and a more invasive API surface.
The rule of thumb: start with Pino. It is faster, simpler, and the JSON-first design plays well with modern log aggregators that all expect JSON over stdout anyway. Reach for Winston only if you need a transport Pino does not have, or if your team already standardized on it.
// logger.js
// Production-grade Pino setup. Single instance, exported once,
// imported everywhere. Never call require('pino')() in multiple files.
const pino = require('pino');
// Base logger configuration
const logger = pino({
// Log level read from env, defaulting to info in prod and debug in dev
level: process.env.LOG_LEVEL || (process.env.NODE_ENV === 'production' ? 'info' : 'debug'),
// Use ISO timestamps instead of the default epoch ms — easier to read
// and what most log aggregators expect
timestamp: pino.stdTimeFunctions.isoTime,
// Serializers turn rich objects (Error, req, res) into safe JSON.
// Without these, an Error logs as "{}" because its own properties
// are non-enumerable.
serializers: {
err: pino.stdSerializers.err,
req: pino.stdSerializers.req,
res: pino.stdSerializers.res,
},
// Redact sensitive fields so they never reach disk or aggregator.
// Paths use dot notation; supports wildcards.
redact: {
paths: ['req.headers.authorization', 'req.headers.cookie', '*.password', '*.token'],
censor: '[REDACTED]',
},
// Add static metadata to every log line — service name, version,
// environment. Invaluable when many services share a log pipeline.
base: {
service: 'orders-api',
version: process.env.APP_VERSION || 'dev',
env: process.env.NODE_ENV || 'development',
},
});
// Use logger.info, logger.warn, logger.error, etc.
// Pass an object as the FIRST argument and the message as the second.
// logger.info({ userId: 42 }, 'user logged in')
module.exports = logger;
The five standard log levels in Pino map directly to numeric values that aggregators can filter on:
+---------------------------------------------------------------+
| LOG LEVELS — WHEN TO USE WHICH |
+---------------------------------------------------------------+
| |
| fatal (60) Process is about to exit. Unrecoverable. |
| Example: cannot connect to DB at startup. |
| |
| error (50) Operation failed. Needs investigation. |
| Example: payment provider returned 500. |
| |
| warn (40) Unexpected but recoverable. |
| Example: retry succeeded on second attempt. |
| |
| info (30) Normal lifecycle events. Business signals. |
| Example: order created, user signed up. |
| |
| debug (20) Verbose detail for development only. |
| Example: SQL query text, intermediate state. |
| |
| trace (10) Per-line execution detail. Almost never on. |
| |
| RULE: Production runs at 'info'. Crank to 'debug' only for |
| short, targeted investigations — debug logs are expensive. |
| |
+---------------------------------------------------------------+
Correlation IDs with AsyncLocalStorage
The single most useful thing you can add to a logging setup is a correlation ID (also called a request ID or trace ID) — a unique string generated when a request enters your service and attached to every log line emitted while handling it. With correlation IDs, you can paste one ID into your log search and instantly see every line for that one request, including any background work it triggered. Without them, you are grepping by timestamp and praying.
The challenge in Node.js is that callbacks and await boundaries break naive context passing. You cannot just attach the ID to a global variable, because two requests overlap in time. You also cannot pass the ID as an argument to every function, because that pollutes every signature in the codebase.
The solution is AsyncLocalStorage — a built-in Node module that provides per-async-context storage, the JavaScript equivalent of thread-local storage. Set the ID at the start of a request, and any code that runs inside that request (no matter how many awaits deep) can read it back.
// request-context.js
// Per-request context stored via AsyncLocalStorage. Every async operation
// triggered inside store.run() inherits the same store, even across
// await boundaries and setTimeout calls.
const { AsyncLocalStorage } = require('node:async_hooks');
const storage = new AsyncLocalStorage();
// Wrap a function so it executes inside a fresh context with the given ID
function runWithContext(context, fn) {
return storage.run(context, fn);
}
// Read the current context from anywhere in the call stack
function getContext() {
return storage.getStore();
}
module.exports = { runWithContext, getContext };
// correlation-middleware.js
// Express middleware that assigns a correlation ID to every request,
// stores it in AsyncLocalStorage, and attaches a child logger that
// automatically includes the ID in every log line.
const { randomUUID } = require('node:crypto');
const logger = require('./logger');
const { runWithContext, getContext } = require('./request-context');
function correlationMiddleware(req, res, next) {
// Honor an upstream ID if present (from a load balancer or upstream
// service), otherwise generate a fresh one. This lets you trace a
// single request across multiple services.
const reqId = req.headers['x-request-id'] || randomUUID();
// Echo the ID back so the client (and downstream services) can use it
res.setHeader('x-request-id', reqId);
// Create a child logger that automatically includes reqId on every line
const reqLogger = logger.child({ reqId, method: req.method, url: req.url });
// Run the rest of the request inside a context that carries both
// the ID and the child logger
runWithContext({ reqId, logger: reqLogger }, () => {
reqLogger.info('request received');
// Log on response finish so we capture status and duration
const start = process.hrtime.bigint();
res.on('finish', () => {
const durationMs = Number(process.hrtime.bigint() - start) / 1e6;
reqLogger.info({ status: res.statusCode, durationMs }, 'request completed');
});
next();
});
}
// Helper that any module can call to get the current request logger,
// without having to receive `req` as an argument.
function log() {
const ctx = getContext();
// Fall back to the root logger for code running outside a request
return ctx?.logger || logger;
}
module.exports = { correlationMiddleware, log };
// orders-service.js
// Notice this file does NOT import req or pass IDs around.
// It just calls log() and gets the right child logger automatically.
const { log } = require('./correlation-middleware');
async function createOrder(userId, items) {
log().info({ userId, itemCount: items.length }, 'creating order');
try {
const order = await db.orders.insert({ userId, items });
log().info({ orderId: order.id }, 'order created');
return order;
} catch (err) {
// Pino's err serializer turns this into structured fields
log().error({ err, userId }, 'order creation failed');
throw err;
}
}
module.exports = { createOrder };
When you grep your log aggregator for reqId:"5f2a-...", you now see the entire journey — request received, "creating order", the SQL the DB layer logged, "order created", and the response status — all with timestamps. This single feature pays for the entire logging investment.
Health Checks — Liveness vs Readiness
A health endpoint is a URL that an orchestrator (Kubernetes, ECS, a load balancer) hits to ask "is this instance OK?" Most teams ship a single /health that returns 200 and call it done. That is wrong, and it causes outages.
Kubernetes (and most modern orchestrators) distinguishes two questions:
- Liveness — Is this process alive enough to keep running, or should I kill and restart it? If liveness fails, the orchestrator restarts the container.
- Readiness — Is this process ready to receive traffic right now? If readiness fails, the orchestrator stops sending requests but does not restart.
Conflating the two causes cascading failures. Imagine your app's "health" check verifies the database connection. The database has a five-second hiccup. Every replica fails its check at the same moment, the orchestrator restarts every container simultaneously, and now you have a thundering herd reconnecting to a still-recovering database. A correct setup would have failed readiness (stop sending traffic to me) without failing liveness (do not restart me — I am fine, the DB will come back).
// health.js
// Two distinct endpoints: /healthz for liveness, /readyz for readiness.
// The "z" suffix is a long-standing convention from Google's internal
// systems — it signals "this is a machine endpoint, not a user route".
const express = require('express');
const router = express.Router();
// Track readiness state. The app starts NOT ready, flips to ready once
// dependencies (DB pool warmed, migrations done, caches loaded) are up.
let isReady = false;
function setReady(value) {
isReady = value;
}
// Liveness — answers "is the process alive?".
// Should be cheap, dependency-free, and almost never fail. The only
// thing that should make this return non-200 is the event loop being
// fully wedged or the process being deadlocked.
router.get('/healthz', (req, res) => {
res.status(200).json({ status: 'alive', pid: process.pid, uptime: process.uptime() });
});
// Readiness — answers "should I receive traffic?".
// May check downstream dependencies (DB ping, cache ping). If any
// critical dep is down, return 503 so the load balancer drains us.
router.get('/readyz', async (req, res) => {
if (!isReady) {
return res.status(503).json({ status: 'starting' });
}
const checks = {};
try {
// Cheap DB ping — SELECT 1, no application-level logic
await db.query('SELECT 1');
checks.db = 'ok';
} catch (err) {
checks.db = 'fail';
}
try {
await redis.ping();
checks.redis = 'ok';
} catch (err) {
checks.redis = 'fail';
}
const allOk = Object.values(checks).every((v) => v === 'ok');
res.status(allOk ? 200 : 503).json({ status: allOk ? 'ready' : 'degraded', checks });
});
module.exports = { router, setReady };
# kubernetes deployment snippet — the orchestrator's view of the contract
livenessProbe:
httpGet:
path: /healthz
port: 3000
initialDelaySeconds: 15
periodSeconds: 20
failureThreshold: 3 # restart after 3 consecutive failures (~60s)
readinessProbe:
httpGet:
path: /readyz
port: 3000
initialDelaySeconds: 5
periodSeconds: 5
failureThreshold: 2 # remove from service after 2 failures (~10s)
The contract is now explicit: a transient DB blip drains traffic for 10 seconds, then traffic resumes when the DB recovers, and no container is ever restarted unless the process itself is genuinely wedged.
Process Metrics — Event Loop Lag, Memory, and CPU
Logs tell you what happened. Metrics tell you how the process is doing right now — and the most important Node.js metric is one that does not exist in any other runtime: event loop lag.
Event loop lag is the delay between when a timer was supposed to fire and when it actually fired. If you schedule setTimeout(fn, 0) and fn runs 200ms later, your event loop has 200ms of lag — meaning some synchronous code held the loop hostage for that long. High lag is the smoking gun for sync hot paths, blocking crypto, JSON.stringify on huge objects, or sync file I/O. Every Node.js production service should monitor it.
// metrics.js
// Collect the metrics that actually matter for a Node.js process and
// expose them on /metrics for Prometheus scraping (or push them into
// your APM of choice). Keep the collection interval cheap — every 10s
// is fine for almost everything except event loop lag.
const { monitorEventLoopDelay } = require('node:perf_hooks');
const { log } = require('./correlation-middleware');
// monitorEventLoopDelay samples the loop with high precision and
// keeps a histogram. Resolution=20ms means it samples every 20ms.
const histogram = monitorEventLoopDelay({ resolution: 20 });
histogram.enable();
function collectMetrics() {
const mem = process.memoryUsage();
const cpu = process.cpuUsage();
// Convert nanoseconds to milliseconds for the lag fields
const metrics = {
// Event loop delay percentiles (the single most important metric)
eventLoopLag: {
p50: histogram.percentile(50) / 1e6,
p99: histogram.percentile(99) / 1e6,
max: histogram.max / 1e6,
},
// Memory in MB — RSS is what the OS sees, heapUsed is what V8 holds
memory: {
rssMB: mem.rss / 1024 / 1024,
heapUsedMB: mem.heapUsed / 1024 / 1024,
heapTotalMB: mem.heapTotal / 1024 / 1024,
externalMB: mem.external / 1024 / 1024,
},
// CPU time accumulated since process start, in microseconds
cpu: {
userMicros: cpu.user,
systemMicros: cpu.system,
},
uptimeSeconds: process.uptime(),
pid: process.pid,
};
return metrics;
}
// Periodically log metrics so they show up in your log aggregator
// even without a Prometheus pipeline. Reset the histogram each cycle
// so percentiles describe the last interval, not the lifetime.
setInterval(() => {
const m = collectMetrics();
// Warn loudly if event loop lag exceeds 100ms — almost always a bug
if (m.eventLoopLag.p99 > 100) {
log().warn({ metrics: m }, 'event loop lag is high');
} else {
log().info({ metrics: m }, 'process metrics');
}
histogram.reset();
}, 10_000).unref();
module.exports = { collectMetrics };
A typical alert configuration looks like this: page on event loop lag p99 over 200ms for 5 minutes, page on RSS growing monotonically over 2 hours (memory leak), warn on CPU sustained over 80% per core. These three alerts catch the vast majority of "the app is slow" complaints before users notice.
APM Tools — Datadog, New Relic, and OpenTelemetry
Application Performance Monitoring (APM) tools take observability one step further: they automatically instrument your HTTP server, your database driver, your HTTP client, and your job queue, then build distributed traces showing the full path of every request across services. You see a flame graph of "this request took 1200ms — 80ms in Express, 50ms in auth middleware, 1000ms in a single Postgres query, 70ms serializing the response," with no manual instrumentation.
The big three for Node.js are Datadog, New Relic, and the open-source OpenTelemetry standard (which Datadog and New Relic both also accept as input).
// tracing.js
// MUST be the very first import in your entry point — before express,
// before pg, before anything else. The agent monkey-patches modules
// at require time, so any module loaded before the agent is invisible
// to it.
const tracer = require('dd-trace').init({
service: 'orders-api',
env: process.env.NODE_ENV,
version: process.env.APP_VERSION,
// Sample rate — 1.0 means trace every request, 0.1 means 10%.
// High-traffic services usually sample at 0.05 to 0.1 for cost reasons.
sampleRate: 0.1,
// Auto-instrument these libraries with no code changes needed
plugins: true,
// Tag every span with logger correlation so logs and traces link up
logInjection: true,
});
module.exports = tracer;
// index.js — entry point
// CRITICAL ordering: tracer first, then everything else
require('./tracing');
require('./logger');
const express = require('express');
const { correlationMiddleware } = require('./correlation-middleware');
const { router: healthRouter, setReady } = require('./health');
require('./metrics');
const app = express();
app.use(correlationMiddleware);
app.use(healthRouter);
// ... routes ...
async function main() {
// Warm up dependencies BEFORE flipping readiness on
await db.connect();
await redis.connect();
app.listen(3000, () => {
setReady(true);
require('./logger').info('server ready');
});
}
main();
The combination of structured logs (Pino) + correlation IDs (AsyncLocalStorage) + APM (Datadog/OTel) is the modern observability stack. Logs answer "what happened," traces answer "where was the time spent," and metrics answer "how is the process holding up." Skip any one of the three and you will eventually be paged at 3 AM with no way to find the answer.
Napkin AI Visual Prompt: "Dark gradient (#0a1a0a -> #0d2e16). Three connected circles labeled 'Logs (Pino)', 'Traces (OTel)', and 'Metrics (Prometheus)' in Node green (#68a063). Amber (#ffb020) lines connect them through a central white monospace 'reqId' badge. Below: three small icons of Datadog, New Relic, and Grafana in white. Caption: 'The Three Pillars of Observability'."
Common Mistakes
1. Using console.log in production.
console.log is synchronous, unstructured, and routes through process.stdout.write with no batching. Under load it becomes a measurable bottleneck and produces logs that no aggregator can parse. Replace every console.log with a Pino call before you ship. The only acceptable console.log in production code is in the bootstrap section before the logger has loaded.
2. Logging at debug level in production. Debug logs are designed to be verbose — they often outnumber info logs by 50x. Running production at debug pegs the disk I/O, fills your aggregator quota, and increases event loop lag. Production should run at info, with the option to temporarily flip a single instance to debug for an active investigation. Never leave debug on globally.
3. A single /health endpoint that checks the database.
This conflates liveness with readiness. When the database has a hiccup, every replica fails health at the same time, the orchestrator restarts every container, and you turn a 5-second blip into a 5-minute outage. Split into /healthz (process is alive — no dep checks) and /readyz (deps are up — used to drain traffic), and configure the orchestrator probes accordingly.
4. Logging the request body without redaction.
The first time you accidentally log a credit card number, password, or JWT, you have a compliance incident. Configure Pino's redact option from day one to strip authorization, cookie, password, token, and any other sensitive paths. Treat the log pipeline as if it were public — anything that reaches it should be safe to read.
5. Ignoring event loop lag. Memory and CPU dashboards are useful but they will not catch the most common Node.js failure mode: a sync hot path that wedges the loop. Always export event loop lag as a metric, and alert when p99 exceeds 100ms. The first time you catch a synchronous JSON.parse on a 50MB payload before it hits production, you will understand why.
Interview Questions
1. "What is the difference between Pino and Winston, and which would you pick for a new Node.js service?"
Pino is engineered for raw speed and JSON-first structured logging. It uses a hand-rolled serializer, writes directly to stdout, and offloads pretty-printing and external transports to a separate worker thread, making it roughly 5x faster than Winston in synchronous benchmarks and putting almost no pressure on the event loop. The API is intentionally minimal — child loggers, log levels, serializers, and redaction. Winston is older and more flexible, with a richer formatting pipeline and a larger ecosystem of transports (file, HTTP, syslog, MongoDB, custom). The cost is more allocations per log call and a heavier API surface. For a new service in 2025, I would pick Pino: modern log aggregators (Loki, Elasticsearch, Datadog) all expect JSON over stdout, which is exactly what Pino emits by default, and the speed difference matters at scale. I would only choose Winston if my team had already standardized on it or I needed a transport Pino lacks.
2. "Explain how AsyncLocalStorage enables correlation IDs across async boundaries."
AsyncLocalStorage is a built-in Node module that provides per-async-context storage, similar to thread-local storage in other languages. When you call storage.run(context, fn), Node attaches the given context to the current async resource. Any callback, promise, timer, or await continuation that runs as a descendant of that resource inherits the same context, even though they execute later on the event loop. This solves the correlation ID problem without polluting function signatures: the request middleware generates an ID, calls storage.run({ reqId, logger }, () => next()), and any code that runs inside the request — no matter how deeply nested or how many awaits down — can call storage.getStore() and retrieve the same ID. Combined with Pino's logger.child({ reqId }), every log line emitted during the request automatically carries the ID, which lets you grep your aggregator for one request and see its entire journey including downstream DB and HTTP calls.
3. "What is the difference between liveness and readiness probes, and why does conflating them cause outages?"
A liveness probe answers "is this process alive enough to keep running?" — if it fails, the orchestrator restarts the container. A readiness probe answers "is this process ready to receive traffic right now?" — if it fails, the orchestrator removes the pod from the load balancer rotation but does not restart it. Conflating the two causes cascading failures during transient downstream issues. Imagine a single /health endpoint that pings the database. The database has a five-second hiccup, every replica fails health simultaneously, Kubernetes interprets that as "the process is broken" and restarts every container at once. Now you have a thundering herd of fresh processes all trying to reconnect to a still-recovering database, turning a five-second blip into a multi-minute outage. The correct pattern is /healthz doing nothing but returning 200 (the process is alive), and /readyz doing dependency checks (the process should receive traffic). A DB blip then drains traffic for ten seconds, traffic resumes when the DB recovers, and no container is ever restarted unnecessarily.
4. "Why is event loop lag the most important metric for a Node.js process, and how do you measure it?"
Because Node.js is single-threaded, anything that holds the event loop synchronously blocks every concurrent request the process is handling. CPU and memory dashboards will not catch this — CPU may look fine because the work is brief, and memory may look fine because nothing leaked. The smoking gun is event loop lag: the delay between when a timer was scheduled to fire and when it actually fired. If you schedule a setTimeout(fn, 0) and it runs 200ms later, you have 200ms of lag, meaning some sync code held the loop hostage for that duration. Common culprits are sync crypto, large JSON.stringify/JSON.parse, sync file I/O, and CPU-heavy loops. The standard way to measure it is perf_hooks.monitorEventLoopDelay({ resolution: 20 }), which samples the loop with a high-resolution histogram and lets you read p50, p99, and max percentiles. Production services should expose lag as a metric and alert when p99 exceeds about 100ms.
5. "What does an APM tool like Datadog or New Relic actually do, and how does auto-instrumentation work in Node.js?"
An APM tool produces distributed traces — a timeline of every request showing how long was spent in each layer (HTTP server, middleware, database driver, downstream HTTP call) including across service boundaries via propagated trace headers. The result is a flame graph that lets you answer "why is this endpoint slow?" without adding manual timing code. Auto-instrumentation works by monkey-patching popular Node modules at require time. When you require('dd-trace').init() as the very first line of your entry point, the agent installs hooks into Node's module loader so that the next time someone requires express, pg, mysql2, axios, redis, bullmq, etc., the agent intercepts and wraps key methods to start and finish trace spans. Because it is a require-time patch, the tracer must load before any other module — load order is critical, and forgetting to put the tracer first is the most common reason auto-instrumentation silently produces empty traces. OpenTelemetry uses the same technique with a vendor-neutral API, and Datadog and New Relic both accept OTel data, so for a new project I would use OTel to avoid vendor lock-in.
Quick Reference — Logging and Monitoring Cheat Sheet
+---------------------------------------------------------------+
| LOGGING AND MONITORING CHEAT SHEET |
+---------------------------------------------------------------+
| |
| STRUCTURED LOGGER (Pino): |
| const log = require('pino')({ |
| level: 'info', |
| redact: ['*.password', 'req.headers.authorization'], |
| serializers: { err: pino.stdSerializers.err }, |
| }); |
| log.info({ userId: 42 }, 'user logged in'); |
| |
| LOG LEVELS: |
| fatal -> error -> warn -> info -> debug -> trace |
| Production runs at 'info'. Page on error+. Never log debug. |
| |
| CORRELATION IDS (AsyncLocalStorage): |
| storage.run({ reqId, logger }, () => next()); |
| const ctx = storage.getStore(); |
| ctx.logger.info('anywhere in the call stack'); |
| |
| HEALTH CHECKS: |
| GET /healthz -> 200 always (process alive) |
| GET /readyz -> 200/503 (deps up?) |
| |
| PROCESS METRICS: |
| perf_hooks.monitorEventLoopDelay({ resolution: 20 }) |
| process.memoryUsage() // rss, heapUsed, external |
| process.cpuUsage() // user, system in microseconds |
| |
| APM: |
| require('dd-trace').init() // FIRST line of entry point |
| Auto-instruments express, pg, redis, axios, bullmq, ... |
| |
+---------------------------------------------------------------+
+---------------------------------------------------------------+
| KEY RULES |
+---------------------------------------------------------------+
| |
| 1. No console.log in production — use Pino |
| 2. Always emit JSON, never plain strings |
| 3. Attach a correlation ID to every request via ALS |
| 4. Split /healthz (liveness) from /readyz (readiness) |
| 5. Redact secrets in the logger config, not at call sites |
| 6. Monitor event loop lag — alert when p99 > 100ms |
| 7. Load the APM tracer FIRST, before any other require |
| 8. Production at 'info', flip to 'debug' only for triage |
| |
+---------------------------------------------------------------+
| Concern | Wrong Way | Right Way |
|---|---|---|
| Log format | console.log("msg") | log.info({ ctx }, "msg") |
| Logger choice | console everywhere | Pino single instance |
| Request tracing | grep by timestamp | Correlation ID via ALS |
| Health endpoint | one /health checking DB | /healthz + /readyz |
| Secrets in logs | logged in plaintext | Pino redact paths |
| Event loop | unmonitored | monitorEventLoopDelay p99 alert |
| Memory tracking | only RSS | RSS + heapUsed + external |
| Distributed tracing | manual console.time | Datadog/OTel auto-instrument |
| APM load order | anywhere | first line of entry point |
| Log level in prod | debug | info |
Prev: Lesson 10.1 -- Process Management (PM2 & Clustering) Next: Lesson 10.3 -- Environment Configuration
This is Lesson 10.2 of the Node.js Interview Prep Course -- 10 chapters, 42 lessons.