Database Interview Prep
Subqueries and Advanced Queries

Correlated vs Non-Correlated Subqueries

Per-Row Loops, EXISTS, and JOIN Rewrites

LinkedIn Hook

"Your dashboard query took 80 ms in dev and 47 seconds in production. The culprit was four characters: o.id inside a subquery."

Most developers learn subqueries as "a SELECT inside a SELECT" and stop there. They write one, it works on the test data, and they ship it. Then the table grows from ten thousand rows to ten million, and the same query starts melting CPUs. The reason is almost always the same: the subquery turned out to be correlated — it references a column from the outer query — and the database is re-executing it once for every single outer row.

A non-correlated subquery runs once. The database evaluates it, caches the result, and reuses that result for the whole outer query. A correlated subquery runs N times, where N is the number of outer rows. With ten outer rows you barely notice. With ten million you notice immediately, in production, on a Friday.

The fix is rarely "don't use subqueries." Sometimes a correlated subquery is exactly the right tool — EXISTS (...) is a correlated pattern, and it is often the fastest way to test "does at least one matching child row exist." Other times the same logic rewrites cleanly into a JOIN that the planner can optimise far better. Knowing which is which — and being able to read EXPLAIN to confirm — is the difference between a query that scales and a query that gets paged about at 2 AM.

In Lesson 6.2, I break down the execution model of correlated vs non-correlated subqueries: per-row vs once-only evaluation, why EXISTS is inherently correlated, the performance trade-offs, and the mechanical recipe for rewriting a correlated subquery as a JOIN when the planner needs help.

Read the full lesson -> [link]

#SQL #Database #PostgreSQL #MySQL #BackendDevelopment #InterviewPrep #QueryPerformance


Correlated vs Non-Correlated Subqueries thumbnail


What You'll Learn

  • The mechanical difference between correlated and non-correlated subqueries
  • Why a non-correlated subquery executes exactly once and a correlated one executes per outer row
  • How to spot a correlation by looking for outer-table column references inside the subquery
  • Why EXISTS is inherently correlated and when it is the fastest possible pattern
  • How IN with a subquery is usually non-correlated, but with a correlation becomes per-row
  • The cost model: when N small executions beat one large materialisation, and vice versa
  • The recipe for rewriting a correlated subquery as an equivalent JOIN
  • How modern planners (PostgreSQL especially) auto-decorrelate subqueries — and when they cannot
  • How to read EXPLAIN to confirm whether your subquery runs once or N times

The Library Index Card Analogy — Look Up Once vs Walk to the Shelf Every Time

Imagine you are a librarian processing a stack of return requests. Each request says "did this patron ever borrow a banned book?" You have two ways to answer.

Strategy A — look up once. You walk to the back room, pull the master list of banned books, photocopy it, and bring the copy back to your desk. Now for every patron in your stack, you just glance at the photocopy. The walk to the back room happened once. The list never changes mid-batch. You process the stack quickly because the expensive work is already done.

Strategy B — walk to the shelf for every patron. Instead of photocopying, you read the first patron's name, walk to the back room, scan the banned-book list looking specifically for that patron's borrowing history, walk back, write the answer, then read the next patron's name, walk to the back room again, scan again, walk back again, and so on. The work per patron is small. The number of patrons is huge. You spend the entire afternoon walking.

A non-correlated subquery is Strategy A. The inner query has no reference to the outer row, so the database can run it once, cache the result, and reuse it for every outer row. A correlated subquery is Strategy B. The inner query mentions a column from the outer row (o.id, c.customer_id), so the database has to plug in a different outer value each time and re-execute. With ten outer rows you barely notice the walking. With ten million you walk yourself into the ground.

+---------------------------------------------------------------+
|           THE TWO EXECUTION SHAPES                            |
+---------------------------------------------------------------+
|                                                                |
|   NON-CORRELATED (Strategy A):                                 |
|                                                                |
|     [ inner query ]  ->  [ result cached ]                     |
|             |                    |                            |
|             v                    v                            |
|     runs ONCE total      reused for every outer row           |
|                                                                |
|   CORRELATED (Strategy B):                                     |
|                                                                |
|     outer row 1  ->  inner query (uses row 1's id)  ->  result|
|     outer row 2  ->  inner query (uses row 2's id)  ->  result|
|     outer row 3  ->  inner query (uses row 3's id)  ->  result|
|     ...                                                        |
|     outer row N  ->  inner query (uses row N's id)  ->  result|
|                                                                |
|     runs N times -- once per outer row                         |
|                                                                |
+---------------------------------------------------------------+

The single most important skill in this lesson is being able to read a subquery and immediately answer one question: does the inner SELECT reference any column from the outer query? If yes, it is correlated and runs per outer row. If no, it is non-correlated and runs exactly once. Everything else — performance, EXISTS semantics, JOIN rewrites — flows from that one distinction.


Setting Up the Sample Schema

Every example below uses the same two tables. Run these once if you want to reproduce the queries.

-- Customers and their orders -- the standard subquery teaching schema
CREATE TABLE customers (
  id      SERIAL PRIMARY KEY,
  name    TEXT NOT NULL,
  country TEXT NOT NULL
);

CREATE TABLE orders (
  id          SERIAL PRIMARY KEY,
  customer_id INT NOT NULL REFERENCES customers(id),
  total       NUMERIC NOT NULL,
  status      TEXT NOT NULL,
  created_at  TIMESTAMPTZ NOT NULL DEFAULT now()
);

INSERT INTO customers (name, country) VALUES
  ('Alice', 'US'),
  ('Bob',   'UK'),
  ('Carol', 'US'),
  ('Dave',  'DE'),
  ('Eve',   'US');

INSERT INTO orders (customer_id, total, status) VALUES
  (1, 120.00, 'shipped'),
  (1,  55.00, 'shipped'),
  (2, 300.00, 'pending'),
  (3,  42.00, 'shipped'),
  (3, 999.00, 'cancelled'),
  (5,  18.00, 'shipped');
-- Note: customer 4 (Dave) has zero orders.

Non-Correlated Subqueries — Run Once, Cache, Reuse

A non-correlated subquery is fully self-contained. You could copy it out, paste it into a new query window, and run it independently. It does not mention any column from the outer query, which means the database can evaluate it exactly once, materialise the result, and use that result for the entire outer scan.

-- Non-correlated: the inner query knows nothing about the outer row.
-- It computes "the average order total across all orders" once,
-- and the outer query compares each order's total against that single
-- cached number.
SELECT id, customer_id, total
FROM orders
WHERE total > (SELECT AVG(total) FROM orders);
-- id | customer_id | total
-- ---+-------------+-------
--  3 |           2 | 300.00
--  5 |           3 | 999.00
--
-- The inner SELECT AVG(total) FROM orders runs ONCE, returns 255.67,
-- and the outer WHERE then becomes effectively WHERE total > 255.67.

You can spot it immediately: the inner SELECT has no reference to orders o aliased from the outer query. It is a closed expression. The planner sees this and produces a plan that says "evaluate subquery once, then use the constant in the filter."

EXPLAIN ANALYZE
SELECT id FROM orders WHERE total > (SELECT AVG(total) FROM orders);
-- Seq Scan on orders  (cost=37.78..78.05 rows=567 width=4)
--   Filter: (total > $0)
--   InitPlan 1 (returns $0)
--     ->  Aggregate  (cost=37.77..37.78 rows=1 width=32)
--           ->  Seq Scan on orders orders_1
--
-- Notice the "InitPlan" node: that is the planner's signal that the
-- subquery is evaluated ONCE, before the outer scan starts, and the
-- single result ($0) is plugged into the filter for every row.

The key signal in EXPLAIN is the InitPlan node. InitPlan = "computed once at the start, before the outer query runs." If you see InitPlan, you have a non-correlated subquery and you do not have a per-row cost problem.

Non-correlated subqueries also commonly appear with IN:

-- Non-correlated IN: get all orders from US customers.
-- The inner query "give me every customer id where country = US"
-- runs once and produces the set {1, 3, 5}.
-- The outer query then does customer_id IN (1, 3, 5).
SELECT id, customer_id, total
FROM orders
WHERE customer_id IN (
  SELECT id FROM customers WHERE country = 'US'
);
-- id | customer_id | total
-- ---+-------------+-------
--  1 |           1 | 120.00
--  2 |           1 |  55.00
--  4 |           3 |  42.00
--  5 |           3 | 999.00
--  6 |           5 |  18.00

Correlated Subqueries — Per-Row Re-Execution

A correlated subquery references a column from the outer query. It cannot be run in isolation — paste it into a new window and the database complains that the outer column does not exist. Because the inner query depends on a value that changes per outer row, the database has to re-execute the inner query once per outer row, plugging in a different outer value each time.

-- Correlated: the inner query references c.id from the OUTER customers row.
-- Read this as: "for each customer c, compute their total spend by
-- summing orders WHERE customer_id matches THIS customer."
SELECT
  c.id,
  c.name,
  (SELECT COALESCE(SUM(o.total), 0)
   FROM orders o
   WHERE o.customer_id = c.id) AS total_spent
FROM customers c
ORDER BY c.id;
-- id | name  | total_spent
-- ---+-------+-------------
--  1 | Alice |      175.00
--  2 | Bob   |      300.00
--  3 | Carol |     1041.00
--  4 | Dave  |        0.00
--  5 | Eve   |       18.00
--
-- The inner SUM ran 5 times -- once per customer -- with a different
-- value of c.id plugged in each time.

The correlation is the four characters c.id. Without that reference, the inner query would always return the same total (the sum of all orders) and you would get the same number on every row. With the reference, the planner is forced into a nested loop: for each row of customers, run the inner SUM against the matching orders rows.

EXPLAIN ANALYZE
SELECT c.id,
       (SELECT SUM(o.total) FROM orders o WHERE o.customer_id = c.id)
FROM customers c;
-- Seq Scan on customers c  (cost=0.00..117.50 rows=5 width=36)
--   SubPlan 1
--     ->  Aggregate  (cost=22.51..22.52 rows=1 width=32)
--           ->  Seq Scan on orders o
--                 Filter: (customer_id = c.id)
--
-- The "SubPlan" node (vs InitPlan) is the giveaway. SubPlan means
-- the subquery is evaluated FOR EACH outer row, with the correlated
-- column substituted in. Five customers -> five Aggregate runs.

InitPlan vs SubPlan is the cheat code. Look at any EXPLAIN output for a query with a subquery. If you see InitPlan, the subquery runs once. If you see SubPlan, it runs per outer row. Postgres also sometimes produces Hashed SubPlan (semi-join optimisation) which is closer to once-per-distinct-value — better than per-row but not as good as InitPlan.

+---------------------------------------------------------------+
|           HOW TO TELL THEM APART                              |
+---------------------------------------------------------------+
|                                                                |
|   QUESTION: does the inner SELECT mention an outer column?    |
|                                                                |
|       NO  ->  non-correlated  ->  EXPLAIN shows InitPlan      |
|                                ->  runs 1 time total          |
|                                ->  cached, plugged into outer |
|                                                                |
|       YES ->  correlated      ->  EXPLAIN shows SubPlan       |
|                                ->  runs N times (N = outer)   |
|                                ->  re-evaluated per row       |
|                                                                |
|   SHORTCUT: if you can copy the inner SELECT into a new       |
|   query window and run it standalone, it is non-correlated.   |
|   If it errors with "column does not exist", it is correlated.|
|                                                                |
+---------------------------------------------------------------+

EXISTS — The Canonical Correlated Pattern

EXISTS (subquery) is almost always correlated, and that is by design. It asks the question "does at least one row exist that matches some condition tied to this outer row?" The subquery references the outer row, returns true on the first matching row, and short-circuits — it does not need to count, it does not need to fetch all matches, it just needs one. This makes EXISTS one of the fastest patterns in SQL when used correctly.

-- Find customers who have placed at least one order.
-- The subquery runs per customer, but it stops at the first matching order
-- because EXISTS only cares whether ANY row exists.
SELECT c.id, c.name
FROM customers c
WHERE EXISTS (
  SELECT 1
  FROM orders o
  WHERE o.customer_id = c.id
);
-- id | name
-- ---+-------
--  1 | Alice
--  2 | Bob
--  3 | Carol
--  5 | Eve
--
-- Dave (id=4) is excluded because no orders row matches customer_id=4.
-- Note: SELECT 1 inside EXISTS is idiomatic. The actual columns
-- selected by EXISTS are ignored -- only the existence of a row matters.
-- SELECT *, SELECT 1, SELECT NULL all behave identically.

The matching NOT EXISTS is the standard way to express "anti-join" — find outer rows for which no matching inner row exists.

-- Find customers who have NEVER placed an order.
SELECT c.id, c.name
FROM customers c
WHERE NOT EXISTS (
  SELECT 1
  FROM orders o
  WHERE o.customer_id = c.id
);
-- id | name
-- ---+------
--  4 | Dave

Why EXISTS often beats IN and JOIN + DISTINCT for "does any match exist":

  • It short-circuits on the first match (no need to count or fetch all)
  • It naturally returns each outer row at most once (no need for DISTINCT)
  • The planner has decades of dedicated optimisations for EXISTS / NOT EXISTS
  • Modern Postgres and SQL Server convert EXISTS into a semi-join internally — effectively a hash join that stops on first match per outer row
-- These three queries answer the same question.
-- On modern Postgres they often produce the same plan (semi-join),
-- but historically EXISTS has been the most reliable choice.

-- 1. EXISTS  (preferred for "any match" semantics)
SELECT c.id FROM customers c
WHERE EXISTS (SELECT 1 FROM orders o WHERE o.customer_id = c.id);

-- 2. IN with non-correlated subquery
SELECT c.id FROM customers c
WHERE c.id IN (SELECT customer_id FROM orders);

-- 3. JOIN + DISTINCT
SELECT DISTINCT c.id FROM customers c
INNER JOIN orders o ON o.customer_id = c.id;

Napkin AI Visual Prompt: "Dark navy gradient (#0a0f1f -> #111a2e). Center: a sky blue (#4fc3f7) box labeled 'EXISTS (SELECT 1 ...)' with a short-circuit symbol (lightning bolt) inside. Around it, three rose (#ff5c8a) outer rows feeding in. Each row sends a query to the EXISTS box, which lights up after finding one match and immediately returns true (sky blue arrow back). Below the diagram, three equivalent SQL snippets stacked: EXISTS (sky blue border), IN subquery (white border), JOIN + DISTINCT (white border), with a sky blue checkmark next to EXISTS labeled 'short-circuits'. White monospace labels throughout."


Performance — When Per-Row Wins, When It Loses

The instinct "correlated is bad, non-correlated is good" is wrong. Both have valid uses. The right way to think about it is a cost equation:

non-correlated cost = cost(inner query, executed once)
                    + cost(outer scan + lookup in cached result)

correlated cost     = cost(outer scan)
                    + N * cost(inner query, with one outer value bound)

Correlated wins when each individual inner execution is cheap (because the correlation column is indexed) and N is small or the inner query short-circuits (EXISTS). Non-correlated wins when the inner query is expensive to compute but produces a result that is cheap to look up in — for example a small set of ids you can hash.

+---------------------------------------------------------------+
|           WHEN EACH SHAPE WINS                                |
+---------------------------------------------------------------+
|                                                                |
|   CORRELATED is fast when:                                     |
|     - The correlation column is indexed (each lookup O(log N)) |
|     - The pattern is EXISTS / NOT EXISTS (short-circuits)      |
|     - The outer row count is small                             |
|     - The inner query returns a tiny number of rows per call   |
|                                                                |
|   NON-CORRELATED is fast when:                                 |
|     - The inner query is expensive to compute (heavy aggregate)|
|     - The result is small enough to materialise / hash         |
|     - The outer row count is large                             |
|     - The same value is needed by every outer row              |
|                                                                |
|   DANGER ZONE -- correlated subquery, no index on the          |
|   correlation column, large outer table:                       |
|     -> N full table scans of the inner table                   |
|     -> O(N * M) total work                                     |
|     -> 47 seconds becomes a real production story              |
|                                                                |
+---------------------------------------------------------------+

The single most common production performance bug in this area is a correlated subquery whose correlation column has no index. Each inner execution scans the full inner table, and you multiply that cost by the number of outer rows. The fix is almost always one of two things: add the missing index, or rewrite the correlated subquery as a JOIN that the planner can hash.


Modern Planners Auto-Decorrelate (Sometimes)

PostgreSQL, SQL Server, and Oracle have all gotten smarter at recognising correlated subqueries that are mathematically equivalent to JOINs and rewriting them automatically. This is called subquery decorrelation or subquery unnesting. When it works, the planner takes a query like:

SELECT c.id, c.name
FROM customers c
WHERE EXISTS (SELECT 1 FROM orders o WHERE o.customer_id = c.id);

...and internally rewrites it to a semi-join, which executes as a hash join that stops on first match per outer row. The EXPLAIN output shows Hash Semi Join instead of SubPlan, and the cost is comparable to a clean JOIN.

But auto-decorrelation has limits. It tends to break down when:

  • The subquery is in the SELECT list (a "scalar subquery") rather than in WHERE
  • The subquery uses LIMIT or ORDER BY
  • The correlation involves a non-equi predicate (e.g. o.created_at > c.signup_date - INTERVAL '30 days')
  • The subquery uses aggregates with the correlation in a HAVING clause
  • MySQL's planner is older and decorrelates fewer cases than Postgres

When the planner cannot decorrelate, you get the SubPlan node and N executions. That is when you reach for the manual rewrite.


The Recipe — Rewriting a Correlated Subquery as a JOIN

Most correlated subqueries can be rewritten as a JOIN against an aggregated subquery. The recipe is mechanical:

  1. Pull the correlated subquery out into its own derived table.
  2. Replace the correlation with a GROUP BY on the same column.
  3. JOIN the derived table back to the outer query on that column.
  4. Use LEFT JOIN + COALESCE if the original used a default for missing matches.
-- BEFORE: correlated subquery in the SELECT list.
-- Each customer triggers one SUM execution -- N customers -> N runs.
SELECT
  c.id,
  c.name,
  (SELECT COALESCE(SUM(o.total), 0)
   FROM orders o
   WHERE o.customer_id = c.id) AS total_spent
FROM customers c;

-- AFTER: derived table aggregated once, joined back to customers.
-- The aggregate runs ONE time over the entire orders table,
-- producing one row per customer_id. The LEFT JOIN attaches it
-- to the customers row, and COALESCE handles customers with no orders.
SELECT
  c.id,
  c.name,
  COALESCE(o.total_spent, 0) AS total_spent
FROM customers c
LEFT JOIN (
  SELECT customer_id, SUM(total) AS total_spent
  FROM orders
  GROUP BY customer_id
) o ON o.customer_id = c.id;
-- id | name  | total_spent
-- ---+-------+-------------
--  1 | Alice |      175.00
--  2 | Bob   |      300.00
--  3 | Carol |     1041.00
--  4 | Dave  |        0.00
--  5 | Eve   |       18.00
--
-- Same result. Different execution shape. With 1M customers and
-- 10M orders, this version is typically 50-500x faster because the
-- inner aggregate runs once (one hash aggregate over orders) instead
-- of one million times.

The JOIN form is also what the planner internally produces when it auto-decorrelates a SELECT-list subquery — but writing it explicitly removes the dependency on the planner's decorrelation rules and gives you predictable performance across engines.

+---------------------------------------------------------------+
|           THE REWRITE PATTERN                                 |
+---------------------------------------------------------------+
|                                                                |
|   CORRELATED SHAPE:                                            |
|                                                                |
|     SELECT outer.x,                                            |
|            (SELECT agg(inner.y)                                |
|             FROM inner                                         |
|             WHERE inner.fk = outer.pk)                         |
|     FROM outer                                                 |
|                                                                |
|     -> N executions of the inner aggregate                     |
|                                                                |
|   JOIN-REWRITE SHAPE:                                          |
|                                                                |
|     SELECT outer.x,                                            |
|            COALESCE(g.agg_y, 0)                                |
|     FROM outer                                                 |
|     LEFT JOIN (                                                |
|       SELECT fk, agg(y) AS agg_y                               |
|       FROM inner                                               |
|       GROUP BY fk                                              |
|     ) g ON g.fk = outer.pk                                     |
|                                                                |
|     -> 1 execution of the inner aggregate                      |
|                                                                |
+---------------------------------------------------------------+

Napkin AI Visual Prompt: "Dark navy gradient (#0a0f1f -> #111a2e). Two side-by-side execution boxes. LEFT box labeled 'Correlated -> N runs' shows a rose (#ff5c8a) loop with five small SUM nodes feeding into five customer rows. RIGHT box labeled 'JOIN-Rewrite -> 1 run' shows a single sky blue (#4fc3f7) hash aggregate over an orders table feeding once into a LEFT JOIN node, then into the customers result. Below both, an EXPLAIN snippet comparison: rose 'SubPlan, loops=5' vs sky blue 'Hash Right Join'. White monospace labels."


A Note on NOT IN vs NOT EXISTS and NULLs

NOT IN with a subquery has a famous foot-gun: if the subquery returns even one NULL, the entire NOT IN evaluates to UNKNOWN for every outer row, and you get zero results. NOT EXISTS does not have this problem because it works at the row level rather than evaluating membership in a set.

-- Suppose orders.customer_id is nullable and one row has NULL.
-- NOT IN silently returns ZERO customers -- a real bug in production code.
SELECT c.id FROM customers c
WHERE c.id NOT IN (SELECT customer_id FROM orders);
-- (returns no rows if any customer_id is NULL)

-- NOT EXISTS handles this correctly: it returns Dave as expected.
SELECT c.id FROM customers c
WHERE NOT EXISTS (
  SELECT 1 FROM orders o WHERE o.customer_id = c.id
);
-- id
-- ----
--  4

The rule: prefer NOT EXISTS over NOT IN whenever the subquery column might be nullable. It is also typically faster, because the planner can use a hash anti-join with short-circuit semantics.


MySQL vs PostgreSQL — Dialect Notes

Both engines support correlated and non-correlated subqueries with the same SQL syntax. The differences live in the planner:

  • Postgres auto-decorrelates EXISTS, NOT EXISTS, and many IN subqueries into semi/anti joins. Scalar subqueries in the SELECT list are decorrelated less aggressively. EXPLAIN distinguishes InitPlan, SubPlan, and Hashed SubPlan.
  • MySQL historically had weaker subquery decorrelation. Pre-5.6, correlated IN subqueries famously produced "DEPENDENT SUBQUERY" plans that ran per row even when a JOIN would have been trivially equivalent. MySQL 5.6+ and especially 8.0 are dramatically better, but it is still safer in MySQL to write JOINs directly when performance matters.
  • In MySQL EXPLAIN, look for DEPENDENT SUBQUERY (correlated, runs per row) vs SUBQUERY (non-correlated, runs once). The MySQL nomenclature is the equivalent of Postgres's SubPlan vs InitPlan.

A query that is fine on Postgres can be catastrophic on MySQL because of this difference. If you maintain code that runs on both, write the JOIN form explicitly — it executes well everywhere.


Common Mistakes

1. Assuming any subquery is "evaluated once." The most common misconception is that subqueries are like local variables — defined once, used everywhere. They are not. The moment the inner SELECT references an outer column, the database is forced into per-row evaluation, and that one keystroke (o.id instead of id) can change a 50 ms query into a 50 second query. Always trace each subquery and check whether it references outer columns.

2. Forgetting to index the correlation column. A correlated subquery whose correlation column has no index forces the inner query to scan its entire table on every outer row. That is N full scans for N outer rows — quadratic work. The fix is almost always a single CREATE INDEX on the foreign key column inside the inner table. Always check that the correlation key is indexed before declaring a correlated subquery "slow by design."

3. Using NOT IN with a nullable subquery. NOT IN (SELECT col FROM ...) returns zero rows the moment a NULL appears in the subquery output, due to three-valued logic. This is not a planner bug — it is the SQL standard — and it produces silent, hard-to-debug correctness errors. Always use NOT EXISTS instead when the inner column might be nullable.

4. Rewriting correlated subqueries that the planner already decorrelates. On modern Postgres, an EXISTS correlated subquery is internally converted to a hash semi-join and is already optimal. Hand-rewriting it as a JOIN + DISTINCT can actually be slower because you lose the short-circuit. Run EXPLAIN before "optimising" — if you see Hash Semi Join or Hash Anti Join already, leave it alone.

5. Confusing SubPlan with InitPlan in EXPLAIN. Both are subquery markers in Postgres, but they mean opposite things. InitPlan runs once before the outer query starts (cheap, non-correlated). SubPlan runs per outer row (potentially expensive, correlated). Misreading the plan as "fine, it has an InitPlan" when it actually shows SubPlan is a surprisingly common reason teams ship slow queries to production.


Interview Questions

1. "What is the difference between a correlated and a non-correlated subquery, and how does it affect performance?"

A non-correlated subquery is fully self-contained — its inner SELECT does not reference any column from the outer query, so the database can evaluate it exactly once, cache the result, and reuse it for the entire outer scan. A correlated subquery references one or more columns from the outer query, which means the inner SELECT depends on the current outer row, and the database has to re-execute the inner query once per outer row, plugging in a different outer value each time. The performance impact is enormous when the outer table is large: a non-correlated subquery is O(inner) + O(outer), while a correlated subquery is O(outer * inner) in the worst case. The correlated form is still fast when the correlation column is indexed (each inner lookup is O(log N)) or when the pattern is EXISTS / NOT EXISTS (which short-circuits on the first match), but in any other case it can produce catastrophic per-row costs that only show up on production-sized data.

2. "Why is EXISTS considered an inherently correlated pattern, and when should you prefer it over IN or a JOIN?"

EXISTS (subquery) is correlated by design because the question it answers is "for this specific outer row, does at least one matching inner row exist," which only makes sense when the inner query references the outer row. The subquery short-circuits on the first match — it does not count, sum, or fetch additional rows — so each per-row execution is extremely cheap, and modern planners further optimise it into a hash semi-join that runs in linear time. You should prefer EXISTS over IN when the inner column might be nullable (because NOT IN silently returns zero rows in the presence of NULLs due to SQL three-valued logic), and over JOIN + DISTINCT because EXISTS naturally returns each outer row at most once without the cost of the deduplication step. The general rule is: if you are asking "does any matching child exist," reach for EXISTS first; if you are asking "give me data from both tables," reach for a JOIN.

3. "Show how you would rewrite a correlated subquery in the SELECT list as a JOIN, and explain why the JOIN form is usually faster."

A scalar subquery in the SELECT list — like SELECT c.name, (SELECT SUM(o.total) FROM orders o WHERE o.customer_id = c.id) FROM customers c — runs the inner aggregate once per outer customer row. With a million customers, that is a million aggregate executions. The JOIN rewrite is mechanical: pull the inner subquery out into a derived table that aggregates by the correlation column, then LEFT JOIN it back to the outer table on that column, wrapping the aggregated value in COALESCE for any rows without matches. The result is SELECT c.name, COALESCE(o.total_spent, 0) FROM customers c LEFT JOIN (SELECT customer_id, SUM(total) AS total_spent FROM orders GROUP BY customer_id) o ON o.customer_id = c.id. Now the inner aggregate runs exactly once over the entire orders table, producing one row per customer in a single hash aggregate, and the LEFT JOIN attaches the result. The JOIN form is faster because the work is amortised across all outer rows in a single batch operation rather than repeated per row, and the planner can use hash aggregation and hash joining — both extremely well-optimised paths.

4. "How can you tell from a PostgreSQL EXPLAIN plan whether a subquery is correlated or not, and what should you look for to spot a performance problem?"

Postgres labels subqueries in the plan with one of three node types. InitPlan means the subquery is non-correlated and is evaluated exactly once before the outer query starts, with the result cached as a parameter ($0, $1, etc.) and plugged into the outer scan; this is the cheap case and you can stop worrying. SubPlan means the subquery is correlated and is re-executed once per outer row, which is the expensive case — look for the loops count in the inner node and multiply by the inner cost to get the real total work. Hashed SubPlan means the planner converted the subquery into a hash structure built once and probed per outer row, which is a middle ground better than SubPlan but worse than InitPlan. The performance red flags are: a SubPlan node feeding off a Seq Scan of a large inner table (no index on the correlation column), loops= numbers in the millions, and total inner cost multiplied by loops dwarfing the outer scan cost. The fix is usually adding an index on the correlation column or rewriting as a JOIN with a derived aggregated table.

5. "Explain why NOT IN (subquery) is dangerous when the subquery column can be NULL, and what to use instead."

NOT IN is implemented in terms of three-valued logic. The expression x NOT IN (a, b, c, NULL) evaluates as x != a AND x != b AND x != c AND x != NULL, and x != NULL is always UNKNOWN, which makes the entire conjunction UNKNOWN for every value of x. The WHERE clause filters out rows where the predicate is not TRUE, so any UNKNOWN result is dropped, and the query returns zero rows even if there are obvious matches that should pass. This is one of the most subtle bugs in SQL because the query produces some result silently — not an error — and it only manifests when at least one NULL appears in the subquery output, which can happen long after the code was written and tested. The safe replacement is NOT EXISTS, which works at the row level instead of as a set membership check: SELECT c.id FROM customers c WHERE NOT EXISTS (SELECT 1 FROM orders o WHERE o.customer_id = c.id) returns the customers with no orders correctly regardless of NULLs. As a portable rule, prefer NOT EXISTS over NOT IN for any subquery whose column is not declared NOT NULL.


Quick Reference — Cheat Sheet

+---------------------------------------------------------------+
|           CORRELATED vs NON-CORRELATED CHEAT SHEET            |
+---------------------------------------------------------------+
|                                                                |
|  THE TEST:                                                     |
|    Does the inner SELECT reference an outer column?           |
|      NO  -> non-correlated -> runs ONCE                        |
|      YES -> correlated     -> runs per OUTER ROW               |
|                                                                |
|  EXPLAIN MARKERS (PostgreSQL):                                 |
|    InitPlan       -> evaluated once, parameterised             |
|    SubPlan        -> per-row execution (correlated)            |
|    Hashed SubPlan -> hashed once, probed per row               |
|    Hash Semi Join -> EXISTS auto-decorrelated (good)           |
|    Hash Anti Join -> NOT EXISTS auto-decorrelated (good)       |
|                                                                |
|  EXPLAIN MARKERS (MySQL):                                      |
|    SUBQUERY            -> non-correlated, runs once            |
|    DEPENDENT SUBQUERY  -> correlated, runs per row             |
|                                                                |
|  EXISTS PATTERNS:                                              |
|    EXISTS (SELECT 1 FROM ... WHERE inner.fk = outer.pk)        |
|      -> "any matching child" -- short-circuits on first hit    |
|    NOT EXISTS (...)                                            |
|      -> "no matching child"  -- safe with NULLs                |
|                                                                |
|  REWRITE RECIPE (correlated SELECT-list subquery -> JOIN):     |
|    1. Pull inner subquery into a derived table                 |
|    2. GROUP BY the correlation column                          |
|    3. LEFT JOIN derived table back on that column              |
|    4. COALESCE for default values on no-match rows             |
|                                                                |
|  PERFORMANCE INSTINCTS:                                        |
|    - Always index the correlation column                       |
|    - Prefer NOT EXISTS over NOT IN                             |
|    - Trust EXISTS in modern Postgres (semi-join)               |
|    - Hand-rewrite scalar correlated subqueries as JOINs        |
|    - Run EXPLAIN before AND after every rewrite                |
|                                                                |
+---------------------------------------------------------------+

+---------------------------------------------------------------+
|           KEY RULES                                            |
+---------------------------------------------------------------+
|                                                                |
|  1. Correlation = inner SELECT mentions an outer column        |
|  2. Non-correlated runs 1x; correlated runs N times            |
|  3. EXISTS is correlated by design and short-circuits          |
|  4. Prefer NOT EXISTS over NOT IN when NULLs are possible      |
|  5. Index every correlation column or pay quadratic cost       |
|  6. InitPlan = once; SubPlan = per row; learn the difference   |
|  7. Scalar correlated subqueries in SELECT often beg for JOINs |
|  8. Modern Postgres auto-decorrelates EXISTS -- check EXPLAIN  |
|  9. MySQL planner is weaker -- write JOINs explicitly there    |
| 10. Always EXPLAIN ANALYZE before and after a rewrite          |
|                                                                |
+---------------------------------------------------------------+
ConcernWrong WayRight Way
"Any match exists"JOIN ... GROUP BY ... HAVING COUNT > 0WHERE EXISTS (SELECT 1 ...)
"No match exists"NOT IN (SELECT col ...)NOT EXISTS (SELECT 1 ...)
Scalar aggregate per rowCorrelated subquery in SELECT listLEFT JOIN derived aggregated table
Slow correlated subqueryAdd LIMIT to "fix" itIndex the correlation column
Reading EXPLAIN"There's a subquery, must be slow"Check InitPlan vs SubPlan vs Hash Semi
Portable SQLRely on planner decorrelationWrite JOINs explicitly when in doubt
Nullable inner colNOT INNOT EXISTS
EXISTS column choiceSELECT * FROM ...SELECT 1 FROM ... (idiomatic)

Prev: Lesson 6.1 -- Subqueries Next: Lesson 6.3 -- UNION, INTERSECT, EXCEPT


This is Lesson 6.2 of the Database Interview Prep Course -- 12 chapters, 58 lessons.

On this page