INNER JOIN
Set Intersection, ON vs USING, and the Many-Table Pipeline
LinkedIn Hook
"Your INNER JOIN dropped 40% of the orders. The bug wasn't in the WHERE clause — it was the NULL customer_id on rows you forgot existed."
Most developers learn INNER JOIN as "match rows in two tables where the keys are equal" and stop there. That sentence is technically correct and operationally dangerous. INNER JOIN is the only join type that silently deletes rows from your result set with zero warning. Any row on either side that fails to find a match — because the foreign key is NULL, because the parent row was archived, because of a typo in a status filter applied before the join — vanishes without a trace.
Then you put the join inside a subquery, and a row that "should be there" never makes it to the dashboard. Then you write a three-way join and end up with row counts higher than either source table because one of the joins quietly became a fan-out. Then you join four tables and EXPLAIN shows 200 million rows from a 50k-row source, because the planner picked the wrong join order and the optimizer hint you wrote for MySQL is being ignored by Postgres.
INNER JOIN is also where you first meet the ON-vs-USING distinction (USING collapses both join columns into one), the difference between filtering in the ON clause and filtering in WHERE (for INNER JOIN they are equivalent — for OUTER JOIN they are not), and the rule that NULL never equals NULL so a NULL-vs-NULL join match never happens.
In Lesson 5.2, I take INNER JOIN apart end to end: the set-theory intuition, the row-by-row execution model, ON vs USING vs NATURAL JOIN, the NULL trap, multi-table chains, the ON-vs-WHERE filter equivalence, and the common pitfalls that turn a "simple join" into a four-hour debugging session.
Read the full lesson -> [link]
#SQL #Database #PostgreSQL #MySQL #BackendDevelopment #InterviewPrep #Joins
What You'll Learn
- The set-theory model of INNER JOIN as the intersection of two row sets on a key
- The conceptual row-by-row execution model and why it explains every join surprise
- The exact difference between
ON,USING, andNATURAL JOINand when each is safe - Why NULL on either side of the join key always drops the row, and how to detect it
- How to chain multiple INNER JOINs and read the execution shape from EXPLAIN
- The ON-vs-WHERE equivalence for INNER JOIN (and why it stops being true for OUTER JOIN)
- The fan-out pitfall when joining one row to many rows and how it inflates aggregates
- Index-backed joins (Hash Join, Merge Join, Nested Loop) and when each is fast
The Two-List Clipboard Analogy
Picture two clipboards on a desk. The left clipboard is a list of orders, each with an order number and a customer ID. The right clipboard is a list of customers, each with a customer ID and a name. Your job is to walk down the left list, and for every order, find the matching customer on the right list and write down the pair (order number, customer name).
That walk is an INNER JOIN. The join key is customer_id. For every row on the left, you ask "is there a row on the right where the customer_id is the same?" If yes, you emit a combined row. If no — say the order has a NULL customer_id, or its customer_id refers to a customer who was deleted — the order disappears from your output. Same on the other side: a customer with no orders does not appear either, because the walk is symmetric.
This is the entire mental model. Every confusing INNER JOIN behavior — disappearing rows, doubled row counts, NULL handling, multi-table chains — is just an elaboration of "for every row on the left, find matching rows on the right; if none, drop the left row."
+---------------------------------------------------------------+
| THE TWO-CLIPBOARD WALK |
+---------------------------------------------------------------+
| |
| ORDERS (left) CUSTOMERS (right) |
| +----+-------------+ +----+----------+ |
| | id | customer_id | | id | name | |
| +----+-------------+ +----+----------+ |
| | 1 | 10 | --> | 10 | Alice | match -> emit |
| | 2 | 20 | --> | 20 | Bob | match -> emit |
| | 3 | 99 | --> | - | - | miss -> drop |
| | 4 | NULL | --> | - | - | miss -> drop |
| +----+-------------+ | 30 | Carol | miss -> drop |
| +----+----------+ |
| |
| RESULT (intersection only): |
| (order 1, Alice), (order 2, Bob) |
| |
+---------------------------------------------------------------+
Notice three things in that diagram. Order 3 referenced a customer who does not exist (broken foreign key) and was dropped. Order 4 had a NULL customer_id and was dropped because NULL never matches anything, including another NULL. And Carol the customer was dropped because she had no orders. Those three drops are the source of almost every "where did my row go?" bug in real-world SQL.
The Conceptual Execution Model
Internally the database does not literally walk both clipboards row by row — it picks a fast algorithm (Hash, Merge, or Nested Loop). But every join algorithm produces the same output as the conceptual model, and the conceptual model is what you reason about when reading or writing SQL. Here it is in pseudocode:
for each row L in left_table:
for each row R in right_table:
if join_condition(L, R) is TRUE:
emit combined row (L, R)
Three crucial things about this model:
- One left row can produce many output rows. If L matches three rows in R, you get three combined rows. This is the fan-out, and it is a feature, not a bug — but it surprises people who think a join "looks up one customer per order."
- The join condition is evaluated as a three-valued logic expression. It must evaluate to TRUE for a match. UNKNOWN (which is what NULL = anything returns) is not TRUE, so the row is dropped.
- The set of output columns is the union of left columns and right columns (with USING and NATURAL JOIN merging the join key into a single column — more on that below).
ON vs USING vs NATURAL JOIN
PostgreSQL gives you three syntaxes for specifying the join condition. They produce semantically identical results when the keys match cleanly, but they differ in column output and in how dangerous they are when the schema changes.
-- Sample tables used throughout this lesson
CREATE TABLE customers (
id SERIAL PRIMARY KEY,
name TEXT NOT NULL,
city TEXT
);
CREATE TABLE orders (
id SERIAL PRIMARY KEY,
customer_id INT REFERENCES customers(id),
total NUMERIC,
status TEXT
);
INSERT INTO customers (name, city) VALUES
('Alice', 'Seattle'),
('Bob', 'Austin'),
('Carol', 'Boston'),
('Dave', 'Denver');
INSERT INTO orders (customer_id, total, status) VALUES
(1, 120.00, 'shipped'),
(1, 45.00, 'shipped'),
(2, 55.50, 'pending'),
(3, 300.00, 'shipped'),
(NULL, 10.00, 'pending'), -- guest checkout, no customer
(99, 77.00, 'shipped'); -- orphan, customer 99 does not exist
Note: the last INSERT will fail in real life because of the foreign key. For demonstration, drop the FK or pretend it's a legacy import. The point is to show the join behavior on dirty data.
ON — explicit and the only one you should use in production
-- ON: explicit join condition. The join column appears twice in the
-- output (once from each side) unless you list columns explicitly.
SELECT o.id AS order_id, c.name, o.total
FROM orders o
INNER JOIN customers c ON o.customer_id = c.id
ORDER BY o.id;
-- order_id | name | total
-- ---------+-------+--------
-- 1 | Alice | 120.00
-- 2 | Alice | 45.00
-- 3 | Bob | 55.50
-- 4 | Carol | 300.00
-- (Dave is gone -- no orders. Order 5 is gone -- NULL key.
-- Order 6 is gone -- no customer 99.)
USING — collapses the join column into one
-- USING: the join column must have the same name on both sides.
-- It appears exactly once in the output, not twice.
-- Requires renaming customers.id to customer_id (or vice versa).
ALTER TABLE customers RENAME COLUMN id TO customer_id;
SELECT order_id, customer_id, name, total
FROM orders o
INNER JOIN customers c USING (customer_id)
-- Note: customer_id is now an unqualified column -- you cannot
-- write o.customer_id or c.customer_id after USING.
ORDER BY order_id;
USING is concise and often nicer to read. The cost: it requires the columns on both sides to share a name, and it removes your ability to qualify the column with a table alias, which can make schema refactors painful.
NATURAL JOIN — never use this
-- NATURAL JOIN: joins on EVERY column with the same name in both tables.
-- It looks at the schema and silently picks the keys.
SELECT * FROM orders NATURAL JOIN customers;
-- This is a time bomb. If someone adds a column named "status" to
-- customers next year, the NATURAL JOIN silently starts joining on
-- (customer_id, status) and your query starts returning fewer rows.
The rule: use ON in production code because it is the most explicit and survives schema changes. Use USING for ad-hoc queries when both sides really do share a column name and you want clean output. Never use NATURAL JOIN. It has the worst failure mode of any SQL feature: silent wrong answers when an unrelated column is added to either table.
The NULL Trap — Why Three-Valued Logic Drops Rows
SQL uses three-valued logic: TRUE, FALSE, and UNKNOWN. The expression NULL = NULL evaluates to UNKNOWN, not TRUE. The join condition only emits a row when it evaluates to TRUE, so any join row where either side of the equality is NULL is dropped silently.
-- Find orders with their customer name. Watch the row count.
SELECT COUNT(*) FROM orders;
-- count
-- -----
-- 6
SELECT COUNT(*)
FROM orders o
INNER JOIN customers c ON o.customer_id = c.customer_id;
-- count
-- -----
-- 4
-- We "lost" 2 rows: the NULL-key guest checkout and the orphan id=99.
-- Neither produced an error. The dashboard just shows 4.
If you want NULL-key rows to participate in some way, you almost always want LEFT JOIN (next lesson) instead of INNER JOIN. If you specifically want NULL-vs-NULL to count as a match (rare, usually a sign of bad schema design), you need IS NOT DISTINCT FROM:
-- Postgres-specific: IS NOT DISTINCT FROM treats two NULLs as equal.
SELECT o.id, c.name
FROM orders o
INNER JOIN customers c
ON o.customer_id IS NOT DISTINCT FROM c.customer_id;
-- This is almost never what you want. It is shown only so you
-- recognize the pattern in legacy code.
The healthier fix: enforce NOT NULL on foreign keys at the schema level whenever the relationship is mandatory, and use LEFT JOIN whenever the relationship is optional and you want the optional side to survive.
Multi-Table INNER JOINs — The Pipeline
Joins compose. You can chain three, four, or twenty INNER JOINs in a single query, and the conceptual model is the same: each join takes the result of the previous join as its left side, and matches it against the next table.
CREATE TABLE products (
id SERIAL PRIMARY KEY,
name TEXT NOT NULL,
price NUMERIC
);
CREATE TABLE order_items (
id SERIAL PRIMARY KEY,
order_id INT REFERENCES orders(id),
product_id INT REFERENCES products(id),
quantity INT
);
INSERT INTO products (name, price) VALUES
('Widget', 10.00),
('Gadget', 25.00),
('Gizmo', 8.50);
INSERT INTO order_items (order_id, product_id, quantity) VALUES
(1, 1, 3), -- order 1: 3 widgets
(1, 2, 1), -- order 1: 1 gadget
(2, 3, 2), -- order 2: 2 gizmos
(3, 1, 1), -- order 3: 1 widget
(4, 2, 4); -- order 4: 4 gadgets
-- Three-table INNER JOIN: customer -> order -> line items -> product
SELECT
c.name AS customer,
o.id AS order_id,
p.name AS product,
oi.quantity AS qty,
p.price AS unit_price,
oi.quantity * p.price AS line_total
FROM customers c
INNER JOIN orders o ON o.customer_id = c.customer_id
INNER JOIN order_items oi ON oi.order_id = o.id
INNER JOIN products p ON p.id = oi.product_id
ORDER BY c.name, o.id, p.name;
-- customer | order_id | product | qty | unit_price | line_total
-- ---------+----------+---------+-----+------------+-----------
-- Alice | 1 | Gadget | 1 | 25.00 | 25.00
-- Alice | 1 | Widget | 3 | 10.00 | 30.00
-- Alice | 2 | Gizmo | 2 | 8.50 | 17.00
-- Bob | 3 | Widget | 1 | 10.00 | 10.00
-- Carol | 4 | Gadget | 4 | 25.00 | 100.00
+---------------------------------------------------------------+
| THE JOIN PIPELINE |
+---------------------------------------------------------------+
| |
| customers |
| | |
| v |
| [JOIN ON c.customer_id = o.customer_id] |
| | |
| v |
| customers + orders |
| | |
| v |
| [JOIN ON o.id = oi.order_id] |
| | |
| v |
| customers + orders + order_items |
| | |
| v |
| [JOIN ON oi.product_id = p.id] |
| | |
| v |
| customers + orders + order_items + products -> SELECT |
| |
+---------------------------------------------------------------+
The order in which you write the joins is not necessarily the order in which the database executes them. Postgres's optimizer is free to reorder INNER JOINs because the result is mathematically the same regardless of evaluation order (INNER JOIN is associative and commutative). The planner picks the order that minimizes intermediate row counts — usually starting from the smallest filtered table and joining outward.
ON vs WHERE — Equivalent for INNER JOIN
A surprising property of INNER JOIN: filters written in the ON clause and filters written in the WHERE clause produce identical results. This is not true for OUTER JOIN (which we cover in the next lesson, and which is one of the most common interview questions).
-- These two queries are exactly equivalent for INNER JOIN.
SELECT o.id, c.name
FROM orders o
INNER JOIN customers c
ON o.customer_id = c.customer_id
AND o.status = 'shipped';
SELECT o.id, c.name
FROM orders o
INNER JOIN customers c
ON o.customer_id = c.customer_id
WHERE o.status = 'shipped';
-- Both return the same rows and the planner usually produces the
-- same plan for both.
The convention: put join conditions (the keys that link the tables) in the ON clause and put row filters (predicates on individual columns) in WHERE. This makes the query easier to read and easier to convert to a LEFT JOIN later without quietly changing the semantics.
Napkin AI Visual Prompt: "Dark navy gradient (#0a0f1f -> #111a2e). Side-by-side comparison of two SQL snippets in white monospace. LEFT box labeled 'ON clause' shows 'INNER JOIN c ON o.customer_id = c.id AND o.status = shipped' with sky blue (#4fc3f7) highlight on the AND. RIGHT box labeled 'WHERE clause' shows 'INNER JOIN c ON o.customer_id = c.id WHERE o.status = shipped'. Below both, a single result table with an equals sign between, glowing sky blue. Caption in rose (#ff5c8a): 'INNER JOIN: identical. OUTER JOIN: very different.' Subtle grid overlay."
The Fan-Out Pitfall
When you join a parent row to a child table and the child has multiple rows per parent, the parent row is duplicated in the output once per child. This is correct join behavior, but it breaks aggregations that assume one row per parent.
-- BUG: counts customers but joins through order_items, which fans out.
SELECT COUNT(*) AS customer_count
FROM customers c
INNER JOIN orders o ON o.customer_id = c.customer_id
INNER JOIN order_items oi ON oi.order_id = o.id;
-- customer_count
-- --------------
-- 5
-- This is the count of LINE ITEMS that map to a customer with
-- at least one order -- not the count of customers!
-- FIX: count distinct customers, or aggregate before joining.
SELECT COUNT(DISTINCT c.customer_id) AS customer_count
FROM customers c
INNER JOIN orders o ON o.customer_id = c.customer_id
INNER JOIN order_items oi ON oi.order_id = o.id;
-- customer_count
-- --------------
-- 3
-- Even better: pre-aggregate to avoid the fan-out entirely.
SELECT c.name,
(SELECT COALESCE(SUM(oi.quantity * p.price), 0)
FROM orders o
JOIN order_items oi ON oi.order_id = o.id
JOIN products p ON p.id = oi.product_id
WHERE o.customer_id = c.customer_id) AS revenue
FROM customers c
ORDER BY revenue DESC;
The rule: any time you SUM, COUNT, or AVG over a join, ask "could this join have multiplied the rows from the table I'm aggregating over?" If yes, either use COUNT(DISTINCT ...), pre-aggregate in a subquery or CTE, or restructure the query. The classic symptom is a revenue number that is exactly 2x or 3x the right answer because some customers have two or three orders.
Join Algorithms — How EXPLAIN Reads
PostgreSQL has three physical join algorithms: Nested Loop, Hash Join, and Merge Join. The planner picks one based on table sizes, available indexes, and statistics. You do not choose directly, but understanding what each one does makes EXPLAIN output legible.
+---------------------------------------------------------------+
| PHYSICAL JOIN ALGORITHMS |
+---------------------------------------------------------------+
| |
| NESTED LOOP |
| For each outer row -> probe inner table (ideally indexed). |
| Best when: outer is tiny OR inner has a matching index. |
| Cost: O(N * log M) with index, O(N * M) without. |
| |
| HASH JOIN |
| Build a hash of the smaller table, stream the larger one. |
| Best when: both sides are large and one fits in work_mem. |
| Cost: O(N + M), but needs memory for the hash. |
| |
| MERGE JOIN |
| Sort both sides on the key, then walk in lockstep. |
| Best when: both sides are already sorted (e.g. from index). |
| Cost: O(N log N + M log M), or O(N + M) if pre-sorted. |
| |
+---------------------------------------------------------------+
EXPLAIN ANALYZE
SELECT c.name, o.id, o.total
FROM customers c
INNER JOIN orders o ON o.customer_id = c.customer_id
WHERE c.city = 'Seattle';
-- Hash Join (cost=1.05..2.21 rows=2 width=44)
-- Hash Cond: (o.customer_id = c.customer_id)
-- -> Seq Scan on orders o (cost=0.00..1.06 rows=6)
-- -> Hash (cost=1.04..1.04 rows=1)
-- -> Seq Scan on customers c
-- Filter: (city = 'Seattle')
-- Postgres built a hash on the (small) Seattle-customers set and
-- streamed orders through it. With an index on orders.customer_id
-- and a large customers table, the planner might switch to Nested
-- Loop or Merge Join instead.
MySQL only had Nested Loop until version 8.0.18 added Hash Join — older MySQL versions perform very poorly on large unindexed joins and the standard fix is to add the right index on the inner table's join key. In Postgres, the planner has all three algorithms available from day one.
Common Mistakes
1. Forgetting that NULL in the join key drops rows.
A NOT NULL-less foreign key column is the most common source of "missing rows" bugs. You write INNER JOIN customers c ON o.customer_id = c.id and any order with a NULL customer_id (guest checkout, soft-deleted parent, broken import) silently disappears from the result. Always check the row count before and after a join — if it dropped unexpectedly, search for NULLs in the join keys with SELECT COUNT(*) FROM orders WHERE customer_id IS NULL; and decide whether you really want INNER JOIN or LEFT JOIN.
2. Aggregating across a fan-out join.
SUM(o.total) after joining orders to order_items double-counts every order that has more than one line item, because the join repeats the order row once per item. The result is a revenue number that looks plausible but is wrong by 30-200% depending on average items-per-order. Either aggregate order_items to one row per order in a subquery before joining, or use SUM(DISTINCT o.id, o.total) patterns very carefully.
3. Using NATURAL JOIN because it looks clean.
NATURAL JOIN auto-joins on every column that shares a name. If anyone ever adds a created_at, updated_at, or status column to either table, the join silently changes meaning and starts dropping rows. It produces no error, no warning — the query just starts returning fewer results. Outlaw NATURAL JOIN in code review and prefer explicit ON.
4. Writing filters that defeat indexes inside ON or WHERE.
A predicate like ON LOWER(o.email) = LOWER(c.email) cannot use a plain index on email and forces a full scan and hash join over the wider table. Either store the lowercase form in a separate column with its own index, or create a functional index CREATE INDEX ON orders (LOWER(email));. The same applies to o.created_at::date = c.signup_date and similar wraps.
5. Confusing INNER JOIN order with execution order.
The order you write joins in is not necessarily the order they execute in. Postgres's planner reorders INNER JOINs to minimize intermediate row counts. If you assume "the database joins customers first, then orders, then items because that's how I wrote it," you will misread EXPLAIN output and chase the wrong tuning. Use EXPLAIN (ANALYZE, BUFFERS) and read the plan tree, not the SQL text.
Interview Questions
1. "What is the difference between INNER JOIN, ON, USING, and NATURAL JOIN, and which would you use in production?"
INNER JOIN is the operation: it returns rows where the join condition evaluates to TRUE for some pair from the two tables. ON, USING, and NATURAL JOIN are three syntaxes for specifying that condition. ON is the most general — you write any boolean expression you like, including expressions across columns with different names, and the join columns appear once from each side in the output. USING (col) requires both tables to have a column with the same name and merges them into a single output column; you can no longer qualify it with a table alias afterward. NATURAL JOIN automatically uses every shared column name as a join key, which makes it dangerous because a future schema change can silently start joining on additional columns and silently drop rows. In production code I always use ON because it is explicit, survives column renames, and makes intent obvious to the next reader. USING is fine in ad-hoc queries when both sides genuinely share a clean key name. NATURAL JOIN should be banned from any codebase.
2. "Why does an INNER JOIN sometimes return fewer rows than the smaller of the two tables?"
Three reasons, all stemming from the rule that INNER JOIN only emits a row when the join condition evaluates to TRUE. First, NULL in the join key on either side: NULL = anything returns UNKNOWN, not TRUE, so any row with a NULL key is dropped. Second, broken referential integrity: if a foreign key points to a parent row that no longer exists (because of a hard delete, a failed migration, or a typo on import), the child row has no match and is dropped. Third, additional predicates in the ON clause or downstream WHERE clauses that filter out rows. The diagnostic workflow is to count the source tables, count the result, count NULLs in the join key (COUNT(*) FILTER (WHERE key IS NULL)), and count orphans (SELECT COUNT(*) FROM child WHERE NOT EXISTS (SELECT 1 FROM parent WHERE parent.id = child.fk)). The numbers always reconcile, and usually the missing rows fall into one of those three buckets.
3. "For an INNER JOIN, are these two queries equivalent? JOIN c ON o.cid = c.id AND o.status = 'shipped' versus JOIN c ON o.cid = c.id WHERE o.status = 'shipped'."
For INNER JOIN, yes, they are equivalent — they return identical row sets and the planner usually produces the same plan. INNER JOIN treats the conjunction of the ON predicates and the WHERE predicates as a single boolean filter applied to the cross-product of the two tables. The reason it matters to know they are equivalent is that the answer changes for OUTER JOIN. In a LEFT JOIN, a predicate in the ON clause is evaluated before the outer-join row preservation kicks in, so it controls which rows on the right side count as a match — non-matching left rows still survive with NULLs on the right. The same predicate in the WHERE clause is evaluated after the join, so it filters out the NULL-padded rows and effectively turns the LEFT JOIN back into an INNER JOIN. Knowing the difference is one of the most common SQL interview questions because the bug it produces is silent and expensive in production.
4. "What is a fan-out and how do you avoid it when aggregating across a join?"
A fan-out is what happens when you join a parent row to a child table that has many rows per parent. The parent row is repeated in the output once per matching child row. This is correct join behavior — it is exactly what the conceptual model says — but it breaks any aggregate that assumes one row per parent. If you SUM(orders.total) after joining orders to order_items, an order with three line items contributes its total three times, inflating revenue by 200%. The fix has three flavors. First, use SUM(DISTINCT o.id, o.total) patterns or COUNT(DISTINCT) where appropriate, but be careful: SUM(DISTINCT col) deduplicates only on col, not on the row. Second, pre-aggregate the child table in a subquery or CTE so you join customers to (one row per order with its items pre-summed), restoring the one-to-one shape. Third, restructure as separate aggregating queries and combine in the application or with a UNION/CTE. The general rule: never aggregate across a join unless you can prove the join did not multiply the rows on the side you're aggregating over.
5. "Postgres has three physical join algorithms. Which one runs and why does it matter?"
Postgres picks between Nested Loop, Hash Join, and Merge Join based on the planner's row estimates and the available indexes. Nested Loop iterates the outer side and probes the inner side once per outer row; it is fast when the outer is tiny or the inner has a matching index, and catastrophic when both sides are large and unindexed (O(N*M)). Hash Join builds an in-memory hash table on the smaller side and streams the larger side through it; it is the workhorse for medium-to-large joins and runs in O(N+M), but needs work_mem to hold the hash. If the hash spills to disk, it switches to a batched algorithm that is much slower. Merge Join sorts both sides on the join key and walks them in lockstep; it is the winner when both sides are already sorted (typically because the optimizer is reading them via a matching index) and degrades when sorting from scratch is expensive. You read which one was chosen in EXPLAIN ANALYZE output — the join node is labeled Nested Loop, Hash Join, or Merge Join. It matters because the wrong choice is usually the difference between a 5 ms query and a 5 minute query, and the fix is almost always either creating the right index (helping Nested Loop or Merge Join) or raising work_mem (helping Hash Join stay in memory).
Quick Reference — INNER JOIN Cheat Sheet
+---------------------------------------------------------------+
| INNER JOIN CHEAT SHEET |
+---------------------------------------------------------------+
| |
| SET MODEL: |
| INNER JOIN = intersection of two tables on a key |
| Drops left rows with no match, drops right rows with none |
| |
| CONCEPTUAL EXECUTION: |
| for L in left: |
| for R in right: |
| if cond(L, R) is TRUE: emit (L, R) |
| |
| SYNTAX: |
| JOIN c ON o.cid = c.id -> use this in production |
| JOIN c USING (cid) -> ok for ad-hoc, same name |
| NATURAL JOIN c -> NEVER use, time bomb |
| |
| NULL RULE: |
| NULL = NULL -> UNKNOWN -> not TRUE -> row dropped |
| Use IS NOT DISTINCT FROM only if you really mean it |
| (rare; usually a sign of bad schema) |
| |
| ON vs WHERE: |
| INNER JOIN: equivalent (same result, often same plan) |
| OUTER JOIN: NOT equivalent -- the difference is huge |
| Convention: keys in ON, row filters in WHERE |
| |
| FAN-OUT: |
| parent + child(many) -> parent row repeated |
| SUM/COUNT over fan-out joins double-counts |
| Fix: COUNT(DISTINCT) or pre-aggregate in subquery/CTE |
| |
| ALGORITHMS: |
| Nested Loop -> small outer or indexed inner |
| Hash Join -> medium-large, fits work_mem |
| Merge Join -> both sides pre-sorted on key |
| |
| MULTI-TABLE: |
| INNER JOIN is associative + commutative |
| Planner reorders for the cheapest evaluation |
| Write order != execution order |
| |
+---------------------------------------------------------------+
+---------------------------------------------------------------+
| KEY RULES |
+---------------------------------------------------------------+
| |
| 1. INNER JOIN drops unmatched rows on BOTH sides |
| 2. NULL in the join key always drops the row |
| 3. Use ON in production -- never NATURAL JOIN |
| 4. ON and WHERE are equivalent for INNER JOIN, NOT for OUTER |
| 5. Aggregating across a fan-out join double-counts |
| 6. The planner reorders INNER JOINs freely |
| 7. Read EXPLAIN to see which physical algorithm was picked |
| 8. Add an index on the inner side's join key for hot joins |
| 9. Always count rows before and after a join while debugging |
| 10. Prefer LEFT JOIN whenever the right-side row is optional |
| |
+---------------------------------------------------------------+
| Concern | Wrong Way | Right Way |
|---|---|---|
| Syntax | NATURAL JOIN customers | INNER JOIN customers c ON o.cid = c.id |
| Optional FK | INNER JOIN with NULL keys | LEFT JOIN, handle NULLs explicitly |
| Aggregation | SUM(total) over fan-out | Pre-aggregate child in a CTE first |
| Equality on NULLs | o.cid = c.id (drops NULLs) | IS NOT DISTINCT FROM (only if intended) |
| Filter location | All filters in WHERE for OUTER | Keys in ON, predicates in WHERE |
| Index strategy | No index on join key | Index inner side's FK column |
| Functional join | LOWER(a) = LOWER(b) | Functional index on LOWER(col) |
| Debugging | Trust the row count | Count source tables, NULLs, orphans |
| Plan reading | Assume write order = exec order | Read EXPLAIN tree bottom-up |
| MySQL < 8.0.18 | Hash join fallback | Index the inner join key |
Prev: Lesson 5.1 -- What Are Joins? Next: Lesson 5.3 -- LEFT JOIN
This is Lesson 5.2 of the Database Interview Prep Course -- 12 chapters, 58 lessons.