UNION, INTERSECT, EXCEPT
Set Operators, Column Compatibility, and the MySQL Workaround
LinkedIn Hook
"Your monthly report doubled in row count after a refactor. Someone changed UNION ALL to UNION — or the other way around — and nobody noticed for three weeks."
Set operators are the quiet workhorses of SQL. UNION stitches two result sets into one. INTERSECT keeps only the rows that appear in both. EXCEPT (called MINUS in Oracle) keeps the rows from the first that do not appear in the second. Together they let you express "either / both / neither" questions over query results in a way that joins simply cannot match for clarity.
And yet they are the source of an astonishing number of production bugs. UNION quietly deduplicates by sorting the entire combined result, which on a 50-million-row report turns a 200 ms query into a 90-second sort spill. UNION ALL keeps every row including duplicates, which is what you almost always actually wanted — and is dramatically faster. The two are one keyword apart and behave completely differently.
Then there is column compatibility. Set operators require both branches to have the same number of columns, in the same order, with compatible types. The output column names come from the first branch only. NULL handling differs from joins. And MySQL did not even ship INTERSECT or EXCEPT until version 8.0.31 in late 2022, so a query that runs perfectly on PostgreSQL throws a syntax error on a MySQL 5.7 production replica.
In Lesson 6.3, I break down all three set operators end to end: UNION vs UNION ALL, INTERSECT, EXCEPT, the column compatibility rules, ordering combined results, the MySQL workaround using JOIN and NOT EXISTS, and how to read EXPLAIN to tell whether your set operator is sorting or hashing.
Read the full lesson -> [link]
#SQL #Database #PostgreSQL #MySQL #BackendDevelopment #InterviewPrep #SetTheory
What You'll Learn
- How UNION, INTERSECT, and EXCEPT combine result sets at the row level using set semantics
- The crucial difference between UNION (deduplicates) and UNION ALL (keeps duplicates) and why ALL is usually what you want
- Column compatibility rules — same count, same order, compatible types — and how the output column names are decided
- How to ORDER BY a combined result and why the ORDER BY can only appear at the very end
- How NULLs behave under set operators and why two NULLs are treated as equal here even though they are not equal under
= - How to emulate INTERSECT and EXCEPT in MySQL versions that do not support them natively (pre-8.0.31)
- How to read EXPLAIN to see whether the engine is using HashSetOp, SetOp, or Sort+Unique to perform the operation
- When to reach for set operators vs joins vs subqueries to express the same idea
The Two Email Lists Analogy — Either, Both, Neither
Imagine you run a small conference and you have two email lists. List A is everyone who registered for the morning workshop. List B is everyone who registered for the afternoon workshop. A few people signed up for both. You want three different reports:
- Everyone who attended at least one workshop — combine both lists, but only count each person once. That is
UNION. - People who attended both workshops — only those who appear on both lists. That is
INTERSECT. - People who only attended the morning workshop — those on list A but not on list B. That is
EXCEPT.
Now consider a fourth question: "How many seats did we fill in total across both sessions?" Here you actually want to count someone twice if they showed up to both — once for morning, once for afternoon. That is UNION ALL. The "ALL" keyword tells the database "do not deduplicate — keep every row from every input even if some are identical." This sounds like a small distinction, but it is the difference between counting unique humans and counting filled seats. Both questions are valid; you have to know which one you are asking.
+---------------------------------------------------------------+
| SET OPERATORS AS VENN DIAGRAMS |
+---------------------------------------------------------------+
| |
| UNION (A or B, deduped) |
| ___ ___ |
| / \ / \ result = entire shaded area |
| | A |---| B | each row appears at most once |
| \___/ \___/ |
| [############] |
| |
| UNION ALL (A or B, every row kept) |
| ___ ___ |
| / \ / \ result = every row from A |
| | A |---| B | plus every row from B |
| \___/ \___/ duplicates included |
| [#####] [#####] |
| |
| INTERSECT (A and B) |
| ___ ___ |
| / \ / \ result = lens region only |
| | A |[##]| B | rows that exist in BOTH |
| \___/ \___/ |
| |
| EXCEPT (A minus B) |
| ___ ___ |
| / \ / \ result = left crescent only |
| | A##|---| B | rows in A that are NOT in B |
| \___/ \___/ |
| |
+---------------------------------------------------------------+
The mental model is set theory, not row-by-row processing. You are combining two sets of rows according to a logical rule, and the database is responsible for figuring out the most efficient way to compute it.
UNION vs UNION ALL — The One-Word Performance Cliff
UNION combines two result sets and removes duplicates. UNION ALL combines two result sets and keeps every row, duplicates and all. They differ by a single keyword and they have very different costs.
-- Sample tables used throughout this lesson
CREATE TABLE employees (
id SERIAL PRIMARY KEY,
name TEXT,
email TEXT,
dept TEXT
);
CREATE TABLE contractors (
id SERIAL PRIMARY KEY,
name TEXT,
email TEXT,
dept TEXT
);
INSERT INTO employees (name, email, dept) VALUES
('Alice', 'alice@co.io', 'Eng'),
('Bob', 'bob@co.io', 'Eng'),
('Carol', 'carol@co.io', 'Sales'),
('Dave', 'dave@co.io', 'Eng');
INSERT INTO contractors (name, email, dept) VALUES
('Eve', 'eve@ext.io', 'Eng'),
('Bob', 'bob@co.io', 'Eng'), -- same person as employee Bob
('Frank', 'frank@ext.io', 'Sales');
-- UNION: deduplicates by sorting the combined result.
-- Bob appears once even though he is in both tables.
SELECT name, email FROM employees
UNION
SELECT name, email FROM contractors;
-- name | email
-- ------+----------------
-- Alice | alice@co.io
-- Bob | bob@co.io <- only one Bob row, deduped
-- Carol | carol@co.io
-- Dave | dave@co.io
-- Eve | eve@ext.io
-- Frank | frank@ext.io
-- (6 rows)
-- UNION ALL: keeps every row. Bob appears twice.
-- This is the "headcount of all engagements" version.
SELECT name, email FROM employees
UNION ALL
SELECT name, email FROM contractors;
-- name | email
-- ------+----------------
-- Alice | alice@co.io
-- Bob | bob@co.io <- employee Bob
-- Carol | carol@co.io
-- Dave | dave@co.io
-- Eve | eve@ext.io
-- Bob | bob@co.io <- contractor Bob
-- Frank | frank@ext.io
-- (7 rows)
The performance gap is enormous. UNION has to either sort the entire combined result and discard adjacent duplicates, or build a hash table of every row seen so far and probe it for every new row. Either way, every byte of every column you select is part of the dedup key, and on a wide table with millions of rows the work is significant.
UNION ALL does no such work. It simply concatenates the two streams and emits them. There is no sort, no hash, no comparison. The cost is essentially "scan A, then scan B."
+---------------------------------------------------------------+
| UNION vs UNION ALL EXECUTION |
+---------------------------------------------------------------+
| |
| UNION (deduplicate) |
| Scan A -> |
| HashAggregate / Sort+Unique -> output |
| Scan B -> |
| Cost: O(N log N) for sort, plus full materialization |
| |
| UNION ALL (concatenate) |
| Scan A -> Append -> output |
| Scan B -> |
| Cost: O(N) -- two sequential scans, no extra work |
| |
| RULE: |
| If duplicates are impossible (e.g. the two branches use |
| mutually exclusive WHERE clauses) -> UNION ALL is correct |
| AND faster. Use UNION only when you actively need dedup. |
| |
+---------------------------------------------------------------+
The most common mistake is reflexively typing UNION when there cannot possibly be duplicates because the two branches are mutually exclusive — for example, "active orders from this year" UNION "active orders from last year" partitioned by year. There is no overlap, the dedup is wasted work, and the query runs many times slower than it should.
Column Compatibility — The Strict Rules
Set operators have stricter requirements than joins. Both branches must:
- Have the same number of columns. Mismatched counts are a syntax error.
- Have the columns in the same positional order. Set operators do not match by name, only by position.
- Have compatible data types per column. Postgres is fairly forgiving (it will coerce
INTandNUMERIC), but mixingTEXTandINTwill fail. - Use the column names from the first branch only. The second branch's column names are discarded.
-- Both branches have two columns of compatible types.
-- Output column names come from the FIRST SELECT only.
SELECT name AS person_name, email AS contact FROM employees
UNION ALL
SELECT name, email FROM contractors; -- the AS aliases here are ignored
-- Output columns: person_name, contact
-- Mismatched count -> error
SELECT name, email FROM employees
UNION
SELECT name FROM contractors;
-- ERROR: each UNION query must have the same number of columns
-- Mismatched type -> error (no implicit cast between text and integer)
SELECT name, id FROM employees
UNION
SELECT name, email FROM contractors;
-- ERROR: UNION types integer and text cannot be matched
-- Fix by casting both sides to a common type
SELECT name, id::TEXT FROM employees
UNION
SELECT name, email FROM contractors;
-- Works -- both columns are now TEXT.
Because matching is positional, a refactor that adds a column to one branch but forgets the other will fail loudly at parse time, which is good. A worse refactor — changing the order of columns in one branch — will succeed silently and produce nonsense rows where email from one branch lines up with dept from the other. Always list columns explicitly in set-operator branches; never use SELECT * because adding a column to the underlying table will silently shift positions in one branch and break the alignment.
INTERSECT — Rows in Both Result Sets
INTERSECT returns rows that appear in both inputs. Like UNION, it deduplicates by default. INTERSECT ALL (Postgres) preserves the multiplicity of duplicates — if a row appears twice in A and three times in B, INTERSECT ALL returns it twice (the minimum count).
-- Find people who appear as both an employee and a contractor.
-- Matching is by ALL selected columns, not just one.
SELECT name, email FROM employees
INTERSECT
SELECT name, email FROM contractors;
-- name | email
-- -----+----------------
-- Bob | bob@co.io
-- (1 row)
-- INTERSECT ALL keeps the minimum count of duplicates.
-- If Bob appeared twice in employees and once in contractors,
-- INTERSECT ALL would return him once. INTERSECT (without ALL)
-- would also return him once because of dedup.
SELECT name, email FROM employees
INTERSECT ALL
SELECT name, email FROM contractors;
A common pitfall: INTERSECT compares all selected columns together, not just one. If you want "people whose name appears in both tables regardless of email," you must select only name in both branches:
-- WRONG: matches by (name, email) tuple. A person with the same name
-- but a different email will NOT intersect.
SELECT name, email FROM employees
INTERSECT
SELECT name, email FROM contractors;
-- RIGHT: match by name alone
SELECT name FROM employees
INTERSECT
SELECT name FROM contractors;
-- name
-- -----
-- Bob
-- (1 row)
EXCEPT — Rows in A but Not in B
EXCEPT (called MINUS in Oracle) returns rows from the first input that do not appear in the second. Like the others it deduplicates by default; EXCEPT ALL is the multiset version.
-- Employees who are NOT also contractors.
-- Bob is excluded because he appears in both tables.
SELECT name, email FROM employees
EXCEPT
SELECT name, email FROM contractors;
-- name | email
-- ------+--------------
-- Alice | alice@co.io
-- Carol | carol@co.io
-- Dave | dave@co.io
-- (3 rows)
-- The opposite direction: contractors who are NOT employees.
SELECT name, email FROM contractors
EXCEPT
SELECT name, email FROM employees;
-- name | email
-- ------+----------------
-- Eve | eve@ext.io
-- Frank | frank@ext.io
-- (2 rows)
EXCEPT is asymmetric — A EXCEPT B is not the same as B EXCEPT A, just as A - B is not B - A in arithmetic. This is a frequent source of bugs when refactoring a query: swapping the order of the two branches silently changes the meaning.
NULL Handling in Set Operators
Set operators treat two NULLs as equal for the purposes of dedup, intersection, and difference. This is the opposite of how WHERE col1 = col2 behaves, where NULL = NULL evaluates to NULL (not true). The set operators use a stricter "is not distinct from" semantics, which is exactly what you want for set arithmetic.
INSERT INTO employees (name, email, dept) VALUES ('Grace', NULL, 'Eng');
INSERT INTO contractors (name, email, dept) VALUES ('Grace', NULL, 'Eng');
SELECT name, email FROM employees
INTERSECT
SELECT name, email FROM contractors;
-- name | email
-- ------+-------
-- Bob | bob@co.io
-- Grace | NULL <- two NULLs match here, even though NULL = NULL is not true
Ordering a Combined Result
ORDER BY can only appear once in a set-operator query, and it must appear at the very end — after the last branch. It applies to the combined result, not to either branch individually. You cannot put an ORDER BY inside a UNION branch (without wrapping it in a subquery), and you cannot have separate ORDER BY clauses on each branch.
-- ORDER BY at the end sorts the entire combined result
SELECT name, email, 'employee' AS source FROM employees
UNION ALL
SELECT name, email, 'contractor' AS source FROM contractors
ORDER BY name ASC;
-- name | email | source
-- ------+----------------+-----------
-- Alice | alice@co.io | employee
-- Bob | bob@co.io | contractor
-- Bob | bob@co.io | employee
-- Carol | carol@co.io | employee
-- Dave | dave@co.io | employee
-- Eve | eve@ext.io | contractor
-- Frank | frank@ext.io | contractor
-- Grace | NULL | contractor
-- Grace | NULL | employee
-- WRONG: ORDER BY inside a branch is a syntax error
SELECT name FROM employees ORDER BY name
UNION ALL
SELECT name FROM contractors;
-- ERROR: syntax error at or near "UNION"
-- If you NEED to limit one branch before combining, wrap it in parens
( SELECT name FROM employees ORDER BY id DESC LIMIT 5 )
UNION ALL
( SELECT name FROM contractors ORDER BY id DESC LIMIT 5 )
ORDER BY name;
-- This gives you "5 most recent employees + 5 most recent contractors,
-- combined and then sorted by name."
The ORDER BY at the end can reference column names or positional numbers. Because column names come from the first branch, that is what you sort by — even if the second branch used different aliases.
Napkin AI Visual Prompt: "Dark navy gradient (#0a0f1f -> #111a2e). Diagram showing a SQL query split into two stacked SELECT branches with a 'UNION ALL' bar between them in sky blue (#4fc3f7), feeding into a single 'ORDER BY name' bar at the bottom in rose (#ff5c8a). On the right, a parallel 'WRONG' diagram showing an ORDER BY embedded in the upper branch with a red strikethrough. White monospace labels throughout. Below: a small note 'ORDER BY belongs at the end -- applies to the combined result.'"
The MySQL Workaround — Emulating INTERSECT and EXCEPT
MySQL did not support INTERSECT or EXCEPT natively until version 8.0.31, released in October 2022. Any production MySQL on 5.7 or early 8.0 will throw a syntax error. You also cannot rely on these operators in cross-engine code that needs to support older deployments.
The standard workaround uses INNER JOIN for INTERSECT and LEFT JOIN ... WHERE NULL (or NOT EXISTS) for EXCEPT.
-- INTERSECT emulated via INNER JOIN
-- "People who appear as both employees and contractors"
SELECT DISTINCT e.name, e.email
FROM employees e
INNER JOIN contractors c
ON e.name = c.name AND e.email <=> c.email;
-- The <=> operator is MySQL's NULL-safe equality.
-- It returns true when both sides are NULL, matching INTERSECT semantics.
-- Standard SQL equivalent: e.email IS NOT DISTINCT FROM c.email
-- EXCEPT emulated via LEFT JOIN with IS NULL
-- "Employees who are NOT contractors"
SELECT DISTINCT e.name, e.email
FROM employees e
LEFT JOIN contractors c
ON e.name = c.name AND e.email <=> c.email
WHERE c.name IS NULL;
-- EXCEPT emulated via NOT EXISTS (often cleaner and faster)
SELECT DISTINCT e.name, e.email
FROM employees e
WHERE NOT EXISTS (
SELECT 1 FROM contractors c
WHERE c.name = e.name AND c.email <=> e.email
);
+---------------------------------------------------------------+
| SET OPERATOR -> JOIN EQUIVALENTS |
+---------------------------------------------------------------+
| |
| PostgreSQL / Standard MySQL Workaround |
| -------------------- ---------------- |
| A INTERSECT B -> A INNER JOIN B + DISTINCT |
| A EXCEPT B -> A LEFT JOIN B WHERE B.k IS NULL |
| A EXCEPT B -> A WHERE NOT EXISTS (B) |
| A UNION B -> A UNION B (works everywhere) |
| A UNION ALL B -> A UNION ALL B (works everywhere) |
| |
| GOTCHA: NULL handling |
| Set operators treat NULL = NULL as TRUE. |
| JOINs treat NULL = NULL as UNKNOWN. |
| Use IS NOT DISTINCT FROM (Postgres) or <=> (MySQL) |
| to match set-operator semantics in a JOIN equivalent. |
| |
+---------------------------------------------------------------+
The NULL gotcha matters a lot. A naive INNER JOIN ... ON a.email = b.email will fail to match two rows where both emails are NULL, but INTERSECT would have matched them. If your workaround does not use a NULL-safe comparison, you will silently lose rows that the set-operator version would have returned.
Reading EXPLAIN — Hash, Sort, or Append
The planner picks one of three execution shapes for set operators, and you can see which from EXPLAIN.
EXPLAIN ANALYZE
SELECT name, email FROM employees
UNION
SELECT name, email FROM contractors;
-- HashAggregate (cost=...) <- dedup via in-memory hash
-- Group Key: employees.name, employees.email
-- -> Append
-- -> Seq Scan on employees
-- -> Seq Scan on contractors
EXPLAIN ANALYZE
SELECT name, email FROM employees
UNION ALL
SELECT name, email FROM contractors;
-- Append (cost=...) <- no dedup, just concatenate
-- -> Seq Scan on employees
-- -> Seq Scan on contractors
EXPLAIN ANALYZE
SELECT name, email FROM employees
INTERSECT
SELECT name, email FROM contractors;
-- HashSetOp Intersect (cost=...)
-- -> Append
-- -> Subquery Scan on "*SELECT* 1"
-- -> Seq Scan on employees
-- -> Subquery Scan on "*SELECT* 2"
-- -> Seq Scan on contractors
EXPLAIN ANALYZE
SELECT name, email FROM employees
EXCEPT
SELECT name, email FROM contractors;
-- HashSetOp Except (cost=...)
-- -> Append
-- -> ...
The key signals:
- Append alone ->
UNION ALL, the cheapest shape, just streaming both inputs through. - HashAggregate above an Append ->
UNIONdoing dedup with a hash table. - HashSetOp Intersect / Except -> the dedicated set-operator node Postgres uses for
INTERSECTandEXCEPT. Falls back toSetOp(sort-based) when work_mem is too small. - Sort + Unique -> the older sort-based dedup path, used when hashing is not viable.
If you see a Sort with Sort Method: external merge Disk: ..., your set operator is spilling to disk. The fixes are typically: switch UNION to UNION ALL if dedup is unnecessary, increase work_mem, or restructure the query so the dedup happens on a smaller pre-aggregated result.
When to Use Set Operators vs Joins vs Subqueries
The same logical question can often be expressed three ways. Knowing when each is the right tool matters for both readability and performance.
-- Question: "Which customers placed an order in BOTH 2025 and 2026?"
-- Option 1: INTERSECT
SELECT customer_id FROM orders WHERE EXTRACT(YEAR FROM created_at) = 2025
INTERSECT
SELECT customer_id FROM orders WHERE EXTRACT(YEAR FROM created_at) = 2026;
-- Option 2: Self-join
SELECT DISTINCT a.customer_id
FROM orders a
JOIN orders b ON a.customer_id = b.customer_id
WHERE EXTRACT(YEAR FROM a.created_at) = 2025
AND EXTRACT(YEAR FROM b.created_at) = 2026;
-- Option 3: GROUP BY + HAVING with conditional aggregation
SELECT customer_id
FROM orders
WHERE EXTRACT(YEAR FROM created_at) IN (2025, 2026)
GROUP BY customer_id
HAVING COUNT(DISTINCT EXTRACT(YEAR FROM created_at)) = 2;
All three return the same answer. The INTERSECT version reads most clearly as "customers from set A and from set B." The self-join is often the fastest because the planner has more freedom to push down filters and use indexes. The GROUP BY version scales well when you need to ask "in N specific years" for arbitrary N. Pick the one your team will most easily understand six months from now, and only optimize a different shape if profiling proves it matters.
Common Mistakes
1. Reflexive UNION when you mean UNION ALL.
This is the single most common set-operator bug. UNION deduplicates by sorting or hashing the entire combined result, which is expensive on wide tables. If the two branches cannot possibly produce duplicate rows — for example, mutually exclusive date ranges, or WHERE status = 'open' vs WHERE status = 'closed' — then UNION ALL is both correct and dramatically faster. Default to UNION ALL and only switch to UNION when you actively need deduplication.
2. Using SELECT * in set operator branches.
Because set operators match by column position, SELECT * is a time bomb. Adding a column to one of the underlying tables (or even reordering columns in a migration) will silently shift positions in one branch and produce nonsense rows where the wrong columns line up. Always list columns explicitly, in the same order, in every branch of a set operator.
3. Forgetting that ORDER BY belongs at the end.
Trying to put ORDER BY inside a UNION branch is a syntax error in standard SQL. The ORDER BY applies to the combined result and must appear after the last branch. If you genuinely need to sort or limit one branch before combining (for example, "top 5 from each table"), wrap each branch in parentheses with its own ORDER BY and LIMIT, then UNION ALL the parenthesized subqueries.
4. Assuming INTERSECT and EXCEPT exist on every database.
MySQL did not ship INTERSECT or EXCEPT until version 8.0.31 (October 2022). Code that runs perfectly on PostgreSQL or SQL Server will throw a "syntax error near INTERSECT" on a MySQL 5.7 or early 8.0 deployment. If your codebase needs to support multiple engines or older MySQL, use the JOIN-based workarounds (INNER JOIN for intersect, LEFT JOIN ... IS NULL or NOT EXISTS for except) and use a NULL-safe equality (<=> in MySQL, IS NOT DISTINCT FROM in Postgres).
5. Forgetting that NULLs are equal in set operators but not in joins.
SELECT NULL INTERSECT SELECT NULL returns one row — set operators treat NULL as equal to NULL. But SELECT * FROM a JOIN b ON a.x = b.x will not match rows where both a.x and b.x are NULL, because NULL = NULL is not true under standard SQL. When you replace a set operator with a join, you must use a NULL-safe equality operator or you will silently lose rows.
Interview Questions
1. "What is the difference between UNION and UNION ALL, and which one should you use by default?"
UNION combines two result sets and removes duplicate rows; UNION ALL combines them and keeps every row including duplicates. Mechanically, UNION must either sort the combined result and discard adjacent duplicates, or build a hash table of every row seen so far and probe it for each new row — both of which require materializing the entire result and doing per-row comparisons across every selected column. UNION ALL just concatenates the two streams with no extra work, so its cost is essentially the cost of scanning both inputs. The default should be UNION ALL unless you have a specific reason to deduplicate, because (a) on most queries the two branches are mutually exclusive and there cannot be duplicates anyway, and (b) the dedup cost on wide tables with millions of rows can turn a sub-second query into a multi-minute sort spill. The rule of thumb: type UNION ALL first and only switch to UNION when profiling or correctness explicitly requires it.
2. "What are the column compatibility rules for set operators, and what happens to column names in the output?"
Both branches of a set operator must have the same number of columns, in the same positional order, with compatible data types per column. Set operators do not match columns by name — they match by position — so a column named email in branch 1 and a column named address in branch 2 will be unioned together if they are in the same slot. The output column names come from the first branch only; aliases on the second branch are ignored. Type compatibility is per-column: PostgreSQL will coerce numeric types together (INT and NUMERIC) but will reject mixing TEXT with INT unless you cast explicitly. Because matching is positional, SELECT * in a set-operator branch is dangerous: adding a column to the underlying table in a migration silently shifts positions in one branch and produces nonsense alignment with the other. Always list columns explicitly in every branch.
3. "Why doesn't standard MySQL pre-8.0.31 support INTERSECT or EXCEPT, and how do you emulate them?"
INTERSECT and EXCEPT were part of the SQL standard since SQL-92, but MySQL historically did not implement them — they only landed in MySQL 8.0.31 in October 2022. Any production MySQL on 5.7 or early 8.0 will throw a syntax error on these keywords. The standard workarounds use joins: A INTERSECT B becomes SELECT DISTINCT ... FROM A INNER JOIN B ON <all_columns_match>, and A EXCEPT B becomes either SELECT DISTINCT ... FROM A LEFT JOIN B ON <all_columns_match> WHERE B.key IS NULL or SELECT DISTINCT ... FROM A WHERE NOT EXISTS (SELECT 1 FROM B WHERE <all_columns_match>). The critical detail is NULL handling: set operators treat two NULLs as equal, but = in a JOIN treats NULL = NULL as unknown and rejects the match. To preserve set-operator semantics in the workaround, use MySQL's NULL-safe equality <=> or, in standard SQL, IS NOT DISTINCT FROM. Otherwise you will silently lose rows where both sides have NULL in some column.
4. "Where can ORDER BY appear in a query that uses set operators, and why?"
ORDER BY can appear exactly once, at the very end of the entire set-operator chain, after the last branch. It applies to the combined result, not to either branch individually. The reason is conceptual: the set operator produces a single combined relation, and ORDER BY is a presentation concern that belongs to that final relation, not to its constituents. Putting ORDER BY inside a branch is a syntax error in standard SQL because the parser cannot tell whether the ORDER BY is meant for that branch or for the whole expression. If you genuinely need to sort or limit one branch before combining — for example "the 5 most recent employees and the 5 most recent contractors, all sorted by name" — you wrap each branch in parentheses with its own ORDER BY and LIMIT, then apply UNION ALL between the parenthesized subqueries, then add the final ORDER BY at the end. The output column names that the final ORDER BY can reference come from the first branch of the chain, regardless of what aliases the later branches used.
5. "How do NULLs behave under set operators, and how is that different from how they behave under = in a JOIN?"
Under set operators, two NULL values in the same column are treated as equal. SELECT NULL UNION SELECT NULL returns one row, not two. SELECT NULL INTERSECT SELECT NULL returns one row. This is "is not distinct from" semantics — the same comparison rule used by GROUP BY and DISTINCT. Under standard equality in a WHERE or JOIN ON clause, NULL behaves very differently: NULL = NULL evaluates to NULL (which is treated as false), so a row with a.x = NULL will never match a row with b.x = NULL. This asymmetry trips people up when they refactor a set-operator query into an equivalent join. If you replace A INTERSECT B with A INNER JOIN B ON A.col = B.col, you will silently lose any rows where col is NULL on both sides — the original INTERSECT would have matched them, but the join does not. The fix is to use a NULL-safe equality: IS NOT DISTINCT FROM in PostgreSQL and standard SQL, or <=> in MySQL. This is the single most common bug when porting set-operator queries to engines or codebases that prefer joins.
Quick Reference — Set Operators Cheat Sheet
+---------------------------------------------------------------+
| SET OPERATORS CHEAT SHEET |
+---------------------------------------------------------------+
| |
| THE FOUR OPERATORS: |
| A UNION B -> rows in A or B, deduped |
| A UNION ALL B -> rows in A or B, keep duplicates |
| A INTERSECT B -> rows in both A and B, deduped |
| A INTERSECT ALL B -> multiset intersection |
| A EXCEPT B -> rows in A not in B, deduped |
| A EXCEPT ALL B -> multiset difference |
| |
| COLUMN RULES: |
| 1. Same column count in every branch |
| 2. Same positional order (NOT matched by name) |
| 3. Compatible types per column |
| 4. Output names come from FIRST branch only |
| |
| ORDER BY: |
| Allowed once, at the very end |
| Applies to the combined result |
| Wrap branches in () to ORDER BY / LIMIT inside |
| |
| NULL HANDLING: |
| Set operators treat NULL = NULL as TRUE |
| JOINs treat NULL = NULL as UNKNOWN -- use <=> or |
| IS NOT DISTINCT FROM when emulating |
| |
| PERFORMANCE: |
| UNION ALL -> Append, O(N), no extra work |
| UNION -> HashAggregate / Sort+Unique, O(N log N) |
| INTERSECT -> HashSetOp Intersect |
| EXCEPT -> HashSetOp Except |
| Sort spill -> "Sort Method: external merge" |
| |
| MYSQL COMPATIBILITY: |
| UNION / UNION ALL -> works everywhere |
| INTERSECT / EXCEPT -> only MySQL 8.0.31+ |
| Workaround: INNER JOIN (intersect) |
| NOT EXISTS or LEFT JOIN IS NULL (except) |
| Use <=> for NULL-safe equality |
| |
+---------------------------------------------------------------+
+---------------------------------------------------------------+
| KEY RULES |
+---------------------------------------------------------------+
| |
| 1. Default to UNION ALL -- only use UNION when dedup needed |
| 2. Never SELECT * in a set-operator branch |
| 3. Always list columns explicitly, in the same order |
| 4. ORDER BY belongs at the END of the entire chain |
| 5. Wrap branches in () to ORDER BY / LIMIT individually |
| 6. Set operators treat NULL = NULL as TRUE |
| 7. Replacing INTERSECT/EXCEPT with a JOIN needs NULL-safe = |
| 8. INTERSECT/EXCEPT are not in MySQL pre-8.0.31 |
| 9. Output column names come from the FIRST branch |
| 10. Check EXPLAIN for HashSetOp vs Sort to spot spills |
| |
+---------------------------------------------------------------+
| Concern | Wrong Way | Right Way |
|---|---|---|
| Combining mutually exclusive branches | UNION (wastes dedup) | UNION ALL |
| Branch column lists | SELECT * in each | Explicit columns, same order |
| Output naming | Aliasing both branches | Alias only the first branch |
| Sorting combined result | ORDER BY inside a branch | ORDER BY at the very end |
| Limiting one branch first | Bare LIMIT in a branch | (SELECT ... LIMIT N) then UNION |
| Emulating INTERSECT in MySQL | Plain INNER JOIN ON = | INNER JOIN ... ON a <=> b |
| Emulating EXCEPT in MySQL | LEFT JOIN ... ON = | LEFT JOIN ... <=> ... WHERE IS NULL |
| NULL semantics in JOIN equivalent | = | IS NOT DISTINCT FROM / <=> |
Prev: Lesson 6.2 -- Correlated vs Non-Correlated Subqueries Next: Lesson 6.4 -- Common Table Expressions
This is Lesson 6.3 of the Database Interview Prep Course -- 12 chapters, 58 lessons.