Database Interview Prep
Filtering, Sorting and Aggregation

GROUP BY

Bucketing Rows, Counting Piles, and the Non-Aggregate Rule

LinkedIn Hook

"Your query returns 'column must appear in GROUP BY clause or be used in an aggregate function' and you have no idea why."

Every SQL beginner hits this wall. You write SELECT customer_id, name, COUNT(*) FROM orders GROUP BY customer_id and the database refuses to run it. Meanwhile your friend on MySQL with default settings runs the exact same query and gets garbage data that looks fine. Which behavior is correct? Spoiler: the strict one.

GROUP BY is not just "add this clause when you use COUNT." It is a fundamental reshape of your result set from N input rows to K output buckets, and every column you project has to make sense at the bucket level — either because you grouped by it, because it is functionally determined by something you grouped by, or because you aggregated it down to a single value.

Once you internalize the bucket model, the rules stop feeling arbitrary. Grouping by multiple columns creates finer buckets. Grouping by DATE_TRUNC('month', created_at) gives you monthly roll-ups. GROUPING SETS gives you multiple grain levels in a single scan. And the query execution order (FROM -> WHERE -> GROUP BY -> HAVING -> SELECT -> ORDER BY) explains why you cannot reference SELECT aliases in GROUP BY in standard SQL.

In Lesson 4.4, I break down GROUP BY from first principles: the bucket model, the non-aggregate rule, expression grouping, ordinal positions, functional dependency exceptions, and GROUPING SETS/ROLLUP/CUBE.

Read the full lesson -> [link]

#SQL #Database #GroupBy #DataEngineering #BackendDevelopment #InterviewPrep


GROUP BY thumbnail


What You'll Learn

  • Why GROUP BY exists and the "N rows in, K buckets out" mental model
  • The non-aggregate rule — why every projected column must be grouped or aggregated
  • Multi-column grouping and how adding columns creates finer buckets
  • Grouping by expressions like DATE_TRUNC('month', created_at) for time-series rollups
  • GROUP BY with ordinal positions (GROUP BY 1, 2) and why it is controversial
  • The functional dependency exception in PostgreSQL when grouping by a primary key
  • GROUPING SETS, ROLLUP, and CUBE for multi-grain aggregation in a single query
  • The full logical query execution order and why it matters for aliases

The Card-Sorting Analogy — Piles, Counts, and Grain

Imagine you are handed a shuffled deck of 200 loyalty cards from a coffee shop. Each card has a customer name, a store location, a purchase date, and an amount. Your manager walks over and asks: "How much did each store sell last month?"

You do not need a database to answer this. You physically sort the cards into piles — one pile per store. Card goes into the "Downtown" pile, another into "Airport," another into "Mall," and so on. When every card is in exactly one pile, you count each pile and add up the amounts. You now have one row of output per pile, not per card. Two hundred cards collapsed into five numbers.

That is literally what GROUP BY does. The input is a flat table of rows. You pick one or more columns (the store name, in our case) and the database sorts rows into buckets where all rows in a bucket share the same value for those columns. Then for each bucket, you compute summary values — COUNT, SUM, AVG, MIN, MAX — and emit exactly one output row per bucket.

The manager asks a follow-up: "Break it down by store and month." Now each card goes into a pile defined by the pair (store, month). Downtown-January, Downtown-February, Airport-January, and so on. You end up with more piles, each containing fewer cards. That is what "multi-column grouping" means: more grouping columns -> finer buckets -> more output rows with less data each.

Here is the crucial rule the analogy makes obvious: once you have sorted the cards into piles, you can no longer ask "what is the customer name of this pile?" — there is no single customer name, there are many. You can only ask questions that make sense at the pile level: how many cards, total amount, earliest date, latest date. Any column you want to display must either be the thing you sorted by (store, month) or a pile-level summary (count, sum). That is the non-aggregate rule. It is not arbitrary; it is physics.

+---------------------------------------------------------------+
|           FROM ROWS TO BUCKETS                                 |
+---------------------------------------------------------------+
|                                                                |
|  INPUT (orders table, 200 rows)                               |
|  +----+----------+-----------+------------+--------+          |
|  | id | customer | store     | created_at | amount |          |
|  +----+----------+-----------+------------+--------+          |
|  |  1 | Alice    | Downtown  | 2026-01-03 |  12.50 |          |
|  |  2 | Bob      | Airport   | 2026-01-04 |   8.75 |          |
|  |  3 | Alice    | Downtown  | 2026-02-11 |  15.00 |          |
|  |  4 | Carol    | Mall      | 2026-01-22 |   5.25 |          |
|  |  ..| ...      | ...       | ...        |  ...   |          |
|  +----+----------+-----------+------------+--------+          |
|                                                                |
|                       |                                        |
|                       | GROUP BY store                         |
|                       v                                        |
|                                                                |
|  OUTPUT (3 buckets)                                            |
|  +-----------+-------+---------+                              |
|  | store     | count | sum     |                              |
|  +-----------+-------+---------+                              |
|  | Downtown  |    87 | 1094.25 |                              |
|  | Airport   |    54 |  612.50 |                              |
|  | Mall      |    59 |  701.75 |                              |
|  +-----------+-------+---------+                              |
|                                                                |
|  Every output row = one bucket. Every projected column must   |
|  be either the bucket key (store) or a summary (count, sum).  |
|                                                                |
+---------------------------------------------------------------+

Napkin AI Visual Prompt: "Dark gradient (#0a0f1f -> #111a2e). Left side: a messy pile of rectangular cards in mixed colors (sky blue, rose, white) representing unsorted rows, labeled 'orders: 200 rows' in white monospace. A large sky blue (#4fc3f7) arrow curves to the right labeled 'GROUP BY store'. Right side: three neat stacks of same-colored cards (one all sky blue labeled 'Downtown 87', one all rose labeled 'Airport 54', one all white labeled 'Mall 59'). Title: 'N rows in -> K buckets out' in rose (#ff5c8a) monospace."


Why GROUP BY Exists

A SQL query without GROUP BY returns one output row per input row (after filtering). That is fine for "show me every order," but useless for "show me totals per customer." The moment you need a summary — count, sum, average, maximum — you need to collapse many rows into one. GROUP BY is the operator that defines how many rows collapse into each output: all rows sharing the same values for the grouping columns become one row.

-- Without GROUP BY: one row per input row
SELECT customer_id, amount FROM orders WHERE created_at >= '2026-01-01';
-- customer_id | amount
-- ------------+--------
--           1 |  12.50
--           1 |  15.00
--           2 |   8.75
--           3 |   5.25
--           1 |  22.00
-- (5 rows — raw data)

-- With GROUP BY: one row per unique customer_id
SELECT customer_id, COUNT(*) AS order_count, SUM(amount) AS total_spend
FROM orders
WHERE created_at >= '2026-01-01'
GROUP BY customer_id;
-- customer_id | order_count | total_spend
-- ------------+-------------+-------------
--           1 |           3 |       49.50
--           2 |           1 |        8.75
--           3 |           1 |        5.25
-- (3 rows — one per bucket)

The second query has three output rows because there are three distinct values in customer_id. Customer 1's three orders collapsed into a single row with order_count = 3 and total_spend = 49.50. The raw amounts are gone — they have been summarized away.


The Non-Aggregate Rule

Every column in your SELECT list (or ORDER BY, or HAVING) must be one of:

  1. A column listed in the GROUP BY clause, or
  2. Wrapped in an aggregate function (COUNT, SUM, AVG, MIN, MAX, STRING_AGG, ARRAY_AGG, etc.), or
  3. A column functionally determined by a grouping column (Postgres-specific, see below).

If you violate this rule, the database raises an error like column "orders.customer_name" must appear in the GROUP BY clause or be used in an aggregate function. The rule exists because the bucket contains many rows, and for a non-grouped non-aggregated column, there is no single answer to "what is its value for this bucket?"

-- WRONG: customer_name is not grouped and not aggregated
SELECT customer_id, customer_name, COUNT(*)
FROM orders
GROUP BY customer_id;
-- ERROR: column "orders.customer_name" must appear in the GROUP BY
--        clause or be used in an aggregate function

-- FIX 1: add customer_name to GROUP BY (if it is 1:1 with customer_id, same result)
SELECT customer_id, customer_name, COUNT(*)
FROM orders
GROUP BY customer_id, customer_name;

-- FIX 2: aggregate it (pick MAX/MIN arbitrarily when you know it is constant per bucket)
SELECT customer_id, MAX(customer_name) AS customer_name, COUNT(*)
FROM orders
GROUP BY customer_id;

MySQL historical quirk: before 5.7, MySQL allowed the wrong query above to run and silently returned an arbitrary customer_name from some row in each bucket. This was a constant source of subtle bugs. Modern MySQL defaults to ONLY_FULL_GROUP_BY mode and enforces the standard rule.


Multi-Column Grouping — Finer Buckets

Listing multiple columns in GROUP BY creates buckets keyed by the combination. Each unique tuple of values becomes one bucket. The more columns you group by, the more buckets you get and the fewer rows per bucket.

-- One bucket per (store, month) combination
SELECT
  store,
  DATE_TRUNC('month', created_at) AS month,
  COUNT(*)                         AS order_count,
  SUM(amount)                      AS revenue
FROM orders
GROUP BY store, DATE_TRUNC('month', created_at)
ORDER BY store, month;
-- store    | month      | order_count | revenue
-- ---------+------------+-------------+---------
-- Airport  | 2026-01-01 |          54 |  612.50
-- Airport  | 2026-02-01 |          61 |  734.00
-- Downtown | 2026-01-01 |          87 | 1094.25
-- Downtown | 2026-02-01 |          92 | 1188.50
-- Mall     | 2026-01-01 |          59 |  701.75
-- Mall     | 2026-02-01 |          63 |  744.00
-- (6 rows: 3 stores x 2 months)

Adding more grouping columns is like adding more sorting criteria to your pile of cards — you end up with more, smaller piles. The trade-off is that each bucket summarizes less data. Grouping by (store, month, hour, customer_id, product_id) might give you back nearly the original row count — at which point GROUP BY is doing no useful compression.


GROUP BY with Expressions

You can group by any expression, not just bare columns. This is how you do time-series rollups (grouping by month, week, day), bucketing (age ranges, price tiers), and derived keys.

-- Group by computed buckets — age brackets
SELECT
  CASE
    WHEN age < 18 THEN 'minor'
    WHEN age < 30 THEN 'young_adult'
    WHEN age < 60 THEN 'adult'
    ELSE 'senior'
  END                   AS age_bracket,
  COUNT(*)              AS user_count,
  AVG(lifetime_value)   AS avg_ltv
FROM users
GROUP BY CASE
  WHEN age < 18 THEN 'minor'
  WHEN age < 30 THEN 'young_adult'
  WHEN age < 60 THEN 'adult'
  ELSE 'senior'
END;
-- age_bracket | user_count | avg_ltv
-- ------------+------------+---------
-- young_adult |       1204 |  142.30
-- adult       |       2891 |  287.55
-- senior      |        612 |  412.00
-- minor       |         87 |   23.40

The expression must match between SELECT and GROUP BY exactly (in standard SQL). Some databases let you use a SELECT alias in GROUP BY as a convenience (Postgres, MySQL, SQLite do; Oracle and SQL Server historically do not), but the standard says no because GROUP BY is evaluated before SELECT in the logical execution order.

-- Postgres/MySQL-only shortcut — uses SELECT alias in GROUP BY
SELECT DATE_TRUNC('month', created_at) AS month, COUNT(*)
FROM orders
GROUP BY month; -- non-standard but convenient

GROUP BY Column Position (1, 2, 3)

You can also reference grouping columns by their position in the SELECT list. This is concise and avoids repeating long expressions, but it is controversial because any reorder of the SELECT list silently breaks the grouping.

-- Positional GROUP BY — refers to SELECT columns 1 and 2
SELECT
  store,                                    -- position 1
  DATE_TRUNC('month', created_at) AS month, -- position 2
  COUNT(*)                        AS n,     -- position 3 (aggregated)
  SUM(amount)                     AS rev    -- position 4 (aggregated)
FROM orders
GROUP BY 1, 2
ORDER BY 1, 2;

Why it is controversial: if someone swaps the order of store and month in the SELECT list, the GROUP BY still says 1, 2 and now refers to different columns. The query still runs, the results are still valid-looking, but the semantics changed silently. For ad-hoc analytics queries this is fine; for production queries that live in version control and get reviewed, prefer explicit column names or expressions.

Both styles are in the SQL standard and supported by all major databases. Style guides split on which to prefer — dbt's style guide famously recommends positional for analytics, while most application code style guides require explicit.


The Functional Dependency Exception (Postgres)

Strict standard SQL says every non-aggregated projected column must be in GROUP BY. But Postgres (since 9.1) is smart enough to allow projecting columns that are functionally determined by the grouping columns — specifically, when you GROUP BY a table's primary key, you can project any other column of that table without listing it in GROUP BY.

-- users.id is the primary key, so name, email, created_at are
-- functionally determined by it. Postgres accepts this query.
SELECT
  u.id,
  u.name,
  u.email,
  COUNT(o.id) AS order_count
FROM users u
LEFT JOIN orders o ON o.user_id = u.id
GROUP BY u.id;  -- name and email are functionally dependent on id

Without this exception, you would have to write GROUP BY u.id, u.name, u.email — verbose and redundant, since the primary key already uniquely determines every other column in that row. Postgres sees that u.id is the PK of users and silently allows the projection.

Caveats: This only works when grouping by a declared primary key (or UNIQUE NOT NULL column in some versions). Grouping by a non-key column that happens to be unique in practice does not trigger the exception. And it only applies to columns from the same table as the PK — you cannot extend it to joined tables through the dependency. MySQL 5.7+ has a similar relaxation for functional dependencies.


GROUPING SETS, ROLLUP, CUBE — Multi-Grain in One Query

Sometimes you want totals at multiple levels in a single query: per-store, per-month, per-store-per-month, and grand total — all from one scan. GROUPING SETS lets you specify multiple grouping column lists and get the union of all their groupings in one result set.

-- GROUPING SETS: three separate groupings unioned together
SELECT
  store,
  DATE_TRUNC('month', created_at) AS month,
  SUM(amount)                     AS revenue
FROM orders
GROUP BY GROUPING SETS (
  (store, DATE_TRUNC('month', created_at)),  -- by store and month
  (store),                                    -- by store only
  ()                                          -- grand total
);
-- store    | month      | revenue
-- ---------+------------+---------
-- Airport  | 2026-01-01 |  612.50
-- Airport  | 2026-02-01 |  734.00
-- Downtown | 2026-01-01 | 1094.25
-- Downtown | 2026-02-01 | 1188.50
-- Airport  | NULL       | 1346.50    <- store subtotal
-- Downtown | NULL       | 2282.75    <- store subtotal
-- NULL     | NULL       | 5074.00    <- grand total

NULL appears in the output columns that were not part of that particular grouping set — so the subtotal rows have NULL for month, and the grand total row has NULL for both. Use the GROUPING() function to distinguish "real NULL" from "aggregate-level NULL."

ROLLUP is a shortcut for hierarchical subtotals: ROLLUP(a, b, c) expands to GROUPING SETS ((a, b, c), (a, b), (a), ()).

CUBE produces all possible combinations: CUBE(a, b) expands to GROUPING SETS ((a, b), (a), (b), ()).

-- ROLLUP — hierarchical: store+month, then store, then grand total
SELECT store, DATE_TRUNC('month', created_at) AS month, SUM(amount)
FROM orders
GROUP BY ROLLUP (store, DATE_TRUNC('month', created_at));

-- CUBE — every combination of the grouping columns
SELECT store, product_category, SUM(amount)
FROM orders
GROUP BY CUBE (store, product_category);
-- Produces: (store, category), (store), (category), ()

These are the building blocks for OLAP-style reporting. Without them, you would run four separate queries and UNION them — paying four table scans instead of one.


Query Execution Order

Here is the logical order SQL evaluates a query. This is not the physical plan (the optimizer may reorder for performance) but the logical model that explains every rule about what can reference what.

+---------------------------------------------------------------+
|           LOGICAL QUERY EXECUTION ORDER                       |
+---------------------------------------------------------------+
|                                                                |
|   1. FROM / JOIN    -> produce the row source (raw rows)      |
|   2. WHERE          -> filter raw rows (no aggregates yet)    |
|   3. GROUP BY       -> collapse rows into buckets             |
|   4. HAVING         -> filter buckets (aggregates allowed)    |
|   5. SELECT         -> project columns, compute aggregates    |
|   6. DISTINCT       -> dedupe output rows                     |
|   7. ORDER BY       -> sort final result (aliases OK here)    |
|   8. LIMIT / OFFSET -> slice the result                       |
|                                                                |
+---------------------------------------------------------------+

This order explains several otherwise-mysterious rules:

  • Why WHERE cannot use aggregates: WHERE runs before GROUP BY, so the aggregates do not exist yet. Use HAVING instead (next lesson).
  • Why HAVING can use aggregates: HAVING runs after GROUP BY, so it sees bucket-level values.
  • Why SELECT aliases are not allowed in WHERE or GROUP BY (standard SQL): SELECT runs after GROUP BY, so the alias does not exist yet. ORDER BY runs after SELECT, so aliases work there.
  • Why GROUP BY cannot reference aggregates: you cannot bucket rows by a per-bucket value that does not exist until after bucketing.

Postgres and MySQL relax the alias rule in GROUP BY as a convenience, but the standard and portable practice is to repeat the expression.


Common Mistakes

1. Forgetting a non-aggregate column in GROUP BY. The canonical beginner error. You write SELECT store, product, COUNT(*) FROM sales GROUP BY store and the database rejects it because product is neither grouped nor aggregated. The question to ask yourself is: "For a single bucket, what single value should product display?" If there is no clear answer, either add it to GROUP BY (which makes finer buckets) or aggregate it with MAX, MIN, or STRING_AGG.

2. Grouping by a SELECT alias in standard SQL. SELECT DATE_TRUNC('month', created_at) AS month, COUNT(*) FROM orders GROUP BY month works in Postgres, MySQL, and SQLite but fails in Oracle, DB2, and SQL Server. The logical execution order says SELECT runs after GROUP BY, so the alias does not exist at GROUP BY time. Repeat the expression (GROUP BY DATE_TRUNC('month', created_at)) or use positional notation (GROUP BY 1) for portable code.

3. Grouping too finely and getting back nearly the original row count. If you GROUP BY order_id, customer_id, product_id, created_at_second, every row likely has a unique combination and your "aggregated" result is almost identical to the raw table. GROUP BY only compresses data when multiple rows share the same grouping key. If you see bucket counts close to row counts, you are grouping at the wrong grain — drop columns until buckets hold meaningful groups.

4. Using COUNT(column) when you meant COUNT(*). COUNT(*) counts every row in the bucket. COUNT(col) counts rows where col IS NOT NULL. On a LEFT JOIN where the right side might be NULL, these differ dramatically — COUNT(*) gives you "customers" while COUNT(orders.id) gives you "customers with at least one order." Know which one you want.

5. Assuming ORDER BY is implied by GROUP BY. In most databases, GROUP BY does not guarantee any particular output order. If you need sorted results, add ORDER BY explicitly. MySQL historically returned GROUP BY results sorted, which caused a whole generation of developers to write queries that broke silently when migrated to Postgres or when MySQL changed its optimizer.


Interview Questions

1. "Explain the rule that every non-aggregated column in SELECT must appear in GROUP BY. Why does this rule exist?"

GROUP BY collapses many input rows into one output row per unique combination of grouping column values. For columns listed in GROUP BY, all rows in a bucket share the same value by definition, so projecting them is unambiguous. For aggregated columns (COUNT, SUM, etc.), the aggregate function reduces the bucket's rows to a single summary value. But for any other column, the bucket contains potentially many different values and there is no principled way to pick one. The rule exists because the database refuses to silently guess — it would rather error out and force you to be explicit about what you meant. MySQL before 5.7 violated this by arbitrarily returning some value from one of the rows, which caused notorious silent-bug behavior; modern MySQL enforces the standard rule via ONLY_FULL_GROUP_BY. Postgres adds a functional-dependency relaxation: when you GROUP BY a table's primary key, other columns of that table can be projected because the PK functionally determines them, so the "which value" question has a single correct answer.

2. "What is the logical query execution order in SQL, and why does it matter?"

The logical order is: FROM/JOIN produces the row source, WHERE filters raw rows, GROUP BY collapses them into buckets, HAVING filters buckets, SELECT projects columns and computes aggregates, DISTINCT dedupes, ORDER BY sorts the final result, and LIMIT/OFFSET slices it. This is the logical model the language is defined against, not necessarily the physical plan the optimizer executes. It matters because it explains every sequencing rule in the language. WHERE cannot use aggregates because WHERE runs before GROUP BY, so aggregates do not exist yet — HAVING is the clause for post-aggregation filtering. SELECT aliases are not allowed in WHERE or (standard) GROUP BY because SELECT has not run yet — but they are allowed in ORDER BY because ORDER BY runs after SELECT. GROUP BY cannot reference an aggregate because you cannot bucket by a value that only exists after bucketing. Every "why can't I use X in clause Y" question has the same answer: check the execution order.

3. "When would you reach for GROUPING SETS, ROLLUP, or CUBE instead of multiple GROUP BY queries?"

Use them when you need aggregations at multiple grain levels in a single result — typically for dashboards, pivot reports, and OLAP summaries. GROUPING SETS lets you specify an arbitrary list of column-combinations; the result is the union of a separate grouping for each combination, in a single table scan. ROLLUP is a shortcut for hierarchical subtotals: ROLLUP(country, state, city) produces aggregates at (country, state, city), (country, state), (country), and the grand total — exactly what you want for drill-down reports where each level is a subset of the previous. CUBE produces every possible combination of the listed columns, which is what you want for pivot tables where you care about every slice and dice. The alternative is writing separate queries and UNIONing them, but that pays one table scan per query and is verbose. GROUPING SETS runs once, reads the base data once, and emits all the grouping levels as a single result set. Use the GROUPING() function to distinguish the synthetic NULLs in subtotal rows from real data NULLs.

4. "What is the difference between GROUP BY with an explicit column and GROUP BY with an ordinal position like GROUP BY 1, 2?"

Functionally they produce the same result — both reference the grouping columns, just by different names. Explicit columns name them literally (GROUP BY store, month), while positional notation names them by their index in the SELECT list (GROUP BY 1, 2 means "group by the first and second items in SELECT"). Positional is concise, avoids repeating long expressions like DATE_TRUNC('month', created_at), and is idiomatic in analytics code and the dbt community. The downside is fragility: any reorder of the SELECT list silently changes the meaning of the GROUP BY without any error, which is dangerous for queries that live in version control and get edited by many hands. For ad-hoc exploration and analytics queries where the author is the main audience, positional is fine. For production queries embedded in application code, prefer explicit column names or expressions so a reviewer can verify the grouping intent without counting commas.

5. "Why does Postgres allow you to omit a column from GROUP BY when you group by a primary key?"

Because the primary key functionally determines every other column in that row. If you group by users.id and users.id is the PK of users, then within each bucket all rows share the same id, which means they share the same name, email, created_at, and every other column of that row — because there is exactly one row per id. The "which value should we display for this bucket" question has a single, unambiguous answer: the value from the one row. Postgres recognizes this at plan time and relaxes the standard rule for columns in the same table as the grouped PK. This saves you from writing GROUP BY u.id, u.name, u.email, u.created_at, u.country, u.signup_source, ... for every column you want to project, which is redundant since the PK already determines all of them. The relaxation requires a declared PK or UNIQUE NOT NULL constraint — Postgres will not infer functional dependency from data patterns — and applies only to columns of the table whose PK you grouped on, not to joined tables. MySQL 5.7+ has a similar relaxation.


Quick Reference — GROUP BY Cheat Sheet

+---------------------------------------------------------------+
|           GROUP BY CHEAT SHEET                                 |
+---------------------------------------------------------------+
|                                                                |
|  MENTAL MODEL:                                                 |
|   N input rows -> K output buckets (one per unique key)        |
|                                                                |
|  THE RULE:                                                     |
|   Every SELECT column must be:                                 |
|     1. Listed in GROUP BY, OR                                  |
|     2. Wrapped in an aggregate function, OR                    |
|     3. Functionally determined by GROUP BY (Postgres/MySQL)    |
|                                                                |
|  SYNTAX VARIANTS:                                              |
|   GROUP BY store                     -- single column          |
|   GROUP BY store, month              -- multi-column           |
|   GROUP BY DATE_TRUNC('month', ts)   -- expression             |
|   GROUP BY 1, 2                      -- ordinal position       |
|                                                                |
|  MULTI-GRAIN:                                                  |
|   GROUP BY GROUPING SETS ((a,b),(a),())                        |
|   GROUP BY ROLLUP (a, b, c)          -- hierarchical           |
|   GROUP BY CUBE (a, b)               -- all combinations       |
|                                                                |
|  EXECUTION ORDER:                                              |
|   FROM -> WHERE -> GROUP BY -> HAVING -> SELECT -> ORDER BY   |
|                                                                |
+---------------------------------------------------------------+

+---------------------------------------------------------------+
|           KEY RULES                                            |
+---------------------------------------------------------------+
|                                                                |
|  1. GROUP BY collapses rows into buckets — one output per key  |
|  2. Non-aggregated columns must be in GROUP BY                 |
|  3. Adding columns to GROUP BY makes finer, smaller buckets    |
|  4. Group by expressions for time rollups and bucketing        |
|  5. Postgres allows PK functional dependency — skip repeats    |
|  6. WHERE filters rows, HAVING filters buckets                 |
|  7. Use GROUPING SETS/ROLLUP/CUBE for multi-grain in one scan  |
|  8. GROUP BY does not guarantee order — add ORDER BY           |
|                                                                |
+---------------------------------------------------------------+
ConcernWrong WayRight Way
Non-agg columnSELECT customer, name, COUNT(*) ... GROUP BY customerAdd name to GROUP BY or aggregate it
Time rollupGROUP BY created_at (one bucket per second)GROUP BY DATE_TRUNC('month', created_at)
SELECT alias in GROUP BYRelying on MySQL/PG quirk across databasesRepeat expression or use positional
Multi-grainFour separate queries + UNIONGROUP BY GROUPING SETS (...)
Filtering aggregatesWHERE COUNT(*) > 5HAVING COUNT(*) > 5
PK projectionListing every column in GROUP BYGroup by PK only (Postgres)
Output orderAssume GROUP BY sortsExplicit ORDER BY
COUNT semanticsCOUNT(col) when you mean all rowsCOUNT(*) for row count

Prev: Lesson 4.3 -- Aggregate Functions Next: Lesson 4.5 -- HAVING vs WHERE


This is Lesson 4.4 of the Database Interview Prep Course -- 12 chapters, 58 lessons.

On this page