LinkedIn Hook

"Your finance dashboard reported $0.00 in refunds for an entire quarter. The refunds table had 4,812 rows. What went wrong?"

Someone wrote SUM(refund_amount) on a column where half the rows were NULL. The SUM silently skipped the nulls, the surviving rows summed to a non-zero value, but a different bug up the pipeline returned an empty result set — and SUM on an empty set is not 0, it is NULL. The dashboard rendered NULL as $0.00 and nobody noticed for three months.

Aggregate functions look like the simplest thing in SQL. COUNT, SUM, AVG, MIN, MAX — every tutorial covers them in the first week. And yet they are the single largest source of "the numbers on the report are wrong" bugs in production analytics. The reason is that aggregates have subtle, surprising, and inconsistent behavior around NULL, around empty sets, around integer math, and around the difference between COUNT(*) and COUNT(column).

In Lesson 4.3, I break down every aggregate function, show side-by-side outputs for COUNT(*) vs COUNT(col) vs COUNT(DISTINCT col), explain the integer-division trap in AVG, and walk through FILTER (WHERE ...) for conditional aggregates.

Read the full lesson -> [link]

#SQL #Database #DataEngineering #Postgres #Analytics #InterviewPrep

Aggregate Functions thumbnail

What You'll Learn

The five core aggregate functions — MIN, MAX, AVG, SUM, COUNT — and their exact semantics
How each aggregate handles NULL values, and why COUNT(*) is the one exception
The difference between COUNT(*), COUNT(column), and COUNT(DISTINCT column) with side-by-side output
How SUM and AVG silently skip NULL rows, and when that becomes a bug
The FILTER (WHERE ...) clause for conditional aggregates in PostgreSQL
Combining multiple aggregates in a single SELECT for dashboard-style queries
STRING_AGG and ARRAY_AGG for concatenating values across rows
The integer-division trap in AVG and how to force a decimal result
Empty-set behavior: why SUM returns NULL but COUNT returns 0

The Calculator-on-a-Spreadsheet-Column Analogy

Picture a spreadsheet with one column of numbers. You drag your mouse down the column to select it, and the status bar at the bottom of the screen instantly shows you five things: the smallest value, the largest value, the average, the total, and the count. You did not loop through the rows, you did not write a formula, you just selected the column and the spreadsheet summarized it for you.

That is exactly what a SQL aggregate function does. SELECT MIN(price), MAX(price), AVG(price), SUM(price), COUNT(*) FROM products is the database equivalent of selecting a column in Excel and reading the status bar. The aggregate function is the calculator, the FROM clause is the column, and the WHERE clause is the optional filter you apply before selecting. One query, five numbers, zero loops.

But the calculator has quirks. If some cells in the column are empty, Excel's AVERAGE skips them — it does not treat empty as zero. SQL does the same thing with NULL, and this is where people get burned. An average over a column where 40% of the rows are NULL is the average of the other 60%, not the average of all rows with nulls treated as zero. Same for SUM, MIN, and MAX. The one exception is COUNT(*), which counts rows regardless of whether any column is null, because * does not refer to a column at all — it just counts the rows themselves.

+---------------------------------------------------------------+
|           AGGREGATES ARE CALCULATORS ON A COLUMN              |
+---------------------------------------------------------------+
|                                                                |
|    price                                                       |
|    -----                                                       |
|    10.00   -+                                                  |
|    25.50    |                                                  |
|    NULL     |  -- aggregates read this column top to bottom    |
|    15.00    |     and skip NULLs (except COUNT(*))             |
|    40.00    |                                                  |
|    NULL     |                                                  |
|    30.00   -+                                                  |
|                                                                |
|    MIN  = 10.00                                                |
|    MAX  = 40.00                                                |
|    SUM  = 120.50   (NULLs skipped)                             |
|    AVG  = 24.10    (120.50 / 5, NOT / 7)                       |
|    COUNT(*)     = 7  (all rows)                                |
|    COUNT(price) = 5  (non-NULL prices only)                    |
|                                                                |
+---------------------------------------------------------------+

The Five Core Aggregates and Their NULL Behavior

Every SQL engine implements the same five core aggregates with the same semantics defined in the SQL standard. Learn them once and they work the same in PostgreSQL, MySQL, SQL Server, Oracle, and SQLite.

MIN and MAX — Smallest and Largest Non-NULL Values

MIN(col) returns the smallest value in col across all rows, ignoring NULL. MAX(col) returns the largest. Both work on numbers, strings (lexicographic order), dates, and timestamps. They return NULL only if every row in the group is NULL or the set is empty.

-- Setup for all examples in this lesson
CREATE TABLE orders (
  id          SERIAL PRIMARY KEY,
  customer_id INT,
  amount      NUMERIC(10, 2),
  status      TEXT,
  created_at  TIMESTAMP
);

INSERT INTO orders (customer_id, amount, status, created_at) VALUES
  (1, 100.00, 'paid',     '2026-01-05'),
  (1,  50.00, 'paid',     '2026-01-10'),
  (2, 200.00, 'refunded', '2026-01-12'),
  (2,   NULL, 'pending',  '2026-01-15'),  -- NULL amount
  (3,  75.50, 'paid',     '2026-02-01'),
  (3, 300.00, 'paid',     '2026-02-03'),
  (4,   NULL, 'cancelled','2026-02-10'),  -- NULL amount
  (5,  42.00, 'paid',     '2026-02-14');

SELECT
  MIN(amount) AS smallest,
  MAX(amount) AS largest
FROM orders;

-- Result:
-- smallest | largest
-- ---------+---------
--    42.00 |  300.00
--
-- The two NULL rows were skipped. If every row had been NULL,
-- both MIN and MAX would return NULL, not an error.

SUM — Total of Non-NULL Values, or NULL on Empty Sets

SUM(col) adds all non-NULL values in col. The gotcha: SUM of an empty set returns NULL, not 0. This is surprising because "I summed nothing, the answer should be zero" is the mathematical intuition — but SQL says "there was nothing to sum, so there is no answer." Guard against this with COALESCE(SUM(col), 0).

-- SUM skips NULL amounts
SELECT SUM(amount) AS total_revenue FROM orders;
-- total_revenue
-- -------------
--        767.50
--
-- (100 + 50 + 200 + 75.50 + 300 + 42 = 767.50; two NULLs skipped)

-- SUM over an EMPTY set returns NULL, not 0 — classic dashboard bug
SELECT SUM(amount) AS total FROM orders WHERE customer_id = 999;
-- total
-- ------
-- NULL     <-- NOT 0! Always wrap in COALESCE for user-facing output.

SELECT COALESCE(SUM(amount), 0) AS total FROM orders WHERE customer_id = 999;
-- total
-- ------
--  0.00

AVG — Arithmetic Mean of Non-NULL Values

AVG(col) returns SUM(col) / COUNT(col) — that is, the total of non-NULL values divided by the number of non-NULL values. It does not divide by the number of rows. This is the source of the biggest NULL-handling confusion in SQL.

-- AVG ignores NULLs in both the numerator AND the denominator
SELECT
  SUM(amount)   AS total,     -- 767.50
  COUNT(*)      AS rows_all,  -- 8
  COUNT(amount) AS rows_nn,   -- 6 (non-null)
  AVG(amount)   AS avg_amt    -- 767.50 / 6 = 127.9166...
FROM orders;

-- total  | rows_all | rows_nn | avg_amt
-- -------+----------+---------+-----------
-- 767.50 |        8 |       6 | 127.9166...
--
-- NOT 767.50 / 8 = 95.93. The two NULL rows are invisible to AVG.
-- If you want NULLs to count as zero, use:
--   SUM(COALESCE(amount, 0)) / COUNT(*)
-- or:
--   AVG(COALESCE(amount, 0))

The Integer Division Trap

In PostgreSQL, AVG on an INTEGER column returns a NUMERIC (decimal), so you do not get integer truncation. But SUM(int_col) / COUNT(*) absolutely will truncate, because / on two integers is integer division in most databases. Always prefer AVG() over manual SUM/COUNT when you need an arithmetic mean.

CREATE TABLE scores (points INT);
INSERT INTO scores VALUES (10), (11), (12);

-- AVG returns NUMERIC: safe
SELECT AVG(points) FROM scores;
-- 11.0000000000000000

-- Manual SUM/COUNT with integer cast: WRONG — integer division
SELECT SUM(points) / COUNT(*) FROM scores;
-- 11                          <-- happens to work here
INSERT INTO scores VALUES (1), (2);
SELECT SUM(points) / COUNT(*) FROM scores;
-- 7                           <-- (10+11+12+1+2)/5 = 36/5 = 7 (truncated from 7.2)

-- Fix: cast one operand to a decimal type
SELECT SUM(points)::numeric / COUNT(*) FROM scores;
-- 7.2000000000000000

COUNT — The Most Misused Aggregate

COUNT has three forms, and they do not mean the same thing. Getting this wrong is the single most common aggregate bug.

+---------------------------------------------------------------+
|           COUNT(*) vs COUNT(col) vs COUNT(DISTINCT col)       |
+---------------------------------------------------------------+
|                                                                |
|  COUNT(*)           -> number of ROWS in the group,            |
|                        including rows where every column is   |
|                        NULL. Never returns NULL.               |
|                                                                |
|  COUNT(column)      -> number of rows where `column` IS NOT    |
|                        NULL. Effectively COUNT(*) minus the    |
|                        NULL rows of that column.               |
|                                                                |
|  COUNT(DISTINCT c)  -> number of DISTINCT non-NULL values of   |
|                        `c`. Duplicates collapse, NULLs drop.   |
|                                                                |
+---------------------------------------------------------------+

SELECT
  COUNT(*)                   AS rows_total,       -- counts rows
  COUNT(amount)              AS rows_with_amount, -- skips NULL amount
  COUNT(DISTINCT customer_id) AS unique_customers, -- unique non-NULL
  COUNT(DISTINCT status)     AS unique_statuses
FROM orders;

-- Result for our 8-row orders table:
-- rows_total | rows_with_amount | unique_customers | unique_statuses
-- -----------+------------------+------------------+-----------------
--          8 |                6 |                5 |               4
--
-- COUNT(*)            = 8  (all rows, including 2 with NULL amount)
-- COUNT(amount)       = 6  (8 rows - 2 NULL amounts = 6)
-- COUNT(DISTINCT c_id)= 5  (customers 1..5)
-- COUNT(DISTINCT stat)= 4  (paid, refunded, pending, cancelled)

Rule of thumb: if you want "how many rows matched," always write COUNT(*). Only use COUNT(column) if you specifically mean "how many of those rows had a non-null value for this column." COUNT(1) and COUNT(*) are identical in every modern database — pick COUNT(*) for clarity.

Combining Multiple Aggregates in a Single SELECT

You do not need one query per metric. A single SELECT can return as many aggregates as you want, and the database computes them all in one pass over the data. This is how dashboard summary cards are built.

-- Dashboard summary: five metrics, one scan of the orders table
SELECT
  COUNT(*)                         AS total_orders,
  COUNT(DISTINCT customer_id)      AS unique_customers,
  COALESCE(SUM(amount), 0)         AS total_revenue,
  COALESCE(AVG(amount), 0)         AS avg_order_value,
  MIN(created_at)                  AS first_order_at,
  MAX(created_at)                  AS last_order_at
FROM orders
WHERE status = 'paid';

-- total_orders | unique_customers | total_revenue | avg_order_value | first_order_at      | last_order_at
-- -------------+------------------+---------------+-----------------+---------------------+---------------------
--            5 |                4 |        567.50 |          113.50 | 2026-01-05 00:00:00 | 2026-02-14 00:00:00

The WHERE clause is applied before the aggregates run, so these numbers cover only paid orders. That is the classic pattern: filter rows, then aggregate the survivors.

Conditional Aggregates with FILTER (WHERE ...)

Sometimes you want several aggregates, each restricted to a different subset of rows — "total revenue from paid orders, total refund amount, count of pending orders" — all in one query. Before SQL:2003, you did this with SUM(CASE WHEN ... THEN amount END), which works everywhere but is verbose. PostgreSQL (and SQLite 3.30+) support the cleaner FILTER (WHERE ...) clause.

-- PostgreSQL FILTER clause — the modern way to write conditional aggregates
SELECT
  COUNT(*)                                         AS total_orders,
  COUNT(*) FILTER (WHERE status = 'paid')          AS paid_orders,
  COUNT(*) FILTER (WHERE status = 'refunded')      AS refunded_orders,
  SUM(amount) FILTER (WHERE status = 'paid')       AS paid_revenue,
  SUM(amount) FILTER (WHERE status = 'refunded')   AS refund_total,
  AVG(amount) FILTER (WHERE status = 'paid')       AS avg_paid_order
FROM orders;

-- total_orders | paid_orders | refunded_orders | paid_revenue | refund_total | avg_paid_order
-- -------------+-------------+-----------------+--------------+--------------+----------------
--            8 |           5 |               1 |       567.50 |       200.00 |         113.50

-- Portable equivalent using CASE (works in MySQL, SQL Server, Oracle)
SELECT
  COUNT(*) AS total_orders,
  SUM(CASE WHEN status = 'paid'     THEN 1 ELSE 0 END) AS paid_orders,
  SUM(CASE WHEN status = 'refunded' THEN 1 ELSE 0 END) AS refunded_orders,
  SUM(CASE WHEN status = 'paid'     THEN amount END)   AS paid_revenue,
  SUM(CASE WHEN status = 'refunded' THEN amount END)   AS refund_total,
  AVG(CASE WHEN status = 'paid'     THEN amount END)   AS avg_paid_order
FROM orders;

Both produce identical results. The FILTER form reads as "compute this aggregate, but only over rows matching this predicate," which is exactly how you would describe it in English. The CASE form uses the fact that SUM and AVG ignore NULL, so CASE ... WHEN matched THEN value ELSE NULL END effectively filters non-matching rows out of the aggregate.

STRING_AGG and ARRAY_AGG — Concatenating Across Rows

The five core aggregates reduce a column to a single scalar. Sometimes you want to keep all the values but pack them into one string or array. That is what STRING_AGG and ARRAY_AGG do.

-- Collapse all paid order amounts per customer into a comma-separated string
SELECT
  customer_id,
  STRING_AGG(amount::text, ', ' ORDER BY created_at) AS amounts_csv,
  ARRAY_AGG(amount ORDER BY created_at)              AS amounts_arr,
  COUNT(*)                                           AS orders
FROM orders
WHERE status = 'paid'
GROUP BY customer_id
ORDER BY customer_id;

-- customer_id | amounts_csv        | amounts_arr         | orders
-- ------------+--------------------+---------------------+--------
--           1 | 100.00, 50.00      | {100.00,50.00}      |      2
--           3 | 75.50, 300.00      | {75.50,300.00}      |      2
--           5 | 42.00              | {42.00}             |      1

STRING_AGG(expr, separator) joins the values with a separator (PostgreSQL, SQL Server 2017+, SQLite 3.44+). MySQL spells it GROUP_CONCAT(expr SEPARATOR ', '). Oracle uses LISTAGG. The optional ORDER BY inside the aggregate controls the order of concatenation, which matters when the result is user-visible.

ARRAY_AGG(expr) builds a native array (PostgreSQL) and is wonderful for returning one row per group with the full detail packed into an array column — no second query, no client-side grouping. Like the scalar aggregates, both skip NULL inputs by default.

Empty-Set Behavior — The Quirk That Burns Dashboards

When no rows match the WHERE clause, each aggregate behaves differently:

+---------------------------------------------------------------+
|           AGGREGATE RESULTS ON AN EMPTY SET                   |
+---------------------------------------------------------------+
|                                                                |
|   Aggregate          |  Result on empty set                    |
|   -------------------+-----------------------------------------|
|   COUNT(*)           |  0                                      |
|   COUNT(column)      |  0                                      |
|   COUNT(DISTINCT c)  |  0                                      |
|   SUM(column)        |  NULL    <-- NOT 0!                     |
|   AVG(column)        |  NULL                                   |
|   MIN(column)        |  NULL                                   |
|   MAX(column)        |  NULL                                   |
|   STRING_AGG(...)    |  NULL                                   |
|   ARRAY_AGG(...)     |  NULL    (PostgreSQL)                   |
|                                                                |
+---------------------------------------------------------------+

-- Prove it
SELECT
  COUNT(*)    AS c_star,   -- 0
  SUM(amount) AS s,        -- NULL
  AVG(amount) AS a,        -- NULL
  MIN(amount) AS mn        -- NULL
FROM orders
WHERE customer_id = 999;   -- matches nothing
-- c_star | s    | a    | mn
-- -------+------+------+------
--      0 | NULL | NULL | NULL

The practical consequence: always wrap SUM in COALESCE when the output is a numeric dashboard field. Otherwise a zero-row group produces NULL, which rendering layers convert to 0 in some languages and to the literal string "null" in others. Wrap it at the query layer and the bug cannot reach the UI.

Napkin AI Visual Prompt: "Dark gradient (#0a0f1f -> #111a2e). A horizontal band of five boxes, each labeled with a monospace aggregate name in sky blue (#4fc3f7): MIN, MAX, AVG, SUM, COUNT. Below each box, two pill badges: an empty-set result badge (rose #ff5c8a for NULL, sky blue for 0) and a NULL-handling badge reading 'skips NULL' or 'counts all' in white monospace. Between the five boxes, a flowing column of mock spreadsheet cells with two rose-highlighted NULL cells. A dashed sky blue arrow from the column points into each aggregate box. Title: 'Five Aggregates, Five Different Empty-Set Rules'. White monospace labels."

Common Mistakes

1. Writing AVG(int_col) and expecting integer behavior. Most databases return a decimal type from AVG even on integer columns, but SUM(int_col) / COUNT(*) will truncate because of integer division. If you ever compute an average by hand, cast one operand to NUMERIC or DOUBLE PRECISION first. Better: just use AVG() and let the engine handle the type promotion.

2. Using COUNT(column) when you meant COUNT(*). COUNT(column) silently skips rows where column IS NULL. If your orders table has a shipped_at column that is NULL for unshipped orders and you write COUNT(shipped_at), you get the number of shipped orders, not the total number of orders. COUNT(*) counts every row in the group unconditionally. When you want "how many rows," always write COUNT(*).

3. Assuming SUM of an empty set is 0. It is NULL. When you filter down to a group that has no rows, SUM returns NULL, AVG returns NULL, MIN and MAX return NULL, and only COUNT returns 0. Dashboards rendering NULL as "null" or mixing it into arithmetic (NULL + 10 = NULL) cause hours of debugging. Wrap numeric aggregates in COALESCE(agg, 0) whenever the output is user-facing.

4. Mixing aggregates with non-grouped columns in the SELECT list. SELECT customer_id, COUNT(*) FROM orders; is an error in standard SQL because customer_id is not aggregated and not in a GROUP BY. MySQL (with ONLY_FULL_GROUP_BY off) accepts it and returns an arbitrary customer_id from some random row, which is almost never what you want. Either add GROUP BY customer_id or wrap it in an aggregate like MIN(customer_id). Lesson 4.4 covers GROUP BY in depth.

5. Forgetting that DISTINCT and FILTER compose. COUNT(DISTINCT customer_id) FILTER (WHERE status = 'paid') counts unique customers who have at least one paid order — a totally different number from COUNT(DISTINCT customer_id) over the whole table. Knowing that these compose lets you express "unique paying customers this month" in a single expression.

6. Using COUNT(DISTINCT col) on a huge column when you only need an approximation. Exact distinct counts can be expensive at scale because the database must track every value seen. For analytics at billions of rows, look at APPROX_COUNT_DISTINCT (BigQuery, Snowflake, Redshift) or hll_count extensions in PostgreSQL. They trade a small error (around 1%) for dramatically lower memory usage.

Interview Questions

1. "What is the difference between COUNT(*), COUNT(1), and COUNT(column)?"

COUNT(*) and COUNT(1) are identical in every modern database — they both count rows in the group and never skip anything, including rows where every column is NULL. The 1 is not "count the column named 1" (there is no such column); it is a constant expression that is the same for every row, so there is nothing to skip. Query optimizers recognize both forms and generate the same execution plan. COUNT(column), on the other hand, counts only rows where column IS NOT NULL, so it can be strictly less than COUNT(*) whenever that column is nullable. The practical guidance is: write COUNT(*) when you mean "number of rows," write COUNT(column) only when you specifically need "number of rows with a non-null value for this column," and never write COUNT(1) because it reads like a cargo-culted micro-optimization that was never actually an optimization.

2. "Walk me through how SUM, AVG, MIN, MAX, and COUNT handle NULL, and why it matters."

All five aggregates follow the same rule with one exception: they skip NULL inputs. SUM adds only non-null values, AVG divides the sum of non-null values by the count of non-null values (not the total row count), and MIN/MAX pick the smallest/largest non-null value. The exception is COUNT(*), which counts rows and does not look at any column, so nulls are invisible to the decision. COUNT(column) follows the general rule and skips null values of that column. This matters because it is easy to write AVG(amount) on a table where half the amounts are null and silently get the average of the other half, which may not be what the business wants. If you need nulls to count as zero, write AVG(COALESCE(amount, 0)) explicitly. On empty sets, COUNT returns 0 and every other aggregate returns NULL — that asymmetry is another classic source of bugs, which is why user-facing numeric aggregates should always be wrapped in COALESCE(SUM(x), 0) or similar.

3. "What does the FILTER (WHERE ...) clause do, and how would you write the same thing in a database that does not support it?"

FILTER (WHERE predicate) is a SQL:2003 standard clause that restricts a single aggregate to rows matching the predicate, without affecting other aggregates in the same SELECT. For example, SUM(amount) FILTER (WHERE status = 'paid') sums only paid orders, while a sibling SUM(amount) FILTER (WHERE status = 'refunded') sums only refunds — both in the same query, one scan of the table. PostgreSQL, SQLite 3.30+, and a few others implement it natively. In databases that do not (MySQL, older SQL Server), you write the equivalent with CASE: SUM(CASE WHEN status = 'paid' THEN amount END). This works because the ELSE branch is implicitly NULL, and SUM skips nulls, so only rows matching the CASE contribute. The FILTER form is more readable and composes naturally with DISTINCT — COUNT(DISTINCT customer_id) FILTER (WHERE status = 'paid') reads as "unique customers who have at least one paid order" without any nesting.

4. "Why does SELECT SUM(amount) FROM orders WHERE customer_id = 999 return NULL instead of 0 when there are no matching rows, and how do you fix it?"

SUM is defined to sum the non-null values in the input set. When the input set is empty, there is literally nothing to sum, and SQL returns NULL to indicate "undefined / no answer" rather than picking an arbitrary default. This is consistent with AVG, MIN, and MAX, all of which also return NULL on empty sets — the only aggregate that returns a non-null value on an empty set is COUNT, which returns 0. The mathematical argument is that SUM over zero elements has no well-defined value (the "empty sum" convention of zero is arbitrary and not universal), while COUNT over zero elements unambiguously has cardinality zero. The practical fix is COALESCE(SUM(amount), 0), which substitutes 0 for NULL at the query layer. Doing this at the database instead of the application layer is important because many ORMs and front-end frameworks render NULL inconsistently — some as 0, some as the string "null", some as a blank cell — and you do not want your dashboard to depend on which one.

5. "How does AVG handle integer columns, and is there a situation where it would truncate?"

AVG on an integer column returns a decimal type in every major database: PostgreSQL returns NUMERIC, SQL Server returns INT only if you cast it (it defaults to the larger type), MySQL returns DECIMAL, Oracle returns NUMBER. So AVG(int_col) does not truncate. However, if you manually compute the average with SUM(int_col) / COUNT(*), both operands are integers and the / operator performs integer division, truncating the fractional part. SUM(points) / COUNT(*) on the values (10, 11, 12, 1, 2) returns 7, not 7.2. The fix is to cast one operand: SUM(points)::numeric / COUNT(*) in PostgreSQL, or SUM(points) * 1.0 / COUNT(*) as a portable trick. The broader lesson: always prefer the built-in aggregate (AVG) over a manual equivalent, because the built-in handles type promotion correctly.

6. "You need a single query that returns the total number of orders, the number of paid orders, the total revenue from paid orders, and the number of unique customers who paid. How would you write it?"

SELECT
  COUNT(*)                                                   AS total_orders,
  COUNT(*) FILTER (WHERE status = 'paid')                    AS paid_orders,
  COALESCE(SUM(amount) FILTER (WHERE status = 'paid'), 0)    AS paid_revenue,
  COUNT(DISTINCT customer_id) FILTER (WHERE status = 'paid') AS unique_payers
FROM orders;

One scan of the table, four aggregates, each with its own filter predicate. The COALESCE guards against empty results, and the FILTER clauses keep every metric scoped to its own subset without scaffolding a subquery or a self-join. In a database without FILTER, the SUM/COUNT aggregates become SUM(CASE WHEN status = 'paid' THEN amount END) and COUNT(CASE WHEN status = 'paid' THEN 1 END), but the COUNT(DISTINCT ...) FILTER (...) form has no clean rewrite — you fall back to COUNT(DISTINCT CASE WHEN status = 'paid' THEN customer_id END), which works because DISTINCT also ignores NULL.

Quick Reference — Aggregate Functions Cheat Sheet

+---------------------------------------------------------------+
|           AGGREGATE FUNCTIONS CHEAT SHEET                     |
+---------------------------------------------------------------+
|                                                                |
|  CORE FIVE:                                                    |
|   MIN(col)   -> smallest non-NULL value                        |
|   MAX(col)   -> largest  non-NULL value                        |
|   SUM(col)   -> sum of non-NULL values (NULL if empty)         |
|   AVG(col)   -> SUM(col) / COUNT(col), decimal type            |
|   COUNT(*)   -> row count, never NULL, never skips             |
|                                                                |
|  COUNT VARIANTS:                                               |
|   COUNT(*)             -> rows in group                        |
|   COUNT(column)        -> rows where column IS NOT NULL        |
|   COUNT(DISTINCT c)    -> unique non-NULL values of c          |
|   COUNT(1) == COUNT(*) -> same plan, same result               |
|                                                                |
|  EMPTY SET:                                                    |
|   COUNT -> 0                                                   |
|   SUM / AVG / MIN / MAX / STRING_AGG / ARRAY_AGG -> NULL       |
|   Fix with COALESCE(agg, 0) on numeric output                  |
|                                                                |
|  CONDITIONAL:                                                  |
|   agg(x) FILTER (WHERE pred)   -- PostgreSQL / SQLite          |
|   agg(CASE WHEN pred THEN x END) -- portable                   |
|                                                                |
|  CONCATENATION:                                                |
|   STRING_AGG(expr, sep ORDER BY ...) -- PG / SQL Server        |
|   GROUP_CONCAT(expr SEPARATOR sep)   -- MySQL                  |
|   LISTAGG(expr, sep)                 -- Oracle                 |
|   ARRAY_AGG(expr ORDER BY ...)       -- PostgreSQL             |
|                                                                |
+---------------------------------------------------------------+

Question	Use This	Not This
How many rows?	`COUNT(*)`	`COUNT(column)`
How many non-null values?	`COUNT(column)`	`COUNT(*)`
How many unique customers?	`COUNT(DISTINCT customer_id)`	`COUNT(customer_id)`
Total revenue (safe)?	`COALESCE(SUM(amount), 0)`	`SUM(amount)`
Average of integer column?	`AVG(col)`	`SUM(col) / COUNT(*)`
Conditional total (PG)?	`SUM(x) FILTER (WHERE pred)`	subquery
Conditional total (portable)?	`SUM(CASE WHEN pred THEN x END)`	two queries
All values as CSV?	`STRING_AGG(expr, ', ')`	client-side join
All values as array?	`ARRAY_AGG(expr)`	client-side collect
Fastest approximate distinct?	`APPROX_COUNT_DISTINCT(col)`	`COUNT(DISTINCT col)`

Prev: Lesson 4.2 -- Pattern Matching LIKE Next: Lesson 4.4 -- GROUP BY

This is Lesson 4.3 of the Database Interview Prep Course -- 12 chapters, 58 lessons.

Aggregate Functions