Database Interview Prep
Subqueries and Advanced Queries

Window Functions

OVER, PARTITION BY, Ranking, LAG/LEAD, and Frame Clauses

LinkedIn Hook

"You wrote a self-join with a correlated subquery to get a running total. It took 14 seconds. A window function does it in 80 milliseconds."

Most developers learn SQL in two phases. Phase one: SELECT, WHERE, JOIN, GROUP BY. Phase two: window functions — and suddenly half the queries they used to write with subqueries, self-joins, and procedural loops collapse into a single readable line. Top-N per group? Window function. Running totals? Window function. Compare each row to the previous row? Window function. Rank within a category? Window function. Moving averages, percent of total, gap detection, deduplication — all window functions.

The mental block is that aggregate functions (SUM, COUNT, AVG) collapse rows. GROUP BY shrinks ten orders into one row per customer. Window functions do something fundamentally different: they compute a value across a set of rows but keep every original row visible. The SUM you compute follows the row around — it does not replace it. Once that clicks, an entire category of "I need to write a stored procedure" problems vanishes.

The two pieces that confuse newcomers are PARTITION BY (which divides the rows into independent buckets, like a per-group GROUP BY that does not collapse) and the frame clause (ROWS BETWEEN ... AND ..., which controls exactly which rows feed into the calculation for each output row). Master those two, and you can express running totals, moving averages, percent of partition, lead/lag comparisons, and ranking with surgical precision.

In Lesson 6.5, I break down window functions end to end: OVER, PARTITION BY, ORDER BY, ROW_NUMBER vs RANK vs DENSE_RANK, LAG/LEAD for row-to-row comparisons, running totals with SUM() OVER, and frame clauses (ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) so you know exactly what window each calculation sees.

Read the full lesson -> [link]

#SQL #Database #PostgreSQL #MySQL #WindowFunctions #BackendDevelopment #InterviewPrep #QueryPerformance


Window Functions thumbnail


What You'll Learn

  • The fundamental difference between aggregate functions (collapse rows) and window functions (keep rows)
  • How OVER() defines a window of rows for each output row
  • How PARTITION BY divides rows into independent buckets, like per-group computation without collapsing
  • How ORDER BY inside OVER() defines row ordering for cumulative and ranking calculations
  • The three ranking functions: ROW_NUMBER, RANK, and DENSE_RANK, and exactly how they differ on ties
  • LAG and LEAD for comparing each row to its previous or next neighbor
  • Running totals and moving averages with SUM() OVER and explicit frame clauses
  • The frame clause: ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW, sliding windows, and the default frame trap
  • The classic Top-N-per-group pattern using ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ...)
  • PostgreSQL syntax with notes on MySQL 8+ support and gotchas

The Stadium Scoreboard Analogy — Aggregates That Follow the Player

Imagine a stadium scoreboard during a basketball game. Every player has their own running point total displayed next to their name. When a player scores, only their own total updates — the other players' totals stay where they are. At any moment, you can see all ten players on the floor, each with their personal cumulative score, and the team's total score at the bottom. The per-player totals are not separate rows that replaced the players; they are values attached to each player that follow them around as the game progresses.

That is exactly what a window function does. A traditional aggregate like SELECT player, SUM(points) FROM scores GROUP BY player collapses every shot for a player into a single row — you lose the individual shots. A window function like SELECT player, points, SUM(points) OVER (PARTITION BY player ORDER BY shot_time) keeps every original shot AND attaches a running total to each one. You see shot #1 with running total = points of shot 1, shot #2 with running total = points of shot 1 + 2, and so on. Every original row survives. The aggregate just walks alongside it.

Now extend the analogy. The "PARTITION BY player" is what tells the scoreboard to keep each player's running total separate. Without it, every player would share one global running total — useless. With it, each player gets their own independent calculation. The "ORDER BY shot_time" tells the scoreboard the chronological order in which to accumulate. The "frame" is the rule for which shots count toward the total at any given moment — by default, "all shots from the start of the game up to and including this one."

+------------------------------------------------------------------+
|           AGGREGATE vs WINDOW FUNCTION                           |
+------------------------------------------------------------------+
|                                                                  |
|   GROUP BY (collapses):                                          |
|   +----------+--------+         +----------+--------+            |
|   | player   | points |         | player   | total  |            |
|   +----------+--------+         +----------+--------+            |
|   | LeBron   |   3    |   ->    | LeBron   |  7     |            |
|   | LeBron   |   2    |         | Curry    |  5     |            |
|   | LeBron   |   2    |         +----------+--------+            |
|   | Curry    |   3    |         (4 rows became 2; shots gone)    |
|   | Curry    |   2    |                                          |
|   +----------+--------+                                          |
|                                                                  |
|   SUM() OVER (PARTITION BY player) (keeps rows):                 |
|   +----------+--------+--------+                                 |
|   | player   | points | total  |                                 |
|   +----------+--------+--------+                                 |
|   | LeBron   |   3    |   7    |  <- 7 = 3+2+2 attached to row   |
|   | LeBron   |   2    |   7    |                                 |
|   | LeBron   |   2    |   7    |                                 |
|   | Curry    |   3    |   5    |                                 |
|   | Curry    |   2    |   5    |                                 |
|   +----------+--------+--------+                                 |
|   (5 rows stay 5; total walks alongside each row)                |
|                                                                  |
+------------------------------------------------------------------+

The OVER Clause — Anatomy of a Window

Every window function call has three parts inside its OVER() clause: an optional PARTITION BY, an optional ORDER BY, and an optional frame clause. Together they define the window of rows that the function looks at when computing the value for the current row.

function_name(args) OVER (
    PARTITION BY  partition_columns   -- divides rows into independent buckets
    ORDER BY      order_columns       -- defines row order within each bucket
    frame_clause                      -- defines which rows in the bucket to use
)
  • Omit everything: SUM(x) OVER () — one giant window over the whole result set.
  • Add PARTITION BY user_id: separate window per user; each user is independent.
  • Add ORDER BY created_at: rows within each partition are ordered, enabling cumulative behavior.
  • Add a frame: control exactly which rows from the ordered partition contribute (e.g., last 3 rows, all preceding rows, the next row only).

Window functions are evaluated after WHERE, GROUP BY, and HAVING, but before ORDER BY and LIMIT of the outer query. That means you cannot filter on a window function in WHERE — you must wrap it in a subquery or CTE first. This is one of the most common gotchas.


Example 1 — Per-User Running Total of Orders

Let's set up a sample table and walk through the building blocks.

-- Sample table: customer orders over time
CREATE TABLE orders (
  id          SERIAL PRIMARY KEY,
  customer    TEXT,
  amount      NUMERIC(10, 2),
  created_at  TIMESTAMPTZ
);

INSERT INTO orders (customer, amount, created_at) VALUES
  ('Alice',  120.00, '2026-04-01 09:00'),
  ('Alice',   55.00, '2026-04-03 14:00'),
  ('Alice',   80.00, '2026-04-07 11:30'),
  ('Bob',    300.00, '2026-04-02 10:15'),
  ('Bob',     45.00, '2026-04-05 16:45'),
  ('Bob',    120.00, '2026-04-09 08:20'),
  ('Carol',  500.00, '2026-04-04 12:00'),
  ('Carol',  250.00, '2026-04-08 15:30');

-- Running total per customer, ordered chronologically.
-- For each row, SUM walks through all earlier rows of the SAME customer
-- (same partition) and adds the amounts up to and including the current row.
SELECT
  customer,
  created_at,
  amount,
  SUM(amount) OVER (
    PARTITION BY customer
    ORDER BY created_at
  ) AS running_total
FROM orders
ORDER BY customer, created_at;

-- customer | created_at        | amount | running_total
-- ---------+-------------------+--------+---------------
-- Alice    | 2026-04-01 09:00  | 120.00 | 120.00
-- Alice    | 2026-04-03 14:00  |  55.00 | 175.00
-- Alice    | 2026-04-07 11:30  |  80.00 | 255.00
-- Bob      | 2026-04-02 10:15  | 300.00 | 300.00
-- Bob      | 2026-04-05 16:45  |  45.00 | 345.00
-- Bob      | 2026-04-09 08:20  | 120.00 | 465.00
-- Carol    | 2026-04-04 12:00  | 500.00 | 500.00
-- Carol    | 2026-04-08 15:30  | 250.00 | 750.00

Notice three things. First, every original row is present — no collapsing. Second, Alice's running total resets when we cross to Bob, because PARTITION BY customer makes each customer independent. Third, the running total is cumulative because of ORDER BY created_at inside OVER — without it, every row in a partition would see the same total (the partition sum).


PARTITION BY — Per-Group Calculations Without Collapsing

PARTITION BY is the window function equivalent of GROUP BY, with one critical difference: it does not reduce row count. Each partition is computed independently, and the result is broadcast back to every row in that partition.

-- Compare each order to the customer's average order size
-- and compute that customer's percent-of-total contribution.
SELECT
  customer,
  amount,
  AVG(amount)  OVER (PARTITION BY customer)             AS customer_avg,
  amount - AVG(amount) OVER (PARTITION BY customer)     AS diff_from_avg,
  ROUND(
    100.0 * amount /
    SUM(amount) OVER (PARTITION BY customer),
    1
  )                                                     AS pct_of_customer_total
FROM orders
ORDER BY customer, created_at;

-- customer | amount | customer_avg | diff_from_avg | pct_of_customer_total
-- ---------+--------+--------------+---------------+----------------------
-- Alice    | 120.00 |    85.00     |    35.00      |   47.1
-- Alice    |  55.00 |    85.00     |   -30.00      |   21.6
-- Alice    |  80.00 |    85.00     |    -5.00      |   31.4
-- Bob      | 300.00 |   155.00     |   145.00      |   64.5
-- Bob      |  45.00 |   155.00     |  -110.00      |    9.7
-- Bob      | 120.00 |   155.00     |   -35.00      |   25.8
-- Carol    | 500.00 |   375.00     |   125.00      |   66.7
-- Carol    | 250.00 |   375.00     |  -125.00      |   33.3

Try doing that with GROUP BY and you'd need a self-join or a correlated subquery for every column. With windows, it's three lines.

+------------------------------------------------------------------+
|           PARTITION BY -- THE BUCKET METAPHOR                    |
+------------------------------------------------------------------+
|                                                                  |
|   Alice's bucket          Bob's bucket          Carol's bucket   |
|   +-----------+           +-----------+         +-----------+    |
|   | 120.00    |           | 300.00    |         | 500.00    |    |
|   |  55.00    |           |  45.00    |         | 250.00    |    |
|   |  80.00    |           | 120.00    |         +-----------+    |
|   +-----------+           +-----------+                          |
|       sum=255               sum=465               sum=750        |
|       avg=85                avg=155               avg=375        |
|                                                                  |
|   Each bucket computes independently.                            |
|   Results are broadcast to every row in the bucket.              |
|   No row count change.                                           |
|                                                                  |
+------------------------------------------------------------------+

Napkin AI Visual Prompt: "Dark navy gradient (#0a0f1f -> #111a2e). Three vertical buckets labeled 'Alice', 'Bob', 'Carol' in white monospace. Each bucket holds 2-3 sky blue (#4fc3f7) row cards with amounts. Below each bucket, a rose (#ff5c8a) summary card showing SUM and AVG for that bucket only. An arrow labeled 'PARTITION BY customer' points down at the buckets. A second arrow labeled 'broadcast' points back up from each summary into its rows. Caption: 'Independent calculations per partition. Row count never changes.'"


Ranking Functions — ROW_NUMBER, RANK, DENSE_RANK

These three functions assign an integer to each row within a partition based on the ORDER BY. They look identical until ties show up, and that is when the differences matter.

  • ROW_NUMBER() — Always assigns 1, 2, 3, 4, ... with no duplicates. Tied rows get arbitrary distinct numbers.
  • RANK() — Tied rows share the same rank, and the next rank skips ahead. (1, 2, 2, 4)
  • DENSE_RANK() — Tied rows share the same rank, and the next rank does not skip. (1, 2, 2, 3)
-- Sample: scores with intentional ties
CREATE TABLE exam_scores (
  student  TEXT,
  subject  TEXT,
  score    INT
);

INSERT INTO exam_scores VALUES
  ('Alice', 'Math', 95),
  ('Bob',   'Math', 88),
  ('Carol', 'Math', 88),  -- tie with Bob
  ('Dave',  'Math', 75),
  ('Eve',   'Math', 75),  -- tie with Dave
  ('Frank', 'Math', 60);

SELECT
  student,
  score,
  ROW_NUMBER() OVER (ORDER BY score DESC) AS row_num,
  RANK()       OVER (ORDER BY score DESC) AS rnk,
  DENSE_RANK() OVER (ORDER BY score DESC) AS dense_rnk
FROM exam_scores
ORDER BY score DESC, student;

-- student | score | row_num | rnk | dense_rnk
-- --------+-------+---------+-----+-----------
-- Alice   |  95   |    1    |  1  |    1
-- Bob     |  88   |    2    |  2  |    2
-- Carol   |  88   |    3    |  2  |    2     <- tied at 2 (RANK), skip to 4 next
-- Dave    |  75   |    4    |  4  |    3     <- DENSE goes 2,3 (no skip)
-- Eve     |  75   |    5    |  4  |    3
-- Frank   |  60   |    6    |  6  |    4
+------------------------------------------------------------------+
|           ROW_NUMBER vs RANK vs DENSE_RANK ON TIES               |
+------------------------------------------------------------------+
|                                                                  |
|   Score:   95   88   88   75   75   60                           |
|                                                                  |
|   ROW_NUMBER:  1    2    3    4    5    6     -> always unique   |
|   RANK:        1    2    2    4    4    6     -> ties + gap      |
|   DENSE_RANK:  1    2    2    3    3    4     -> ties no gap     |
|                                                                  |
|   USE CASES:                                                     |
|     ROW_NUMBER -> "give me a unique sequence number per partition"|
|                   (top-N per group, deduplication)               |
|     RANK       -> "olympic medals" -- two golds, no silver        |
|     DENSE_RANK -> "tier the rows" -- distinct values get distinct |
|                   tiers, no gaps                                 |
|                                                                  |
+------------------------------------------------------------------+

Top-N Per Group — The Killer Use Case

-- Top 1 highest-scoring student per subject
INSERT INTO exam_scores VALUES
  ('Grace', 'Science', 92),
  ('Henry', 'Science', 87),
  ('Ivy',   'Science', 99);

-- Wrap the windowed query in a subquery so we can filter on row_num.
-- (Window functions cannot appear in WHERE directly.)
SELECT subject, student, score
FROM (
  SELECT
    subject,
    student,
    score,
    ROW_NUMBER() OVER (
      PARTITION BY subject
      ORDER BY score DESC
    ) AS rn
  FROM exam_scores
) ranked
WHERE rn = 1;

-- subject | student | score
-- --------+---------+------
-- Math    | Alice   |  95
-- Science | Ivy     |  99

Change WHERE rn = 1 to WHERE rn <= 3 for top-3 per subject. This pattern replaces an entire family of correlated-subquery and self-join solutions that were the only way to do top-N-per-group before window functions were standardized.


LAG and LEAD — Looking at Neighboring Rows

LAG(col, offset, default) returns the value of col from a previous row in the partition (offset = 1 by default, i.e. the previous row). LEAD(col, offset, default) does the same for a following row. These are essential for time-series comparisons: "how much did revenue change from yesterday to today?", "what is the gap between this event and the previous one?", "did the price go up or down compared to the previous day?".

-- For each order, show the previous order's amount for the same customer
-- and the difference.
SELECT
  customer,
  created_at,
  amount,
  LAG(amount, 1) OVER (
    PARTITION BY customer
    ORDER BY created_at
  ) AS prev_amount,
  amount - LAG(amount, 1) OVER (
    PARTITION BY customer
    ORDER BY created_at
  ) AS change_from_prev,
  LEAD(created_at, 1) OVER (
    PARTITION BY customer
    ORDER BY created_at
  ) AS next_order_at
FROM orders
ORDER BY customer, created_at;

-- customer | created_at        | amount | prev_amount | change_from_prev | next_order_at
-- ---------+-------------------+--------+-------------+------------------+----------------
-- Alice    | 2026-04-01 09:00  | 120.00 |    NULL     |       NULL       | 2026-04-03 14:00
-- Alice    | 2026-04-03 14:00  |  55.00 |   120.00    |     -65.00       | 2026-04-07 11:30
-- Alice    | 2026-04-07 11:30  |  80.00 |    55.00    |      25.00       |     NULL
-- Bob      | 2026-04-02 10:15  | 300.00 |    NULL     |       NULL       | 2026-04-05 16:45
-- Bob      | 2026-04-05 16:45  |  45.00 |   300.00    |    -255.00       | 2026-04-09 08:20
-- Bob      | 2026-04-09 08:20  | 120.00 |    45.00    |      75.00       |     NULL
-- Carol    | 2026-04-04 12:00  | 500.00 |    NULL     |       NULL       | 2026-04-08 15:30
-- Carol    | 2026-04-08 15:30  | 250.00 |   500.00    |    -250.00       |     NULL

The first row of each partition gets NULL from LAG (no previous row exists), and the last row gets NULL from LEAD. You can supply a third argument to give a default value instead: LAG(amount, 1, 0) returns 0 instead of NULL for the first row.


Frame Clauses — ROWS BETWEEN, and the Default You Did Not Notice

This is where window functions get powerful. The frame clause defines, for each row, exactly which other rows in the partition contribute to the calculation. There are two key things to understand:

  1. The default frame depends on whether you have ORDER BY inside OVER().
  2. The default is almost never what beginners expect.
+------------------------------------------------------------------+
|           DEFAULT FRAMES                                         |
+------------------------------------------------------------------+
|                                                                  |
|   OVER (PARTITION BY x)                                          |
|     -> default frame: entire partition                           |
|     -> good for SUM/AVG of the whole bucket                      |
|                                                                  |
|   OVER (PARTITION BY x ORDER BY y)                               |
|     -> default frame: RANGE BETWEEN UNBOUNDED PRECEDING          |
|                       AND CURRENT ROW                            |
|     -> good for running totals                                   |
|     -> WARNING: RANGE (not ROWS) means peers with the same y     |
|        all share the same window edge -- ties produce surprises  |
|                                                                  |
|   Rule of thumb: if you ORDER BY something and want true         |
|   row-by-row behavior, write the frame explicitly with ROWS:     |
|     ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW             |
|                                                                  |
+------------------------------------------------------------------+

Frame Syntax

{ROWS | RANGE | GROUPS} BETWEEN <start> AND <end>

  start/end can be:
    UNBOUNDED PRECEDING       -- from the first row of the partition
    n PRECEDING               -- n rows back
    CURRENT ROW
    n FOLLOWING               -- n rows forward
    UNBOUNDED FOLLOWING       -- to the last row of the partition

ROWS counts physical rows. RANGE groups peers with equal ORDER BY values together. For most beginner-to-intermediate use cases, prefer ROWS to avoid tie surprises.

Example 2 — 3-Day Moving Average and Sliding Window

-- Daily revenue table
CREATE TABLE daily_revenue (
  day      DATE PRIMARY KEY,
  revenue  NUMERIC(10, 2)
);

INSERT INTO daily_revenue VALUES
  ('2026-04-01', 100),
  ('2026-04-02', 120),
  ('2026-04-03',  90),
  ('2026-04-04', 150),
  ('2026-04-05', 200),
  ('2026-04-06', 180),
  ('2026-04-07', 220);

-- Running total (cumulative): every day from start through current
-- 3-day moving average: current row plus the two preceding rows
SELECT
  day,
  revenue,
  SUM(revenue) OVER (
    ORDER BY day
    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
  ) AS running_total,
  ROUND(
    AVG(revenue) OVER (
      ORDER BY day
      ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
    ),
    2
  ) AS moving_avg_3day
FROM daily_revenue
ORDER BY day;

-- day        | revenue | running_total | moving_avg_3day
-- -----------+---------+---------------+-----------------
-- 2026-04-01 |  100    |     100       |    100.00       <- only 1 row in window
-- 2026-04-02 |  120    |     220       |    110.00       <- (100+120)/2
-- 2026-04-03 |   90    |     310       |    103.33       <- (100+120+90)/3
-- 2026-04-04 |  150    |     460       |    120.00       <- (120+90+150)/3
-- 2026-04-05 |  200    |     660       |    146.67       <- (90+150+200)/3
-- 2026-04-06 |  180    |     840       |    176.67       <- (150+200+180)/3
-- 2026-04-07 |  220    |    1060       |    200.00       <- (200+180+220)/3

The ROWS BETWEEN 2 PRECEDING AND CURRENT ROW is the magic. It defines a sliding three-row window that walks across the result set. At day 1 it has only 1 row available, at day 2 only 2, and from day 3 onward it has the full 3-row window. This is exactly how you implement moving averages, rolling sums, "last 7 days" metrics, and any sliding-window calculation — without writing a single subquery.

+------------------------------------------------------------------+
|           SLIDING WINDOW: ROWS BETWEEN 2 PRECEDING AND CURRENT   |
+------------------------------------------------------------------+
|                                                                  |
|   day:  04-01  04-02  04-03  04-04  04-05  04-06  04-07          |
|   rev:   100    120    90    150    200    180    220            |
|                                                                  |
|   day 04-03 window:  [100][120][ 90]                              |
|   day 04-04 window:        [120][ 90][150]                        |
|   day 04-05 window:              [ 90][150][200]                  |
|   day 04-06 window:                    [150][200][180]            |
|   day 04-07 window:                          [200][180][220]      |
|                                                                  |
|   Each window AVG is the moving average for that day.            |
|                                                                  |
+------------------------------------------------------------------+

Example 3 — Detecting Gaps and Streaks with LAG

A classic real-world problem: given a sequence of timestamped events, find rows where a gap of more than N units exists between consecutive events for the same user. Window functions make this trivial.

CREATE TABLE login_events (
  id        SERIAL PRIMARY KEY,
  user_id   INT,
  login_at  TIMESTAMPTZ
);

INSERT INTO login_events (user_id, login_at) VALUES
  (1, '2026-04-01 09:00'),
  (1, '2026-04-01 09:05'),
  (1, '2026-04-01 14:00'),  -- 5h gap
  (1, '2026-04-02 09:00'),  -- 19h gap
  (2, '2026-04-01 10:00'),
  (2, '2026-04-01 10:30');

-- Compute the gap between each login and the previous one for the same user.
-- Then flag rows where the gap exceeds 1 hour.
SELECT
  user_id,
  login_at,
  LAG(login_at) OVER (PARTITION BY user_id ORDER BY login_at) AS prev_login,
  EXTRACT(EPOCH FROM (
    login_at - LAG(login_at) OVER (PARTITION BY user_id ORDER BY login_at)
  )) / 60 AS gap_minutes
FROM login_events
ORDER BY user_id, login_at;

-- user_id | login_at          | prev_login        | gap_minutes
-- --------+-------------------+-------------------+-------------
--    1    | 2026-04-01 09:00  |       NULL        |    NULL
--    1    | 2026-04-01 09:05  | 2026-04-01 09:00  |      5
--    1    | 2026-04-01 14:00  | 2026-04-01 09:05  |    295
--    1    | 2026-04-02 09:00  | 2026-04-01 14:00  |   1140
--    2    | 2026-04-01 10:00  |       NULL        |    NULL
--    2    | 2026-04-01 10:30  | 2026-04-01 10:00  |     30

Wrap that in a CTE and you can filter WHERE gap_minutes > 60 to find session breaks, or use the gap to assign session IDs, or compute streak lengths. The pattern of "use LAG/LEAD to compare adjacent rows, then filter or aggregate" replaces an enormous category of self-join queries.


Example 4 — Deduplication with ROW_NUMBER

A practical task: a table accidentally accumulated duplicate rows for the same email, and you want to keep only the most recent one for each email. Window functions make this a one-liner.

CREATE TABLE users (
  id          SERIAL PRIMARY KEY,
  email       TEXT,
  name        TEXT,
  updated_at  TIMESTAMPTZ
);

INSERT INTO users (email, name, updated_at) VALUES
  ('a@x.com', 'Alice v1',   '2026-04-01 10:00'),
  ('a@x.com', 'Alice v2',   '2026-04-05 12:00'),  -- newer duplicate
  ('a@x.com', 'Alice v3',   '2026-04-08 09:00'),  -- newest
  ('b@x.com', 'Bob',        '2026-04-02 15:00'),
  ('c@x.com', 'Carol v1',   '2026-04-03 11:00'),
  ('c@x.com', 'Carol v2',   '2026-04-07 14:00'); -- newest

-- Keep only the most recent row per email.
WITH ranked AS (
  SELECT
    id,
    email,
    name,
    updated_at,
    ROW_NUMBER() OVER (
      PARTITION BY email
      ORDER BY updated_at DESC
    ) AS rn
  FROM users
)
SELECT id, email, name, updated_at
FROM ranked
WHERE rn = 1
ORDER BY email;

-- id | email   | name      | updated_at
-- ---+---------+-----------+-------------------
--  3 | a@x.com | Alice v3  | 2026-04-08 09:00
--  4 | b@x.com | Bob       | 2026-04-02 15:00
--  6 | c@x.com | Carol v2  | 2026-04-07 14:00

-- To actually delete the duplicates (keep the newest):
DELETE FROM users
WHERE id IN (
  SELECT id
  FROM (
    SELECT id, ROW_NUMBER() OVER (
      PARTITION BY email ORDER BY updated_at DESC
    ) AS rn
    FROM users
  ) r
  WHERE rn > 1
);

This same pattern handles "keep oldest record", "keep highest-scoring submission per student", "keep most recent transaction per account" — all by changing the ORDER BY direction inside OVER.

Napkin AI Visual Prompt: "Dark navy gradient (#0a0f1f -> #111a2e). A horizontal timeline with 7 day cards in white monospace from 04-01 to 04-07. Above the timeline, a sky blue (#4fc3f7) sliding rectangle labeled 'ROWS BETWEEN 2 PRECEDING AND CURRENT ROW' covers 3 cards at a time, drawn at multiple positions to show the slide. Each window position has a small rose (#ff5c8a) caption with the moving average. Title at top: 'Sliding Frame'. Subtitle: 'The window walks; the calculation follows.'"


Dialect Notes — PostgreSQL and MySQL 8+

All examples in this lesson are written in PostgreSQL syntax, but window functions are part of standard SQL and are supported across modern engines:

  • PostgreSQL — Full support since version 8.4 (2009). Supports ROWS, RANGE, and GROUPS frames. Supports filtered aggregates SUM(x) FILTER (WHERE cond) OVER (...). Supports named windows via the WINDOW clause.
  • MySQL 8.0+ — Full window function support since 8.0 (2018). Older MySQL (5.7 and below) has no window functions at all — you have to fall back to user variables and self-joins, which is painful.
  • SQL Server — Full support since 2012. ROWS frames since 2012, RANGE frames since 2017.
  • SQLite — Window functions added in 3.25 (2018).
  • Oracle — Has had window functions (called "analytic functions") since version 8i.

The features that vary most are: GROUPS frames (Postgres only), filtered aggregates FILTER (WHERE ...) (Postgres only), and the exact behavior of RANGE with non-numeric ORDER BY (engine-specific). Stick to ROWS frames and you'll be portable across all modern engines.


Common Mistakes

1. Trying to filter on a window function in WHERE. SELECT ... WHERE ROW_NUMBER() OVER (...) = 1 does not work in any standard SQL engine. Window functions are evaluated logically after WHERE, so the column does not exist when WHERE runs. The fix is to wrap the windowed query in a subquery or CTE and filter in the outer query. This is by far the most common window function mistake, and the error message ("window functions are not allowed in WHERE") is at least clear about what is wrong.

2. Forgetting that the default frame uses RANGE, not ROWS. When you write SUM(x) OVER (ORDER BY y) without an explicit frame, PostgreSQL uses RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. With RANGE, all rows that tie on the ORDER BY value share the same window edge — meaning their running totals will all include each other, not just the rows physically before them. If your ORDER BY column has duplicates and you want true row-by-row cumulative behavior, write the frame explicitly: ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.

3. Confusing PARTITION BY (window) with GROUP BY (aggregate). PARTITION BY does not collapse rows; GROUP BY does. If you write SELECT customer, SUM(amount) OVER (PARTITION BY customer) FROM orders, you get one row per order, with the customer's total repeated on every row. If you want one row per customer, use GROUP BY instead. Mixing them up produces "why am I getting duplicates" confusion.

4. Using RANK when you wanted ROW_NUMBER. Top-N-per-group with WHERE rnk <= 3 and RANK() will return more than 3 rows per group whenever there are ties at the boundary — three rows tied at rank 3 mean you get five rows total. If you specifically want at most N rows per group regardless of ties, use ROW_NUMBER(). Use RANK() only when you genuinely want "everyone tied for third place is still third place."

5. Omitting ORDER BY when computing a running total. SUM(x) OVER (PARTITION BY user_id) without ORDER BY computes the partition total and broadcasts it to every row — not a running total. If you want a running total, you must write SUM(x) OVER (PARTITION BY user_id ORDER BY ts). The presence of ORDER BY is what tells the function "accumulate progressively"; without it, you get the grand total per partition.


Interview Questions

1. "What is the fundamental difference between an aggregate function with GROUP BY and a window function with PARTITION BY?"

GROUP BY collapses rows: every group of input rows that share the grouping columns becomes a single output row, and you can only project the grouping columns and aggregate functions of the other columns. PARTITION BY in a window function does not collapse rows — every input row produces an output row, and the windowed aggregate is computed across the partition and broadcast to each row in that partition. So SELECT customer, SUM(amount) FROM orders GROUP BY customer returns one row per customer with their total, while SELECT customer, amount, SUM(amount) OVER (PARTITION BY customer) FROM orders returns one row per order, each carrying the customer's total alongside the original amount. Window functions are the right tool when you want per-group statistics but also need to keep the original row-level detail visible — for example, to compute "this order's percent of the customer's total spend" without two queries or a self-join.

2. "Explain the difference between ROW_NUMBER, RANK, and DENSE_RANK with an example involving ties."

All three assign integers to rows ordered within a partition, but they handle ties differently. Imagine scores 95, 88, 88, 75. ROW_NUMBER assigns 1, 2, 3, 4 — every row gets a unique number, and ties are broken arbitrarily by the engine. RANK assigns 1, 2, 2, 4 — tied rows share the same rank, and the next rank skips ahead by the number of tied rows (this is the "Olympic medals" model: two silvers, no bronze). DENSE_RANK assigns 1, 2, 2, 3 — tied rows share the same rank, but the next rank does not skip; it simply moves to the next distinct value. The choice depends on intent. For "keep the top row per group" use ROW_NUMBER (it guarantees at most N rows). For "show standings where ties matter" use RANK. For "tier rows into distinct levels with no gaps in tier numbers" use DENSE_RANK.

3. "How do you find the top 3 highest-paid employees per department using window functions?"

You use ROW_NUMBER (or RANK, depending on tie behavior) partitioned by department, ordered by salary descending, and then filter for rows with the resulting rank less than or equal to 3. Because window functions cannot appear directly in WHERE, you wrap the query in a subquery or CTE: WITH ranked AS (SELECT *, ROW_NUMBER() OVER (PARTITION BY department_id ORDER BY salary DESC) AS rn FROM employees) SELECT * FROM ranked WHERE rn <= 3. ROW_NUMBER guarantees exactly three rows per department even when salaries tie at the cutoff. If you want "everyone who is in the top three salary tiers including ties at the boundary," use DENSE_RANK instead. This pattern replaced a nightmare of correlated subqueries and lateral joins that used to be the only way to express top-N-per-group in SQL.

4. "What is a frame clause, what is the default frame, and why does the default sometimes produce surprising results?"

The frame clause defines, for each output row, exactly which rows in the partition contribute to the windowed calculation — for example, "the current row and the two preceding rows" for a 3-row moving average. Syntax is ROWS BETWEEN <start> AND <end> or RANGE BETWEEN <start> AND <end>, with bounds like UNBOUNDED PRECEDING, n PRECEDING, CURRENT ROW, n FOLLOWING, UNBOUNDED FOLLOWING. The default depends on whether ORDER BY is present in OVER: with no ORDER BY, the default frame is the entire partition; with ORDER BY, the default is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. The surprise is the word RANGE: with RANGE, all rows that tie on the ORDER BY column share the same window edge, so running totals can include rows that are "logically after" the current row if they have the same ORDER BY value. The fix is to explicitly write ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW when you want strict row-by-row cumulative behavior. ROWS counts physical rows; RANGE groups peers — and unless you specifically want peer behavior, ROWS is what you want.

5. "How would you compute the difference between consecutive rows for a time series in SQL, and why is that hard without window functions?"

With window functions, you use LAG: SELECT day, revenue, revenue - LAG(revenue) OVER (ORDER BY day) AS daily_change FROM daily_revenue. LAG returns the previous row's value within the partition (or NULL for the first row), so subtracting it from the current value gives the row-to-row delta. Without window functions, you have to self-join the table to itself on a "previous row" condition — typically by computing a row number with a correlated subquery or by joining on t1.day = t2.day - INTERVAL '1 day', which only works if your time series has no gaps. For irregular time series (where "previous row" is not equal to "previous calendar day"), pre-window-function SQL forced you into either a stored procedure with cursors or a correlated subquery using (SELECT MAX(day) FROM t t2 WHERE t2.day < t.day) for every row, which scales quadratically. LAG and LEAD turn that O(n^2) self-join nightmare into a single O(n) sequential scan.


Quick Reference — Window Functions Cheat Sheet

+------------------------------------------------------------------+
|           WINDOW FUNCTIONS CHEAT SHEET                           |
+------------------------------------------------------------------+
|                                                                  |
|  GENERAL FORM:                                                   |
|    func(args) OVER (                                             |
|      PARTITION BY part_cols    -- independent buckets            |
|      ORDER BY     ord_cols     -- ordering within bucket         |
|      ROWS BETWEEN start AND end -- frame (optional)              |
|    )                                                             |
|                                                                  |
|  RANKING:                                                        |
|    ROW_NUMBER()  -> always unique 1,2,3,4...                     |
|    RANK()        -> ties share, gaps follow (1,2,2,4)            |
|    DENSE_RANK()  -> ties share, no gaps    (1,2,2,3)             |
|    NTILE(n)      -> divide into n buckets                        |
|                                                                  |
|  NEIGHBOR ACCESS:                                                |
|    LAG(col, n, default)   -> previous nth row                    |
|    LEAD(col, n, default)  -> next nth row                        |
|    FIRST_VALUE(col)       -> first row of frame                  |
|    LAST_VALUE(col)        -> last row of frame                   |
|    NTH_VALUE(col, n)      -> nth row of frame                    |
|                                                                  |
|  AGGREGATES OVER A WINDOW:                                       |
|    SUM, AVG, MIN, MAX, COUNT all work as window functions        |
|    SUM(x) OVER (PARTITION BY g ORDER BY t)                       |
|      -> running total per group                                  |
|                                                                  |
|  FRAME BOUNDS:                                                   |
|    UNBOUNDED PRECEDING                                           |
|    n PRECEDING                                                   |
|    CURRENT ROW                                                   |
|    n FOLLOWING                                                   |
|    UNBOUNDED FOLLOWING                                           |
|                                                                  |
|  COMMON FRAMES:                                                  |
|    Running total: ROWS BETWEEN UNBOUNDED PRECEDING               |
|                              AND CURRENT ROW                     |
|    3-row moving: ROWS BETWEEN 2 PRECEDING AND CURRENT ROW        |
|    Centered:     ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING        |
|    Whole part:   ROWS BETWEEN UNBOUNDED PRECEDING                |
|                              AND UNBOUNDED FOLLOWING             |
|                                                                  |
+------------------------------------------------------------------+

+------------------------------------------------------------------+
|           KEY RULES                                              |
+------------------------------------------------------------------+
|                                                                  |
|  1. Window functions KEEP rows; GROUP BY collapses them          |
|  2. PARTITION BY = independent buckets, broadcast to all rows    |
|  3. ORDER BY inside OVER enables cumulative behavior             |
|  4. Cannot use window functions in WHERE -- wrap in CTE/subquery |
|  5. Default frame with ORDER BY is RANGE, not ROWS -- be explicit|
|  6. Top-N per group: ROW_NUMBER() OVER (PARTITION BY g ORDER BY) |
|  7. ROW_NUMBER for unique sequence; RANK/DENSE_RANK for ties     |
|  8. LAG/LEAD replace painful self-joins for adjacent comparisons |
|  9. Window functions run AFTER WHERE/GROUP BY, BEFORE ORDER/LIMIT|
| 10. MySQL 5.7 has no window functions; MySQL 8+ supports them    |
|                                                                  |
+------------------------------------------------------------------+
ConcernWrong WayRight Way
Top-N per groupCorrelated subquery + LIMITROW_NUMBER() OVER (PARTITION BY ... ORDER BY ...) + filter
Running totalSelf-join on t2.id <= t.idSUM(x) OVER (ORDER BY t)
Compare to previous rowSelf-join on id - 1LAG(x) OVER (ORDER BY t)
Percent of totalSubquery for grand totalx / SUM(x) OVER ()
Per-group avg with detailTwo queries + JOINAVG(x) OVER (PARTITION BY g)
Filter on rankingWHERE ROW_NUMBER() ...Wrap in CTE, filter outside
Moving averageDate arithmetic + self-joinAVG(x) OVER (... ROWS BETWEEN n PRECEDING ...)
DeduplicationDELETE with NOT INROW_NUMBER() + DELETE WHERE rn > 1
Default frameTrust the default with ORDER BYAlways specify ROWS BETWEEN ...
Ties at top-N cutoffRANK() <= 3 for "exactly 3"ROW_NUMBER() <= 3

Prev: Lesson 6.4 -- Common Table Expressions Next: Lesson 7.1 -- Sequences


This is Lesson 6.5 of the Database Interview Prep Course -- 12 chapters, 58 lessons.

On this page