Database Interview Prep
CRUD Operations

READ

SELECT Data: Projections, Aliases, DISTINCT, and Pagination

LinkedIn Hook

"Your dashboard query returns in 40ms on staging and times out at page 900 in production. Same query. Same index. Different world."

Most developers learn SELECT * FROM users on day one and never unlearn it. They ship SELECT * into production, watch their ORM hydrate 40 columns when the UI needs 3, and then add LIMIT 20 OFFSET 18000 for pagination and wonder why the database is on fire.

SELECT is the most-used statement in SQL and the one people understand least. The difference between a senior and a junior is rarely about JOIN syntax -- it is about knowing that SELECT * breaks when a column is added, that DISTINCT is a code smell 80% of the time, and that OFFSET 100000 reads and discards 100,000 rows before returning the 20 you wanted.

The fix is not complicated. List your columns explicitly. Use aliases for clarity, not cleverness. Reach for DISTINCT only when you actually have duplicates in the data, not to paper over a broken JOIN. And when pagination gets deep, switch from offset to keyset -- WHERE id > last_id ORDER BY id LIMIT N -- which stays O(log n) no matter how deep the user scrolls.

In Lesson 3.2, I break down SELECT the way senior engineers use it: explicit projections, computed columns, DISTINCT semantics, and the offset-vs-keyset pagination showdown every backend engineer gets asked in interviews.

Read the full lesson -> [link]

#SQL #Database #Postgres #BackendDevelopment #InterviewPrep #DataEngineering


READ thumbnail


What You'll Learn

  • Why SELECT * is fine in a REPL and dangerous in production code
  • Column aliases with AS and when aliases can (and cannot) be reused
  • Computed columns and expressions: price * qty, concatenation, function calls
  • DISTINCT semantics -- what it actually does under the hood
  • LIMIT and OFFSET for classic pagination and when it breaks down
  • Keyset (cursor) pagination and why senior engineers prefer it for infinite scroll
  • SELECT INTO for copying result sets into new tables
  • Subqueries in the SELECT list -- when they are useful and when they are a trap

The Library Catalogue Analogy -- Why You Never Ask for "Everything"

Imagine walking into a library and asking the librarian, "Bring me every book." You do not want every book. You want the three books on your reading list, specifically the title, author, and shelf number. The librarian who hands you the complete catalogue of 400,000 volumes has technically answered your question and completely wasted your afternoon. You will spend ten minutes flipping through the stack before you find what you actually needed.

Now imagine the librarian who asks, "Which books? What information about them?" and hands you a single index card with exactly three rows and three columns. That is a good librarian. That is also a good SELECT statement.

The database is the librarian. Every column you list is a request the database must fulfill -- reading from disk, transmitting over the network, allocating memory in your application. When you write SELECT *, you are asking for every book in the library every single time, even when you only need three titles. The database does it because you asked, and the cost shows up in network bandwidth, memory usage, cache pressure, and the day someone adds a 2 MB profile_picture BLOB column to the users table and your list endpoint suddenly returns 200 MB responses.

+---------------------------------------------------------------+
|           SELECT * (The Problem)                              |
+---------------------------------------------------------------+
|                                                                |
|  SELECT * FROM users WHERE id = 42;                           |
|                                                                |
|  Returns: id, name, email, password_hash, bio, avatar_url,    |
|           created_at, updated_at, last_login, ip_address,     |
|           preferences_json, profile_picture_blob, ...         |
|                                                                |
|  Problems:                                                     |
|   - Reads 40 columns when you needed 2                         |
|   - Breaks when a new column is added upstream                 |
|   - Leaks password_hash into logs and API responses            |
|   - Defeats covering-index optimizations                       |
|   - Ships megabytes of BLOB data across the wire               |
|                                                                |
+---------------------------------------------------------------+

+---------------------------------------------------------------+
|           EXPLICIT COLUMNS (The Fix)                          |
+---------------------------------------------------------------+
|                                                                |
|  SELECT id, name FROM users WHERE id = 42;                    |
|                                                                |
|  Benefits:                                                     |
|   - Tiny payload, predictable shape                            |
|   - Stable schema contract with the application                |
|   - Query planner can use covering indexes                     |
|   - No accidental leakage of sensitive columns                 |
|   - Code review catches new column usage explicitly            |
|                                                                |
+---------------------------------------------------------------+

Napkin AI Visual Prompt: *"Dark gradient (#0a0f1f -> #111a2e). Split comparison: LEFT side labeled 'SELECT ' shows a librarian wheeling out a massive cart of 40 books, with red highlights on items like 'password_hash' and 'avatar_blob'. RIGHT side labeled 'Explicit' shows the same librarian handing over a single index card with three neat rows, highlighted in sky blue (#4fc3f7). White monospace labels. Rose (#ff5c8a) arrow labeled 'network + memory cost' pointing at the large stack on the left."


SELECT Basics -- Explicit Columns First

The canonical shape of a SELECT is SELECT <columns> FROM <table>. Filtering with WHERE is covered in detail in Lesson 4.1 -- here we focus purely on the projection step: which columns come back and in what shape.

-- users table shape for all examples in this lesson
-- id | name      | email              | created_at          | status
-- ---+-----------+--------------------+---------------------+--------
--  1 | Alice     | alice@example.com  | 2024-01-15 10:00:00 | active
--  2 | Bob       | bob@example.com    | 2024-02-20 14:30:00 | active
--  3 | Charlie   | charlie@ex.com     | 2024-03-10 09:15:00 | inactive
--  4 | Dana      | dana@example.com   | 2024-04-01 16:45:00 | active

-- Explicit projection: list every column you actually need.
SELECT id, name, email
FROM users;

-- Output:
-- id | name    | email
-- ---+---------+---------------------
--  1 | Alice   | alice@example.com
--  2 | Bob     | bob@example.com
--  3 | Charlie | charlie@ex.com
--  4 | Dana    | dana@example.com

The two rules worth internalizing from day one:

  1. Order matters for the output, not for performance. SELECT name, id and SELECT id, name produce identical rows in different column orders. Pick the order that makes the result easiest to read.
  2. Column names bind at parse time. Misspelling emial throws a "column does not exist" error immediately -- before any rows are read.

SELECT * in Production -- Why Senior Engineers Avoid It

SELECT * is fine at a psql prompt when you are exploring a schema. It is a liability inside application code for four concrete reasons:

  • Schema drift. Tomorrow someone adds a password_reset_token column. Your SELECT * endpoint now exposes it in every API response until someone notices.
  • Wasted bandwidth. Hydrating 40 columns to render a list of user names transfers 20x the bytes you actually need, hits the TCP window harder, and slows the whole request.
  • Covering-index defeat. If an index on (status, id, name) could answer your query without touching the heap, SELECT * forces the database to look up every row's full record anyway.
  • Brittle code. Your application code says row.name, row.email. Adding a column you do not use is harmless. But SELECT * in an INSERT INTO ... SELECT * can silently break when column counts shift between tables.

The rule of thumb is simple: list every column you use, and no columns you do not use. If that list is long, copy-paste it -- a long explicit list is still better than a short implicit one.


Column Aliases with AS

An alias renames a column in the result set. It does not change the table. It is the rename layer between what the database stores and what the application sees.

-- Alias with AS (the AS is optional in most dialects but improves readability)
SELECT
  id          AS user_id,
  name        AS full_name,
  email       AS contact_email
FROM users;

-- Output:
-- user_id | full_name | contact_email
-- --------+-----------+---------------------
--       1 | Alice     | alice@example.com
--       2 | Bob       | bob@example.com

Aliases are especially valuable when:

  • Joining tables with name collisions. users.id and orders.id both exist; aliasing them user_id and order_id disambiguates the result.
  • Returning a computed column. price * qty has no natural name -- SELECT price * qty AS line_total gives it one.
  • Matching application DTOs. If your API returns fullName, aliasing name AS "fullName" lets the database layer hand you the exact key your code expects.

The Alias Scope Gotcha

Aliases defined in the SELECT list are not visible in the WHERE clause in standard SQL (Postgres, SQL Server, Oracle). They are only available in GROUP BY, HAVING, and ORDER BY. This trips up almost every beginner.

-- DOES NOT WORK in Postgres / SQL Server / Oracle:
SELECT price * qty AS line_total
FROM order_items
WHERE line_total > 100;
-- ERROR: column "line_total" does not exist

-- Works -- repeat the expression in WHERE
SELECT price * qty AS line_total
FROM order_items
WHERE price * qty > 100;

-- Or use a subquery / CTE to give the alias a real scope
WITH scored AS (
  SELECT id, price * qty AS line_total
  FROM order_items
)
SELECT id, line_total
FROM scored
WHERE line_total > 100;

MySQL is the exception -- it does allow aliases in WHERE -- but relying on that makes your SQL non-portable. Assume the standard rule: aliases exist after the SELECT runs, not before.


Computed Columns and Expressions

The SELECT list is not limited to table columns. Any SQL expression can appear there: arithmetic, string concatenation, function calls, CASE expressions, type casts. The result gets a synthetic column in the output.

-- order_items table
-- id | product      | price | qty
-- ---+--------------+-------+----
--  1 | Widget       | 10.00 |  3
--  2 | Gadget       | 25.00 |  2
--  3 | Thingamajig  |  4.50 | 10

-- Arithmetic on columns
SELECT
  id,
  product,
  price,
  qty,
  price * qty              AS line_total,
  price * qty * 0.08       AS tax,
  price * qty * 1.08       AS total_with_tax
FROM order_items;

-- Output:
-- id | product     | price | qty | line_total | tax  | total_with_tax
-- ---+-------------+-------+-----+------------+------+---------------
--  1 | Widget      | 10.00 |   3 |      30.00 | 2.40 |         32.40
--  2 | Gadget      | 25.00 |   2 |      50.00 | 4.00 |         54.00
--  3 | Thingamajig |  4.50 |  10 |      45.00 | 3.60 |         48.60

String and function expressions work the same way:

-- String concat, date functions, CASE expressions, casts
SELECT
  id,
  name || ' <' || email || '>'          AS mailbox,      -- Postgres concat
  EXTRACT(YEAR FROM created_at)         AS signup_year,
  AGE(NOW(), created_at)                AS account_age,
  CASE
    WHEN status = 'active'   THEN 'paying'
    WHEN status = 'inactive' THEN 'churned'
    ELSE 'unknown'
  END                                    AS segment,
  CAST(id AS TEXT)                       AS id_string
FROM users;

-- Output:
-- id | mailbox                         | signup_year | account_age     | segment | id_string
-- ---+---------------------------------+-------------+-----------------+---------+----------
--  1 | Alice <alice@example.com>       |        2024 | 1 year 2 mons   | paying  | 1
--  2 | Bob <bob@example.com>           |        2024 | 1 year 1 mon    | paying  | 2
--  3 | Charlie <charlie@ex.com>        |        2024 | 1 year          | churned | 3

Computed columns are free from a planning standpoint -- they do not read extra data, they only transform what the SELECT already fetched. The cost is CPU on the database server, which is almost always the right tradeoff compared to pulling raw columns to the application and computing there.


DISTINCT -- What It Actually Does

DISTINCT removes duplicate rows from the result set. The subtlety that trips people up is that "duplicate" is defined across all selected columns together, not any one column in isolation.

-- orders table
-- id | user_id | product | qty
-- ---+---------+---------+----
--  1 |     1   | Widget  |  2
--  2 |     1   | Gadget  |  1
--  3 |     2   | Widget  |  3
--  4 |     1   | Widget  |  2   <- exact duplicate of row 1 (except id)

-- Unique users who have placed orders
SELECT DISTINCT user_id FROM orders;
-- Output:
-- user_id
-- -------
--       1
--       2

-- Unique (user_id, product) pairs -- "which products has each user bought?"
SELECT DISTINCT user_id, product FROM orders;
-- Output:
-- user_id | product
-- --------+--------
--       1 | Widget
--       1 | Gadget
--       2 | Widget

-- DISTINCT applies to the ENTIRE row, not just user_id
-- Common beginner mistake: "give me distinct users with their products"
-- This does NOT collapse to one row per user.

How DISTINCT Works Under the Hood

The database has two strategies, and the planner picks based on statistics and indexes:

  • Sort + unique scan. Sort the entire result set, then walk through and drop adjacent duplicates. O(n log n). Used when the row count is large and memory is tight.
  • Hash aggregation. Build an in-memory hash set keyed on the distinct columns, insert each row if not already present. O(n) average but needs memory proportional to the number of distinct groups.

Either way, DISTINCT is not free. It materializes intermediate results, it may spill to disk, and it defeats streaming output. Running SELECT DISTINCT * over a million-row table will eat memory and slow your query noticeably even if no duplicates exist.

+---------------------------------------------------------------+
|           DISTINCT EXECUTION                                   |
+---------------------------------------------------------------+
|                                                                |
|  Input rows (8 million)                                        |
|          |                                                     |
|          v                                                     |
|  +---------------+   HashAggregate                             |
|  | hash set on   |   (if memory permits)                       |
|  | (user_id,     |   ->  O(n) time, O(distinct groups) space   |
|  |  product)     |                                             |
|  +---------------+                                             |
|          |                                                     |
|          v                                                     |
|  Output rows (~4000 unique pairs)                              |
|                                                                |
|  Alternative: Sort -> Unique                                   |
|   - O(n log n) time                                            |
|   - spills to disk if work_mem exceeded                        |
|   - streams output once sort completes                         |
|                                                                |
+---------------------------------------------------------------+

The warning sign. If you are adding DISTINCT to fix duplicates that appeared after a JOIN, pause. Nine times out of ten, the JOIN is wrong -- you are joining on a non-unique key and silently multiplying rows, and DISTINCT is papering over the bug instead of fixing it. The correct fix is usually a proper join condition, a GROUP BY, or an EXISTS subquery.


LIMIT and OFFSET -- The Classic Pagination Pattern

LIMIT N caps the result at N rows. OFFSET M skips the first M rows before applying the limit. Together they implement "page M of N per page" pagination in the most straightforward way possible.

-- Page 1: rows 1-20
SELECT id, name, created_at
FROM users
ORDER BY created_at DESC
LIMIT 20 OFFSET 0;

-- Page 2: rows 21-40
SELECT id, name, created_at
FROM users
ORDER BY created_at DESC
LIMIT 20 OFFSET 20;

-- Page 50: rows 981-1000
SELECT id, name, created_at
FROM users
ORDER BY created_at DESC
LIMIT 20 OFFSET 980;

A few rules that are easy to miss:

  • Always ORDER BY with LIMIT. Without ORDER BY, the database returns rows in any order it likes, and "page 2" might repeat rows from "page 1" or skip rows entirely. The order must also be deterministic -- if you order by a column with ties (created_at), add id as a tiebreaker: ORDER BY created_at DESC, id DESC.
  • LIMIT without OFFSET is fine. LIMIT 10 just returns the first 10 rows in the ordered result. Use it whenever you need "top N".
  • OFFSET 0 is the default. You can omit it for the first page.

The Hidden Cost of Deep OFFSET

OFFSET looks cheap in SQL. It is not. The database must read and discard every row it skips. OFFSET 100000 literally reads 100,000 rows from disk, sorts them, throws them away, and then returns the next 20. At page 5000 of a large table, your "fast" pagination query is scanning more data than the full table scan that returned page 1.

+---------------------------------------------------------------+
|           OFFSET AT SCALE -- THE SLOWDOWN                     |
+---------------------------------------------------------------+
|                                                                |
|  Page 1:    OFFSET 0     LIMIT 20  ->  reads 20 rows          |
|  Page 10:   OFFSET 180   LIMIT 20  ->  reads 200 rows         |
|  Page 100:  OFFSET 1980  LIMIT 20  ->  reads 2000 rows        |
|  Page 1000: OFFSET 19980 LIMIT 20  ->  reads 20000 rows       |
|  Page 5000: OFFSET 99980 LIMIT 20  ->  reads 100000 rows      |
|                                                                |
|  Query latency grows LINEARLY with the offset.                 |
|  Users on page 5000 wait 50x longer than users on page 100.    |
|                                                                |
+---------------------------------------------------------------+

This is the pain point every backend engineer eventually hits in production. Small tables survive fine. Tables with millions of rows turn deep pagination into a denial-of-service vector -- a single scraper hitting ?page=50000 can pin a CPU core.


Keyset (Cursor) Pagination -- The Senior Engineer's Answer

Keyset pagination replaces "skip N rows" with "give me the rows after this specific key." You remember the last row the client saw, pass its key back as a cursor, and the database jumps straight to that position using an index.

-- First page: no cursor yet
SELECT id, name, created_at
FROM users
ORDER BY id DESC
LIMIT 20;

-- Output (last id returned: 981):
-- id  | name      | created_at
-- ----+-----------+---------------------
-- 1000| Zoe       | 2025-04-10 ...
--  999| Yuki      | 2025-04-09 ...
-- ...
--  981| Sam       | 2025-03-22 ...

-- Next page: use the last id the client saw as the cursor
SELECT id, name, created_at
FROM users
WHERE id < 981                  -- key cursor
ORDER BY id DESC
LIMIT 20;

-- Output (last id returned: 961):
-- id  | name      | created_at
-- ----+-----------+---------------------
--  980| Riley     | 2025-03-21 ...
-- ...
--  961| Noah      | 2025-03-02 ...

With an index on id, the database performs an index seek to id < 981 and reads exactly 20 rows. That cost stays constant as the user paginates deeper -- page 5000 is just as fast as page 2.

Keyset with Composite Keys

If you want to paginate by created_at DESC, but created_at has ties, you need a composite cursor: (created_at, id). The comparison is lexicographic.

-- Cursor: last row had (created_at='2025-03-22 10:00', id=981)
SELECT id, name, created_at
FROM users
WHERE (created_at, id) < ('2025-03-22 10:00:00', 981)
ORDER BY created_at DESC, id DESC
LIMIT 20;

-- Postgres supports tuple comparison natively.
-- In MySQL, expand it by hand:
-- WHERE (created_at <  '2025-03-22 10:00:00')
--    OR (created_at =  '2025-03-22 10:00:00' AND id < 981)

Offset vs Keyset -- The Tradeoffs

+---------------------------------------------------------------+
|           OFFSET vs KEYSET                                     |
+---------------------------------------------------------------+
|                                                                |
|  OFFSET PAGINATION                                             |
|   + Simple: "page 37 of 1200"                                  |
|   + Jump to any page by number                                 |
|   + Accurate total page count                                  |
|   - O(offset) cost -- deep pages are slow                      |
|   - Rows shift when new data is inserted                       |
|   - Can show duplicates or skip rows during writes             |
|                                                                |
|  KEYSET PAGINATION                                             |
|   + O(log n) regardless of depth -- constant latency           |
|   + Stable under concurrent inserts                            |
|   + Great for infinite scroll / "load more" UIs                |
|   - Cannot jump to "page 37" directly                          |
|   - No total count without a separate COUNT query              |
|   - Cursor must be opaque and URL-safe                         |
|                                                                |
|  RULE OF THUMB:                                                |
|   Admin dashboards with 20 pages -> OFFSET                     |
|   Public feeds / infinite scroll  -> KEYSET                    |
|   > 1000 pages anywhere           -> KEYSET                    |
|                                                                |
+---------------------------------------------------------------+

The interview answer. Offset is easy and works for small result sets. Keyset is the right default whenever the result set might exceed a few hundred pages or whenever you care about consistent performance under load. Switching from offset to keyset is one of the highest-leverage backend changes you can ship on a growing product.


SELECT INTO -- Copying Result Sets into New Tables

SELECT INTO creates a brand new table from the result of a query. It is shorthand for "take whatever this SELECT returns and materialize it as a table."

-- Postgres / SQL Server syntax: creates active_users as a new table
SELECT id, name, email, created_at
INTO active_users
FROM users
WHERE status = 'active';

-- After this runs:
-- - A table named active_users exists
-- - It has the columns id, name, email, created_at
-- - It contains a snapshot of the matching rows at the time the query ran
-- - No indexes or constraints are copied over -- add them separately

-- MySQL does not support SELECT INTO for this purpose. Use CREATE TABLE AS:
CREATE TABLE active_users AS
SELECT id, name, email, created_at
FROM users
WHERE status = 'active';

Common uses for SELECT INTO:

  • Quick backups before a risky migration. SELECT INTO users_backup FROM users; preserves the full table as a safety net.
  • Materializing expensive query results for reporting jobs that can tolerate stale data.
  • ETL staging tables where you want the exact shape of a query snapshot.

Note that SELECT INTO is a DDL operation (it creates a table) wrapped in what looks like a DML statement. It bypasses some protections -- no indexes, no constraints, no foreign keys are copied -- so treat the resulting table as raw data that needs schema hardening before anything depends on it.

Inside PL/pgSQL and other procedural languages, SELECT INTO has a different meaning: it assigns query results into variables. That is a separate feature with the same keyword -- do not confuse the two.


Subqueries in the SELECT List

A scalar subquery can appear as a column in the SELECT list. The database runs it once per outer row and returns a single value.

-- For each user, count their orders
SELECT
  u.id,
  u.name,
  (SELECT COUNT(*) FROM orders o WHERE o.user_id = u.id) AS order_count,
  (SELECT MAX(created_at) FROM orders o WHERE o.user_id = u.id) AS last_order_at
FROM users u;

-- Output:
-- id | name  | order_count | last_order_at
-- ---+-------+-------------+---------------------
--  1 | Alice |           3 | 2025-04-10 12:00:00
--  2 | Bob   |           1 | 2025-03-15 09:00:00
--  3 | Charlie|          0 | NULL

Scalar subqueries are convenient and readable for "for each row, give me one aggregate from another table." The catch is that the subquery runs once per outer row unless the optimizer rewrites it as a join -- and for large user tables that can be thousands of identical lookups. For production code, prefer a LEFT JOIN ... GROUP BY or a correlated LATERAL subquery, both of which give the planner more room to optimize.

Use scalar subqueries for ad hoc queries and reporting. Reach for JOINs or CTEs when the same shape ships in application code.


Common Mistakes

1. Using SELECT * in application code. Fine at the psql prompt, dangerous in production. Adding a column to the table silently changes every response shape, wastes bandwidth, leaks sensitive columns, and defeats covering indexes. List the columns you need explicitly -- your future self will thank you when you add password_reset_token next month and it does not immediately ship in every API response.

2. Paginating with OFFSET on large tables. LIMIT 20 OFFSET 100000 reads 100,020 rows to return 20. Latency grows linearly with depth, and a single scraper hitting ?page=99999 can pin a database core. Switch to keyset pagination (WHERE id > last_id ORDER BY id LIMIT N) whenever the table exceeds a few thousand rows or the UI is infinite-scroll rather than numbered pages.

3. Using DISTINCT to hide broken JOINs. When a JOIN returns duplicates, the first instinct is to slap DISTINCT on the outer query. That hides the bug instead of fixing it -- the underlying JOIN is still multiplying rows and wasting work. Investigate the JOIN condition, confirm that you are joining on a unique key, and use GROUP BY, EXISTS, or a proper LEFT JOIN ... WHERE table.key IS NOT NULL instead. DISTINCT belongs on queries where the raw data actually has duplicates, not as a post-processing Band-Aid.

4. Referencing column aliases in the WHERE clause. SELECT price * qty AS total FROM items WHERE total > 100 throws an error in Postgres, SQL Server, and Oracle. Aliases are assigned after WHERE runs, so the alias does not exist yet. Repeat the expression in WHERE, or wrap the whole thing in a subquery or CTE where the alias has a real scope.

5. Omitting ORDER BY with LIMIT. LIMIT 20 without ORDER BY returns 20 rows in whatever order the storage engine hands them over -- usually insertion order, sometimes not. Paginating produces inconsistent results and duplicated rows across pages. Always order with a deterministic key, and add a tiebreaker column (ORDER BY created_at DESC, id DESC) whenever the sort column can have ties.

6. Forgetting that DISTINCT applies to the whole row. SELECT DISTINCT user_id, product FROM orders does not give you "distinct users with one product each" -- it gives you every unique (user_id, product) pair. If you want one row per user, use GROUP BY user_id with aggregates, not DISTINCT.


Interview Questions

1. "Why is SELECT * considered an anti-pattern in production code?"

Four reasons stack together. First, it defeats the implicit contract between the application and the schema -- adding a column changes every response shape, and sensitive fields (password hashes, reset tokens, internal flags) can leak into logs or API responses without anyone noticing. Second, it wastes bandwidth and memory by transferring columns the application does not use, which gets worse as new columns are added over the table's lifetime. Third, it defeats covering-index optimizations -- an index that could answer the query without touching the heap is forced to do a heap lookup just because the query asked for columns not in the index. Fourth, it makes code review harder because you cannot see at a glance which columns the query actually needs. The fix is to list every column explicitly. Even long lists are better than * because they make the schema dependency visible and stable.

2. "Explain the difference between OFFSET and keyset pagination, and when you would use each."

OFFSET pagination uses LIMIT N OFFSET M -- the database reads and discards M rows, then returns the next N. It is simple and supports arbitrary page jumps ("go to page 37"), which is why it shows up in admin dashboards. Its fatal flaw is that cost scales linearly with the offset: at page 5000 the query reads 100,000 rows to return 20. For large tables and public-facing feeds, this becomes a performance cliff and a DoS vector. Keyset pagination instead remembers the last key the client saw and issues WHERE id > last_id ORDER BY id LIMIT N. With an index on id, the database does an index seek straight to the cursor position and reads exactly N rows, so latency stays constant regardless of depth. Keyset is also stable under concurrent inserts -- rows do not shift under the user as new data arrives. The tradeoff is that you cannot jump directly to page 37 and you cannot cheaply report total page count. The rule of thumb: offset for small numbered-page admin UIs, keyset for infinite scroll and anywhere the table is large enough that users can reach deep pages.

3. "What does DISTINCT actually do under the hood, and why is it often a sign of a bug?"

DISTINCT removes duplicate rows from the result set, where "duplicate" means all selected columns match across two rows. Internally the database picks between two strategies: a hash aggregation that builds an in-memory set keyed on the distinct columns (O(n) time, memory proportional to distinct groups) or a sort-and-unique that sorts the entire result set and drops adjacent duplicates (O(n log n), possibly spilling to disk). Either way, it materializes intermediate results and is not free. The reason it is usually a bug signal is that most "unexpected duplicates" in query results come from a JOIN that is multiplying rows on a non-unique key. Slapping DISTINCT on the outer query hides the multiplication without preventing the wasted work -- the database still joins and still produces all the duplicate rows, then pays again to deduplicate them. The correct fix is almost always to tighten the JOIN condition, use EXISTS instead of a JOIN when you only need existence, or switch to GROUP BY with aggregates. Legitimate uses of DISTINCT exist -- "unique countries in this orders table" -- but whenever DISTINCT appears on top of a JOIN, investigate the JOIN first.

4. "What is the alias scope rule in standard SQL, and why does SELECT price * qty AS total FROM items WHERE total > 100 fail?"

Standard SQL evaluates clauses in a specific logical order: FROM -> WHERE -> GROUP BY -> HAVING -> SELECT -> ORDER BY. Aliases are created by the SELECT clause, so they exist only after SELECT has run. WHERE runs before SELECT, so when the database parses WHERE total > 100, the alias total does not yet exist and the query errors with "column does not exist." Aliases are legally usable in GROUP BY, HAVING, and ORDER BY in most dialects because those run after or alongside SELECT. The fix is to either repeat the expression (WHERE price * qty > 100) or wrap the base query in a subquery or CTE where the alias is an actual column. MySQL is the odd dialect that does allow aliases in WHERE, but relying on it produces non-portable SQL.

5. "You are asked to return the 20 most recent users along with each user's order count. What are the tradeoffs between a scalar subquery in the SELECT list and a LEFT JOIN with GROUP BY?"

A scalar subquery in SELECT -- (SELECT COUNT(*) FROM orders o WHERE o.user_id = u.id) AS order_count -- is readable and local; the reader sees exactly what the column means. The risk is that the subquery runs once per outer row unless the optimizer rewrites it as a join, so for a list of 20 users it is fine, but for 20,000 users it is 20,000 separate aggregates. A LEFT JOIN orders ... GROUP BY users.id pushes the work into a single aggregation pass, which the planner can optimize with hash or sort aggregates and parallelize. It is more verbose and requires GROUP BY on every user column you project, which can be painful. A third option worth mentioning in interviews is a LATERAL subquery (Postgres) or CROSS APPLY (SQL Server), which keeps the per-row readability of a scalar subquery while letting the planner treat it as a join. For a 20-row LIMIT query with a scalar subquery, the subquery only runs 20 times regardless of users table size, so the scalar approach is genuinely fine -- but that caveat is exactly what the interviewer wants you to articulate.


Quick Reference -- SELECT Cheat Sheet

+---------------------------------------------------------------+
|           SELECT CHEAT SHEET                                   |
+---------------------------------------------------------------+
|                                                                |
|  BASIC SHAPE:                                                  |
|    SELECT <columns> FROM <table>                               |
|    [WHERE ...] [ORDER BY ...] [LIMIT N [OFFSET M]]             |
|                                                                |
|  PROJECTIONS:                                                  |
|    SELECT id, name                 -- explicit (good)          |
|    SELECT *                        -- exploration only         |
|    SELECT price * qty AS total     -- computed column          |
|    SELECT name || ' <' || email    -- string expression        |
|                                                                |
|  ALIASES:                                                      |
|    SELECT id AS user_id            -- AS is optional           |
|    Visible in: GROUP BY, HAVING, ORDER BY                      |
|    NOT visible in: WHERE (standard SQL)                        |
|                                                                |
|  DISTINCT:                                                     |
|    SELECT DISTINCT col1, col2      -- whole-row uniqueness     |
|    Cost: hash or sort, not free                                |
|                                                                |
|  LIMIT / OFFSET (classic pagination):                          |
|    ORDER BY created_at DESC, id DESC  -- deterministic         |
|    LIMIT 20 OFFSET 40                 -- page 3                |
|    Cost: O(offset) -- slow on deep pages                       |
|                                                                |
|  KEYSET (cursor pagination):                                   |
|    WHERE id < :last_seen_id                                    |
|    ORDER BY id DESC                                            |
|    LIMIT 20                                                    |
|    Cost: O(log n) regardless of depth                          |
|                                                                |
|  SELECT INTO (Postgres / SQL Server):                          |
|    SELECT cols INTO new_table FROM source WHERE ...            |
|    MySQL: CREATE TABLE new_table AS SELECT ...                 |
|                                                                |
|  SCALAR SUBQUERY IN SELECT:                                    |
|    SELECT id,                                                  |
|      (SELECT COUNT(*) FROM orders o WHERE o.user_id = u.id)    |
|    FROM users u                                                |
|                                                                |
+---------------------------------------------------------------+

+---------------------------------------------------------------+
|           KEY RULES                                            |
+---------------------------------------------------------------+
|                                                                |
|  1. List columns explicitly -- never SELECT * in prod code     |
|  2. Always ORDER BY when using LIMIT                           |
|  3. Add a tiebreaker column to any non-unique ORDER BY         |
|  4. OFFSET is O(offset) -- switch to keyset above ~1000 pages  |
|  5. DISTINCT on top of a JOIN is usually a bug signal          |
|  6. Aliases do not work in WHERE -- repeat the expression      |
|  7. Scalar subqueries run per outer row -- safe with LIMIT     |
|  8. SELECT INTO copies data, not indexes or constraints        |
|                                                                |
+---------------------------------------------------------------+
ConcernWrong WayRight Way
ProjectionSELECT *SELECT id, name, email
Deep paginationLIMIT 20 OFFSET 100000WHERE id < :cursor LIMIT 20
Duplicate rows after JOINSELECT DISTINCT ...Fix the JOIN or use EXISTS
Alias in WHEREWHERE line_total > 100WHERE price * qty > 100
LIMIT orderingLIMIT 20 aloneORDER BY id DESC LIMIT 20
Ties in ORDER BYORDER BY created_atORDER BY created_at, id
Per-row aggregateLoop in app codeScalar subquery or JOIN
Snapshot a queryCREATE TABLE, then INSERTSELECT INTO / CREATE TABLE AS

Prev: Lesson 3.1 -- Insert Data Next: Lesson 3.3 -- Update Data


This is Lesson 3.2 of the Database Interview Prep Course -- 12 chapters, 58 lessons.

On this page