Database Interview Prep
Indexing

Primary and Secondary Indexes

Clustered Storage, Heap Tables, and Lookup Cost

LinkedIn Hook

"Your Postgres query reads the index in 0.2 ms and then spends 40 ms fetching the row. Welcome to the heap lookup tax."

Most developers think of an index as a single thing — a magical accelerator you sprinkle on a column and the query gets faster. But there are actually two fundamentally different kinds of indexes, and the distinction shapes everything: how rows are physically stored on disk, how much I/O each lookup costs, why your primary key choice changes write throughput by 10x, and why the same query plan looks completely different on PostgreSQL versus MySQL InnoDB.

A primary index decides where the row physically lives. A secondary index points at the row from somewhere else. In InnoDB, the primary key IS the table — rows are stored inside the leaves of the primary key B-tree, so every secondary index lookup has to do TWO B-tree walks: one to find the primary key value, and a second to find the actual row. In PostgreSQL, every index is a secondary index, the table is a separate heap, and every index lookup has to do an extra random I/O to fetch the row from the heap (unless the planner can use an index-only scan).

This is why a UUIDv4 primary key on InnoDB destroys write performance — random insertions fragment the clustered B-tree across the entire data file. It is why "covering indexes" matter so much in Postgres — they let you skip the heap visit entirely. It is why the rule "primary keys should be small, monotonic, and immutable" exists at all.

In Lesson 9.2, I break down primary versus secondary indexes: clustered vs non-clustered storage, InnoDB's clustered PK, Postgres's heap plus secondary indexes, the per-lookup cost model, and the architectural impact on primary key choice.

Read the full lesson -> [link]

#SQL #Database #PostgreSQL #MySQL #InnoDB #Indexing #BackendDevelopment #InterviewPrep


Primary and Secondary Indexes thumbnail


What You'll Learn

  • The difference between a primary (clustered) index and a secondary (non-clustered) index
  • How InnoDB stores rows inside the leaves of the primary key B-tree
  • How PostgreSQL stores rows in an unordered heap with every index pointing into it
  • The per-lookup cost difference: one B-tree walk vs two B-tree walks vs B-tree + heap visit
  • Why InnoDB secondary indexes store the primary key value (not a row pointer)
  • Why PostgreSQL has no clustered index and what CLUSTER actually does
  • How index-only scans avoid the heap visit when the index covers all queried columns
  • The architectural impact on primary key choice — why monotonic small keys matter

The Library Analogy — Two Ways to Find a Book

Picture a library with 10 million books. There are two completely different ways the library can be organized.

In the first library, the books themselves are arranged on the shelves in order by ISBN. The shelf IS the catalog. To find a book by ISBN, you walk the shelves in order until you arrive at the right spot — and the book is right there, in your hand, the moment you find its position. There is no "look up the location and then walk to it" step. The location and the book are the same thing. This is a clustered index: the index order and the physical row order are identical, because the index leaves ARE the rows.

Now suppose you want to find a book by author name. You go to a separate index card catalog, look up "Tolkien", and the card tells you the ISBN. Then you have to walk back to the shelves and look up that ISBN to actually fetch the book. Two trips: card catalog, then shelves. This is a secondary index in a clustered-storage system: it gives you the primary key, and you do a second lookup to find the row.

In the second library, books are dumped on shelves in arrival order — no organization at all. Every catalog (ISBN, author, title) is a separate card file, and every card has a slip number telling you which shelf and which position the book sits at. To find any book, you ALWAYS go to a card catalog first, then walk to the indicated shelf. There is no "free" lookup — every search costs a card-catalog visit plus a shelf visit. This is the heap + secondary index model: the table is a pile, and every index is non-clustered.

+---------------------------------------------------------------+
|           CLUSTERED vs HEAP STORAGE                            |
+---------------------------------------------------------------+
|                                                                |
|   InnoDB (clustered)             PostgreSQL (heap)             |
|   +-----------------+            +-----------------+           |
|   |  PK B-tree      |            |  PK B-tree      |           |
|   |  +-----------+  |            |  +-----------+  |           |
|   |  | leaf      |  |            |  | leaf      |  |           |
|   |  |  id=1 row |  |            |  |  id=1 ->  |--+----+      |
|   |  |  id=2 row |  |            |  |  id=2 ->  |--+--+ |      |
|   |  |  id=3 row |  |            |  |  id=3 ->  |--+ | |      |
|   |  +-----------+  |            |  +-----------+  | | |      |
|   +-----------------+            +-----------------+ | |      |
|                                                       | |      |
|   Index leaves CONTAIN              +-----------------v-v---+  |
|   the full rows. No                 |  HEAP (unordered)     |  |
|   heap. No second                   |  page 7: row id=2     |  |
|   lookup. The PK index              |  page 7: row id=1     |  |
|   IS the table.                     |  page 9: row id=3     |  |
|                                     +-----------------------+  |
|                                                                |
+---------------------------------------------------------------+

The library analogy is exact, and the consequences cascade through everything: how you design primary keys, why InnoDB and Postgres have different performance characteristics for the same workload, why "covering indexes" matter, and why some queries that fly on one engine crawl on the other.


Primary Index — The Index That Decides Where Rows Live

A primary index is the index that determines the physical storage order of the rows. It is also called a clustered index because rows are clustered together on disk in the order of this index. There is exactly one primary index per table — you cannot physically order the same rows two ways at the same time.

In InnoDB (the default MySQL storage engine), every table is organized as a clustered index on the primary key. The leaves of the primary key B-tree contain the actual row data. There is no separate "table" file — the B-tree IS the table. If you do not declare a PRIMARY KEY, InnoDB uses the first non-null UNIQUE index, and if there is none, it invents a hidden 6-byte rowid.

In PostgreSQL, there is no clustered index at all. The table is stored as an unordered heap — a stack of 8 KB pages with rows packed in arrival order. Even the column you declared as PRIMARY KEY is just a regular B-tree index that points into the heap with a (page, offset) tuple identifier called a CTID. Every index in Postgres is technically a secondary index — there is no privileged "primary" storage layout.

+---------------------------------------------------------------+
|           PRIMARY INDEX BY ENGINE                              |
+---------------------------------------------------------------+
|                                                                |
|   ENGINE       | Primary Index Model                           |
|   -------------+---------------------------------              |
|   InnoDB       | Clustered: PK B-tree leaves = rows            |
|   SQL Server   | Clustered: one CLUSTERED INDEX per table      |
|   Oracle       | Heap by default; IOT for clustered            |
|   PostgreSQL   | Heap only. CLUSTER is one-shot, not          |
|                | maintained. Every index is secondary.         |
|   SQLite       | Clustered on rowid (or PK if INTEGER PK)      |
|                                                                |
+---------------------------------------------------------------+

This single architectural choice — clustered vs heap — drives most of the performance differences between InnoDB and Postgres. It is also the reason "what should my primary key be" is a much weightier question on InnoDB than on Postgres.


Secondary Index — The Pointer From Somewhere Else

A secondary index is any index that is not the primary index. It sorts rows by some other column (or set of columns), and at the leaves it stores a reference back to the actual row. The form of that reference depends on the engine.

In InnoDB, a secondary index leaf stores (secondary_key, primary_key). Notice what is missing: there is no direct row pointer. To fetch the actual row, the engine takes the primary key value from the secondary index leaf and walks the primary key B-tree a second time to find the row. This is called a bookmark lookup and it costs roughly the same as the secondary index walk itself — every secondary index query in InnoDB is two B-tree traversals when you need columns outside the index.

In PostgreSQL, a secondary index leaf stores (key, ctid) where the CTID is the physical address (page_number, offset_within_page) of the row in the heap. To fetch the row, the engine reads the leaf, takes the CTID, and does one direct page read into the heap. There is no second B-tree walk — but there IS a random heap access, which is often slower than a sequential page read.

+---------------------------------------------------------------+
|           SECONDARY INDEX LOOKUP COST                          |
+---------------------------------------------------------------+
|                                                                |
|   InnoDB:                                                      |
|     1. Walk secondary index B-tree   -> O(log N)               |
|     2. Read PK value from leaf                                 |
|     3. Walk PRIMARY index B-tree     -> O(log N)               |
|     4. Read row from PK leaf                                   |
|     Total: 2 B-tree walks per row                              |
|                                                                |
|   PostgreSQL:                                                  |
|     1. Walk secondary index B-tree   -> O(log N)               |
|     2. Read CTID from leaf                                     |
|     3. Random read of heap page      -> 1 page I/O             |
|     4. Read row from heap page                                 |
|     Total: 1 B-tree walk + 1 heap visit per row                |
|                                                                |
|   COVERING / INDEX-ONLY SCAN:                                  |
|     If all SELECTed columns live in the index, both engines    |
|     skip step 3-4 entirely. This is why "covering indexes"     |
|     matter so much for hot read paths.                         |
|                                                                |
+---------------------------------------------------------------+

The two engines have different performance shapes as a result. InnoDB's secondary index lookups get more expensive as the table grows (because the second B-tree walk gets deeper), but the heap is never random because there is no heap. Postgres's secondary index lookups stay flat in B-tree depth, but every fetch is a random page read into the heap, which can be brutal under cache pressure.


Example 1 — Postgres Heap + Secondary Index in Action

Let's see exactly what happens when you query a Postgres table by an indexed column.

-- Create a small table and indexes
CREATE TABLE customers (
  id          BIGSERIAL PRIMARY KEY,
  email       TEXT NOT NULL,
  full_name   TEXT NOT NULL,
  signup_at   TIMESTAMPTZ NOT NULL DEFAULT now()
);

-- The PRIMARY KEY automatically created an index named customers_pkey.
-- Despite the name, this is NOT a clustered index. The table is still
-- a heap; customers_pkey is just a B-tree pointing into it via CTIDs.
CREATE UNIQUE INDEX idx_customers_email ON customers (email);

INSERT INTO customers (email, full_name) VALUES
  ('alice@example.com', 'Alice Anderson'),
  ('bob@example.com',   'Bob Brown'),
  ('carol@example.com', 'Carol Chen'),
  ('dave@example.com',  'Dave Davis'),
  ('eve@example.com',   'Eve Evans');

-- Inspect the physical row addresses (CTIDs).
-- Each CTID is (block_number, offset_within_block).
SELECT ctid, id, email FROM customers ORDER BY id;
--   ctid  | id |       email
-- --------+----+--------------------
--  (0,1)  |  1 | alice@example.com
--  (0,2)  |  2 | bob@example.com
--  (0,3)  |  3 | carol@example.com
--  (0,4)  |  4 | dave@example.com
--  (0,5)  |  5 | eve@example.com
-- All five rows live in heap page 0, slots 1-5. Inserted in id order
-- because the heap appends in arrival order. There is NO physical
-- sort by email even though we have an email index.

-- Query by email -> Index Scan + Heap Fetch
EXPLAIN (ANALYZE, BUFFERS)
SELECT id, full_name FROM customers WHERE email = 'carol@example.com';
--                          QUERY PLAN
-- ----------------------------------------------------------------
--  Index Scan using idx_customers_email on customers
--    Index Cond: (email = 'carol@example.com'::text)
--    Buffers: shared hit=3
--  Planning Time: 0.080 ms
--  Execution Time: 0.030 ms

-- "Buffers: shared hit=3" tells the story:
--   2 buffers for walking the email index B-tree
--   1 extra buffer for the heap visit to fetch full_name
-- If full_name were already in the index leaf, the heap visit
-- would be skipped (index-only scan).

-- Index-only scan: every selected column lives in the index
CREATE INDEX idx_customers_email_name ON customers (email) INCLUDE (full_name);

EXPLAIN (ANALYZE, BUFFERS)
SELECT full_name FROM customers WHERE email = 'carol@example.com';
--                              QUERY PLAN
-- ----------------------------------------------------------------
--  Index Only Scan using idx_customers_email_name on customers
--    Index Cond: (email = 'carol@example.com'::text)
--    Heap Fetches: 0
--    Buffers: shared hit=2
-- "Heap Fetches: 0" -- the row never had to be visited in the heap.
-- This is roughly 30-50% faster on cache-miss workloads.

The key insight: in Postgres, every index is a secondary index, and every non-covering query pays for one heap visit per matching row. The fix is either an INCLUDE covering index or accepting the cost.


Example 2 — InnoDB Clustered Storage and the Secondary Index Detour

Now the same exercise on MySQL InnoDB, where the primary key IS the table.

-- MySQL / InnoDB
CREATE TABLE customers (
  id          BIGINT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
  email       VARCHAR(255) NOT NULL,
  full_name   VARCHAR(255) NOT NULL,
  signup_at   TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
  UNIQUE KEY idx_customers_email (email)
) ENGINE=InnoDB;

INSERT INTO customers (email, full_name) VALUES
  ('alice@example.com', 'Alice Anderson'),
  ('bob@example.com',   'Bob Brown'),
  ('carol@example.com', 'Carol Chen');

-- The PRIMARY KEY is the clustered index. The table file (.ibd)
-- contains a B-tree whose leaves hold the FULL row data, ordered by id.
-- There is no separate "data" storage; the PK B-tree IS the data.

-- The secondary index on email stores (email, id) at the leaves -- NOT
-- a row pointer. To get full_name, the engine must walk the PK B-tree
-- using the id from the email index leaf.

EXPLAIN FORMAT=TREE
SELECT full_name FROM customers WHERE email = 'carol@example.com';
-- -> Index lookup on customers using idx_customers_email
--    (email='carol@example.com')
--    -> Single-row index lookup on customers using PRIMARY (id=...)
--
-- Two B-tree walks per row: first the email index, then the PK index.
-- This is called a "bookmark lookup" or "back-to-clustered-index lookup".

-- Covering index avoids the second walk
ALTER TABLE customers ADD INDEX idx_email_name (email, full_name);

EXPLAIN FORMAT=TREE
SELECT full_name FROM customers WHERE email = 'carol@example.com';
-- -> Covering index lookup on customers using idx_email_name
--    (email='carol@example.com')
-- ONE B-tree walk. full_name is read directly from the secondary
-- index leaf -- no return trip to the clustered PK index.

The architectural takeaway: in InnoDB, the primary key is special. Every secondary index implicitly stores it, every secondary index lookup implicitly uses it, and the choice of primary key shapes the performance of every query in the database.


Why Primary Key Choice Matters So Much (Especially on InnoDB)

Because InnoDB stores rows in primary key order inside the clustered index, the PK choice has architectural consequences that reach into every layer of performance.

Monotonic vs random insertion. When the primary key is monotonically increasing (e.g. BIGINT AUTO_INCREMENT, ULID, Snowflake ID), every new row goes at the right edge of the clustered B-tree. Pages fill sequentially, splits are rare, and the working set for inserts stays tiny. When the primary key is random (UUIDv4), every insert lands at a random position in the clustered B-tree, forcing page splits across the entire data file. Insert throughput on a 100 GB InnoDB table can drop by 5-10x just by switching from BIGINT AUTO_INCREMENT to UUIDv4. UUIDv7 (time-ordered UUIDs, 2024 RFC 9562) restores the monotonic property and is the right answer when you need globally unique IDs without giving up clustered-index performance.

Secondary index size. Every InnoDB secondary index leaf contains the primary key value. If your PK is a 16-byte UUID, every row in every secondary index pays an extra 16 bytes — and a table with 10 secondary indexes and a billion rows pays an extra 150 GB. With an 8-byte BIGINT, that bill is 75 GB. The PK is replicated into every index whether you like it or not, so smaller PKs mean smaller indexes, more cache hits, and faster everything.

Lookup width. A wide PK (composite of 4 columns, or a 64-character string) makes the clustered index taller, the secondary indexes fatter, and every join slower. A narrow integer PK fits more entries per page, reduces tree depth, and keeps the bufferpool hot.

+---------------------------------------------------------------+
|           PRIMARY KEY DESIGN RULES (InnoDB)                    |
+---------------------------------------------------------------+
|                                                                |
|   GOOD                          BAD                            |
|   --------------------          ---------------------          |
|   BIGINT AUTO_INCREMENT         UUIDv4 (random)                |
|   ULID / Snowflake / UUIDv7     Composite of 4 columns         |
|   8 bytes, monotonic, unique    32-char string PK              |
|   Set once, never changed       Mutable PK (rare, dangerous)   |
|                                                                |
|   THE FOUR RULES:                                              |
|   1. Small  -> fits in cache, narrow secondary indexes         |
|   2. Monotonic -> append-only writes, no page splits           |
|   3. Immutable -> changing the PK rewrites every index entry   |
|   4. Unique   -> InnoDB requires it (will invent a rowid)      |
|                                                                |
+---------------------------------------------------------------+

On Postgres the rules are softer because the heap is unordered and indexes use CTIDs, but they still apply: smaller PKs make all secondary indexes smaller, monotonic PKs help the BTree stay balanced, and immutable PKs avoid the cost of HOT-update misses. The rules are universal — InnoDB just punishes violations harder.


What CLUSTER Does in PostgreSQL (and Why It Is Not What You Think)

PostgreSQL has a CLUSTER command that sounds like it creates a clustered index. It does not.

CLUSTER table USING index_name is a one-shot physical reorganization. It rewrites the heap so that rows are physically stored in the order of the chosen index, at that moment in time. After the command finishes, new inserts and updates go wherever there is free space — exactly as before — and the table gradually drifts back out of cluster order. There is no maintained clustered index; CLUSTER is a maintenance operation, not a structural change.

-- Create an index and physically reorder the heap to match it (one-shot)
CREATE INDEX idx_customers_signup ON customers (signup_at);
CLUSTER customers USING idx_customers_signup;

-- After CLUSTER, range scans by signup_at are sequential (fast).
-- New INSERTs do NOT honor the cluster order -- they go in heap append order.
-- Over time the cluster degrades and you need CLUSTER again.

-- Check cluster status
SELECT relname, reltablespace, relhasindex
FROM pg_class WHERE relname = 'customers';

This means Postgres has no clustered index in the InnoDB sense. The closest you get is "I ran CLUSTER recently and the heap happens to be in index order today." For a permanently sorted layout you would need a partitioned table, a materialized view, or a different storage layer (e.g. pg_partman, BRIN indexes on naturally sorted data like time series).

Napkin AI Visual Prompt: "Dark navy gradient (#0a0f1f -> #111a2e). Three-panel comparison. PANEL 1 labeled 'InnoDB Clustered': a B-tree where the leaves are filled with rounded rectangle 'row' shapes containing all column values, sky blue (#4fc3f7) glow on the leaf rectangles. PANEL 2 labeled 'Postgres Heap': a separate scattered grid of unordered row-page rectangles in rose (#ff5c8a), with two B-tree icons on top labeled 'PK index' and 'email index' both shooting arrows down at random heap pages. PANEL 3 labeled 'InnoDB Secondary': a small B-tree labeled 'idx_email' whose leaf contains '(email, id)', with a sky blue arrow looping back up to the clustered PK B-tree to fetch the row. White monospace labels. Subtitle below all three: 'Where rows live shapes everything.'"


Example 3 — Measuring the Heap Visit Cost in Postgres

Let's quantify what a heap fetch actually costs versus an index-only scan.

-- Generate 1M rows so the difference is measurable
CREATE TABLE events (
  id        BIGSERIAL PRIMARY KEY,
  user_id   BIGINT NOT NULL,
  action    TEXT NOT NULL,
  amount    NUMERIC(10,2) NOT NULL,
  created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);

INSERT INTO events (user_id, action, amount)
SELECT (random() * 100000)::bigint,
       (ARRAY['login','purchase','logout','view'])[1 + (random()*3)::int],
       (random() * 1000)::numeric(10,2)
FROM generate_series(1, 1000000);

CREATE INDEX idx_events_user ON events (user_id);
ANALYZE events;

-- Query 1: Index Scan + Heap Fetch (we need the amount column)
EXPLAIN (ANALYZE, BUFFERS)
SELECT amount FROM events WHERE user_id = 42;
--  Index Scan using idx_events_user on events
--    Index Cond: (user_id = 42)
--    Buffers: shared hit=2 read=11
--  Execution Time: 1.842 ms
-- 11 heap pages had to be read to fetch the amount column for ~10 rows.

-- Query 2: Add a covering index that includes amount
CREATE INDEX idx_events_user_amount ON events (user_id) INCLUDE (amount);

EXPLAIN (ANALYZE, BUFFERS)
SELECT amount FROM events WHERE user_id = 42;
--  Index Only Scan using idx_events_user_amount on events
--    Index Cond: (user_id = 42)
--    Heap Fetches: 0
--    Buffers: shared hit=4
--  Execution Time: 0.087 ms
-- 21x faster. Zero heap pages read. Heap Fetches: 0 means the
-- visibility map confirmed all rows were visible without checking the heap.

The covering index is roughly an order of magnitude faster on a cold cache, because it converts random heap I/O into sequential index page reads. This is the single most reliable optimization for read-heavy workloads on PostgreSQL.


Common Mistakes

1. Using UUIDv4 as a primary key on InnoDB without measuring write throughput. UUIDv4 is random, which means every insert lands at a random spot in the clustered B-tree, splitting pages across the entire data file. On large tables (50 GB+), insert throughput can drop by 5-10x compared to BIGINT AUTO_INCREMENT, and the working set for writes balloons because dirty pages spread across the whole bufferpool. If you need globally unique IDs, use UUIDv7 (time-ordered, monotonic) or a Snowflake/ULID variant. If you do not strictly need them, BIGINT AUTO_INCREMENT is almost always the right answer on InnoDB.

2. Assuming Postgres has a clustered index because you declared a PRIMARY KEY. The Postgres PRIMARY KEY constraint creates a regular B-tree index (<table>_pkey). The table itself is still a heap. Your rows are stored in arrival order regardless of what the PK says, every PK lookup pays one heap visit, and even running CLUSTER is only a one-shot reordering that drifts back out of order on the next write. If you need physical clustering, partitioning or a different storage approach is the right tool — not relying on CLUSTER.

3. Forgetting that InnoDB secondary indexes implicitly contain the primary key. Every secondary index leaf in InnoDB stores (secondary_key, primary_key). If your PK is a 36-character UUID string, every row of every secondary index carries that 36-byte tag, multiplying index sizes dramatically. A table with 10 secondary indexes and 1 billion rows pays 36 GB in extra storage just for the PK replication — versus 8 GB for a BIGINT. This is why "small primary key" is not a stylistic preference but a measurable performance lever.

4. Querying lots of columns from a secondary index without a covering index. SELECT * FROM users WHERE email = ? on Postgres triggers a B-tree walk plus a heap fetch. On a hot table with cold cache, the heap fetch is a random page read that can dominate query time. The fix is either to select fewer columns (so the planner can use an index-only scan) or to add an INCLUDE clause that puts the additional columns inside the index leaves. On InnoDB, the same query triggers a secondary index walk plus a clustered index walk (the bookmark lookup) — the fix is a covering composite index.

5. Treating CLUSTER in Postgres as a permanent setting. After CLUSTER customers USING idx_signup, the heap is in signup_at order — but only at that moment. Every subsequent INSERT and UPDATE adds rows in arrival order, gradually decaying the cluster. There is no background job that maintains the cluster, no setting that enforces it, and no warning when it has drifted. If you depend on physical ordering, rerun CLUSTER periodically as a maintenance step, or restructure the schema (partitioning, BRIN-friendly append-only tables) to avoid the dependency.


Interview Questions

1. "Explain the difference between a primary (clustered) index and a secondary (non-clustered) index, and how InnoDB and PostgreSQL differ in this regard."

A primary or clustered index physically orders the rows on disk by the index key — the leaves of the index B-tree contain the actual row data, so the index IS the table. A secondary or non-clustered index is sorted by some other key and stores at its leaves a reference back to the row, not the row itself. InnoDB uses a clustered index model: every table is stored as a B-tree on the primary key, with full rows in the leaves. Secondary indexes in InnoDB store (secondary_key, primary_key) tuples and require a second walk through the clustered PK B-tree to fetch the row — a "bookmark lookup". PostgreSQL has no clustered index at all: the table is an unordered heap of 8 KB pages, and every index (including the one created by the PRIMARY KEY constraint) is technically a secondary index that stores (key, ctid) where the CTID is the physical heap address. So in InnoDB the PK is structurally privileged and shapes everything; in Postgres the PK is just another B-tree pointing into the heap, and you can never have true clustered storage.

2. "Why does primary key choice matter so much more on InnoDB than on PostgreSQL?"

Three reasons. First, InnoDB stores rows physically in primary key order inside the clustered index, so a random PK like UUIDv4 forces every insert into a random spot in the B-tree, causing page splits across the entire data file and dropping insert throughput by 5-10x on large tables. A monotonic PK (BIGINT AUTO_INCREMENT, UUIDv7, ULID, Snowflake) appends to the right edge of the tree and keeps the write working set tiny. Second, every InnoDB secondary index leaf implicitly contains the primary key value, so a wide PK multiplies the storage cost of every secondary index — a 36-byte UUID PK on a table with 10 indexes and 1B rows costs 36 GB extra just for the PK replication, versus 8 GB for a BIGINT. Third, every secondary index lookup that needs columns outside the index has to walk the clustered PK B-tree a second time, so a deeper or wider PK index slows down every query in the database. PostgreSQL has none of these effects because the heap is unordered and indexes use CTIDs, so PK choice is mostly about indexing semantics, not about storage layout — the rules still apply (small, immutable, unique) but the penalties for violations are much smaller.

3. "Walk me through what happens when PostgreSQL executes SELECT name FROM users WHERE email = 'foo@bar.com' with an index on email."

The planner sees the email = constant predicate and chooses an index scan on the email B-tree. Step 1: walk the email B-tree from the root to the leaf containing foo@bar.com — typically 3-4 page reads on a tree with a few million entries. Step 2: read the leaf entry and extract the CTID, which is a (page_number, offset_within_page) tuple identifying where the row lives in the heap. Step 3: random-read the heap page at page_number into the buffer cache (one page I/O if not already cached). Step 4: read the slot at offset_within_page to find the actual row, then project the name column out of it. Total cost: one B-tree walk plus one heap visit. In EXPLAIN (ANALYZE, BUFFERS) you see Index Scan using idx_email with a Buffers: shared hit/read count slightly higher than the index depth alone — the extra buffer is the heap fetch. If name had been included in the index via CREATE INDEX ... ON users (email) INCLUDE (name), the planner could use an Index Only Scan, skip the heap visit, and report Heap Fetches: 0 — typically 5-30x faster on cold cache.

4. "What does CLUSTER do in PostgreSQL? Is it a substitute for InnoDB's clustered index?"

CLUSTER table USING index_name is a one-shot physical reorganization that rewrites the heap so rows are stored in the order specified by the chosen index. It is essentially a VACUUM FULL that sorts as it rewrites. After it finishes, range scans on that index are roughly sequential and very fast. But it is NOT a maintained clustered index in the InnoDB sense: new INSERTs go to wherever there is free space in the heap (arrival order), UPDATEs that don't fit in the current page move elsewhere, and there is no background process keeping the table in cluster order. Within hours or days of normal write traffic, the table drifts back out of cluster order. Postgres only "remembers" which index you last clustered on (via pg_class.relhasindex plus an indisclustered marker on pg_index) so you can rerun CLUSTER table without specifying the index name. So no — it is not a substitute for a real clustered index. If you need permanent physical clustering in Postgres, the right tools are partitioning (range/list partitions), BRIN indexes on naturally sorted columns like timestamps, or accepting that the heap is unordered and using covering indexes to avoid the heap visit.

5. "What is a covering index, and how does it interact with the primary/secondary index distinction in InnoDB and Postgres?"

A covering index is an index that contains every column needed to answer a particular query, so the database can serve the query entirely from the index without visiting the underlying row. In InnoDB, a covering secondary index avoids the second B-tree walk back into the clustered primary key index — the engine reads the row data directly from the secondary index leaf and never bookmarks back to the PK. This converts a two-walk query into a one-walk query, often halving latency. In PostgreSQL, a covering index avoids the heap visit — what the planner calls an "Index Only Scan" — so a query that would otherwise do a B-tree walk plus a random heap page read does only the B-tree walk, often dropping cold-cache latency by an order of magnitude. Postgres exposes this explicitly via CREATE INDEX ... INCLUDE (col1, col2), which adds the columns to the index leaves but not to the index key itself, so they don't affect ordering or uniqueness. InnoDB has no INCLUDE syntax; you make a covering index by adding the columns to the index key itself, e.g. INDEX idx_email_name (email, full_name). The architectural lesson: on both engines, the cost of "fetch row from somewhere else" is real and measurable, and covering indexes are the most reliable single-query optimization for read-heavy workloads.


Quick Reference — Primary and Secondary Index Cheat Sheet

+---------------------------------------------------------------+
|           PRIMARY vs SECONDARY INDEX CHEAT SHEET               |
+---------------------------------------------------------------+
|                                                                |
|  PRIMARY (CLUSTERED) INDEX:                                    |
|    Determines physical row order on disk                       |
|    One per table -- you cannot cluster two ways at once        |
|    Index leaves CONTAIN the rows                               |
|    InnoDB: yes (PK).  Postgres: no (heap only).                |
|                                                                |
|  SECONDARY (NON-CLUSTERED) INDEX:                              |
|    Sorts rows by some other key                                |
|    Leaves contain a REFERENCE to the row                       |
|    InnoDB leaf: (secondary_key, primary_key)                   |
|    Postgres leaf: (key, ctid)  -- ctid = (page, offset)        |
|                                                                |
|  LOOKUP COST:                                                  |
|    InnoDB clustered query : 1 B-tree walk                      |
|    InnoDB secondary query : 2 B-tree walks (bookmark lookup)   |
|    Postgres any query     : 1 B-tree walk + 1 heap visit       |
|    Covering / Index-Only  : 1 B-tree walk only                 |
|                                                                |
|  PRIMARY KEY DESIGN (esp. InnoDB):                             |
|    Small      -> tiny secondary indexes                        |
|    Monotonic  -> no page splits, fast inserts                  |
|    Immutable  -> changing PK rewrites every index              |
|    Unique     -> required by InnoDB (will invent rowid)        |
|    Prefer: BIGINT AUTO_INCREMENT, ULID, UUIDv7                 |
|    Avoid:  UUIDv4, wide composites, mutable strings            |
|                                                                |
|  POSTGRES CLUSTER:                                             |
|    One-shot reorder, NOT maintained                            |
|    Drifts back to arrival order on new writes                  |
|    Not a substitute for InnoDB clustered index                 |
|                                                                |
+---------------------------------------------------------------+

+---------------------------------------------------------------+
|           KEY RULES                                            |
+---------------------------------------------------------------+
|                                                                |
|  1. InnoDB: the PRIMARY KEY is the table -- choose carefully   |
|  2. Postgres: every index is secondary; the heap is unordered  |
|  3. InnoDB secondary leaves store the PK, not a row pointer    |
|  4. Postgres secondary leaves store a CTID into the heap       |
|  5. Non-covering query = extra lookup (B-tree or heap visit)   |
|  6. Covering index avoids the extra fetch on both engines      |
|  7. Postgres INCLUDE clause = covering without changing keys   |
|  8. Random PK (UUIDv4) destroys InnoDB write throughput        |
|  9. Use UUIDv7/ULID/Snowflake when you need globally unique    |
| 10. CLUSTER in Postgres is a maintenance op, not a structure   |
|                                                                |
+---------------------------------------------------------------+
ConcernInnoDBPostgreSQL
Clustered indexYes (the PK)No (heap only)
Secondary leaf referencePrimary key valueCTID (page, offset)
Cost of secondary lookup2 B-tree walks1 B-tree walk + 1 heap visit
Covering syntaxAdd cols to keyINCLUDE (cols)
Random UUID PK impactSevere (5-10x slower writes)Mild (slightly larger indexes)
Best PK typeBIGINT AUTO_INCREMENTBIGSERIAL or BIGINT identity
If globally unique neededUUIDv7 / ULID / SnowflakeUUIDv7 / ULID (less critical)
Physical clustering toolBuilt-in (always)CLUSTER (one-shot) or partitioning
Index-only scan supportYes (covering composite)Yes (with visibility map)

Prev: Lesson 9.1 -- What Is an Index? Next: Lesson 9.3 -- B-Tree Index


This is Lesson 9.2 of the Database Interview Prep Course -- 12 chapters, 58 lessons.

On this page