Database Interview Prep
Keys

Primary Key

Row Identity, Uniqueness, and the UUID vs Integer Debate

LinkedIn Hook

"Your users table has no primary key. You just do not know it yet."

Most junior developers think a primary key is just "that id column you add because the ORM told you to." Then they hit a real bug: a duplicate row silently breaks a financial report, a foreign key cannot point to anything stable, a replication stream stalls because the same row cannot be identified across nodes. Suddenly the id column is not a formality — it is the single most important design decision in the entire schema.

A primary key is three promises rolled into one column (or set of columns): every row is unique, no row is NULL, and this identity never changes. Break any of those promises and every foreign key, every index, every replication log, and every join downstream inherits the bug.

The interview question is never "what is a primary key." It is "bigserial or UUID, and why?" The right answer is not a preference — it is a tradeoff between index size, insert locality, security, and whether your writes happen on one machine or fifty.

In Lesson 2.1, I break down primary keys in PostgreSQL: uniqueness and NOT NULL guarantees, composite keys, GENERATED ALWAYS AS IDENTITY vs SERIAL, UUID vs bigserial with real tradeoffs, and how to pick a primary key you will not regret in production.

Read the full lesson -> [link]

#Database #PostgreSQL #SQL #DataModeling #BackendDevelopment #InterviewPrep


Primary Key thumbnail


What You'll Learn

  • What a primary key actually guarantees — uniqueness plus NOT NULL, enforced by the engine
  • Why every table needs one, and what breaks when you skip it
  • Single-column vs composite primary keys, and when composites are the right answer
  • SERIAL vs GENERATED ALWAYS AS IDENTITY in PostgreSQL — and why IDENTITY is the modern choice
  • UUID vs bigserial with real tradeoffs: index size, insert locality, leaks, and distributed writes
  • How to choose a primary key you will not regret in production
  • Why changing a primary key value is almost always a design smell

The Passport Number Analogy — One Row, One Identity, Forever

Think about a passport number. Every passport on the planet has one, no two passports share a number, and the number is printed once at issuance and never changes for the life of that passport. If your name changes, the passport gets reissued but the number tracks a new document — the old one is retired. The government never says "let us reuse the passport number of a deceased person for a newborn" because every database, every airline, every border checkpoint in the world has that number stamped on records that still need to point at exactly one person.

A primary key is the passport number of a database row. It uniquely identifies one row, it is never NULL (a row without an ID is a row that cannot be referenced), and it is stable for the entire lifetime of that row. Foreign keys from other tables point at it. Indexes are built on it. Replication streams identify rows by it. The moment you change a primary key value, every one of those pointers becomes a lie.

That is why the interview answer to "can you update a primary key?" is technically yes and practically never. You can issue UPDATE users SET id = 999 WHERE id = 1, PostgreSQL will let you, and every ON UPDATE CASCADE foreign key will propagate the change. But the fact that you need cascade propagation is the red flag — it means your primary key was carrying information that belonged in a separate column.

+---------------------------------------------------------------+
|           WHAT A PRIMARY KEY GUARANTEES                       |
+---------------------------------------------------------------+
|                                                                |
|   1. UNIQUENESS     -> no two rows share the same value        |
|   2. NOT NULL       -> every row has a value                   |
|   3. STABILITY      -> the value never changes (by convention) |
|   4. ONE PER TABLE  -> exactly one primary key per table       |
|                                                                |
|   Enforced by: a unique B-tree index created automatically     |
|   Used for:    row identity, FK targets, replication, joins    |
|                                                                |
+---------------------------------------------------------------+

Definition — What a Primary Key Actually Is

A primary key is a column (or combination of columns) that uniquely identifies each row in a table. The database engine enforces two hard constraints on a primary key:

  1. Unique — no two rows may share the same primary key value.
  2. NOT NULL — every row must have a value. NULL is explicitly disallowed, even in columns that would otherwise accept it.

A table can have at most one primary key, though it may have additional unique constraints on other columns. The primary key is special because it is the designated row identity — the column foreign keys point at by default, the column logical replication uses to identify rows, and the column the query planner assumes is the cheapest way to look up a single row.

When you declare a primary key, PostgreSQL automatically creates a unique B-tree index on the column(s) behind the scenes. This is not optional. The index is how the engine enforces uniqueness (it checks for a collision on every insert) and how it makes primary-key lookups fast (O(log n) instead of a full scan).


Why Primary Keys Exist — Four Jobs in One Column

A primary key earns its place by doing four distinct jobs at once. Understanding these four jobs is how you pick a good primary key instead of a cargo-culted one.

1. Row identity. Every row needs a stable handle so application code, ORMs, and admin tools can refer to "that row" without ambiguity. WHERE id = 42 means exactly one row, always.

2. Index backbone. The unique B-tree index created for the primary key is usually the table's most important index. In databases that use clustered (index-organized) storage — SQL Server by default, MySQL/InnoDB always — the primary key is the physical storage order of the table. PostgreSQL is the exception: its tables are heap-organized, so the primary key is just another index, though still the most frequently used one.

3. Foreign key target. Other tables declare REFERENCES users(id) to express "this row belongs to that user." The foreign key mechanism depends on the target column being unique and stable, which is exactly what a primary key guarantees. Without a primary key, foreign keys cannot point at anything.

4. Replication and CDC identity. Logical replication (PostgreSQL's pgoutput, Debezium, etc.) must identify each row to stream changes. By default it uses the primary key. A table without a primary key cannot be logically replicated unless you manually set a REPLICA IDENTITY FULL, which ships the entire old row on every update — slow and bandwidth-hungry.


Declaring a Primary Key — Column-Level vs Table-Level

PostgreSQL supports two equivalent syntaxes for declaring a primary key: inline on the column, and separately in the table constraints block. For single-column keys, either works. For composite keys, you must use the table-level form because a primary key spanning multiple columns cannot be declared inline.

-- users.sql
-- Column-level primary key declaration.
-- The PRIMARY KEY keyword right after the type does three things:
--   1. Marks the column as the table's primary key
--   2. Implicitly adds NOT NULL
--   3. Creates a unique B-tree index named users_pkey
CREATE TABLE users (
    id         bigint PRIMARY KEY,        -- inline form
    email      text NOT NULL UNIQUE,
    created_at timestamptz NOT NULL DEFAULT now()
);

-- Equivalent table-level form. Functionally identical, but required
-- when the primary key spans more than one column.
CREATE TABLE users_alt (
    id         bigint,
    email      text NOT NULL UNIQUE,
    created_at timestamptz NOT NULL DEFAULT now(),
    CONSTRAINT users_alt_pkey PRIMARY KEY (id)   -- table-level form
);

-- Insert and verify
INSERT INTO users (id, email) VALUES (1, 'ada@example.com');
INSERT INTO users (id, email) VALUES (2, 'linus@example.com');

-- Uniqueness is enforced by the engine
INSERT INTO users (id, email) VALUES (1, 'dup@example.com');
-- ERROR:  duplicate key value violates unique constraint "users_pkey"
-- DETAIL: Key (id)=(1) already exists.

-- NOT NULL is enforced automatically
INSERT INTO users (id, email) VALUES (NULL, 'null@example.com');
-- ERROR:  null value in column "id" of relation "users" violates not-null constraint

Composite Primary Keys — When One Column Is Not Enough

A composite primary key spans two or more columns. The uniqueness guarantee is on the combination of the columns: two rows may share any single column value as long as the full tuple differs. Composite keys are the natural fit for junction tables (many-to-many relationships) and for tables where a natural compound identity exists.

-- enrollments.sql
-- A student can enroll in many courses; a course has many students.
-- The primary key is the pair (student_id, course_id) — the same
-- student cannot enroll in the same course twice, but can enroll in
-- many courses, and many students can enroll in the same course.
CREATE TABLE enrollments (
    student_id bigint NOT NULL REFERENCES students(id),
    course_id  bigint NOT NULL REFERENCES courses(id),
    enrolled_at timestamptz NOT NULL DEFAULT now(),
    grade      text,
    PRIMARY KEY (student_id, course_id)   -- composite key, table-level only
);

-- Valid: same student, different courses
INSERT INTO enrollments (student_id, course_id) VALUES (1, 101);
INSERT INTO enrollments (student_id, course_id) VALUES (1, 102);  -- ok

-- Valid: different students, same course
INSERT INTO enrollments (student_id, course_id) VALUES (2, 101);  -- ok

-- Rejected: same pair
INSERT INTO enrollments (student_id, course_id) VALUES (1, 101);
-- ERROR:  duplicate key value violates unique constraint "enrollments_pkey"
-- DETAIL: Key (student_id, course_id)=(1, 101) already exists.

Composite keys are correct for junction tables but have one downside worth knowing: every foreign key pointing at this table must also carry both columns. If grades wants to reference a specific enrollment, it needs (student_id, course_id) as its foreign key — two columns instead of one. For junction tables this is usually fine because nothing typically references them. For richer many-to-many relationships where other tables do reference the relationship row, many teams add a single surrogate id bigserial PRIMARY KEY and demote the composite to a UNIQUE constraint.

+---------------------------------------------------------------+
|           SURROGATE vs COMPOSITE PRIMARY KEY                  |
+---------------------------------------------------------------+
|                                                                |
|  COMPOSITE PK (natural):                                       |
|    PRIMARY KEY (student_id, course_id)                         |
|    + Models the real constraint directly                      |
|    + No extra column                                          |
|    - FK references carry two columns everywhere                |
|    - Composite joins in every query touching this table       |
|                                                                |
|  SURROGATE PK (id) + UNIQUE (student_id, course_id):           |
|    id bigserial PRIMARY KEY,                                  |
|    UNIQUE (student_id, course_id)                              |
|    + FKs are one column                                        |
|    + Easier ORM mapping, cleaner joins                         |
|    - Extra 8 bytes per row                                     |
|    - Two indexes (pk + unique) instead of one                  |
|                                                                |
+---------------------------------------------------------------+

Auto-Generating Keys — SERIAL vs GENERATED AS IDENTITY

In PostgreSQL, you almost never want to supply primary key values by hand. Instead, you let the engine generate them. There are two ways to do this, and the older one is a trap for new projects.

SERIAL — The Legacy Shortcut

SERIAL is not a real type. It is shorthand that PostgreSQL expands into three things: a bigint column, a sequence object, and a DEFAULT nextval(...) clause. It has been around since the beginning and still works, but it has quirks — the sequence is technically separate from the column, permissions are confusing, and the SQL standard does not know about it.

-- serial_demo.sql
-- SERIAL expands to roughly:
--   id integer NOT NULL DEFAULT nextval('orders_id_seq'::regclass)
-- with a sequence orders_id_seq owned by the column.
CREATE TABLE orders (
    id     serial PRIMARY KEY,       -- 4-byte int, up to ~2.1 billion
    total  numeric(10,2)
);

-- Use bigserial for tables you expect to grow past 2 billion rows.
-- Going from serial to bigserial later is painful; start with bigserial.
CREATE TABLE events (
    id     bigserial PRIMARY KEY,    -- 8-byte int, up to ~9.2 quintillion
    payload jsonb
);

INSERT INTO orders (total) VALUES (19.99);  -- id auto-assigned to 1
INSERT INTO orders (total) VALUES (42.00);  -- id auto-assigned to 2

SELECT * FROM orders;
-- id | total
-- ---+-------
--  1 | 19.99
--  2 | 42.00

GENERATED ALWAYS AS IDENTITY — The Modern Choice

PostgreSQL 10 added GENERATED ... AS IDENTITY, the SQL-standard way to declare an auto-incrementing column. It is functionally similar to SERIAL but cleaner: the sequence is an integral part of the column, permissions are automatic, and you can choose between ALWAYS (refuse user-supplied values) and BY DEFAULT (allow overrides).

-- identity_demo.sql
-- GENERATED ALWAYS: the engine refuses any user-supplied id, even NULL.
-- This is the safest default — you cannot accidentally insert with a
-- hardcoded id that collides with the sequence later.
CREATE TABLE invoices (
    id     bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    amount numeric(10,2) NOT NULL
);

INSERT INTO invoices (amount) VALUES (100.00);     -- id = 1
INSERT INTO invoices (amount) VALUES (250.00);     -- id = 2

-- User-supplied id is rejected
INSERT INTO invoices (id, amount) VALUES (999, 300.00);
-- ERROR:  cannot insert a non-DEFAULT value into column "id"
-- DETAIL: Column "id" is an identity column defined as GENERATED ALWAYS.
-- HINT: Use OVERRIDING SYSTEM VALUE to override.

-- GENERATED BY DEFAULT: engine picks a value if you do not supply one,
-- but accepts your value if you do. Use this only for data migrations
-- where you need to preserve existing ids from a legacy system.
CREATE TABLE legacy_orders (
    id     bigint GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
    label  text
);

INSERT INTO legacy_orders (label) VALUES ('auto');         -- id = 1
INSERT INTO legacy_orders (id, label) VALUES (500, 'manual'); -- id = 500
INSERT INTO legacy_orders (label) VALUES ('auto again');   -- id = 2 (!)
-- The sequence is unchanged by your manual insert, so the next auto
-- value collides with 2, not 501. This is why ALWAYS is the safer default.

Rule of thumb for new PostgreSQL projects: use bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY. It is standards-compliant, safer than SERIAL, and gives you 8 bytes of headroom instead of 4.


UUID as a Primary Key — gen_random_uuid()

Sometimes you do not want sequential integers at all. Maybe you generate IDs on clients before they reach the database, maybe you shard writes across many databases that cannot coordinate a sequence, maybe you do not want primary key values to leak row counts to competitors. The alternative is a UUID — a 128-bit random identifier, written as a hex string like 550e8400-e29b-41d4-a716-446655440000.

PostgreSQL has a native uuid type that stores the value as 16 bytes (not as a 36-character string — the text form is only for display). As of PostgreSQL 13, gen_random_uuid() is built into core and generates a random UUIDv4 without needing the pgcrypto extension.

-- uuid_demo.sql
-- PostgreSQL 13+: gen_random_uuid() is built in.
-- Older versions: CREATE EXTENSION IF NOT EXISTS pgcrypto;
CREATE TABLE sessions (
    id         uuid PRIMARY KEY DEFAULT gen_random_uuid(),
    user_id    bigint NOT NULL REFERENCES users(id),
    created_at timestamptz NOT NULL DEFAULT now()
);

INSERT INTO sessions (user_id) VALUES (1);
INSERT INTO sessions (user_id) VALUES (1);

SELECT * FROM sessions;
-- id                                   | user_id | created_at
-- -------------------------------------+---------+----------------------
-- 550e8400-e29b-41d4-a716-446655440000 |       1 | 2026-04-13 10:00:00+00
-- 7b3e1234-9f0a-4c5d-b6e7-f01234567890 |       1 | 2026-04-13 10:00:01+00

The surface syntax is as clean as bigserial. The interesting question is what changes under the surface — and that is where the real tradeoffs live.


The Real Tradeoffs — bigserial vs uuid

The UUID vs integer debate gets treated as a matter of taste. It is not. It is a set of concrete, measurable tradeoffs that affect index size, insert throughput, security, and distributed-system correctness. Here is the honest comparison.

Concernbigserial / IDENTITYuuid (v4 random)
Storage per row8 bytes16 bytes
B-tree index sizeSmaller (~half)Larger (~2x)
Insert localitySequential — all inserts hit the rightmost leaf pageRandom — inserts scatter across the index
Page cache behaviorHot tail, great cache hit rateWrites touch many pages, worse cache hit rate
Index bloat under writesMinimalHigher due to random insertion points
GenerationRequires the database (sequence call)Can be generated on the client, offline, across nodes
Distributed writesCoordination needed across shardsCollision-safe without coordination
Information leakid=1000 leaks "we have ~1000 rows" to any authenticated userOpaque — leaks nothing
URL appearanceShort, guessable (/orders/1234)Long, unguessable (/orders/550e8400-...)
Debugging/logsHuman-friendlyHard to remember or read aloud
JOIN performanceSlightly faster (smaller keys, better cache)Slightly slower for huge tables

Insert Locality — The Hidden Performance Cost of Random UUIDs

+---------------------------------------------------------------+
|           SEQUENTIAL vs RANDOM INSERTS (B-tree)               |
+---------------------------------------------------------------+
|                                                                |
|  BIGSERIAL (sequential) inserts:                               |
|                                                                |
|    [root]                                                      |
|      |                                                         |
|      +-- [leaf A: 1-100]                                       |
|      +-- [leaf B: 101-200]                                     |
|      +-- [leaf C: 201-300]  <- ALL inserts land here           |
|                                                                |
|    Only one hot page; perfect cache behavior.                  |
|                                                                |
|  UUIDv4 (random) inserts:                                      |
|                                                                |
|    [root]                                                      |
|      |                                                         |
|      +-- [leaf A: 0x00..]  <- insert                           |
|      +-- [leaf B: 0x3f..]  <- insert                           |
|      +-- [leaf C: 0x7a..]  <- insert                           |
|      +-- [leaf D: 0xc1..]  <- insert                           |
|                                                                |
|    Every insert touches a different page; cache thrashes.     |
|                                                                |
+---------------------------------------------------------------+

This is why very high-write workloads often prefer bigserial even when a UUID would otherwise be cleaner. If insert locality matters but you still need UUIDs (e.g. for distributed generation), use UUIDv7 — a newer draft standard where the high bits are a timestamp, making UUIDs mostly monotonic and preserving insert locality. PostgreSQL does not generate UUIDv7 natively yet, but libraries in most languages do.

When to Pick Which

  • Default choice for most apps: bigint GENERATED ALWAYS AS IDENTITY. Small, fast, simple, and you can always add a separate public-facing UUID column later if you need one.
  • Pick UUID when: clients generate IDs offline, multiple databases write to the same logical table without coordination, you need to hide row counts from external users, or you merge tables from separate systems that each used their own 1, 2, 3 sequence.
  • Hybrid pattern: bigint IDENTITY as the internal primary key, plus a separate public_id uuid UNIQUE DEFAULT gen_random_uuid() column for external URLs. You get tight internal joins and opaque external identifiers.

Napkin AI Visual Prompt: "Dark gradient (#0a0f1f -> #111a2e). Split comparison: LEFT side labeled 'bigserial' shows a B-tree with a single hot sky-blue (#4fc3f7) leaf node at the rightmost edge, with an arrow showing sequential inserts stacking. RIGHT side labeled 'uuid v4' shows the same B-tree with rose (#ff5c8a) arrows scattering randomly across every leaf. White monospace labels. Bottom row of badges: '8 bytes / 16 bytes', 'sequential / random', 'leaks count / opaque', 'needs DB / client-generated'. Subtle grid overlay."


Common Mistakes

1. Using business data as a primary key. Email address, username, national ID, SKU — all look like natural keys until the user updates their email, a typo needs to be corrected, or a government reissues national IDs (it happens). When the primary key changes, every foreign key, every cached reference, and every replication consumer breaks. The fix is a synthetic surrogate key (bigint IDENTITY or uuid) for row identity, with the business column as a separate UNIQUE constraint. The surrogate is stable forever; the business column can change without touching foreign keys.

2. Updating primary key values to "fix" data. Once a row has a primary key, that value is a promise to every table that references it. UPDATE orders SET id = 999 WHERE id = 1 may succeed (especially with ON UPDATE CASCADE), but it tells you your primary key was doing a job it should not have been doing — carrying meaning. The correct fix is almost always to insert a new row and delete or deprecate the old one, or to move the changing value to a non-key column. If you find yourself reaching for ON UPDATE CASCADE on an integer surrogate key, stop and redesign.

3. Skipping the primary key entirely. Tables without a primary key are legal in PostgreSQL but painful in practice: logical replication refuses to include them by default, ORMs cannot generate update statements cleanly, and duplicate rows can sneak in undetected. "This is just a log table, it does not need a key" is the famous last words — the moment you need to deduplicate, reference, or stream changes from it, you regret the decision. Every table gets a primary key, even append-only logs (id bigserial PRIMARY KEY is free insurance).

4. Mixing UUIDs and integers inconsistently across tables. Using bigserial for users.id and uuid for orders.id works, but every join across the two requires the application layer to track which table uses which ID style. Worse, if orders.user_id is a bigint pointing at users.id but order_items.order_id is a uuid pointing at orders.id, a query joining user -> orders -> order_items has two completely different key styles. Pick one convention per service and stick with it. The hybrid "internal bigint + external uuid" pattern is fine, but apply it uniformly.

5. Using SERIAL (4-byte int) on tables that will grow. serial caps out at about 2.1 billion. This sounds like a lot until you write one row per user action per second on a busy service and hit the ceiling in four years. Migrating a serial column to bigserial on a billion-row table is a multi-hour operation that usually requires downtime. For anything user-facing or event-driven, start with bigserial or bigint IDENTITY on day one — the extra 4 bytes per row is rounding error compared to the pain of migrating later.


Interview Questions

1. "What is the difference between SERIAL and GENERATED ALWAYS AS IDENTITY in PostgreSQL, and which should you use for new projects?"

SERIAL is a PostgreSQL-specific shorthand that predates the SQL standard's identity columns. When you write id serial PRIMARY KEY, PostgreSQL expands this into three separate things: an integer column, a standalone sequence object, and a DEFAULT nextval(...) clause on the column. The sequence is technically decoupled from the column — permissions are granted separately, the sequence can be dropped independently, and tools that do not understand the shorthand see three objects instead of one. GENERATED ALWAYS AS IDENTITY was added in PostgreSQL 10 and follows the SQL standard. It ties the auto-generation directly to the column, handles permissions automatically, and gives you an ALWAYS mode that refuses user-supplied values outright — preventing the classic bug where someone inserts a hardcoded id = 500 and later auto-generated rows collide with it. For new projects in PostgreSQL 10 or newer, always prefer bigint GENERATED ALWAYS AS IDENTITY. SERIAL still works and you will see it in older code, but there is no reason to write it fresh in 2025 or later.

2. "UUID vs bigserial as a primary key — walk me through the real tradeoffs."

There are four dimensions that matter. First, storage and index size: bigserial is 8 bytes per value, uuid is 16, and the primary key index is usually your biggest index — a 2x size difference affects the total working set and the page cache hit rate, especially on tables over a few hundred million rows. Second, insert locality: bigserial inserts always land at the rightmost leaf of the B-tree, giving you a single hot page with near-perfect cache behavior, while random UUIDv4 inserts scatter across every leaf page, causing cache thrashing and higher write amplification under heavy write load. This is why UUIDv7, which puts a timestamp in the high bits, exists — it recovers sequential locality while keeping the other UUID properties. Third, distributed generation: bigserial requires a database call (or a pre-claimed block) to generate each ID, which is a coordination point across sharded or multi-master setups; UUIDs can be generated on the client, offline, in any service, with astronomical collision odds. Fourth, information leakage: id = 1000 tells any authenticated user that you have around 1000 of something, which is a real competitive-intelligence concern for some businesses, while UUIDs leak nothing. The right choice depends on the workload. For a single-writer OLTP app with no information-leak concerns, bigint IDENTITY is simpler, smaller, and faster. For distributed systems, client-generated IDs, or public URLs, UUIDs earn their extra bytes. A common hybrid pattern is bigint IDENTITY as the internal primary key plus a separate uuid column for external-facing URLs — you get tight internal joins and opaque external identifiers.

3. "Can you update a primary key value in PostgreSQL, and should you?"

Technically yes, practically almost never. PostgreSQL will let you issue UPDATE users SET id = 999 WHERE id = 1, and if you have declared foreign keys with ON UPDATE CASCADE, the change will propagate to every referencing row in every referencing table. The question is why you want to. If the primary key is a synthetic surrogate like bigint IDENTITY or uuid, there is no meaningful reason to change it — by definition, it carries no business meaning, so changing it is just churn that disrupts foreign keys, replication streams, and any external system that cached the ID. If the primary key is a natural key like an email address or an SKU that can legitimately change, the fact that you need to update it is itself the bug — the primary key was carrying business meaning that should have been stored in a separate column with a unique constraint, backed by a surrogate primary key that is guaranteed never to change. The interview-safe answer is: "Yes, you can, but needing to is a design smell. The correct response is to redesign the key so the stable identity and the changeable business value are separate columns."

4. "Why does every table need a primary key? What breaks if you skip it?"

Four things break, in order of severity. First, logical replication: PostgreSQL's logical replication (and tools like Debezium built on top of it) identifies rows by their primary key. Without one, the table is excluded from replication unless you explicitly set REPLICA IDENTITY FULL, which ships the entire old row on every update — slow, bandwidth-hungry, and often unworkable for wide tables. Second, foreign key targets: other tables cannot reference this table, because REFERENCES table(column) requires the target to be unique. If you later decide you need a relationship, you cannot add it without first adding a primary key and backfilling consistent values. Third, ORM update and delete statements: most ORMs generate UPDATE ... WHERE id = ? and DELETE ... WHERE id = ? automatically using the primary key; without one, they either refuse to work, require manual SQL for every mutation, or generate dangerously broad WHERE clauses. Fourth, silent duplicates: without a unique constraint on some identity column, a retry of a failed insert can produce two copies of the same logical row, and nothing in the database will tell you. Even tables that look like they do not need a key — append-only event logs, audit trails — benefit from id bigserial PRIMARY KEY as free insurance against all four of these problems.

5. "When would you use a composite primary key instead of a single surrogate key?"

The classic case is a junction table for a many-to-many relationship — a table whose only purpose is to express "these two entities are linked." An enrollments(student_id, course_id) table, a tags(post_id, tag_id) table, a memberships(user_id, group_id) table. In each of these, the natural primary key is the pair itself, because the business rule is "a student cannot enroll in the same course twice" and the primary key enforces that rule directly with no extra constraint needed. A composite key is also correct when the table has a genuine compound identity that no other table will ever reference — for example, a time-series table partitioned by (device_id, bucket_timestamp) where each row is uniquely identified by the pair and nothing downstream points at individual rows. The tradeoff is that every foreign key pointing at a composite-keyed table must also carry both columns, so junction tables that are referenced by other tables (for example, an enrollments row that has grades and assignments referencing it) usually add a surrogate id bigserial PRIMARY KEY and demote the composite to a UNIQUE constraint. The composite still enforces the business rule; the surrogate gives downstream tables a single-column handle. The rule of thumb: composite PK for pure junction tables nobody else references, surrogate PK + unique constraint for everything else.


Quick Reference — Primary Key Cheat Sheet

+---------------------------------------------------------------+
|           PRIMARY KEY CHEAT SHEET                             |
+---------------------------------------------------------------+
|                                                                |
|  DECLARATION (PostgreSQL 10+):                                 |
|    id bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY          |
|                                                                |
|  UUID ALTERNATIVE (PostgreSQL 13+):                            |
|    id uuid PRIMARY KEY DEFAULT gen_random_uuid()               |
|                                                                |
|  COMPOSITE (junction tables):                                  |
|    PRIMARY KEY (student_id, course_id)                         |
|                                                                |
|  GUARANTEES:                                                   |
|    - Unique across the table                                   |
|    - NOT NULL (automatic, even without declaring it)           |
|    - Exactly one primary key per table                         |
|    - Backed by a unique B-tree index (_pkey)                   |
|                                                                |
|  GENERATED ALWAYS vs BY DEFAULT:                               |
|    ALWAYS     -> rejects user-supplied values (safer default)  |
|    BY DEFAULT -> accepts user-supplied values (migrations)     |
|                                                                |
|  bigserial vs uuid (pick one):                                 |
|    bigserial -> small, sequential, needs DB, leaks counts     |
|    uuid      -> 2x size, random, client-gen, opaque           |
|    UUIDv7    -> UUID shape with sequential locality            |
|                                                                |
+---------------------------------------------------------------+

+---------------------------------------------------------------+
|           KEY RULES                                            |
+---------------------------------------------------------------+
|                                                                |
|  1. Every table gets a primary key. No exceptions.             |
|  2. Prefer synthetic surrogates over natural business keys.    |
|  3. Never update a primary key value — redesign instead.       |
|  4. Use GENERATED ALWAYS AS IDENTITY, not SERIAL, for new work.|
|  5. Start with bigint (8 bytes), not int (4 bytes).            |
|  6. UUID only when distributed, offline-gen, or opaque needed. |
|  7. Composite PK for pure junction tables; surrogate otherwise.|
|  8. Primary key values are stable for the life of the row.     |
|                                                                |
+---------------------------------------------------------------+
ConcernWrong WayRight Way
New int PKid serial PRIMARY KEYid bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY
Email as PKemail text PRIMARY KEYSurrogate PK + email text UNIQUE
No PK "for now"Table with no keyid bigserial PRIMARY KEY always
Mutable PKON UPDATE CASCADE on idImmutable surrogate, move the value to another column
Public URLsExpose bigserial idSeparate public_id uuid column
Junction table (referenced)Composite PK onlySurrogate PK + UNIQUE (a, b)
Junction table (not referenced)Surrogate + uniqueComposite PK (a, b)
Sizeint (4 bytes)bigint (8 bytes) from day one

Prev: Lesson 1.4 -- Schema Design Basics Next: Lesson 2.2 -- Foreign Key


This is Lesson 2.1 of the Database Interview Prep Course -- 12 chapters, 58 lessons.

On this page