Many-to-Many Relationships
Junction Tables, Composite Keys, and Relationship Attributes
LinkedIn Hook
"You added a
course_idscolumn to thestudentstable. It is a comma-separated string. You are now writing your own query language."Many-to-many relationships are the third and most misunderstood cardinality in relational design. One-to-one is rare. One-to-many fits inside the foreign key model you already know — the "many" side carries a single FK column and you are done. But many-to-many breaks that pattern entirely. A student takes many courses. A course enrolls many students. There is no single side that owns the link, so there is nowhere to put a single foreign key. You cannot add
course_idtostudentsbecause a student has many. You cannot addstudent_idtocoursesbecause a course has many.Beginners reach for arrays, JSON columns, or comma-separated strings. Each of those choices throws away the relational model: you lose foreign key enforcement, you cannot index efficiently, you cannot join cleanly, and you cannot answer simple questions like "how many students are in CS101" without parsing strings inside SQL.
The right answer is older than every database you have ever used. Create a third table — a junction table, also called a join table, link table, associative entity, or bridge table — whose only job is to hold pairs of foreign keys, one to each side. Two FKs become a composite primary key. Suddenly every question is a clean two-join query, every constraint is enforced by the database, and every "wait, when did they enroll?" question can be answered by adding a column to the junction table itself.
In Lesson 8.3, I break down many-to-many relationships end to end: why junction tables exist, how to design the composite primary key, when to add attributes to the relationship, and the runnable PostgreSQL patterns that make students
<->courses through enrollments work in production.Read the full lesson -> [link]
#SQL #Database #PostgreSQL #DataModeling #BackendDevelopment #InterviewPrep #ERD
What You'll Learn
- Why many-to-many relationships cannot be modeled with a single foreign key column
- How a junction table (join table, bridge table, associative entity) decomposes M:N into two 1:N relationships
- How to design a composite primary key on the two foreign key columns and why it matters
- When to use a surrogate
idon the junction table vs the natural composite PK - How to add attributes to the relationship itself (enrolled_at, grade, status) without breaking normalization
- The students
<->courses example end to end with runnable PostgreSQL DDL and queries - Indexing strategy: why one composite PK index is not enough and you need a second index in reverse order
- Common anti-patterns: comma-separated columns, JSON arrays of IDs, EAV-style fakes
- PostgreSQL vs MySQL differences for composite keys and ON DELETE behavior
The Classroom Sign-Up Sheet Analogy
Picture a small university with two bulletin boards in the hallway. The first board lists every student in the school — name, ID number, year. The second board lists every course offered this semester — code, title, room. Now the registrar asks a simple question: which students are taking which courses this semester?
You cannot answer that question by writing on either of the existing boards. If you try to write course codes next to each student name, you run out of space the moment a student takes more than one class — and you cannot easily search "who is taking CS101?" because the answer is scattered across dozens of student rows. If you instead try to write student names next to each course, you have the same problem in the opposite direction. The information is fundamentally about pairs, and pairs do not belong to either side alone.
So the registrar puts up a third board: the enrollment sheet. Each row on this board has exactly two things — a student ID and a course code. One row per pair. If a student takes five courses, there are five rows with that student's ID. If a course has thirty students, there are thirty rows with that course's code. The sheet itself owns nothing; it just records connections. And because each row is a pair, you can sort the sheet by student to see what one student is taking, or sort it by course to see who is in one class. The same data answers both questions.
That third board is exactly what a junction table is in SQL. The students table is one bulletin board. The courses table is the other. The enrollments table is the sign-up sheet — a thin table whose only purpose is to hold pairs of foreign keys, one pointing back to each side. The relational model has no built-in concept of a "many-to-many link." Instead, you decompose every M:N relationship into two perfectly normal one-to-many relationships that meet in the middle at a junction table.
+---------------------------------------------------------------+
| WHY YOU CANNOT USE A SINGLE FK |
+---------------------------------------------------------------+
| |
| Wrong attempt 1: put course_id on students |
| |
| students (id, name, course_id) |
| -> Alice can only take ONE course. Broken. |
| |
| Wrong attempt 2: put student_id on courses |
| |
| courses (id, title, student_id) |
| -> CS101 can only enroll ONE student. Broken. |
| |
| Wrong attempt 3: comma-separated string |
| |
| students (id, name, course_ids TEXT) -- "1,4,7" |
| -> No FK enforcement. No index. String parsing in SQL. |
| -> Cannot answer "who is in course 4" without LIKE '%4%'. |
| |
+---------------------------------------------------------------+
+---------------------------------------------------------------+
| THE JUNCTION TABLE PATTERN |
+---------------------------------------------------------------+
| |
| students courses |
| +-----------+ +-------------+ |
| | id (PK) | | id (PK) | |
| | name | | code | |
| | year | | title | |
| +-----+-----+ +------+------+ |
| | | |
| | 1 1 | |
| | | |
| | enrollments | |
| | +----------------------+ | |
| +--N>| student_id (FK, PK) |<N------+ |
| | course_id (FK, PK) | |
| | enrolled_at | |
| | grade | |
| +----------------------+ |
| |
| M:N (students `<->` courses) |
| decomposes into |
| 1:N (students -> enrollments) + N:1 (enrollments -> courses) |
| |
+---------------------------------------------------------------+
Napkin AI Visual Prompt: "Dark navy gradient (#0a0f1f -> #111a2e). Three white monospace table cards. Left card 'students' with rows for Alice, Bob, Carol. Right card 'courses' with rows for CS101, MATH201, ENG110. Center card glowing rose (#ff5c8a) labeled 'enrollments' showing pairs (Alice, CS101), (Alice, MATH201), (Bob, CS101). Sky blue (#4fc3f7) arrows from students -> enrollments labeled '1:N' and from courses -> enrollments labeled '1:N'. A small caption beneath in white monospace reads 'M:N = two 1:N meeting in the middle'. Subtle grid overlay."
Building the Schema in PostgreSQL
The classic example is students and courses. A student takes many courses. A course has many students. We need three tables: students, courses, and the junction table enrollments. The junction table holds two foreign keys, and together those two columns form the composite primary key — meaning the database physically prevents the same student from being enrolled in the same course twice.
Example 1 — DDL with Composite Primary Key
-- Two parent tables. Nothing special — each has its own surrogate key.
CREATE TABLE students (
id SERIAL PRIMARY KEY,
name TEXT NOT NULL,
year INT NOT NULL CHECK (year BETWEEN 1 AND 6)
);
CREATE TABLE courses (
id SERIAL PRIMARY KEY,
code TEXT NOT NULL UNIQUE, -- e.g. 'CS101'
title TEXT NOT NULL,
credits INT NOT NULL DEFAULT 3
);
-- The junction table. Two foreign keys, one composite primary key.
-- ON DELETE CASCADE means: if a student is deleted, their enrollments
-- vanish with them. Same for courses. This is almost always what you
-- want for a junction table — orphaned link rows are meaningless.
CREATE TABLE enrollments (
student_id INT NOT NULL REFERENCES students(id) ON DELETE CASCADE,
course_id INT NOT NULL REFERENCES courses(id) ON DELETE CASCADE,
enrolled_at TIMESTAMPTZ NOT NULL DEFAULT now(),
grade CHAR(1) CHECK (grade IN ('A','B','C','D','F') OR grade IS NULL),
-- The composite primary key: each (student, course) pair appears
-- at most once. This single line replaces a UNIQUE constraint AND
-- gives us the primary index for free.
PRIMARY KEY (student_id, course_id)
);
-- Critical second index! The composite PK above is sorted as
-- (student_id, course_id), which makes "find courses for student X"
-- fast but "find students in course Y" a full table scan. Add a
-- second index in REVERSE order for the other direction.
CREATE INDEX idx_enrollments_course_student
ON enrollments (course_id, student_id);
-- Sample data for the next few examples
INSERT INTO students (name, year) VALUES
('Alice', 2),
('Bob', 1),
('Carol', 3);
INSERT INTO courses (code, title, credits) VALUES
('CS101', 'Intro to Computer Science', 4),
('MATH201', 'Linear Algebra', 3),
('ENG110', 'Composition I', 3);
INSERT INTO enrollments (student_id, course_id, grade) VALUES
(1, 1, 'A'), -- Alice in CS101
(1, 2, 'B'), -- Alice in MATH201
(2, 1, NULL), -- Bob in CS101 (no grade yet)
(3, 2, 'A'), -- Carol in MATH201
(3, 3, 'B'); -- Carol in ENG110
Why the composite PK and not a surrogate id? The composite primary key (student_id, course_id) does two jobs at once: it identifies each row uniquely and it enforces the business rule "no student can be enrolled in the same course twice." If you instead added a SERIAL id PRIMARY KEY to enrollments, you would still need a separate UNIQUE (student_id, course_id) constraint to prevent duplicates — so you would carry an extra column and an extra index for no benefit. Use the natural composite key unless you have a strong reason (for example, another table needs to FK to a single enrollment row, in which case a surrogate id becomes useful).
Querying Across the Junction
Once the junction table is in place, every "many-to-many" question becomes a clean two-join query: from one side, through enrollments, to the other side. Read the joins as walking across the bridge.
Example 2 — "Which courses is Alice taking?"
-- Walk: students -> enrollments -> courses
SELECT s.name AS student, c.code, c.title, e.grade
FROM students s
JOIN enrollments e ON e.student_id = s.id
JOIN courses c ON c.id = e.course_id
WHERE s.name = 'Alice'
ORDER BY c.code;
student | code | title | grade
---------+---------+-------------------------+-------
Alice | CS101 | Intro to Computer Sci.. | A
Alice | MATH201 | Linear Algebra | B
(2 rows)
Example 3 — "Who is enrolled in CS101, and what is the class average?"
-- Walk the bridge in the other direction: courses -> enrollments -> students.
-- This is the query the second index (course_id, student_id) was built for.
SELECT c.code, s.name AS student, e.grade
FROM courses c
JOIN enrollments e ON e.course_id = c.id
JOIN students s ON s.id = e.student_id
WHERE c.code = 'CS101'
ORDER BY s.name;
code | student | grade
--------+---------+-------
CS101 | Alice | A
CS101 | Bob |
(2 rows)
-- Aggregate against the junction table directly: count enrollments
-- per course, ignoring NULL grades when computing the average.
SELECT c.code,
COUNT(*) AS enrolled,
COUNT(e.grade) AS graded,
ROUND(AVG(
CASE e.grade
WHEN 'A' THEN 4 WHEN 'B' THEN 3
WHEN 'C' THEN 2 WHEN 'D' THEN 1
WHEN 'F' THEN 0
END
)::numeric, 2) AS gpa
FROM courses c
LEFT JOIN enrollments e ON e.course_id = c.id
GROUP BY c.code
ORDER BY c.code;
code | enrolled | graded | gpa
---------+----------+--------+------
CS101 | 2 | 1 | 4.00
ENG110 | 1 | 1 | 3.00
MATH201 | 2 | 2 | 3.50
(3 rows)
Notice that the junction table is no longer just a passive link — it now carries a grade column that belongs to neither student nor course alone. A grade is an attribute of the enrollment, not of the student (Alice has many grades, one per course) and not of the course (CS101 has many grades, one per student). Anything that depends on the pair belongs on the junction table. This is the moment a junction table stops being a "join table" and becomes what data modelers call an associative entity — a real entity in its own right, with its own attributes.
When the Junction Table Grows Up
A pure junction table has only the two FK columns and nothing else. The moment you add enrolled_at, grade, status, dropped_at, or final_score, you are admitting that the relationship itself has properties. That is fine — and in fact, most production junction tables end up looking like full entities. There is a useful test: if the column would not make sense on either parent table alone, it belongs on the junction.
Example 4 — Dropping students, soft deletes, and history
-- Add status and timestamps so we can model real enrollment lifecycle.
ALTER TABLE enrollments
ADD COLUMN status TEXT NOT NULL DEFAULT 'active'
CHECK (status IN ('active','dropped','completed')),
ADD COLUMN dropped_at TIMESTAMPTZ;
-- Bob drops CS101
UPDATE enrollments
SET status = 'dropped', dropped_at = now()
WHERE student_id = 2 AND course_id = 1;
-- Find every active enrollment, with how long the student has been enrolled.
SELECT s.name,
c.code,
e.status,
date_trunc('day', now() - e.enrolled_at) AS duration
FROM enrollments e
JOIN students s ON s.id = e.student_id
JOIN courses c ON c.id = e.course_id
WHERE e.status = 'active'
ORDER BY s.name, c.code;
name | code | status | duration
-------+---------+--------+----------
Alice | CS101 | active | 12 days
Alice | MATH201 | active | 12 days
Carol | ENG110 | active | 12 days
Carol | MATH201 | active | 12 days
(4 rows)
If you ever need to track enrollment history — the same student dropping and re-enrolling in the same course across semesters — the natural composite PK (student_id, course_id) becomes too restrictive (it only allows one row per pair). At that point you either add term_id to the composite key, making it (student_id, course_id, term_id), or switch to a surrogate id on the enrollment row. Both are valid; pick the one that matches how the rest of the schema references enrollments.
Napkin AI Visual Prompt: "Dark navy gradient (#0a0f1f -> #111a2e). A timeline running left to right in white monospace. Three rows representing one student-course pair across three semesters. Each row is a card glowing sky blue (#4fc3f7) showing 'enrollments' with values: Fall 2024 status=completed grade=B, Spring 2025 status=dropped, Fall 2025 status=active. A rose (#ff5c8a) caption above reads 'composite PK = (student_id, course_id, term_id)'. Subtle grid overlay."
PostgreSQL vs MySQL Notes
The junction table pattern is identical across every relational database, but there are small dialect differences worth knowing for interviews and migrations.
- Composite PK syntax is the same in Postgres and MySQL:
PRIMARY KEY (student_id, course_id)works in both. SQL Server and Oracle also accept the identical form. - ON DELETE CASCADE behaves the same in Postgres and MySQL/InnoDB. SQLite supports it but requires
PRAGMA foreign_keys = ON;per connection — easy to forget and a frequent bug source. - Index ordering — both engines use B-tree indexes that are useful only for the leading column or a leading prefix. The composite PK
(student_id, course_id)accelerates lookups bystudent_idalone or by(student_id, course_id)together, but not bycourse_idalone. Both Postgres and MySQL need a second index(course_id, student_id)to make reverse-direction lookups fast. This is the most commonly forgotten index in junction-table designs. - Postgres-only feature:
EXCLUDEconstraints let you express "no overlapping enrollments per term" using GiST indexes — powerful for scheduling problems but unnecessary for plain M:N linking. - MySQL historically did not support deferred FK checking, which can complicate bulk inserts that load both parents and the junction in one transaction. Postgres supports
DEFERRABLE INITIALLY DEFERRED. - Naming conventions — Rails calls them
has_many :throughtables, Django calls themthroughmodels, SQLAlchemy calls themsecondarytables, and database textbooks call them associative entities or bridge tables. They are all the same shape: two FKs forming a composite PK, optionally with extra columns. Knowing the vocabulary across stacks avoids a lot of confusion in interviews and code reviews. - Naming the table itself — prefer a name that describes the relationship (
enrollments,memberships,subscriptions,tags_on_posts) rather than concatenating the two parents (students_courses). A descriptive name reads better in queries and gracefully accommodates the moment you add attributes and the table becomes a real entity.
-- Quick sanity check: confirm the indexes are in place and the
-- composite PK is doing its job. Run this after creating the schema.
SELECT indexname, indexdef
FROM pg_indexes
WHERE tablename = 'enrollments'
ORDER BY indexname;
indexname | indexdef
----------------------------------+--------------------------------------------------------------------
enrollments_pkey | CREATE UNIQUE INDEX enrollments_pkey ON enrollments (student_id, course_id)
idx_enrollments_course_student | CREATE INDEX idx_enrollments_course_student ON enrollments (course_id, student_id)
(2 rows)
Two indexes, one in each direction. That is the minimum healthy state for a junction table that gets queried from both sides — and it is the configuration you want to verify on every M:N table you inherit.
Common Mistakes
1. Storing the "many" side as a comma-separated string or JSON array.
Putting course_ids = '1,4,7' on the students table or course_ids JSONB is the most common rookie shortcut, and it throws away every benefit of the relational model. You lose foreign key enforcement (nothing prevents course_id = 9999), you lose efficient joins (you must parse strings inside SQL), you lose simple aggregates ("how many students per course?" becomes a LIKE '%4%' scan), and you create a maintenance nightmare when a course is deleted. The junction table costs one extra CREATE TABLE and pays back every single query you will ever write.
2. Forgetting the second index in reverse column order.
A composite primary key on (student_id, course_id) only accelerates queries that filter by student_id (alone or with course_id). It does not help "find all students in CS101" — that becomes a full sequential scan over the entire enrollments table. Always add a second index (course_id, student_id) so both directions of the bridge are fast. This single missing index is the most common performance regression in junction-table queries.
3. Adding a surrogate id to the junction table without a UNIQUE on the FK pair.
If you must use a surrogate id (some ORMs require it), you still need UNIQUE (student_id, course_id) to prevent double enrollments. Skipping this constraint lets the application insert the same pair twice and corrupts every COUNT, AVG, and SUM that runs against the table. The composite PK gives you uniqueness for free; the surrogate id does not.
4. Putting relationship attributes on the wrong table.
Beginners sometimes add a last_grade column to students or a student_grades array to courses. Both are wrong: a grade depends on the pair of student and course, so it must live on the junction row. The test is simple: if you cannot answer the question "for which course?" or "for which student?" by looking at a single attribute, it does not belong on the parent — it belongs on the junction.
5. Using ON DELETE NO ACTION (the default) when you actually want CASCADE.
By default, deleting a student that has enrollments fails with a foreign key violation. That sometimes is correct — but for a pure link table, almost always you want ON DELETE CASCADE so that orphaned link rows disappear with the parent. Choose intentionally; do not just accept the default.
Interview Questions
1. "Why can't you model a many-to-many relationship with a single foreign key column on either side?"
A single foreign key column can only point to one parent row. If a student takes many courses, a course_id column on the students table has room for exactly one course — every additional course requires either duplicating the student row (which breaks the primary key on students) or storing multiple values in one cell (which breaks first normal form). The same applies in the opposite direction: a student_id column on courses can only enroll one student. Many-to-many is fundamentally a relationship between pairs, and pairs do not belong to either side alone. The relational model decomposes M:N into two normal one-to-many relationships joined through a third table — the junction table — whose rows each hold one pair of foreign keys. This is the only way to preserve foreign key enforcement, allow efficient indexing in both directions, and keep the schema in third normal form.
2. "What goes into the primary key of a junction table, and why?"
The natural choice is a composite primary key on the two foreign key columns — for example PRIMARY KEY (student_id, course_id) on an enrollments table. This serves two purposes simultaneously: it identifies each link row uniquely and it enforces the business rule that the same pair cannot appear twice. Adding a surrogate id column instead works but forces you to also add a separate UNIQUE (student_id, course_id) constraint to prevent duplicates, costing an extra column and an extra index for no real benefit. The exception is when you need to track multiple instances of the same pair across time (a student dropping and re-enrolling each semester) — then you either widen the composite key to include term_id or switch to a surrogate id. Pick the natural composite key by default and reach for a surrogate only when another table needs to FK to a single enrollment row.
3. "Why do you need TWO indexes on a junction table when there's already a composite primary key?"
Because a composite B-tree index is only useful for queries that filter on the leading column or a leading prefix of the columns. A primary key on (student_id, course_id) accelerates "find courses for student X" and "find the specific (X, Y) pair," but it provides no help at all for "find students in course Y" — that query must scan the entire junction table. The fix is a second index in reverse order, (course_id, student_id), which makes the reverse-direction lookups equally fast. Both directions of the bridge are queried in real applications, so both directions must be indexed. This is the single most commonly forgotten optimization in M:N schemas, and it shows up as a sudden production slowdown the first time someone runs "list everyone enrolled in this course."
4. "When does a junction table stop being a 'join table' and become a real entity?"
The moment you add columns that depend on the pair rather than on either parent alone. A pure junction table has just the two foreign key columns. Add enrolled_at, grade, status, dropped_at, or final_score and you have admitted that the relationship itself has attributes. Data modelers call this an associative entity — it is a real entity that happens to derive its identity from the two parents it connects. Once attributes start landing on the junction, you may also want a surrogate id, soft deletes, audit columns, and even relationships from other tables back to the junction row itself (for example, feedback_comments(enrollment_id)). The test for whether something belongs on the junction is simple: if the column would not make sense on either parent alone, it belongs on the link.
5. "How would you store enrollment history if the same student can take the same course multiple times across semesters?"
A pure composite primary key on (student_id, course_id) only allows one row per pair, so it cannot represent the same student retaking the same course in a later semester. Two valid fixes exist. The first is to widen the composite key to include the term: PRIMARY KEY (student_id, course_id, term_id), which permits multiple enrollments for the same pair as long as each is in a distinct term. This stays close to the natural-key approach and makes "Alice's history in CS101" a clean filter. The second is to switch the junction table to a surrogate id primary key with a non-unique (student_id, course_id) index, allowing arbitrary repeats without modeling the term in the key. The first option is preferable when terms are a first-class concept in the schema, the second when other tables need a single stable ID to reference an individual enrollment row.
Quick Reference — Many-to-Many Cheat Sheet
+---------------------------------------------------------------+
| MANY-TO-MANY CHEAT SHEET |
+---------------------------------------------------------------+
| |
| THE PATTERN: |
| M:N -> three tables |
| parent_a, parent_b, junction(a_id, b_id) |
| Decomposes into 1:N + N:1 meeting at the junction |
| |
| JUNCTION TABLE TEMPLATE: |
| CREATE TABLE link ( |
| a_id INT NOT NULL REFERENCES a(id) ON DELETE CASCADE, |
| b_id INT NOT NULL REFERENCES b(id) ON DELETE CASCADE, |
| <attributes of the relationship>, |
| PRIMARY KEY (a_id, b_id) |
| ); |
| CREATE INDEX link_b_a_idx ON link (b_id, a_id); |
| |
| COMPOSITE PRIMARY KEY GIVES YOU: |
| 1. Uniqueness of the pair (no double-link) |
| 2. The primary index in (a_id, b_id) order |
| -> still need a second index (b_id, a_id) for reverse |
| |
| RELATIONSHIP ATTRIBUTES LIVE ON THE JUNCTION: |
| enrolled_at, grade, status, dropped_at, final_score |
| Test: would this column make sense on EITHER parent alone? |
| -> If no, it belongs on the link. |
| |
| WHEN TO USE A SURROGATE id INSTEAD OF COMPOSITE PK: |
| - Other tables need to FK to a single link row |
| - Same pair can repeat across time (or add term_id to PK) |
| - ORM strongly prefers single-column PKs |
| -> Still add UNIQUE (a_id, b_id) unless repeats are valid |
| |
+---------------------------------------------------------------+
+---------------------------------------------------------------+
| ANTI-PATTERNS TO AVOID |
+---------------------------------------------------------------+
| |
| X course_ids TEXT -- "1,4,7" |
| X course_ids JSONB -- [1, 4, 7] |
| X course_ids INT[] -- {1, 4, 7} |
| X Adding course_id columns 1..N to students |
| X Surrogate id with no UNIQUE on the FK pair |
| X Composite PK only, no reverse-order second index |
| |
+---------------------------------------------------------------+
| Concern | Wrong Way | Right Way |
|---|---|---|
| Storing M:N | CSV / JSON array of IDs | Junction table with two FKs |
| Junction PK | Surrogate id only | Composite (a_id, b_id) |
| Reverse lookups | Hope the PK helps | Second index (b_id, a_id) |
| Relationship attributes | On a parent table | On the junction row |
| Deleting a parent | Default NO ACTION | ON DELETE CASCADE on junction |
| Repeat enrollments | Break the unique constraint | Add term_id to composite key |
| Bidirectional queries | One index, slow one direction | Two indexes, fast both ways |
| Enforcing uniqueness | Application-level check | Composite PK or UNIQUE constraint |
Prev: Lesson 8.2 -- One-to-Many Relationships Next: Lesson 8.4 -- ER Diagrams
This is Lesson 8.3 of the Database Interview Prep Course -- 12 chapters, 58 lessons.