What Is Normalization?
Reducing Redundancy and Killing Anomalies
LinkedIn Hook
"You changed a customer's email in one row. The other 47 rows still show the old one. Welcome to the update anomaly."
Most developers learn SQL by stuffing everything into one wide table. Customer name, customer email, order date, product name, product price, shipping address — all sitting comfortably in a single row. It feels intuitive. It even feels fast. Then a customer changes their email, and you realize you have to update 47 rows. You miss one. Now your reports show two emails for the same person, your marketing automation sends to the dead address, and your support team cannot figure out which record is canonical.
That is the update anomaly, and it is one of three classic bugs that normalization was invented to eliminate. The other two are even worse. The insert anomaly says you cannot record a new product until at least one customer orders it, because product data only lives in the orders table. The delete anomaly says deleting the last order from a customer accidentally erases the customer entirely, because customer data only existed inside that order row.
Normalization is the discipline of organizing data so that every fact lives in exactly one place. It is not academic SQL trivia — it is the difference between a database that grows gracefully and a database that becomes a minefield of inconsistencies after six months of production traffic. The normal forms (1NF through BCNF and beyond) are a step-by-step recipe for getting there, and each form fixes a specific class of anomaly that the previous form left behind.
But normalization is not a religion. There are real cases — analytics tables, denormalized read models, materialized aggregates — where deliberately violating the normal forms makes the system faster and simpler. The skill is knowing the rules well enough to break them on purpose, not by accident.
In Lesson 10.1, I break down what normalization actually is, the three anomalies it prevents, an overview of all the normal forms, and the specific situations where denormalization wins.
Read the full lesson -> [link]
#SQL #Database #PostgreSQL #DataModeling #Normalization #BackendDevelopment #InterviewPrep
What You'll Learn
- What normalization is and the two goals it serves: reducing redundancy and preventing anomalies
- The three classic anomalies — insert, update, and delete — and exactly how each one breaks a denormalized table
- A high-level overview of the normal forms (1NF, 2NF, 3NF, BCNF, 4NF, 5NF) and what each one fixes
- How to recognize an unnormalized table from its symptoms before it bites you in production
- Runnable PostgreSQL examples that demonstrate each anomaly and the normalized fix
- When to deliberately denormalize: read-heavy analytics, materialized views, caching aggregates, and OLAP star schemas
- The mental model a senior engineer uses when deciding how normalized a given table should be
The Messy Closet Analogy — One Fact, One Place
Picture a closet where every shirt has the laundry instructions sewn into the collar, the brand sewn into the sleeve, the size printed on the back, and the price tag still attached. Multiply that by 200 shirts. Now the brand decides to change its logo. To update your closet, you have to find every single shirt from that brand and re-sew the sleeves. Miss one shirt? Your closet now disagrees with itself about what the brand looks like.
A well-organized closet works the opposite way. The brand has one card in a drawer somewhere. The size is written on a single label per garment. The laundry instructions live in a binder on the shelf, indexed by fabric type. Each fact about each thing exists in exactly one place. When the brand updates its logo, you change one card and every shirt that references it now shows the new logo automatically. There is no risk of two shirts disagreeing because there is no second copy of the truth.
That is the entire intuition behind database normalization. A denormalized table copies the same fact across many rows — customer name in every order, product price in every line item, supplier address in every shipment. The moment one of those facts changes, you have to find and update every copy. Normalization breaks the wide table into smaller related tables so that every fact lives in exactly one row, and other tables refer to it by a foreign key. Update the one row, and every reader sees the new value automatically.
+---------------------------------------------------------------+
| THE MESSY CLOSET vs THE ORGANIZED CLOSET |
+---------------------------------------------------------------+
| |
| MESSY (Denormalized) ORGANIZED (Normalized) |
| +--------------------+ +------------------+ |
| | Shirt 1 | | brands | |
| | brand: "Acme" | | id | name | |
| | brand_logo: ... | | 1 | Acme | |
| | size: M | +------------------+ |
| +--------------------+ ^ |
| | Shirt 2 | | |
| | brand: "Acme" | | brand_id |
| | brand_logo: ... | +------------------+ |
| | size: L | | shirts | |
| +--------------------+ | id | brand_id | size | |
| | 1 | 1 | M | |
| Change brand logo -> | 2 | 1 | L | |
| Update 200 shirts +------------------+ |
| Change brand logo -> |
| Update 1 brand row |
| |
+---------------------------------------------------------------+
The Two Goals: Reduce Redundancy, Prevent Anomalies
Normalization has exactly two jobs. Everything else — the formal definitions, the dependency theory, the parade of normal forms — is just machinery in service of these two goals.
Goal 1: Reduce redundancy. A fact like "customer 42's email is alice@example.com" should appear in exactly one row of one table. Not in every order they ever placed. Not duplicated in a shipping table. One row, one place. Redundancy wastes storage, but more importantly it creates the conditions for inconsistency — and inconsistency is what destroys trust in a database.
Goal 2: Prevent anomalies. An anomaly is a situation where a normal database operation (insert, update, delete) produces a result that is logically wrong, inconsistent, or impossible. Anomalies are not edge cases — they are the predictable consequence of storing the same fact in multiple places. Normalization is the only general-purpose technique for eliminating them at the schema level.
Notice what is NOT on that list: performance. Normalization is fundamentally about correctness, not speed. A more normalized schema usually has more tables and requires more joins to answer a query, which can make reads slower. The trade-off is that writes become safer and the data stays internally consistent. When you reach for denormalization later in this lesson, you are explicitly trading correctness guarantees for read performance — and you should know exactly which guarantees you are giving up.
The Three Anomalies — and How They Bite You
The three anomalies are the canonical motivation for normalization. They show up the moment you stuff multiple entities into a single table, and they get worse as the table grows. Every database engineer should be able to recite them in their sleep.
+---------------------------------------------------------------+
| THE THREE ANOMALIES |
+---------------------------------------------------------------+
| |
| INSERT ANOMALY |
| "I cannot store a new fact because a related fact |
| does not exist yet." |
| Example: cannot add a new product until someone |
| orders it, because products only live in orders. |
| |
| UPDATE ANOMALY |
| "Changing one fact requires changing many rows, |
| and missing one leaves the database inconsistent." |
| Example: customer renames themselves; you update |
| 46 of 47 order rows; now the same customer has two |
| different names depending on which order you read. |
| |
| DELETE ANOMALY |
| "Deleting one row accidentally erases unrelated facts." |
| Example: deleting a customer's only order erases the |
| customer entirely, because customer info only lived |
| inside that one order row. |
| |
+---------------------------------------------------------------+
Let us see all three in real PostgreSQL.
Example 1: The Update Anomaly
We start with a denormalized orders table that crams customer information directly into every order row. This is a very common starting point for a junior engineer who has not been bitten yet.
-- The denormalized version: customer info lives inside every order
CREATE TABLE orders_bad (
order_id SERIAL PRIMARY KEY,
customer_id INT,
customer_name TEXT,
customer_email TEXT,
product TEXT,
qty INT,
ordered_at TIMESTAMPTZ DEFAULT now()
);
INSERT INTO orders_bad (customer_id, customer_name, customer_email, product, qty) VALUES
(1, 'Alice', 'alice@old.com', 'Keyboard', 1),
(1, 'Alice', 'alice@old.com', 'Mouse', 2),
(1, 'Alice', 'alice@old.com', 'Monitor', 1),
(2, 'Bob', 'bob@example.com', 'Cable', 3);
-- Alice changes her email. We try to update.
UPDATE orders_bad SET customer_email = 'alice@new.com'
WHERE customer_id = 1 AND product <> 'Mouse'; -- oops, missed one row
-- Now look at what we have
SELECT customer_id, customer_email, product FROM orders_bad
WHERE customer_id = 1;
-- customer_id | customer_email | product
-- ------------+----------------+----------
-- 1 | alice@new.com | Keyboard
-- 1 | alice@old.com | Mouse <- inconsistent!
-- 1 | alice@new.com | Monitor
The same customer now has two different emails depending on which order you ask. There is no single source of truth. Every reporting query, every analytics dashboard, every "send a receipt" job has to guess which email is right — and they will guess differently.
Example 2: The Normalized Fix
The fix is to split the wide table into two related tables: one for customers, one for orders. Each customer's email lives in exactly one row, and orders point to that row by customer_id.
-- The normalized version: customer info lives once, orders reference it
CREATE TABLE customers (
customer_id SERIAL PRIMARY KEY,
name TEXT NOT NULL,
email TEXT NOT NULL UNIQUE
);
CREATE TABLE orders_good (
order_id SERIAL PRIMARY KEY,
customer_id INT NOT NULL REFERENCES customers(customer_id),
product TEXT NOT NULL,
qty INT NOT NULL,
ordered_at TIMESTAMPTZ DEFAULT now()
);
INSERT INTO customers (name, email) VALUES
('Alice', 'alice@old.com'),
('Bob', 'bob@example.com');
INSERT INTO orders_good (customer_id, product, qty) VALUES
(1, 'Keyboard', 1),
(1, 'Mouse', 2),
(1, 'Monitor', 1),
(2, 'Cable', 3);
-- Alice changes her email. ONE row to update. Impossible to miss.
UPDATE customers SET email = 'alice@new.com' WHERE customer_id = 1;
-- Every order automatically reflects the new email through the join
SELECT o.product, c.email
FROM orders_good o
JOIN customers c ON c.customer_id = o.customer_id
WHERE c.customer_id = 1;
-- product | email
-- ---------+---------------
-- Keyboard | alice@new.com
-- Mouse | alice@new.com <- automatically correct
-- Monitor | alice@new.com
One update. Zero risk of inconsistency. This is the entire promise of normalization in one example.
Example 3: The Insert Anomaly
The denormalized table also breaks in the other direction. Suppose we want to add a new product to the catalog, but no one has ordered it yet.
-- We have no place to put a product that has not been ordered.
-- Every column in orders_bad expects a customer and an order.
-- We could insert a fake row with NULLs everywhere:
INSERT INTO orders_bad (customer_id, customer_name, customer_email, product, qty)
VALUES (NULL, NULL, NULL, 'Webcam', NULL);
-- But now we have a phantom order with no customer and no quantity,
-- which corrupts every report that counts orders or sums quantities.
SELECT COUNT(*) AS order_count FROM orders_bad;
-- order_count
-- ------------
-- 5 <- includes the phantom Webcam row
-- The normalized fix: products live in their own table
CREATE TABLE products (
product_id SERIAL PRIMARY KEY,
name TEXT NOT NULL UNIQUE,
price_cents INT NOT NULL
);
-- Now adding a brand-new product is a clean, independent operation
INSERT INTO products (name, price_cents) VALUES ('Webcam', 4999);
-- And orders just reference the product by id
ALTER TABLE orders_good ADD COLUMN product_id INT REFERENCES products(product_id);
Each entity gets its own table. Each table can be inserted into independently. No phantom rows, no NULL columns standing in for missing relationships.
Example 4: The Delete Anomaly
The third anomaly is the mirror image of the insert anomaly. Watch what happens when we delete the only order belonging to Bob in the denormalized table.
-- Bob has exactly one order. We delete it.
DELETE FROM orders_bad WHERE customer_id = 2;
-- Did we still have any record that Bob existed?
SELECT * FROM orders_bad WHERE customer_id = 2;
-- (0 rows)
-- Bob is gone. His name, his email, the fact that he is a customer at all
-- only existed inside that one order row, and we just erased it.
-- Now if Bob places a new order tomorrow, we have lost his email and
-- have to re-collect it. Worse, our marketing list silently shrank
-- without anyone noticing.
-- In the normalized schema, deleting an order is a clean operation
-- that does not touch the customer record.
DELETE FROM orders_good WHERE order_id = 4;
SELECT * FROM customers WHERE customer_id = 2;
-- customer_id | name | email
-- ------------+------+-----------------
-- 2 | Bob | bob@example.com <- still here, untouched
Customer existence is independent from order existence. You can delete every order Bob ever placed and Bob himself remains in the customers table. That is the only sane model.
Napkin AI Visual Prompt: "Dark navy gradient (#0a0f1f -> #111a2e). Three side-by-side panels labeled 'INSERT ANOMALY', 'UPDATE ANOMALY', 'DELETE ANOMALY' in white monospace. Each panel shows a denormalized wide table on top with rose (#ff5c8a) highlighted problem cells, and a sky blue (#4fc3f7) arrow pointing down to a normalized two-table version below. Insert panel: phantom NULL row in red. Update panel: one row out of sync in red. Delete panel: customer disappearing entirely in red. Sky blue arrows on the bottom diagrams show 'one fact, one place'."
Overview of the Normal Forms
The normal forms are a graduated sequence of rules. Each form assumes the previous forms are satisfied and adds one more constraint that closes one more class of anomaly. You do not have to memorize the formal definitions on day one, but you should recognize what each form fixes.
+---------------------------------------------------------------+
| THE NORMAL FORMS — WHAT EACH ONE FIXES |
+---------------------------------------------------------------+
| |
| 1NF (First Normal Form) |
| Every column holds a single atomic value. |
| No repeating groups, no comma-separated lists, |
| no JSON blobs hiding multiple values in one cell. |
| |
| 2NF (Second Normal Form) |
| 1NF + every non-key column depends on the WHOLE |
| primary key, not just part of it. Only matters when |
| the primary key is composite (multi-column). |
| |
| 3NF (Third Normal Form) |
| 2NF + no transitive dependencies. Non-key columns |
| depend ONLY on the key, not on other non-key columns. |
| This is the practical target for most OLTP schemas. |
| |
| BCNF (Boyce-Codd Normal Form) |
| A stricter version of 3NF: every determinant must be |
| a candidate key. Closes a few rare edge cases that 3NF |
| leaves open. |
| |
| 4NF (Fourth Normal Form) |
| BCNF + no multi-valued dependencies. Splits tables that |
| accidentally encode two independent one-to-many facts. |
| |
| 5NF (Fifth Normal Form) |
| 4NF + no join dependencies that are not implied by |
| candidate keys. Mostly theoretical. |
| |
| PRACTICAL RULE OF THUMB: |
| Get to 3NF for OLTP. Reach for BCNF if you hit a |
| specific anomaly 3NF cannot fix. 4NF/5NF are rare. |
| |
+---------------------------------------------------------------+
The good news: the lessons that follow this one walk through 1NF, 2NF, 3NF, and BCNF in detail with concrete examples. For now, the only thing you need to internalize is that each form is a specific cure for a specific disease, and the diseases are all variations of "this fact is stored in more places than it should be."
When Denormalization Wins
Normalization is the right default for transactional systems — anywhere correctness, consistency, and the ability to update facts safely are critical. But it is not the right answer everywhere. There are real, principled situations where you should deliberately store data in a denormalized shape.
1. Read-heavy analytics and reporting. OLAP workloads — dashboards, business intelligence, monthly executive reports — read far more often than they write, and the queries involve massive aggregations across many tables. Star schemas (a central fact table joined to a small number of dimension tables) deliberately denormalize dimension data to keep joins shallow and reports fast. This is the entire architecture of data warehouses like Snowflake, BigQuery, and Redshift.
2. Materialized views and pre-computed aggregates. When the same expensive join runs thousands of times per second, computing it once and storing the result is a huge win. Postgres MATERIALIZED VIEW, summary tables refreshed by a cron job, and cache tables that store user-facing query results all fall into this bucket. The denormalized copy is acknowledged to be redundant — you accept the redundancy in exchange for sub-millisecond reads.
3. Caching frequently-read fields on the parent. A posts table might cache comment_count and last_commented_at even though those values can be computed from the comments table. Reading the post page is now a single row fetch instead of a join plus aggregate. The trade-off is that you must keep the cache in sync via triggers, application code, or scheduled recompute — and that synchronization is the cost of denormalization.
4. Document and event store models. NoSQL document databases (MongoDB, DynamoDB) and event sourcing systems often store entire object graphs in a single document or event payload. The data is denormalized by design, because the access pattern is "give me this whole aggregate at once" and joins are expensive or unsupported. The discipline shifts from normal forms to careful aggregate design.
5. Time-series and immutable log data. Metrics, audit logs, and event streams are append-only and never updated. The update anomaly cannot occur because nothing is ever updated. Storing wide, redundant rows costs disk but is fine because the rows are immutable.
+---------------------------------------------------------------+
| NORMALIZE vs DENORMALIZE — DECISION GUIDE |
+---------------------------------------------------------------+
| |
| NORMALIZE WHEN: |
| - OLTP workload (orders, users, accounts, inventory) |
| - Data is updated frequently |
| - Consistency matters more than read latency |
| - You cannot tolerate two copies of the same fact |
| drifting apart |
| |
| DENORMALIZE WHEN: |
| - OLAP / reporting / analytics workload |
| - Read-to-write ratio is very high (1000:1 or more) |
| - Joins are the proven bottleneck (measured, not guessed) |
| - Data is immutable (logs, events, time-series) |
| - You have a reliable mechanism to keep copies in sync |
| |
| THE GOLDEN RULE: |
| Normalize until it hurts, then denormalize until it works. |
| Never denormalize first, never denormalize on a hunch. |
| |
+---------------------------------------------------------------+
The cardinal sin is denormalizing pre-emptively. "I might need to read this fast someday" is not a justification — it is how you end up with the update anomaly six months from now and a data quality incident at 2 AM. Normalize first, measure, and only denormalize the specific hot paths where measurements prove the joins are too slow. Even then, document the redundancy and own the sync responsibility explicitly.
Common Mistakes
1. Storing comma-separated lists in a single column.
A column called tags containing the value 'red,large,sale' violates 1NF on day one. You cannot index it efficiently, you cannot join on individual tags, and updating one tag requires string parsing and rewriting the whole field. The fix is a separate tags table joined via a post_tags junction table. Comma-separated lists are the single most common normalization sin in junior schemas.
2. Treating normalization as a performance issue instead of a correctness issue. "I will denormalize because joins are slow" is a guess until you measure it. Modern Postgres can join a normalized schema with appropriate indexes faster than most teams' homegrown denormalized caches. Measure first. Normalize for correctness, denormalize only the proven hot paths.
3. Confusing "fewer tables" with "simpler schema." A wide table with 60 columns looks simpler than five normalized tables with 12 columns each, but it is dramatically harder to evolve, harder to validate, and harder to query correctly. Schema simplicity is measured in clarity of intent, not in table count. Five small tables that each represent one thing are simpler than one giant table that represents nothing in particular.
4. Skipping foreign keys "for performance."
Foreign keys are how the database enforces the relationships your normalized design depends on. Without them, an orders.customer_id value can point at a customer that no longer exists, and your "normalized" schema silently devolves into a bag of disconnected rows. The cost of a foreign key check is microseconds. The cost of orphaned rows is a multi-day data cleanup project.
5. Denormalizing without a sync strategy.
If you cache comment_count on posts, you must keep it in sync — via a trigger, an application-side update, or a scheduled recompute. Caching a value and then never updating it is worse than not caching at all, because now your reports lie and nobody knows when the lie started. Every denormalized field needs an explicit, documented owner responsible for keeping it accurate.
Interview Questions
1. "What is database normalization, and what are the two main goals it serves?"
Normalization is the process of organizing a relational schema so that every fact is stored in exactly one place, by decomposing wide tables into smaller related tables connected via foreign keys. It serves two main goals. First, it reduces redundancy: data like a customer's email or a product's price exists in one row of one table, not duplicated across many rows of an orders table or a line items table. Second, and more importantly, it prevents anomalies — the three classic insert, update, and delete anomalies that occur when the same fact is stored in multiple places. Notice that "performance" is deliberately not on the list. Normalization is fundamentally about correctness and consistency, not speed. A normalized schema usually requires more joins to read, which can be slower than a denormalized table, but it makes writes safer and prevents the database from contradicting itself. The trade-off is explicit: normalize for correctness, denormalize specific hot paths only when measurements prove you need to.
2. "Explain the three anomalies that normalization prevents, with a concrete example of each."
The insert anomaly is when you cannot store a fact because a related fact does not exist yet. Example: if products only live inside an orders table, you cannot add a new product to your catalog until at least one customer orders it, because every row requires a customer_id and an order_id. The workaround — inserting a phantom order with NULL customer and NULL quantity — corrupts every report that counts or sums those columns. The update anomaly is when changing one fact requires updating many rows, and missing one leaves the database internally inconsistent. Example: a customer changes their email, and you have to find and update all 47 of their order rows; if you miss one, the same customer now has two different emails and there is no single source of truth. The delete anomaly is when removing one row accidentally erases unrelated facts. Example: a customer has only ever placed one order; deleting that order removes the only row containing their name and email, so the customer effectively ceases to exist in the database even though they are still your customer in real life. The fix for all three is the same: split the entities into separate tables (customers, products, orders) so that each entity's existence and lifecycle is independent.
3. "What are the normal forms and what does each one fix? When do you stop normalizing?"
The normal forms are a graduated sequence of rules, each fixing a specific class of anomaly that the previous form left behind. First Normal Form (1NF) requires every column to hold a single atomic value — no comma-separated lists, no repeating groups, no hidden arrays. Second Normal Form (2NF) requires every non-key column to depend on the entire primary key, not just part of it; this only matters for tables with composite primary keys. Third Normal Form (3NF) requires that non-key columns depend only on the key and not on other non-key columns — no transitive dependencies. Boyce-Codd Normal Form (BCNF) is a slightly stricter version of 3NF that closes a few rare edge cases involving overlapping candidate keys. Fourth and Fifth Normal Forms (4NF, 5NF) deal with multi-valued and join dependencies and are mostly of theoretical interest. The practical answer to "when do you stop" is: get to 3NF for any OLTP schema you build. 3NF eliminates the vast majority of real-world anomalies and is the universally accepted floor for a production transactional database. Push to BCNF only if you encounter a specific anomaly 3NF cannot fix. 4NF and 5NF are rare enough in practice that you can learn them when you need them.
4. "When is denormalization the right choice, and what do you give up when you make it?"
Denormalization is the right choice when read performance demonstrably outweighs the cost of duplicated data and you have a reliable mechanism to keep the duplicates in sync. The most common cases are analytics and reporting workloads, where star schemas in a data warehouse deliberately denormalize dimension tables to keep dashboard queries fast; materialized views and pre-computed aggregate tables that cache the result of an expensive join; caching frequently-read fields on a parent row, like storing comment_count on posts; document and event-store data models where joins are unsupported or impractical; and immutable time-series and log data where the update anomaly cannot occur because nothing is ever updated. What you give up is the guarantee that the data is internally consistent. Every duplicated value becomes a synchronization problem — you need triggers, application code, or scheduled jobs to keep copies in sync, and any bug in that synchronization produces a quiet data drift that can run undetected for months. The rule of thumb is: normalize until it hurts, then denormalize until it works. Never denormalize on a hunch, always measure first, and always document who owns the sync responsibility for every redundant field.
5. "A junior developer shows you a table with 60 columns and a comma-separated tags field. What do you tell them and why?"
Two problems, both fundamental. The 60-column table almost certainly mixes multiple entities into one row — a sign that the schema was grown by addition rather than designed. I would walk through it column by column and look for groups of columns that always travel together: customer fields, product fields, address fields, audit fields. Each group is probably a separate entity that deserves its own table connected by a foreign key. This is the normalization fix and it converts a brittle wide table into a set of small, focused tables that can each be evolved independently. The comma-separated tags column is a 1NF violation and a guaranteed source of pain. You cannot index it usefully (a B-tree index on 'red,large,sale' does not help you find posts tagged "large"), you cannot join on individual tags, you cannot count usage of a tag without parsing strings, and adding or removing a single tag requires reading the row, splitting the string, modifying it, and writing it back. The fix is a tags table with one row per tag and a post_tags junction table linking posts to tags many-to-many. After the fix, finding all posts with a given tag is an indexed lookup, counting tag usage is a GROUP BY, and adding a tag is an INSERT into the junction table. The longer the team waits to make this change, the more code they will write that depends on string parsing the tags column — which makes the eventual migration painful. Fix it now, when the table is small.
Quick Reference — Normalization Cheat Sheet
+---------------------------------------------------------------+
| NORMALIZATION CHEAT SHEET |
+---------------------------------------------------------------+
| |
| THE GOAL: |
| Every fact lives in exactly ONE row of ONE table. |
| |
| THE THREE ANOMALIES: |
| INSERT -> cannot store fact A without related fact B |
| UPDATE -> changing one fact requires N row updates |
| DELETE -> removing one row erases unrelated facts |
| |
| THE NORMAL FORMS (what each fixes): |
| 1NF -> atomic columns, no lists or repeating groups |
| 2NF -> no partial dependency on a composite key |
| 3NF -> no transitive dependency between non-key columns |
| BCNF -> every determinant is a candidate key |
| 4NF -> no multi-valued dependencies |
| 5NF -> no spurious join dependencies |
| |
| PRACTICAL TARGETS: |
| OLTP transactional schema -> 3NF (sometimes BCNF) |
| OLAP analytics warehouse -> star schema (denormalized) |
| Document store / events -> aggregate-shaped, denormalized |
| |
| WHEN TO DENORMALIZE: |
| - Measured read bottleneck on a hot path |
| - Read-to-write ratio extremely high |
| - Immutable data (logs, time-series) |
| - Materialized views / cached aggregates |
| - You have an explicit sync mechanism |
| |
+---------------------------------------------------------------+
+---------------------------------------------------------------+
| KEY RULES |
+---------------------------------------------------------------+
| |
| 1. One fact, one place. That is the whole game. |
| 2. Normalize for correctness, not for performance. |
| 3. Get to 3NF by default for any OLTP schema. |
| 4. Comma-separated lists violate 1NF -- always split them. |
| 5. Foreign keys are how the schema enforces normalization. |
| 6. Denormalize only specific hot paths after measuring. |
| 7. Every denormalized field needs an explicit sync owner. |
| 8. Star schemas are denormalized on purpose for analytics. |
| 9. Immutable data tolerates redundancy because nothing |
| is ever updated. |
| 10. Normalize until it hurts, then denormalize until it works. |
| |
+---------------------------------------------------------------+
| Concern | Wrong Way | Right Way |
|---|---|---|
| Customer info | Repeated in every order row | Single customers table, FK from orders |
| Tags on a post | 'red,large,sale' in one column | tags + post_tags junction table |
| Product price | Copied to every line item | products.price, looked up at order time |
| Schema simplicity | Fewer, wider tables | More, narrower tables with clear purpose |
| Read performance | Denormalize pre-emptively | Normalize, measure, denormalize hot paths |
| Cached aggregate | Update count and forget | Trigger or scheduled recompute |
| Foreign keys | "Skip them for speed" | Always declare, microseconds well spent |
| Analytics queries | Force into OLTP schema | Build a denormalized star schema warehouse |
Prev: Lesson 9.6 -- Index Best Practices and EXPLAIN Next: Lesson 10.2 -- First Normal Form
This is Lesson 10.1 of the Database Interview Prep Course -- 12 chapters, 58 lessons.