LinkedIn Hook

"Your full-text search query takes 8 seconds on a B-tree index. You added the wrong index type."

Most engineers learn one index — the B-tree — and try to make it do everything. They reach for it on a JSONB column and wonder why tags @> '["sale"]' still triggers a sequential scan. They build a B-tree on a 200 GB time-series table and watch the index itself balloon to 60 GB. They put a hash index on a low-cardinality flag column and lose range query support without gaining anything.

The truth is that PostgreSQL ships with a half-dozen index types, and each one is built for a specific data shape. GIN inverts your documents so a single keyword lookup is one disk seek. GiST organizes overlapping shapes so a "find all polygons near this point" query traces a tree instead of scanning the world. Hash gives you O(1) equality lookups with a smaller footprint than B-tree. BRIN summarizes huge sequential tables in kilobytes instead of gigabytes — perfect for append-only event logs and time-series data.

Picking the wrong index does not just leave performance on the table — it actively wastes disk, slows writes, and confuses the planner into choosing sequential scans over your carefully built structure. Picking the right index turns 8-second queries into 8-millisecond queries, often with a 50x smaller index file.

In Lesson 9.4, I break down the four "other" index types in PostgreSQL — GIN, GiST, Hash, and BRIN — with the data shapes they are built for, the operators they accelerate, the trade-offs in size and write cost, and runnable examples for each.

Read the full lesson -> [link]

#PostgreSQL #Database #Indexing #FullTextSearch #BackendDevelopment #InterviewPrep #QueryPerformance

GIN, GiST, Hash, BRIN thumbnail

What You'll Learn

Why B-tree is not the only index type, and the data shapes that defeat it
How GIN inverts composite values (JSONB, arrays, tsvector) for fast containment lookups
How GiST organizes overlapping ranges and geometric shapes via a balanced tree of bounding boxes
When Hash indexes beat B-tree for pure equality, and the trade-offs you accept in return
How BRIN summarizes huge sequentially-correlated tables in kilobytes instead of gigabytes
The size, build time, and write amplification trade-offs between the four types
Which operators each index type accelerates and how to verify with EXPLAIN
How MySQL differs (mostly B-tree, with FULLTEXT and SPATIAL as separate beasts)

The Library Analogy — Four Different Catalogs for Four Different Questions

Picture a city library. The librarians keep multiple catalogs because no single catalog can answer every question efficiently.

The first catalog is the alphabetical card catalog — one card per book, sorted by title. This is the B-tree. It is perfect when you know the title and want to look it up, or when you want every title between "M" and "P". But if you walk in and ask "which books mention dragons?", the alphabetical catalog is useless — you would have to read every card.

For that question, the librarians keep a subject index: one entry per topic, and under each topic, a list of every book that mentions it. "Dragons -> Book 14, Book 87, Book 203." This is exactly what GIN (Generalized Inverted iNdex) does for JSON keys, array elements, and full-text words. One lookup gives you every row that contains the value.

For "find all books shelved within 30 feet of this map pin", the librarians keep a floor map index — a hierarchy of bounding boxes that covers every shelf, then every section, then every aisle. Walking the tree narrows the search to the relevant area in a few hops, instead of pacing the whole library. This is GiST (Generalized Search Tree), used for geometric queries and overlapping ranges.

For "is the book with ISBN 978-0-123 here, yes or no?", the librarians keep a hash slip box: drop the ISBN into a hash function, jump straight to the slot, done. No range queries, no ordering — just the fastest possible single-key lookup. This is the Hash index.

And for the basement archive of fifty years of newspapers — millions of issues, all chronological — the librarians do not bother indexing every issue. They just write on each shelf "1987 January through 1987 March". To find an article from February 1987, you walk to that shelf and search. This is BRIN (Block Range INdex): a tiny summary of which value range lives in which physical block, perfect when the data is naturally sorted on disk.

+---------------------------------------------------------------+
|           FOUR INDEXES, FOUR DIFFERENT JOBS                   |
+---------------------------------------------------------------+
|                                                                |
|  B-TREE   -> "title = 'Hobbit'" or "year BETWEEN 1990 AND 2000"|
|              Sorted, balanced, range-friendly. The default.    |
|                                                                |
|  GIN      -> "tags @> ARRAY['sale']" or "doc @@ 'postgres'"    |
|              Inverted: one entry per element, lists rows.      |
|              JSONB, arrays, tsvector, trigrams.                |
|                                                                |
|  GiST     -> "polygon && ST_MakeEnvelope(...)"                 |
|              Tree of bounding boxes. Geometric, ranges, KNN.   |
|                                                                |
|  HASH     -> "session_id = 'abc123'"                           |
|              Pure equality, O(1), smaller than B-tree.         |
|              No ordering, no ranges, no LIKE.                  |
|                                                                |
|  BRIN     -> "logged_at BETWEEN '2024-01-01' AND '2024-01-02'" |
|              Tiny per-block summary. Huge correlated tables.   |
|                                                                |
+---------------------------------------------------------------+

The mental model: the index type follows the data shape and the query shape. Composite values with containment queries -> GIN. Spatial or overlapping ranges -> GiST. Equality on opaque keys -> Hash. Massive append-only tables with naturally sorted data -> BRIN. Everything else -> B-tree.

GIN — The Inverted Index for JSONB, Arrays, and Full-Text Search

GIN stands for Generalized Inverted iNdex. It does what its name suggests: instead of storing one entry per row, it stores one entry per element inside the row, and each entry points to a list of row IDs containing that element. A row with tags = ['sale', 'new', 'featured'] produces three GIN entries: one for sale, one for new, one for featured. A query like WHERE tags @> ARRAY['sale'] does a single GIN lookup on sale and immediately gets the list of matching rows.

GIN is the right tool when:

You store JSONB and query with @>, ?, ?&, or ?|
You store arrays and query with @>, &&, or <@
You run full-text search with tsvector @@ tsquery
You enable the pg_trgm extension and want fast LIKE '%substring%' queries

The trade-off: GIN indexes are expensive to update. Every insert or update has to touch one index entry per element in the value. A document with 200 keys produces 200 index updates. PostgreSQL mitigates this with the gin_pending_list_limit and the fastupdate option, which buffers updates in a pending list and flushes them lazily, but write-heavy workloads will still feel the cost.

-- Example 1: GIN on JSONB for product attribute search.
-- We want fast "find all products tagged 'sale' AND priced under $50".

CREATE TABLE products (
  id        BIGSERIAL PRIMARY KEY,
  name      TEXT NOT NULL,
  price     NUMERIC(10, 2) NOT NULL,
  -- Attributes is an opaque bag of key/value pairs that varies per product.
  -- A clothing item might have {size, color}; a laptop might have {ram, cpu}.
  attributes JSONB NOT NULL
);

-- Seed a few rows so the example is runnable end to end.
INSERT INTO products (name, price, attributes) VALUES
  ('Red Tee',    19.99, '{"size":"M","color":"red","tags":["sale","new"]}'),
  ('Blue Jeans', 49.99, '{"size":"32","color":"blue","tags":["sale"]}'),
  ('Laptop',    999.00, '{"ram":"16GB","cpu":"M2","tags":["featured"]}'),
  ('Mug',        12.50, '{"color":"white","tags":["new","clearance"]}');

-- Build a GIN index on the entire JSONB column.
-- The default operator class supports @>, ?, ?&, and ?| operators.
CREATE INDEX idx_products_attrs_gin ON products USING GIN (attributes);

-- Query: find every product whose tags array contains 'sale'.
-- The @> operator means "contains". GIN handles this in one lookup.
EXPLAIN ANALYZE
SELECT id, name, price
FROM products
WHERE attributes @> '{"tags":["sale"]}';

-- Sample output (abbreviated):
-- Bitmap Heap Scan on products  (cost=8.01..14.50 rows=2 width=...)
--   Recheck Cond: (attributes @> '{"tags": ["sale"]}'::jsonb)
--   ->  Bitmap Index Scan on idx_products_attrs_gin
--         Index Cond: (attributes @> '{"tags": ["sale"]}'::jsonb)
-- Returns: Red Tee (19.99), Blue Jeans (49.99)

-- Tip: for a more compact index that ONLY supports @>, use jsonb_path_ops:
--   CREATE INDEX ... USING GIN (attributes jsonb_path_ops);
-- It is roughly half the size and faster for containment, but loses the
-- ? / ?& / ?| operators.

For full-text search, GIN over a tsvector column (often generated from to_tsvector('english', body)) is the standard pattern. A WHERE doc_tsv @@ to_tsquery('english', 'postgres & index') query becomes a tiny intersection of two posting lists instead of a full table scan.

GiST — The Generalized Search Tree for Geometry, Ranges, and Nearest-Neighbor

GiST stands for Generalized Search Tree. Where B-tree is a tree of sorted keys and GIN is an inverted index of elements, GiST is a balanced tree of bounding predicates. Each internal node says "everything below me is contained inside this bounding box." A query walks down the branches whose bounding boxes overlap the search predicate and ignores the rest. The structure is generic enough that PostGIS, range types, trigrams, and even custom data types all use GiST under the hood.

GiST is the right tool when:

You store geometry or geography (PostGIS) and query with &&, ST_Intersects, or KNN ordering
You store range types (int4range, tsrange) and need overlap (&&) or containment (@>) queries
You need nearest-neighbor search (ORDER BY column <-> point LIMIT 10)
You want to enforce an exclusion constraint like "no two bookings can overlap for the same room"

The trade-off compared to B-tree: GiST is slower for exact-match queries and produces a lossy first pass — it returns a candidate set, and PostgreSQL rechecks the predicate on each row. For pure equality on a scalar column, B-tree wins. For overlapping or spatial predicates, GiST is the only practical option.

-- Example 2: GiST on a tsrange to enforce non-overlapping room bookings.
-- A meeting room can only be reserved by one person at a time. We use a
-- GiST exclusion constraint to make double-booking a database error.

CREATE EXTENSION IF NOT EXISTS btree_gist;  -- lets us mix scalar + range

CREATE TABLE room_bookings (
  id       BIGSERIAL PRIMARY KEY,
  room_id  INT NOT NULL,
  -- The booking window is a single tsrange value, e.g. [10:00, 11:00).
  during   TSRANGE NOT NULL,
  -- The exclusion constraint says: no two rows may have the same room_id
  -- AND overlapping 'during' ranges. PostgreSQL enforces this with a GiST
  -- index because GiST understands the && (overlap) operator on tsranges.
  EXCLUDE USING GIST (room_id WITH =, during WITH &&)
);

-- First booking: succeeds.
INSERT INTO room_bookings (room_id, during)
VALUES (101, '[2026-04-14 10:00, 2026-04-14 11:00)');

-- Second booking, same room, overlapping window: fails.
INSERT INTO room_bookings (room_id, during)
VALUES (101, '[2026-04-14 10:30, 2026-04-14 11:30)');
-- ERROR:  conflicting key value violates exclusion constraint
--         "room_bookings_room_id_during_excl"
-- DETAIL: Key (room_id, during)=(101, ["2026-04-14 10:30:00","2026-04-14 11:30:00"))
--         conflicts with existing key
--         (101, ["2026-04-14 10:00:00","2026-04-14 11:00:00")).

-- Same room, non-overlapping: succeeds.
INSERT INTO room_bookings (room_id, during)
VALUES (101, '[2026-04-14 11:00, 2026-04-14 12:00)');

-- Query: which bookings overlap a given window? GiST answers in log time.
EXPLAIN ANALYZE
SELECT id, room_id, during
FROM room_bookings
WHERE during && '[2026-04-14 10:45, 2026-04-14 11:15)'::tsrange;

-- Sample output (abbreviated):
-- Index Scan using room_bookings_room_id_during_excl on room_bookings
--   Index Cond: (during && '[2026-04-14 10:45:00, 2026-04-14 11:15:00)'::tsrange)
-- Returns: id=1 (10:00-11:00) and id=3 (11:00-12:00) if 11:00 boundary overlaps;
-- with [11:00,12:00) and (...,11:15) the overlap is computed correctly.

For PostGIS, the same pattern applies with CREATE INDEX ... USING GIST (geom) and queries like WHERE geom && ST_MakeEnvelope(...) or ORDER BY geom <-> ST_MakePoint(...) LIMIT 10. The KNN ordering (<->) is one of GiST's killer features — it walks the tree in distance order and stops as soon as LIMIT is satisfied.

Hash — Equality-Only, Smaller Than B-Tree

Hash indexes do exactly one thing: they map a key through a hash function to a bucket, and the bucket points at the matching rows. There is no ordering, no range support, no LIKE, no ORDER BY acceleration. In exchange for that narrowness, hash indexes are typically smaller than B-tree for the same column and offer O(1) average lookup instead of O(log n).

Before PostgreSQL 10, hash indexes were not WAL-logged, which made them unsafe for crash recovery and effectively unused. Since PostgreSQL 10 they are crash-safe and replicated, and they are a legitimate choice when:

You only ever query the column with = (no <, >, BETWEEN, LIKE, or ORDER BY)
The column holds long, opaque values (UUIDs, session tokens, content hashes) where B-tree would store the full key in every internal node
You want to save disk space on a very large lookup table

The trade-off: you lose every operator that is not equality. If a single query later needs to sort or range-scan that column, the hash index cannot help. Most teams default to B-tree for this reason — it is "good enough at equality" and stays useful when requirements change. Hash is a deliberate optimization, not a default.

-- Example 3: Hash index on a session token column.
-- Sessions are looked up only by exact token from a cookie. We never sort
-- or range-scan tokens, so a hash index gives us a smaller footprint.

CREATE TABLE sessions (
  token       TEXT PRIMARY KEY,
  user_id     BIGINT NOT NULL,
  expires_at  TIMESTAMPTZ NOT NULL
);

-- The PRIMARY KEY already creates a B-tree on token. For demo purposes,
-- we drop it and replace with a hash index to compare sizes.
ALTER TABLE sessions DROP CONSTRAINT sessions_pkey;
CREATE INDEX idx_sessions_token_hash ON sessions USING HASH (token);

-- Seed a few rows.
INSERT INTO sessions (token, user_id, expires_at) VALUES
  ('a8f3c1d2e4b5f6a7', 42, now() + interval '1 day'),
  ('9b2e7a1c3d4f5e60', 17, now() + interval '2 hours'),
  ('ff0099aabbccdd11',  3, now() + interval '7 days');

-- Equality lookup: hash index is consulted directly.
EXPLAIN ANALYZE
SELECT user_id, expires_at
FROM sessions
WHERE token = 'a8f3c1d2e4b5f6a7';

-- Sample output (abbreviated):
-- Index Scan using idx_sessions_token_hash on sessions
--   Index Cond: (token = 'a8f3c1d2e4b5f6a7'::text)
-- Returns: user_id=42, expires_at=...

-- Range query: hash index CANNOT help. The planner falls back to a Seq Scan.
EXPLAIN ANALYZE
SELECT user_id
FROM sessions
WHERE token > 'a' AND token < 'b';

-- Sample output (abbreviated):
-- Seq Scan on sessions
--   Filter: ((token > 'a'::text) AND (token < 'b'::text))
-- Note: no index involved. This is the cost of choosing hash over B-tree.

The rule of thumb: use hash only if you are sure the column will never be queried any other way. If in doubt, use B-tree.

BRIN — Block Range Index for Huge Sequential Tables

BRIN stands for Block Range INdex. Instead of storing one entry per row, BRIN stores one entry per range of physical blocks (typically 128 blocks, or about 1 MB on default 8 KB pages). Each entry summarizes the minimum and maximum values in that block range. To answer WHERE logged_at BETWEEN x AND y, PostgreSQL walks the tiny BRIN summary, finds block ranges whose min/max overlap the query window, and scans only those blocks.

The win is dramatic on the right data. A B-tree on a billion-row event table can easily occupy 30 GB. The equivalent BRIN index might be 2 MB. That is not a typo — BRIN is roughly four orders of magnitude smaller. It is also extremely cheap to build and to maintain on inserts, because new rows usually land in the latest block range and only the latest summary entry needs updating.

The catch is that BRIN only works when the physical row order is correlated with the indexed value. Time-series data inserted in chronological order is the canonical fit: created_at increases monotonically, so each block range covers a tight time window, and a query for "events from yesterday" touches only a handful of block ranges. If the data is unordered — random UUIDs, hash-partitioned ingestion — every block range covers the full value domain, and BRIN degenerates into a full table scan.

-- Example 4: BRIN on a huge append-only event log.
-- Events stream in chronologically. We query by time window. A B-tree
-- would be enormous; a BRIN summary fits in kilobytes.

CREATE TABLE events (
  id          BIGSERIAL PRIMARY KEY,
  occurred_at TIMESTAMPTZ NOT NULL,
  user_id     BIGINT NOT NULL,
  event_type  TEXT NOT NULL,
  payload     JSONB
);

-- Simulate 1 million chronological events spread across one year.
-- Because we insert in time order, the physical row order matches
-- occurred_at — exactly the precondition BRIN needs.
INSERT INTO events (occurred_at, user_id, event_type, payload)
SELECT
  '2025-01-01'::timestamptz + (i * interval '30 seconds'),
  (random() * 10000)::bigint,
  (ARRAY['click', 'view', 'purchase'])[1 + (i % 3)],
  jsonb_build_object('seq', i)
FROM generate_series(1, 1000000) AS i;

-- Build the BRIN index. The pages_per_range option tunes the trade-off
-- between index size (smaller with larger ranges) and selectivity
-- (better with smaller ranges). 128 is the default.
CREATE INDEX idx_events_occurred_brin
  ON events USING BRIN (occurred_at) WITH (pages_per_range = 64);

-- Compare sizes (run after VACUUM ANALYZE events;)
SELECT
  pg_size_pretty(pg_relation_size('events'))                       AS table_size,
  pg_size_pretty(pg_relation_size('idx_events_occurred_brin'))     AS brin_size;

-- Sample output:
--   table_size | brin_size
--   -----------+-----------
--   88 MB      | 48 kB
-- The BRIN summary is roughly 0.05% the size of the table.

-- Range query: BRIN narrows the scan to a tiny slice of block ranges.
EXPLAIN ANALYZE
SELECT count(*)
FROM events
WHERE occurred_at BETWEEN '2025-06-01' AND '2025-06-02';

-- Sample output (abbreviated):
-- Aggregate
--   ->  Bitmap Heap Scan on events
--         Recheck Cond: (occurred_at >= '2025-06-01' AND occurred_at < '2025-06-02')
--         Rows Removed by Index Recheck: ~1500
--         ->  Bitmap Index Scan on idx_events_occurred_brin
--               Index Cond: (occurred_at >= '2025-06-01' AND occurred_at < '2025-06-02')
-- count = 2880  (one event every 30 seconds for 24 hours)

The two killer use cases for BRIN are append-only event logs (clickstream, audit logs, metrics) and time-series tables (IoT readings, financial ticks). On those workloads BRIN gives you 95% of the query speedup of a B-tree at less than 1% of the disk cost.

MySQL Differences

MySQL's index landscape is much narrower than PostgreSQL's:

B-tree is the default and only general-purpose index in InnoDB. Every secondary index is a B-tree.
HASH indexes exist only in the in-memory MEMORY engine (rare in production) and the InnoDB adaptive hash index, which is automatic and not user-controlled.
FULLTEXT indexes provide full-text search on CHAR, VARCHAR, and TEXT columns, used with MATCH ... AGAINST. Conceptually similar to GIN over tsvector but with a separate query syntax.
SPATIAL (R-tree) indexes support geometric data via POINT, POLYGON, etc. Conceptually similar to PostgreSQL's GiST for spatial queries, but limited to spatial types only.

There is no MySQL equivalent to BRIN, no GIN-style inverted index for arbitrary JSON containment, and no GiST-style exclusion constraints. If your workload depends on those features, PostgreSQL is the natural choice.

Putting It Together — Choosing an Index Type

+---------------------------------------------------------------+
|           DECISION FLOW                                        |
+---------------------------------------------------------------+
|                                                                |
|  Is the column JSONB / array / tsvector / trigram?             |
|       YES -> GIN                                               |
|       NO  -> next                                              |
|                                                                |
|  Is the column geometric / range / overlap-checked?            |
|       YES -> GiST  (or SP-GiST for partitioned spaces)         |
|       NO  -> next                                              |
|                                                                |
|  Is the table huge AND naturally sorted on this column?        |
|       YES -> BRIN                                              |
|       NO  -> next                                              |
|                                                                |
|  Are queries strictly "= ?" with no ranges / sorts / LIKE?     |
|       YES -> Hash  (only if you are SURE)                      |
|       NO  -> B-tree (the default for everything else)          |
|                                                                |
+---------------------------------------------------------------+

The default should always be B-tree. Reach for GIN, GiST, Hash, or BRIN only when the data shape and query shape both call for it. Picking the right index type is one of the highest-leverage tuning decisions you can make in PostgreSQL — and one of the easiest interview signals that a candidate has worked on real production systems.

Napkin AI Visual Prompt: "Dark navy gradient (#0a0f1f -> #111a2e). Four-quadrant comparison panel in white monospace. Top-left quadrant 'GIN': sky blue (#4fc3f7) inverted-list diagram showing one keyword pointing to many row IDs, captioned 'JSONB / arrays / full-text'. Top-right quadrant 'GiST': overlapping bounding boxes in rose (#ff5c8a) with a search arrow tracing the tree, captioned 'Geometry / ranges / KNN'. Bottom-left quadrant 'HASH': a single sky blue arrow from a key into a bucket, captioned 'Equality only, smaller'. Bottom-right quadrant 'BRIN': a long horizontal bar split into rose-tinted ranges with min/max labels, captioned 'Huge sorted tables, kilobytes'. A central diamond labeled 'Pick by data shape' with arrows to all four. Subtle grid overlay."

Common Mistakes

1. Using B-tree on a JSONB column and wondering why containment is slow. A B-tree on a JSONB column can only answer queries on the whole value: WHERE attributes = '{...}'. It cannot accelerate attributes @> '{"tag":"sale"}' because the right-hand side is a sub-document, not the full value. The fix is CREATE INDEX ... USING GIN (attributes) (or GIN (attributes jsonb_path_ops) for the most compact form). Always check EXPLAIN to verify the planner is using the GIN index — a "Seq Scan" on a containment query is the telltale sign you forgot to switch index types.

2. Putting BRIN on an unordered column. BRIN only works when the physical row order is correlated with the indexed value. If you BRIN-index a user_id column on a table where rows are inserted in random order, every block range will contain the full range of user IDs, so every query will have to scan every block range. BRIN will look like it is doing nothing, because it effectively is. Check the correlation with SELECT correlation FROM pg_stats WHERE tablename = 'events' AND attname = 'occurred_at'; — values close to 1.0 or -1.0 are good candidates, values near 0 are not.

3. Choosing Hash because "O(1) is faster than O(log n)". The constant factors and cache behavior of B-tree make it competitive with Hash for almost all real workloads, and Hash gives up range queries, sorting, and LIKE support entirely. The moment a single query needs to sort by the column or scan a range, your hash index is dead weight and you have to add a B-tree anyway. Default to B-tree. Reach for Hash only when you have profiled the workload and confirmed that (a) the column is queried only by equality and (b) the size savings actually matter.

4. Forgetting that GIN inserts are slow. A row with a JSONB value containing 200 keys produces ~200 GIN index entries on insert. On a write-heavy ingestion workload this can dominate transaction time. Mitigations: enable fastupdate = on (the default) so updates buffer into a pending list and flush lazily, raise gin_pending_list_limit for bigger batches, or move ingestion through a staging table that gets indexed in bulk. Do not GIN-index a column you write to constantly unless you have measured the cost.

5. Not using btree_gist when you need a scalar + range exclusion. GiST out of the box does not know how to handle equality on plain integers, so you cannot write EXCLUDE USING GIST (room_id WITH =, during WITH &&) without first running CREATE EXTENSION btree_gist;. The error message is unhelpful ("data type integer has no default operator class for access method gist") and trips up everyone who tries this for the first time.

Interview Questions

1. "Walk me through when you would choose GIN over a B-tree index. What does the storage layout look like, and what does it cost on writes?"

GIN is the right choice whenever you query a composite column — JSONB, an array, a tsvector, or a trigram set — with operators that look inside the value rather than at the value as a whole. The classic examples are jsonb @> '{"tag":"sale"}', arr && ARRAY[1,2,3], and tsv @@ to_tsquery('postgres & index'). A B-tree cannot help with any of these because it only indexes the entire value. GIN inverts the structure: instead of one entry per row, it stores one entry per element inside the row, and each entry holds a posting list of row IDs that contain that element. A query for tag = 'sale' becomes a single lookup that returns the entire posting list. The trade-off is on writes: every insert or update has to add or remove one entry per element in the new value, so a JSONB document with 200 keys generates 200 index updates. PostgreSQL softens this with the fastupdate pending-list mechanism, which buffers updates and flushes them lazily, but write-heavy workloads still pay a real cost. The right way to think about GIN is "great for read-mostly composite data, painful for high-throughput ingestion."

2. "What is BRIN, and on what kind of table does it actually win? How big is a BRIN index compared to a B-tree on the same column?"

BRIN — Block Range INdex — stores a tiny summary entry for every range of physical blocks, typically 128 blocks per range. Each entry holds the minimum and maximum value of the indexed column within that block range. To answer a range query, PostgreSQL walks the BRIN summary, finds block ranges whose min/max overlaps the query window, and scans only those blocks instead of the whole table. The size difference is dramatic: a B-tree on a billion-row event table might be 30 GB, while the equivalent BRIN index is roughly 2 MB. The catch is that BRIN only wins when the physical row order is correlated with the indexed value — when the data is naturally sorted on disk. Append-only event logs, time-series metrics, and audit tables where rows arrive in chronological order are the canonical fit. If you BRIN-index a column where the values are randomly distributed across blocks, every block range will summarize the full value domain and the index will provide no selectivity at all. The check is SELECT correlation FROM pg_stats — values near 1 or -1 mean BRIN will work, values near 0 mean it will not.

3. "Why would you ever choose a Hash index over a B-tree, given that B-tree also handles equality?"

You would choose Hash for one reason: you are absolutely certain the column will never be queried with anything other than =, and you want the smaller footprint and slightly faster pure-equality lookup. Hash indexes do not support range queries, ordering, LIKE, or any operator other than equality. For long opaque keys like UUIDs, content hashes, or session tokens — where B-tree would store the full value at every internal node — Hash can be noticeably smaller. Since PostgreSQL 10 hash indexes are also WAL-logged and crash-safe, which they were not before, so they are a legitimate option in modern PostgreSQL. That said, the default should always be B-tree because the moment a future query needs to sort or range-scan that column, the hash index becomes useless and you have to add a B-tree anyway. Hash is a deliberate optimization for stable, equality-only access patterns, not a general default. Most teams I have worked with never use it — B-tree is "good enough" and stays useful as requirements change.

4. "How do you enforce 'no two bookings can overlap for the same room' at the database level? What index type makes that possible?"

You use a GiST exclusion constraint. PostgreSQL's EXCLUDE USING GIST (room_id WITH =, during WITH &&) clause says: no two rows may have the same room_id AND overlapping during ranges. The constraint is enforced by a GiST index because GiST understands the && (overlap) operator on range types — B-tree only understands equality and ordering, so it cannot express "these two ranges overlap". The one gotcha is that GiST does not natively support = on plain scalar columns, so you have to run CREATE EXTENSION btree_gist first, which teaches GiST to handle equality on integers and text. Once that is in place, the exclusion constraint behaves exactly like a unique constraint, except the conflict rule is "overlapping ranges" instead of "identical values". Any insert or update that would create an overlap fails with a constraint violation, and you get bulletproof scheduling logic at the database layer instead of having to coordinate it across application code.

5. "Compare GIN, GiST, and B-tree for full-text search. Which one would you pick and why?"

For full-text search on a tsvector column, the realistic options are GIN and GiST. B-tree is useless because it cannot index the individual lexemes inside a tsvector. GIN is the right default for almost every read-heavy full-text workload: it stores one posting list per lexeme, so a tsv @@ to_tsquery('postgres & index') query becomes a tiny intersection of two posting lists and runs in milliseconds even on millions of documents. GiST on a tsvector is lossy — it stores a signature of each document and produces candidate matches that PostgreSQL has to recheck against the actual row, so queries are slower and less precise. GiST's only advantage is faster updates, because it touches one signature per row instead of one entry per lexeme. The rule of thumb: pick GIN unless your write throughput is so high that GIN's per-element update cost is the bottleneck, in which case GiST is the fallback. In practice, most teams pick GIN, and the few that hit GIN's write ceiling solve it by batching ingestion through a staging table rather than switching to GiST.

Quick Reference — Cheat Sheet

+---------------------------------------------------------------+
|           INDEX TYPE DECISION TABLE                            |
+---------------------------------------------------------------+
|                                                                |
|  TYPE   | BEST FOR                | OPERATORS                  |
|  -------|-------------------------|----------------------------|
|  B-tree | Equality + ranges       | =, <, >, BETWEEN, LIKE 'a%'|
|         | Sorting, uniqueness     | ORDER BY, IS NULL          |
|         | Default for everything  |                            |
|  -------|-------------------------|----------------------------|
|  GIN    | JSONB containment       | @>, ?, ?&, ?|              |
|         | Array containment       | @>, &&, <@                 |
|         | Full-text search        | @@ (tsvector / tsquery)    |
|         | Trigram LIKE            | LIKE '%substring%'         |
|  -------|-------------------------|----------------------------|
|  GiST   | Geometric / spatial     | &&, ST_Intersects, <->     |
|         | Range types             | &&, @>, <@                 |
|         | Exclusion constraints   | EXCLUDE USING GIST (...)   |
|         | KNN / nearest neighbor  | ORDER BY col <-> point     |
|  -------|-------------------------|----------------------------|
|  Hash   | Equality only           | = (and nothing else)       |
|         | Long opaque keys        |                            |
|  -------|-------------------------|----------------------------|
|  BRIN   | Huge sorted tables      | =, <, >, BETWEEN           |
|         | Time-series / logs      | (only if correlated)       |
|                                                                |
+---------------------------------------------------------------+

+---------------------------------------------------------------+
|           SIZE AND TRADE-OFF SUMMARY                          |
+---------------------------------------------------------------+
|                                                                |
|  TYPE   | SIZE       | WRITE COST | NOTES                     |
|  -------|------------|------------|---------------------------|
|  B-tree | Medium     | Low        | The default. Always safe. |
|  GIN    | Large      | High       | Posting list per element. |
|  GiST   | Medium     | Medium     | Lossy, requires recheck.  |
|  Hash   | Small      | Low        | Equality only. No sort.   |
|  BRIN   | Tiny       | Very low   | Needs sorted physical row.|
|                                                                |
+---------------------------------------------------------------+

+---------------------------------------------------------------+
|           KEY RULES                                            |
+---------------------------------------------------------------+
|                                                                |
|  1. Default to B-tree. Switch only when data + query demand it.|
|  2. JSONB / array / tsvector / trigram -> GIN                  |
|  3. Geometry / range / exclusion / KNN -> GiST                 |
|  4. Time-series / append-only / sorted on disk -> BRIN         |
|  5. Pure equality on opaque keys -> Hash (rarely)              |
|  6. Always EXPLAIN ANALYZE to confirm the index is used        |
|  7. Remember GIN write amplification on high-ingest tables     |
|  8. CREATE EXTENSION btree_gist to mix scalar = with range &&  |
|                                                                |
+---------------------------------------------------------------+

Use Case	Wrong Index	Right Index
`tags @> '["sale"]'` on JSONB	B-tree	GIN
`MATCH ... AGAINST` full-text	B-tree	GIN on tsvector
`LIKE '%substring%'`	B-tree	GIN on `gin_trgm_ops`
Polygon overlap	B-tree	GiST (PostGIS)
Non-overlapping bookings	UNIQUE constraint	GiST exclusion
Nearest 10 points to (x,y)	B-tree	GiST KNN (`<->`)
1 B-row time-series range scan	B-tree (huge)	BRIN (kilobytes)
Session token equality lookup	B-tree (fine)	Hash (smaller)
Range query on session token	Hash (broken)	B-tree

Prev: Lesson 9.3 -- B-Tree Index Next: Lesson 9.5 -- Sorted, Unsorted, Covering Indexes

This is Lesson 9.4 of the Database Interview Prep Course -- 12 chapters, 58 lessons.

GIN, GiST, Hash, BRIN