Algorithms / Phase 3E

Product Quantization: Compressing Embeddings Without Killing Search

Product Quantization is not a search index by itself. It is a compression technique that replaces large float vectors with compact codebook IDs, reducing RAM and cache pressure at the cost of spatial precision.

A 1536-dimensional float32 embedding is expensive because every vector carries 6144 bytes before metadata, graph links, or IDs.

That number gets ugly fast. A few thousand vectors are not a problem. Millions of vectors become a memory budget. Tens or hundreds of millions become an architecture decision. Vector databases must store the vector values, search structures, document IDs, delete markers, metadata pointers, and allocator overhead.

Compression matters because retrieval is often limited by memory and cache behavior, not only by distance formulas. Smaller vector representations let more candidates sit closer to the CPU. They also reduce storage cost and make large indexes possible on fewer machines.

First-Principles Purpose

Product Quantization is a compression method. It is not the same thing as IVF clustering or HNSW graphs. IVF routes queries into centroid-owned lists. HNSW walks a proximity graph. PQ changes how vectors are represented in memory.

The goal is to reduce vector footprint so more vectors fit in RAM and cache. PQ can be combined with IVF or other ANN indexes. In large vector systems, it is often the compression layer under a candidate retrieval stage.

PQ does not fix semantic mistakes. It compresses geometry. If the embedding space already maps a query to the wrong region, compression will not infer intent. That is the boundary described in The Semantic Fallacy.

Raw Vector Storage Math

1536 dimensions × 4 bytes per float32 = 6144 bytes

One million vectors are about 6.14 GB of raw vector values only. Ten million vectors are about 61.44 GB of raw vector values only. That excludes IDs, metadata, indexes, graph links, centroids, deleted flags, locks, and allocator overhead.

Raw vectors are simple and accurate, but they are expensive to move through memory. Even when CPU instructions are fast, memory bandwidth can dominate. PQ attacks that specific cost.

Product Quantization Core Idea

PQ splits one large vector into smaller sub-vectors. It trains a codebook for each sub-vector position. Each codebook contains centroid vectors for that slice of the space. Instead of storing the original float values for every sub-vector, the system stores the ID of the nearest centroid in the matching codebook.

The database stores compact integer codes instead of raw float sub-vectors. At query time, those codes are used to estimate distance. The original vector position is approximated by the selected centroids.

Vector Decomposition Math

Original vector:
1536 dimensions, float32 = 6144 bytes

PQ setup:
96 sub-vectors

Each sub-vector:
1536 / 96 = 16 dimensions

Each sub-vector code:
1 byte

Compressed code size:
96 × 1 byte = 96 bytes

Reduction:
6144 / 96 = 64x

This 64x reduction counts compressed vector codes only. Real systems also store codebooks, IDs, metadata, and index structures. Some systems keep original vectors for reranking. PQ shrinks the candidate scan representation; it does not delete every other memory cost.

Codebooks

Each sub-vector group has its own codebook. A codebook is a small set of centroid vectors trained for one sub-vector position. With a 1-byte code, the system can represent 256 possible centroid IDs. Instead of storing all 16 float values for a sub-vector, it stores one byte that points into the codebook.

The codebook is shared across many database vectors. That shared dictionary is what makes the compression practical. Training quality matters. A poor codebook creates poor approximations and recall loss.

Asymmetric Distance Computation

Asymmetric Distance Computation, or ADC, keeps the query vector uncompressed while database vectors stay compressed. The system precomputes distances from query sub-vectors to codebook centroids. This creates a lookup table for each sub-vector position.

ADC keeps the query sharp and lets the database stay compressed.

That is the asymmetry: query side exact, database side coded. The query is new and small in count. The database is large and repeated. Spend precision on the query. Save memory on the corpus.

Lookup Table Flow

The query is split into the same sub-vector layout used by the database codes. For each query block, the system computes distances to every centroid in the matching codebook. Those distances are stored in lookup tables, often called LUTs.

For each compressed database vector, the engine reads its code IDs. Each ID selects one LUT entry. The estimated distance is the sum of selected entries across all sub-vectors. Full float distance calculation is replaced by table reads and additions.

Recall Loss

PQ is lossy. Original vector positions are approximated by codebook centroids. Nearby vectors can be reordered because their compressed representations distort the exact geometry. Recall can drop, especially when candidates are close together.

More sub-vectors can improve quality because each slice is smaller. Larger codebooks can improve quality because each slice has more centroid choices. Both choices increase memory, training work, or lookup cost. PQ is not free accuracy. It is a controlled memory-for-precision trade.

Oversampling and Reranking Pipeline

  1. Use compressed vectors to retrieve many candidates.
  2. Oversample more than the final top-k.
  3. Fetch original vectors if available.
  4. Rerank candidates using exact distance.
  5. Return final top-k.

PQ is often used to cheaply narrow the candidate set, not always to make the final ranking decision. Oversampling protects recall by letting approximate search return a wider pool. Reranking then repairs the final ordering with original vectors when the system can afford it.

Complexity Table

Method Memory Cost Speed Recall Impact Best Use Case Failure Mode
Raw float32 vectors Highest Heavy memory reads Exact geometry Small or recall-critical corpora RAM and bandwidth cost
float16 compression About half of float32 Better bandwidth Low to moderate Simple precision reduction Numeric precision loss
int8 quantization Low Fast compact math Moderate Hardware-friendly compression Scale and clipping errors
Product Quantization Very low codes LUT reads and sums Can be significant Large ANN candidate scans Codebook approximation error
PQ + reranking Codes plus optional originals Fast first pass, slower final pass Improved final quality Production search with strict top-k quality Rerank fetch latency

Advantages of PQ

  • Drastic memory reduction. Codes can be much smaller than raw float arrays.
  • Better cache behavior. More candidate data fits near the CPU.
  • Lower storage cost. Large vector corpora become cheaper to persist and move.
  • Works well with IVF. IVF narrows lists; PQ makes list scans cheaper.

Disadvantages of PQ

  • Lossy compression. The original geometry is approximated.
  • Recall degradation. Close candidates can be reordered.
  • Codebook training required. Bad training data creates bad codes.
  • Codebooks can go stale. Distribution shifts can reduce quality.
  • Final ranking may need reranking. Original vectors may still be needed.

Production Failure Modes

A bad training sample can produce codebooks that miss important regions. Codebook collapse can happen when centroids fail to cover the diversity of a sub-vector position. Distribution drift changes the vector population after training, leaving old codebooks less representative.

Too few sub-vectors create coarse approximations. Too-small codebooks limit centroid choice. Rare concepts can lose recall because training data underrepresents them. Reranking can repair quality, but fetching too many original vectors can bring back latency pressure. The compression stage and rerank stage must be tuned together.

When to Use PQ

PQ is a good fit when memory is the bottleneck, the dataset is large, approximate results are acceptable, and the system can tolerate some recall loss. It fits IVF-style candidate retrieval especially well because the scan inside selected lists can run over compact codes.

It is also useful when final candidates can be reranked. Use PQ to reduce the search set, then rerank a smaller pool with original vectors. For memory sizing, connect the plan to the out-of-memory failure modes.

When Not to Use PQ

PQ may be weak when exact precision is required, the dataset is small enough to keep raw vectors, recall loss is unacceptable, embeddings update constantly, or original vector geometry must be preserved exactly. In those cases, raw vectors, float16, or a different ANN layout may be the cleaner choice.

Use BST and K-D Tree failure modes, IVF routing, and HNSW graph behavior to decide whether the bottleneck is pruning, routing, traversal, or memory.

Animated SVG Diagram

Product Quantization compression flow A 6144-byte float vector splits into 96 sub-vector blocks. Blocks map into codebook centroids, producing a 96-byte compressed code array. 6144-byte float32 vector 96-byte PQ code array 1536 float32 dimensions 96 sub-vector blocks, shown sampled Codebook Palette 0 1 2 3 4 5 6 7 96 centroid IDs, sampled 64x smaller code representation
PQ replaces raw float sub-vectors with compact codebook IDs. The code array is tiny; the trade is approximation error.

Technical FAQs

Is Product Quantization an index or a compression method?

Product Quantization is a compression method. It is often paired with indexes such as IVF or HNSW, but it is not a complete search index by itself.

Why does PQ reduce recall?

PQ replaces exact sub-vectors with nearest codebook centroids. That approximation can move vector positions and reorder close candidates.

What is Asymmetric Distance Computation?

ADC keeps the query uncompressed while database vectors stay compressed. Distances are estimated through per-query lookup tables against codebook centroids.

Why are lookup tables fast?

Lookup tables turn repeated floating-point sub-vector distance calculations into indexed reads and additions over compact code IDs.

Why is oversampling needed with PQ?

Oversampling gives the system more candidates before final ranking, which reduces the chance that compression error removes the true best result.

When should compressed candidates be reranked with original vectors?

Rerank when final ordering matters, recall targets are strict, or compressed distances are close enough that approximation error can change the top results.