June 10, 2026 · Andrew Mikhail
turboquant.cpp: near-optimal vector quantization in 400 lines of C++, no training required
Embeddings are the working memory of modern AI systems — and they're heavy. A single 1536-dimensional fp32 embedding is 6 KB. Ten million of them is 60 GB before you've built an index. If retrieval needs to run on a phone, a robot, or anything that isn't a rack of servers, that's your entire RAM budget gone.
Today we're open-sourcing turboquant.cpp: a small C++23 implementation of TurboQuant, an online vector quantization algorithm with near-optimal distortion guarantees. It compresses vectors to 1–4 bits per coordinate — up to 31x smaller than fp32 — with no training phase, no codebook learning, and no batching. Every vector is quantized independently, the moment it arrives.
Why "online" matters
The standard tools for vector compression — product quantization and its descendants — learn codebooks from a representative sample of your data. That works until it doesn't: your distribution shifts, your corpus updates continuously, or you're in a streaming setting (KV caches, on-device personalization, indexes that grow in real time) where there is no "representative sample" to train on, and no budget to retrain.
TurboQuant sidesteps all of it. There is nothing to train, ever.
How it works, in 60 seconds
-
Rotate. Multiply the vector by a random rotation matrix (shared across all vectors, generated from a seed). A classic concentration result says the rotated coordinates look like i.i.d. Gaussians — regardless of the input distribution. This is the trick that eliminates training: instead of learning your data's distribution, the rotation forces every vector into the same known distribution.
-
Quantize each coordinate. Since every coordinate is now approximately N(0,1), the optimal scalar quantizer is a universal constant — the Lloyd–Max codebook for the standard normal. We precompute these for bitwidths 1–4 and bake them into the code. Quantization is a table lookup.
-
Sketch the residual (optional). For unbiased inner-product estimates — what you actually want for retrieval — a second stage quantizes the reconstruction residual with a quantized Johnson–Lindenstrauss sketch at 1 bit per coordinate. That's the difference between the two quantizers in the library:
QuantizerMSEminimizes reconstruction error,QuantizerProdgives unbiased inner products with low variance.
The paper proves distortion within a small constant factor (~2.7x) of the information-theoretic lower bound — for an algorithm that never sees your data in advance, that's remarkably close to the best any method could possibly do.
Numbers
Single-threaded on an Apple M4 Max (bazel run -c opt //:turboquant_benchmark):
| Quantizer | d | bits/coord | quantize/s | inner product/s | compression |
|---|---|---|---|---|---|
| MSE | 1536 | 1 | 10.0k | — | 31.3x |
| MSE | 1536 | 2 | 9.8k | — | 15.8x |
| Prod | 1536 | 2 | 2.9k | 4.2k | 15.7x |
| MSE | 256 | 2 | 319k | — | 15.1x |
| Prod | 256 | 2 | 108k | 162k | 14.2x |
Concretely: a 1536-d OpenAI-style embedding goes from 6,144 bytes to 388 at 2 bits per coordinate, and reconstructed distances stay within the proven bounds.
What's in the repo
The core library is under 400 lines of C++23. One dependency (Eigen, header-only), built with Bazel (Eigen resolves from the Bazel Central Registry — clone and bazel build //..., nothing to install). MIT licensed. The example and benchmark are both single files you can read over coffee.
We're equally upfront about what it doesn't do yet, because the gaps are well-scoped contributions:
- The rotation is a dense d×d matrix, so throughput is O(d²). A randomized Hadamard transform would make it O(d log d) — this is the single biggest win available.
- The quantization loop is scalar; it vectorizes naturally with AVX-512 or NEON.
- Bitwidths above 4 need the Lloyd–Max optimization solved at higher precision.
If any of those sound fun, the issues are open.
Why we built it
EdgeAI's thesis is that models and retrieval should run where the data lives — on the device, not in someone else's datacenter. That only works if the primitives fit in device memory and don't drag a training pipeline behind them. A vector compressor with provable guarantees, zero training infrastructure, and a 400-line implementation is exactly the kind of primitive that makes on-device RAG practical.
Code: github.com/RunEdgeAI/turboquant.cpp Paper: TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate