Google Research published a paper on Tuesday that rattled global semiconductor markets within 24 hours of going live. Samsung shares dropped nearly 5% in South Korea. SK Hynix fell 6%. Japanese flash memory company Kioxia dropped close to 6%. Micron and Sandisk fell in US trading. Billions of dollars in market capitalisation evaporated across the memory chip sector in two days.
The cause was a compression algorithm called TurboQuant and the reaction, while understandable, was based on a partial reading of what the technology actually does.
Here is the full picture.
What TurboQuant Is
TurboQuant is a vector quantization algorithm developed by researchers at Google Research, Google DeepMind, KAIST, and New York University. It will be formally presented at the International Conference on Learning Representations (ICLR 2026) in Rio de Janeiro next month.
The problem it solves is specific: the key-value (KV) cache bottleneck in large language model inference.
To understand why that matters, you need to understand what the KV cache is. When an AI model processes a conversation ( your messages, its responses, the full context of everything said) it stores all of that context in a high-speed working memory called the KV cache. This cache grows with every word in the conversation. For a model handling a long document, a complex coding task, or an extended multi-turn conversation, the KV cache can consume enormous amounts of GPU memory.
This is one of the primary reasons running frontier AI models is expensive. A single inference server running a large model needs vast amounts of High Bandwidth Memory (HBM) chips, not primarily for the model weights, but to maintain the KV cache for simultaneous users. The more users, the more concurrent caches, the more memory required.
Traditional approaches to reducing the KV cache size use quantization, compressing the 16-bit floating point values in the cache down to smaller integers. But conventional quantization methods have a problem: they require storing additional "quantization constants" alongside the compressed data to enable accurate decompression. These constants typically add 1 to 2 extra bits per value, partially cancelling out the compression gains.
TurboQuant eliminates this overhead through a two-stage process.
Stage 1 — PolarQuant:Instead of storing vectors in standard Cartesian coordinates (X, Y, Z), PolarQuant converts them into polar coordinates, a magnitude and a set of angles. This geometric transformation makes the data's structure highly predictable, enabling aggressive compression without the precision loss that normally accompanies it.
Stage 2 — QJL Error Correction: The tiny amount of residual error left after PolarQuant is corrected using a 1-bit Johnson-Lindenstrauss projection, a classical mathematical technique that preserves the relative distances between data points in high-dimensional space. By reducing each error value to a simple sign bit (+1 or -1), this stage eliminates bias from the compression without requiring any additional stored constants.
The result: TurboQuant compresses KV cache values from 16 bits down to 3 bits, a 6x reduction in memory footprint, with no measurable loss in model accuracy across standard benchmarks including Needle-in-a-Haystack, LongBench, and RULER. On Nvidia H100 GPUs, 4-bit TurboQuant achieves up to 8x faster performance in computing attention scores compared to unquantized keys.
The algorithm is also training-free and data-oblivious, it requires no retraining of the model and no dataset-specific tuning. It can be applied to existing deployed models immediately.
Why the Market Reacted the Way It Did
The investor logic was straightforward: if AI models can run on 6x less memory, demand for the HBM chips that power AI data centres falls sharply. Memory chip manufacturers like Samsung, SK Hynix, and Micron had been riding a multi-year upcycle fuelled almost entirely by AI infrastructure buildout. TurboQuant introduced a credible question mark over how long that cycle continues.
Some drew immediate comparisons to DeepSeek, the Chinese AI model that showed frontier performance was achievable at a fraction of previously assumed training costs, triggering a similar market shock in January 2025. Cloudflare CEO Matthew Prince called TurboQuant explicitly "Google's DeepSeek moment." The internet, noting that TurboQuant is essentially a lossless compression algorithm that exceeded theoretical efficiency expectations, compared it to Pied Piper, the fictional startup from HBO's Silicon Valley whose middle-out compression algorithm was the show's central MacGuffin.
The comparison is apt in one direction and imprecise in another.
Why the Market Reaction Was Partially Overstated
TurboQuant is genuinely significant. The market's two-day reaction was understandable but not fully calibrated to what the technology actually does and does not affect.
It targets inference, not training. TurboQuant compresses the KV cache, a component of the inference process, where a trained model generates responses. It does not address training memory at all. Training large AI models requires massive HBM capacity to store model weights, gradients, and activations during the learning process. That demand is entirely unaffected by TurboQuant. HBM is the primary memory type used in AI training scenarios, and it is essentially immune to any impact from TurboQuant. The memory most directly affected is the standard DRAM used in inference servers, a different market from the HBM that dominates analyst attention.
Jevons Paradox applies here. When efficiency improves, usage tends to expand rather than contract. A server that previously hosted one large model on its full memory allocation can now host six with TurboQuant compression but the AI industry will not simply run the same models more cheaply. It will run six times more complex models, process longer contexts, serve more simultaneous users, and build applications that were previously too expensive to contemplate. The efficiency gain unlocks new demand rather than eliminating existing demand. Analysts note this explicitly: AI infrastructure spending is growing at extraordinary rates regardless of TurboQuant, and a technology that reduces memory requirements by 6x does not reduce spending by 6x when memory is only one component of a data centre.
Adoption takes time. TurboQuant's paper will be formally peer-reviewed at ICLR next month. Large-scale deployment in production AI systems follows research publication by months to years. Memory orders for 2026 data centre buildouts are largely already locked in. The near-term market impact is sentiment, not procurement.
The stock bounce has already begun. The initial drops in Samsung, SK Hynix, and Micron shares were partly recovered within 48 hours as analysts pushed back on the "demand destruction" narrative. Wells Fargo analyst Andrew Rocha noted that TurboQuant "directly attacks the cost curve for memory in AI systems" while acknowledging that the demand picture for AI memory remains strong and that compression algorithms have existed for years without fundamentally altering procurement volumes.
What TurboQuant Actually Changes
Despite the nuance around the market reaction, the technology is a genuine advance with real practical consequences.
Lower AI inference costs. Running large language models is expensive primarily because of KV cache memory requirements. A 6x reduction in those requirements translates directly into lower cloud compute bills. For the startups and enterprises running AI inference at scale, this is a meaningful cost reduction event. A company spending Ksh 500,000 per month on AI inference compute might achieve similar throughput for under Ksh 100,000, a transformative shift in unit economics.
Larger models on existing hardware. Hardware that was previously insufficient for frontier-scale models becomes viable with TurboQuant compression. For developers in Kenya and across Africa running AI on constrained budgets (where renting US or European cloud compute in USD is already expensive relative to local revenues) the ability to run more capable models on less hardware is directly relevant.
Longer context windows become affordable. The KV cache grows with context length. TurboQuant's compression makes very long context processing (entire codebases, lengthy legal documents, extended research papers ) economically feasible for a much wider range of applications.
Vector search improves at scale. Beyond LLMs, TurboQuant improves vector search, the similarity lookup technology that powers search engines, recommendation systems, and RAG (retrieval-augmented generation) pipelines. Google tested it on the GloVe benchmark and found superior recall ratios without the large codebooks or dataset-specific tuning that competing approaches require. For developers building RAG systems and semantic search tools, this matters immediately.
Google's own products improve. Google Search, YouTube recommendations, and advertising targeting all run on vector search at Google's scale. TurboQuant is not just a research paper, it is production infrastructure for Google's core revenue engine.
The Bigger Picture
TurboQuant fits into a pattern of AI efficiency breakthroughs that are reshaping the economics of the industry faster than the hardware cycle can keep up with. DeepSeek showed that training costs could be compressed dramatically through architectural innovation. TurboQuant shows that inference memory can be compressed dramatically through algorithmic innovation. Neither breakthrough reduces the absolute scale of AI infrastructure investment, they reduce the cost per unit of capability, which in turn expands the total amount of capability the market can afford to build and deploy.
The memory chip companies that investors worried about are probably fine. The developers, startups, and enterprises that have been priced out of frontier AI capabilities are definitely better off. The AI industry's resource consumption is not going to shrink, it is going to become significantly more efficient at consuming what it already has, and then use that efficiency to consume more.
TurboQuant will be formally presented at ICLR 2026 in Rio de Janeiro, April 23-27. The full paper is available at arxiv.org/abs/2504.19874.
Comments