Nvidia unveils Groq 3: company bets on dedicated chips for AI inference
Nvidia has introduced Groq 3, its first chip built specifically for AI inference. It does not compete with Rubin GPU on overall power, but serves a different…
AI-processed from IEEE Spectrum AI; edited by Hamidun News
Nvidia showed Groq 3 — the company's first chip designed specifically for AI inference rather than model training. This is an important signal: the market is shifting from a race for ever-larger models to a race for how fast and cheaply these models can respond to users.
Why Nvidia Changes Course
At the GTC conference, Nvidia's CEO announced not only the Vera Rubin lineup but also a separate class of processors for model inference. Groq 3 LPU is built on licensed technology from Groq startup, which Nvidia acquired a license from at the end of last year. The fact that roughly two and a half months passed from licensing to product announcement shows how rapidly demand for inference in data centers is growing.
"Finally AI is capable of doing useful work, and the inference
inflection point has already arrived."
Training and inference solve different problems, so they need different hardware. During training, the system runs massive amounts of data for weeks and updates model weights. During inference, everything happens at the moment of a user's request, and for reasoning models, one session can include multiple internal passes before a human sees the answer. Here, the critical factors are not maximum FLOPS, but latency, stable data flow, and predictable token generation time.
How Groq 3 Works
Groq's approach differs from the familiar GPU scheme. Instead of relying on separate high-speed HBM memory next to the graphics processor, the chip uses SRAM built directly into the compute block. This simplifies data movement: it flows through the processor linearly, without extra trips outside and back. Through this, the architecture sacrifices universality but wins where maximum quick response is needed. For inference, where the model generates tokens sequentially rather than computing everything in one large batch, such a design is particularly useful.
The difference is notable in specifications too. Rubin GPU remains a machine for heavy computations and large models, while Groq 3 was made for a different goal — minimal latency at the decode stage, when the answer is already being assembled token by token. In overall computations and memory capacity, the LPU is noticeably more modest, but it wins in throughput and is better suited for final inference. Therefore, Nvidia does not replace GPU with a new class of chip, but complements it with a specialized accelerator.
- Rubin GPU has 288 GB of HBM, Groq 3 has about 500 MB of built-in SRAM
- Rubin delivers up to 50 petaflops in 4-bit computations, Groq 3 — 1.2 petaflops in 8-bit
- In memory throughput, Groq 3 reaches 150 TB/s compared to 22 TB/s for Rubin
- Groq 3's focus — not universality, but fast token generation with low latency
Market Shifts to Inference
Over the past couple of years, there has been a real explosion of startups around inference chips. D-Matrix focuses on digital in-memory compute, Etched — on ASICs for transformers, RainAI — on neuromorphic circuits, EnCharge — on analog in-memory compute, FuriosaAI — on architecture for tensor operations. With its announcement, Nvidia did not simply add another product, but effectively confirmed: the niche turned out to be too large to ignore within the GPU empire.
At the same time, the bet is being placed not just on a separate chip, but on dividing inference into parts. AWS recently showed a system with Trainium and Cerebras CS-3, where prefill and decode are performed by different types of hardware. Nvidia is heading the same direction: the new Groq 3 LPX module will include eight LPUs and Vera Rubin system.
Prefill and the heavier part of decode will remain on Rubin, while the final stage of inference — on Groq 3. Such a hybrid allows using the strengths of each processor instead of a compromise.
What This Means
The main news is not that Nvidia released yet another accelerator, but that the largest player in the market publicly recognized inference as a separate class of computing. For AI products, this is good news: if such architectures truly scale in production, model responses will become faster and the economics of mass use — more predictable. The next stage of competition in AI will be not just for model quality, but for the cost of a million useful answers.
Want to stop reading about AI and start using it?
AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.