Stanford Showcases Onyx Chip for Sparse AI: 8x Faster Than CPU, 70x More Efficient
Stanford unveiled Onyx, a programmable accelerator for sparse AI computations. The chip doesn't waste resources on zero operations, achieving approximately…
AI-processed from IEEE Spectrum AI; edited by Hamidun News
Researchers from Stanford presented the Onyx accelerator, which can extract benefits from "zeros" inside AI models. The idea is not to waste energy on multiplications and additions that don't change anything in advance, and thereby accelerate computations without abandoning large models.
Why zeros matter
In neural networks, data, weights, and activations are stored as arrays of numbers — vectors, matrices, and tensors. In many cases, a significant portion of these numbers are zero or so close to zero that they can be considered zeros without noticeable quality loss. This property is called sparsity.
If more than half are zeros, the model can already benefit from specialized algorithms: instead of storing and processing the entire matrix, the system saves only useful values and skips empty operations. For industry this is important because models grow faster than infrastructure gets cheaper. The more parameters, the higher the quality, but also the more expensive it is to run in terms of time, energy, and carbon footprint.
The article gives an example of Meta Llama with 2 trillion parameters. Researchers also recall results from Cerebras: on Llama 7B they managed to zero out 70–80 percent of parameters without loss of accuracy. This means that inside large models there is already a hidden reserve for acceleration — it just needs to be learned how to use it.
Where efficiency is lost
The problem is that popular hardware was built from the start for dense computations, not sparse structures. When data is compressed, along with non-zero values, metadata must be stored — row indices, column indices, and segments. Access to such data becomes indirect and unpredictable: the processor first has to find coordinates, then the actual value. As a result, some time is spent not on math, but on wandering through memory and service operations.
- GPUs excel at dense matrices, but with random sparsity often parallelize useless operations with zeros.
- Structured sparsity doesn't always help, because it requires a rigid nullification pattern, for example two zeros out of every four adjacent parameters.
- CPUs are more flexible, but often hit prefetcher misses and unpredictable memory accesses.
- Even sparse libraries don't reduce all overhead, because some resources go to servicing the data itself.
Manufacturers are already looking for workarounds, but so far they are only targeted. Apple accelerated indirect memory accesses in A14 and M1 chips, Cerebras is pushing the sparse approach in its Wafer Scale Engine, and Meta is developing MTIA. But there are limitations here too: some solutions work only with weight sparsity, others reveal support only for individual operations like matrix multiplication. For real AI workloads this is insufficient, because models consist not of one operation, but of a long chain of different layers and transformations.
How Onyx is built
The Stanford team started from the bottom up and created Onyx — a programmable accelerator that can equally work with both sparse and dense computations. At its core is a CGRA architecture, an intermediate option between CPU and FPGA: it is noticeably more flexible than a classical processor, but at the same time more efficient than fully bit-configurable circuits. Onyx consists of computing blocks and memory blocks, which store compressed matrices and immediately process them in that form, without expanding them back to dense format unless necessary.
The compiler is particularly important: it translates expressions like multiplying a sparse matrix by a vector into a graph of memory and computations, then distributes it across the chip blocks. According to Stanford data, on average Onyx consumed 70 times less energy than CPU and performed computations approximately 8 times faster. By the energy-delay product metric, the gain reached 565 times relative to a 12-core Intel Xeon with sparse libraries.
The next generation of Onyx should add support for nonlinear layers, normalization, softmax, and more convenient switching between sparse and dense modes.
What this means
The main idea of the article is not that another AI chip has appeared, but that developers are beginning to optimize models not only by reducing precision or size, but also by the structure of the computations themselves. If the sparse approach takes hold, large models will be able to run cheaper and faster, which means the next leap in AI may come not only from new models, but also from a new class of hardware.
Want to stop reading about AI and start using it?
AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.