Taalas challenges GPUs: hardwired logic over flexibility for 17,000 tokens per second

Q: What is the source?

Originally published on MarkTechPost. Hamidun News processes and adapts the material with AI.

Q: When was it published?

2026-02-23. Reading time: 3 min.

Toronto startup Taalas is developing specialized hardwired AI chips that replace programmable GPUs for inference workloads. The company claims it reaches 17,000

Hamidun News Editorial

AI monitoring · MarkTechPost

2026-02-23· 3 min

AI-processed from MarkTechPost; edited by Hamidun News

Taalas challenges GPUs: hardwired logic over flexibility for 17,000 tokens per second — Source: MarkTechPost. Collage: Hamidun News.

◐ Listen to article

The entire AI industry for the last decade has been built on one unspoken axiom: silicon must be flexible. Models change every week, architectures evolve every quarter, and only programmable GPUs can keep pace with this race. Toronto-based startup Taalas believes this logic has led the industry into a dead end — and proposes a radical alternative: chips with hardwired logic that can do nothing but inference, but do it at 17,000 tokens per second.

To grasp the scale of this claim, it's worth recalling the context. Modern GPUs — from NVIDIA's H100 to the latest Blackwell — are essentially supercomputers on a chip, capable of executing arbitrary computations. Their architecture inherits decades of GPU development: thousands of programmable cores, complex memory hierarchies, flexible data buses. This universality allows the same hardware to run both giant model training and inference, as well as scientific simulations. But universality comes at a cost — in power consumption, latency, and expense. Every clock cycle spent decoding instructions and managing data streams is energy and time that doesn't go toward actual matrix multiplication.

Taalas attacks precisely this point. The company develops chips where computational paths are hardwired directly into the silicon — so-called hardwired logic. This means the chip doesn't interpret a program on the fly, but physically embodies specific transformer architecture operations: matrix multiplications, attention functions, normalization. Essentially, instead of a universal processor, you get an electronic circuit that does exactly one thing — but does it with minimal overhead.

The approach is not new in principle. ASIC chips (application-specific integrated circuits) have long been used in cryptocurrency mining, telecommunications, and video processing. Google presented TPUs — tensor processors — back in 2016, which are also specialized for neural network computations, though they retain some degree of programmability. But Taalas, it seems, goes further, maximizing specialization for ultimate per-token performance.

The figure of 17,000 tokens per second deserves special attention. For comparison: typical inference of a large language model on a single H100-class GPU yields anywhere from a few hundred to a few thousand tokens per second, depending on model size and batch. If Taalas truly achieves the claimed speed with comparable quality and model size, this could mean a dramatic reduction in inference costs — the primary expense item for companies deploying AI services in production. It is inference cost, not training cost, that determines the economics of most AI products today: every ChatGPT query, every Copilot call, every image generation — these are dollars spent on GPU time.

However, the approach carries an obvious and serious risk. Hardwired logic means rigid binding to a specific model architecture. If the industry tomorrow moves from transformers to something fundamentally different — say, state-space model architectures or hybrid approaches — Taalas chips risk becoming expensive paperweights. This is the classic specialization dilemma: you win in efficiency but lose in adaptability. Google can update the software of its TPUs, NVIDIA releases new drivers and CUDA libraries — but Taalas will have to design a new chip.

That said, the startup has a strong counter-argument. Transformer architecture has dominated for eight years and shows no signs of departing soon. Basic operations — matrix multiplications, attention mechanisms — remain fundamentally identical from GPT-2 to the latest models. Moreover, the trend toward "ubiquitous inference," which Taalas champions as its mantra, suggests that AI computation should become as cheap and accessible as electricity. And for that, you need exactly specialized, energy-efficient chips, not expensive universal GPUs.

There is also market context. GPU shortages and NVIDIA's monopolistic position have created strong demand for alternatives. Major cloud providers — Amazon, Google, Microsoft — are already developing their own chips. Startups like Groq, Cerebras, and SambaNova offer unconventional architectures. Taalas fits this trend but occupies the most radical position on the flexibility-specialization spectrum.

The main question Taalas must answer is not technical but economic. Can they manufacture and update their chips fast enough to keep pace with model evolution? Can they convince customers that betting on hardwired logic is justified? If yes — we might see the beginning of a new era in which AI inference ceases to be a luxury and becomes an infrastructure norm. If no — it will be another lesson in why the industry clings so fiercely to flexibility.

Hamidun News

AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Telegram channel RSS hamidun.com

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

🎓 Academy — 7 days free Free consultation