Lubomir Gorbatko Introduces Sessa — An Alternative to Transformers and Mamba

Habr analyzed the Sessa architecture — a new attempt to rethink decoders for long context. The idea is to combine attention adaptability with feedback from state-space models. In theory, this approach provides more flexible memory than Transformer and Mamba, especially when models struggle to precisely focus on a single important token and must work in long, noisy context.

Khamidun Zhemal

AI monitoring · Habr AI

Apr 27, 2026· 2 min

AI-processed from Habr AI; edited by Hamidun News

Lubomir Gorbatko Introduces Sessa — An Alternative to Transformers and Mamba — Source: Habr AI. Collage: Hamidun News.

◐ Listen to article

A breakdown of the Sessa architecture has been published on Habr — this is an attempt to rethink decoder-only models and offer an alternative to the familiar pairing of Transformer and Mamba. The author does not promise a ready-made replacement for market leaders, but instead demonstrates something more fundamental: different decoders can be described through a common foundation, and then honestly compared by how they store and retrieve information from long context. The article's logic progresses from simple to complex.

First, the author rederives the Transformer not as a set of familiar blocks, but as an evolution of ordinary convolution. The idea is that a fixed window and fixed coefficients quickly hit limitations: such a mixer sees only local context and adapts poorly to the task. If weights are made dependent on input, and then normalized through softmax, attention naturally emerges.

In this interpretation, the Transformer's strength is flexible comparison of the current token with previous ones, but the price is known: computations grow expensive as sequence length increases, and in diffuse mode attention struggles to hold a specific distant element. The article then moves to S4D and Mamba. Here the author views the problem as a memory task: instead of rereading the entire prefix each time, the model can accumulate the past in internal state.

This approach resolves some attention problems and makes working with long sequences cheaper. But it has its own boundary. In the author's account, Mamba works well when the selective state space mechanism can "freeze" state at the right time and hold the needed signal.

If this mode is poorly recognized, especially on noisy or very long sequences, the influence of old tokens begins to decay exponentially, and precise extraction of needed information becomes less reliable. Against this backdrop, Sessa is presented as a hybrid variant. The author proposes combining two ideas: retain attention-like adaptivity while simultaneously adding feedback, that is, controlled feedback through past states.

Inside the layer, two branches appear: forward, which gathers information from the prefix, and feedback, which reuses already accumulated states. The key insight is that coefficients of both branches depend on the current token and sequence length, meaning the model gets a more flexible memory mechanism than classical Transformer and more direct access to history than Mamba. Essentially, this is an attempt to embed attention within a recurrent circuit, rather than keeping these approaches on opposite sides of the barricade.

The article's main emphasis is not on the slogan "we defeated transformers," but on comparing memory modes. The author considers a controlled scenario where models struggle to precisely focus on one needed token. In such a mode, Transformer influence from distant tokens decays roughly as the inverse of distance, Mamba decays exponentially, and Sessa's tail decays more slowly, which theoretically provides more stable extraction over long distance.

In a multi-layer Sessa configuration, according to the author, it may even support retrieval profiles without distance degradation. Along with the Habr post, an arXiv paper and code have been published, and the research itself reports comparable experiments on long context. However, the author directly demonstrates the current boundary of the result: right now this is primarily theory and an architectural hypothesis, and the next important step is training at scale of several billion parameters and validation outside carefully controlled regimes.

In short, the material is interesting not only for Sessa itself, but for the way of explanation. It reduces Transformer, Mamba, and the new architecture to a common scheme and shows exactly where their memory properties diverge. For those following the long-context model race, this is an important signal: a notable alternative to transformers may come not from a complete rejection of attention, but from its combination with more expressive recurrent memory.

Hamidun News

AI news without noise. Daily editorial selection from 50+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Telegram channel RSS hamidun.com

Need AI working inside your business — not just in your newsfeed?

I build production AI for companies — custom CRM, internal tools, autonomous agents, workflow automation. Owned by you, shaped to your process, no per-seat tax. Built by Zhemal Khamidun, CPO of AlpinaGPT (AI platform, 6,000+ users).

Book a free consultation →

Lubomir Gorbatko Introduces Sessa — An Alternative to Transformers and Mamba

Need AI working inside your business — not just in your newsfeed?

The AI world, distilled — once a week