Lubomir Gorbatko Introduces Sessa — An Alternative to Transformers and Mamba
Habr analyzed the Sessa architecture — a new attempt to rethink decoders for long context. The idea is to combine attention adaptability with feedback from…
AI-processed from Habr AI; edited by Hamidun News
A breakdown of the Sessa architecture has been published on Habr — this is an attempt to rethink decoder-only models and offer an alternative to the familiar pairing of Transformer and Mamba. The author does not promise a ready-made replacement for market leaders, but instead demonstrates something more fundamental: different decoders can be described through a common foundation, and then honestly compared by how they store and retrieve information from long context. The article's logic progresses from simple to complex.
First, the author rederives the Transformer not as a set of familiar blocks, but as an evolution of ordinary convolution. The idea is that a fixed window and fixed coefficients quickly hit limitations: such a mixer sees only local context and adapts poorly to the task. If weights are made dependent on input, and then normalized through softmax, attention naturally emerges.
In this interpretation, the Transformer's strength is flexible comparison of the current token with previous ones, but the price is known: computations grow expensive as sequence length increases, and in diffuse mode attention struggles to hold a specific distant element. The article then moves to S4D and Mamba. Here the author views the problem as a memory task: instead of rereading the entire prefix each time, the model can accumulate the past in internal state.
This approach resolves some attention problems and makes working with long sequences cheaper. But it has its own boundary. In the author's account, Mamba works well when the selective state space mechanism can "freeze" state at the right time and hold the needed signal.
If this mode is poorly recognized, especially on noisy or very long sequences, the influence of old tokens begins to decay exponentially, and precise extraction of needed information becomes less reliable. Against this backdrop, Sessa is presented as a hybrid variant. The author proposes combining two ideas: retain attention-like adaptivity while simultaneously adding feedback, that is, controlled feedback through past states.
Inside the layer, two branches appear: forward, which gathers information from the prefix, and feedback, which reuses already accumulated states. The key insight is that coefficients of both branches depend on the current token and sequence length, meaning the model gets a more flexible memory mechanism than classical Transformer and more direct access to history than Mamba. Essentially, this is an attempt to embed attention within a recurrent circuit, rather than keeping these approaches on opposite sides of the barricade.
The article's main emphasis is not on the slogan "we defeated transformers," but on comparing memory modes. The author considers a controlled scenario where models struggle to precisely focus on one needed token. In such a mode, Transformer influence from distant tokens decays roughly as the inverse of distance, Mamba decays exponentially, and Sessa's tail decays more slowly, which theoretically provides more stable extraction over long distance.
In a multi-layer Sessa configuration, according to the author, it may even support retrieval profiles without distance degradation. Along with the Habr post, an arXiv paper and code have been published, and the research itself reports comparable experiments on long context. However, the author directly demonstrates the current boundary of the result: right now this is primarily theory and an architectural hypothesis, and the next important step is training at scale of several billion parameters and validation outside carefully controlled regimes.
In short, the material is interesting not only for Sessa itself, but for the way of explanation. It reduces Transformer, Mamba, and the new architecture to a common scheme and shows exactly where their memory properties diverge. For those following the long-context model race, this is an important signal: a notable alternative to transformers may come not from a complete rejection of attention, but from its combination with more expressive recurrent memory.
Want to stop reading about AI and start using it?
AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.