Microsoft veteran ran a transformer on a 6 MHz PDP-11 with 64 KB of memory

Q: What is the source?

Originally published on 3DNews AI. Hamidun News processes and adapts the material with AI.

Q: When was it published?

May 2, 2026. Reading time: 3 min.

Dave Plummer, a former Microsoft developer, ran a tiny transformer called Attention 11 on a PDP-11 minicomputer. The 1970s machine, with a 6 MHz processor…

Hamidun News Editorial

AI monitoring · 3DNews AI

May 2, 2026· 3 min

AI-processed from 3DNews AI; edited by Hamidun News

Microsoft veteran ran a transformer on a 6 MHz PDP-11 with 64 KB of memory — Source: 3DNews AI. Collage: Hamidun News.

◐ Listen to article

Microsoft veteran Dave Plummer demonstrated that a transformer can not only be explained with hand gestures, but also literally run on hardware from the late 1970s. His experiment with the minicomputer PDP-11 running at 6 MHz and 64 KB of RAM reduces the conversation about AI to a grounded picture: training is lots of arithmetic, repetition, and careful optimization.

Old Computer, New Task

Plummer is known as a developer who previously participated in creating important Windows components. In his new video, he undertook not a nostalgic trick for likes, but a demonstration of the basic principles of modern models. At the center of the experiment is a 47-year-old PDP-11 system, a machine from an era when no one even dreamed of large language models. The contrast is what makes the project convincing: if a stripped-down version of a transformer can be trained on such a device, then the core idea is much simpler than it appears against the backdrop of data centers and billion-dollar budgets.

The Attention 11 model ran on the PDP-11, written in PDP-11 assembly by developer Damien Bouré. Its task appeared modest at first glance: take a sequence of eight numbers and output it in reverse order. The key here is not to memorize a few examples, but to grasp the rule that will work on new input data. This is where Plummer puts emphasis: even in such a toy scenario, the model must learn the structure, not just guess the next answer by pattern.

How They Shrunk the Model

For this experiment to have any chance of working, developers had to severely compress the architecture. Attention 11 is not a mini-copy of ChatGPT, but a single-layer transformer with one attention mechanism, refined to a state of engineering minimalism. The model has only 1216 parameters. Instead of the memory arrays and accelerators typical of modern AI projects, fixed-point arithmetic was used here, and the forward pass was reduced to 8-bit precision. Essentially, it's an educational skeleton of a transformer, keeping only what's necessary to demonstrate the actual training process.

1216 parameters instead of billions
fixed-point arithmetic
8-bit precision for the forward pass
optimization of nearly every processor cycle
task requires rule discovery, not memorization of examples

Yet even with such constraints, the result was far from decorative. Plummer reported that the model reached 100% accuracy in roughly 350 training steps. On a PDP-11/44 system with a cache board, this took about three and a half minutes. Compared to modern GPUs, this is certainly museum-grade speed. But for a 6-megahertz machine with 64 KB of RAM, the sheer fact of successful full training matters more than absolute time: the experiment proves that transformer principles don't require magic, only resources and good engineering.

Not Magic, but Mathematics

The main goal of this project was not to find a useful practical replacement for modern models. Plummer tried to show something less romantic: at the foundation of AI there is no sacred fire. There is a cycle of errors, corrections, and iterations, where weights gradually adjust to the task. That's why his demonstration works as an antidote to mystification of neural networks. It strips away the marketing layer and leaves bare mechanics that can be observed almost frame by frame.

"From guessing to knowing."

That's how Plummer describes the moment when the model stops stumbling and begins consistently applying the rule it discovered. This is the most interesting effect of the experiment: the viewer sees not a ready-made smart answer, but the birth of an ability through successive corrections. Against the backdrop of AGI discussions, this sounds sobering. Modern systems impress not because they violate the laws of computation, but because the same mechanism runs at a colossal scale—on incomparably larger data, models, and computational clusters.

What This Means

The PDP-11 experiment does not prove that ChatGPT can be ported to a retro-computer. Instead, it clearly demonstrates something else: the basic ideas of transformers are compact enough to be understood, reproduced, and trained even on ancient hardware. For the market, this is another argument in favor of efficient small models and careful optimization, especially now when computational cost is becoming a separate competitive factor.

Hamidun News

AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Telegram channel RSS hamidun.com

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

🎓 Academy — 7 days free Free consultation