Habr AI showed how to build a Linear Layer in C++ and CUDA as part of the From MNIST to Transformer series
Habr AI released the third part of the From MNIST to Transformer series — this time about tensors, tensor multiplication, and a custom Linear Layer without…
AI-processed from Habr AI; edited by Hamidun News
Habr AI released the third part of the "From MNIST to Transformer" series, where it breaks down the transition from regular matrices to tensors and shows how to build a basic Linear Layer from scratch without PyTorch. The material guides the reader through a simple neural network for MNIST recognition and emphasizes not the libraries' APIs, but how these operations are actually executed on the GPU.
About the third part
The new article continues the route from low-level CUDA code to architectures that Transformers and modern LLMs are built on later. Instead of ready-made abstractions, the author proposes looking directly at the computations: how data lies in memory, how the GPU launches operations, and why even an elementary neural network layer requires a large volume of engineering work. This is a good format for those who are tired of perceiving ML frameworks as magic from a single line of code.
The main shift in this part is the transition from matrices to tensors. For applied machine learning, this is a basic topic because real data rarely fits within two dimensions. The author shows how a developer's thinking changes when they start working not only with tables of numbers, but with multidimensional structures, from which batches, layer weights, intermediate representations, and other building blocks of the future model are assembled. It is at this level that preparation begins for understanding more complex blocks like attention and embeddings.
What they implement by hand
The practical part is built around custom code in C++ and CUDA. There is no attempt to hide complexity behind a convenient interface: instead, the reader is offered a path from mathematical formula to manual implementation. This approach is useful because it connects abstract linear algebra with specific development steps — data placement, kernel calls, checking dimension shapes, and understanding where exactly errors or performance losses appear. Without this link, it is difficult to feel the difference between a textbook example and a real system.
- Implementation of tensor multiplication
- Creating the first Linear Layer, that is, a fully connected layer
- Working with memory and data placement on the GPU
- Linking mathematics and code for training on MNIST
- Building a simple network for recognizing handwritten digits
"Only this way can you truly understand how LLMs work."
After implementing basic operations, the article leads to a custom fully connected layer and then to a small network for recognizing handwritten digits from MNIST. This is an important moment: the material does not stop at individual primitives, but shows how to assemble them into a working chain. At the same time, it becomes clear that even a simple classifier relies on a fairly deep stack of knowledge — from mathematics to the structure of video memory and the peculiarities of parallel computing.
Why this is important
The value of such analyses is that they dispel the illusion of ease that Python wrappers often create. When a developer sees only ready-made library calls, it is difficult for them to assess the cost of each operation and understand why a model suddenly hits a memory wall, bandwidth limit, or batch size constraint. Analysis at the level of tensors and layers helps better read profilers, more carefully design architecture, and more consciously choose trade-offs between speed, accuracy, and implementation complexity.
The "From MNIST to Transformer" series is especially useful for those who want not just to run other people's models, but to understand the mechanics of modern AI systems. It does not promise quick entry and directly warns: there will be plenty of code, CUDA, mathematics, and manual memory work. But it is precisely this format that provides the foundation that will be useful both when optimizing inference and when reading other people's CUDA implementations, and when trying to understand why some architectural decisions work faster than others.
What this means
For Russian-speaking audiences, this is a good signal: demand is shifting from superficial tutorials to engineering analysis of the AI stack's internals. The more such materials appear, the less LLMs are perceived as black boxes and the easier it becomes for developers to transition from using ready-made models to their conscious tuning and optimization. This is especially important now, when demand for GPU optimization and understanding model internals is growing rapidly among practicing engineers.
Want to stop reading about AI and start using it?
AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.