Mistral Unveils Small 4 — a Model Uniting Reasoning, Code, and Vision
Mistral released Small 4 — a unified model that replaces three specialized systems: Magistral for reasoning, Pixtral for vision, and Devstral for code. The…
AI-processed from Mistral AI News; edited by Hamidun News
Mistral AI presented Mistral Small 4 — a model that combines three separate specialized models in one system: Magistral for complex reasoning, Pixtral for image analysis, and Devstral for code. Previously, developers had to choose which model to use for a specific task. Now a single universal solution solves all problems without the need to switch between systems.
One Model Instead of Three
Mistral Small 4 is a hybrid architecture optimized for chat, coding, agent tasks, and complex reasoning. It supports both text and graphic inputs, opening a wide range of applications: from conversational interfaces and document processing to visual information analysis and autonomous agent creation. The company notes that the release of Small 4 confirms its commitment to open source — the model is distributed under the Apache 2.0 license. Mistral is proud to have joined NVIDIA Nemotron Coalition as a founding member, advancing collaboration and innovation in AI development. This is a sign that the industry is moving toward open, modular solutions that companies can adapt to their needs.
What's Inside the Model
The architecture is built on modern principles of scalability and efficiency:
- Mixture of Experts (MoE): 128 experts with 4 active simultaneously per token — efficient distribution of computations without loading all parameters
- Parameters: 119B total, 6B active per token (8B including embedding and output layers)
- Context: 256k tokens — support for long documents, multi-page reports, and analysis
- Multimodality: built-in support for text and images without adapter modules
- Flexible reasoning: the reasoning_effort parameter allows you to change the depth of analysis for the task
This design allows the model to scale without efficiency losses. Only 6B parameters are active per token, which reduces memory requirements and accelerates inference. Compared to traditional 120B models, Small 4 saves computational resources through expert routing — each token goes only to the necessary experts.
Reasoning on the Fly
The main innovation is the reasoning_effort parameter, which allows you to dynamically change the model's behavior for a specific task. If reasoning_effort="none", the model responds as fast as possible, like Mistral Small 3.2. If reasoning_effort="high", it switches to deep step-by-step analysis mode, equivalent to previous versions of Magistral for complex reasoning. Thanks to this, one model can work both as a fast chatbot for everyday tasks and as a research partner for complex analytical tasks. This is especially convenient for enterprise systems where not all requests require deep analysis, and excess computational power leads to unnecessary costs. Developers can even configure intermediate reasoning levels if the standard modes don't suit them.
Performance and Optimization
In speed-optimized mode (low-latency setup), Small 4 runs 40% faster than its predecessors — minimal response delays. In throughput-optimized mode, the system processes 3 times more requests per second than Mistral Small 3. Inference optimization was done jointly with NVIDIA. The model is fully optimized for vLLM and SGLang, guaranteeing efficient high-performance deployment in various infrastructure scenarios. Developers have access to vLLM, llama.cpp, SGLang, and Transformers, which simplifies integration into existing pipelines. Minimum infrastructure for deployment: 4 NVIDIA HGX H100, 2 NVIDIA HGX H200, or 1 NVIDIA DGX B200. For maximum performance, it is recommended to double these resources.
What This Means
Mistral Small 4 signals that the era of specialized models is coming to an end. In the future, one universal option with configurable parameters may replace an entire shelf of specialized tools. For developers, this is a simplification: no need to choose and switch between models. For companies, it reduces the complexity of architecture, deployment, and system maintenance.
Want to stop reading about AI and start using it?
AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.