StepFun unveils Step 3.7 Flash — a 198-billion-parameter Vision-Language model
StepFun has released Step 3.7 Flash, a Vision-Language model with 198 billion parameters built on a mixture-of-experts architecture. The model features built-in

StepFun has officially introduced Step 3.7 Flash — a new Vision-Language model that promises to become a significant solution for specialized tasks in artificial intelligence. This model is built on a Mixture of Experts (MoE) architecture and contains 198 billion parameters, which allows it to deliver high performance while efficiently utilizing computational resources.
Technical Parameters and Architecture
Step 3.7 Flash is distinguished by several key technical characteristics. The model uses a MoE architecture in which only relevant parameter subnets are activated for each input example.
This makes it possible to achieve a balance between model scale and operational efficiency. Built-in visual capabilities allow the model to process not only text data but also analyze images. An expanded context window of 256 thousand tokens enables working with long documents, complex codebases, and detailed visual materials without loss of context.
A special place is occupied by the Advisor mode, integrated into the model's architecture. This mode provides an additional level of control over the model's behavior and allows for more structured and predictable output. Such an approach is particularly useful in production environments where reliability and consistency of results are required.
Target Applications and Use Cases
StepFun positions Step 3.7 Flash as a solution for two main areas of application. The first is code development automation. The model can analyze source code, identify potential improvements, generate optimized versions, and even participate in the debugging process. Built-in vision allows it to work with code screenshots and architecture diagrams. The second area is integration into search systems. Visual capabilities make the model suitable for search platforms that work with both text queries and images. Extended context enables searching for complex, multifaceted answers based on comprehensive data. Additionally, the model can be applied in analytical tools — for processing combined datasets, analyzing video materials with detailed content transcription, and providing structured recommendations based on obtained results.
Market Position
The release of Step 3.7 Flash continues a visible trend in the market toward specialization. Instead of creating universal models, companies are increasingly developing solutions optimized for specific tasks. StepFun demonstrates that effective specialization is achieved not only through architectural choices but also through special operating modes that allow adapting model behavior to specific requirements. MoE architecture is becoming the standard for large models, particularly in the context of energy consumption constraints and infrastructure costs. This opens the door to more accessible and environmentally friendly AI solutions.
What This Means
The emergence of Step 3.7 Flash indicates an important shift in the strategy for developing large models. Instead of a race for size and universality, developers are focusing on deep optimization for specific applications. For developers and companies, this means more tools to choose from and the ability to select a solution that perfectly matches the needs of their project. MoE architecture, in turn, becomes not just an engineering trick but a standard for efficient next-generation models. This makes it possible to reduce infrastructure costs and operational expenses without compromising quality, which is critical for commercial implementation of AI.
Хотите не читать про ИИ, а внедрить его?
«AI News» — это полезные новости из мира ИИ. Системно научиться работать с нейросетями и применять их в работе — в Hamidun Academy.