NVIDIA Developer Blog→ original

StepFun Presents Step 3.7 Flash on NVIDIA GPU for Multimodal Work

StepFun launched Step 3.7 Flash on NVIDIA GPU, a multimodal model with 198 billion parameters. It processes text, images, videos, and documents in real time…

AI-processed from NVIDIA Developer Blog; edited by Hamidun News
StepFun Presents Step 3.7 Flash on NVIDIA GPU for Multimodal Work
Source: NVIDIA Developer Blog. Collage: Hamidun News.
◐ Listen to article

StepFun presented Step 3.7 Flash, a multimodal AI model capable of simultaneously analyzing texts, images, videos, and documents. The model is already available on NVIDIA accelerators and is designed for enterprise-scale applications.

What is

Step 3.7 Flash Step 3.7 Flash is a language model with 198 billion parameters that supports multimodality. Unlike text-only models, it processes multiple types of input data simultaneously: text queries, high-resolution images, video sequences, and document scans. This allows applications to work with real-world business scenarios where information arrives in multiple formats. The model is trained to process this data in real time without requiring preprocessing or input conversion. Integration with NVIDIA infrastructure means companies can use existing GPU clusters without migrating to new systems.

Multimodal

Capabilities Step 3.7 Flash covers key enterprise scenarios: Visual content search—finds needed information in photo and video archives Document analysis—extracts data from tables, contracts, reports, and receipts Video analysis—understands narrative and extracts details from camera recordings or video conferences Hybrid queries—answers questions requiring cross-source information matching This approach is useful for law firms (contract and correspondence analysis), manufacturing (video-based quality control), medicine (analysis of medical images and reports), and finance (processing multiple documents).

Scaling and

Performance StepFun emphasizes that Step 3.7 Flash is not a research project but a production-ready solution. The model is optimized for NVIDIA GPUs, including new architectures. This means predictable latency, batch processing support for high-load systems, and guaranteed compatibility with enterprise infrastructure. Availability on NVIDIA accelerators is critical for companies that have already invested in GPU clusters. They can add multimodality to existing applications without retraining engineers or rewriting pipelines.

What

This Means AI's transition from text analysis to full multimodality is not just adding features—it's a paradigm shift. When a model sees the screen as a human does (text + image + video simultaneously), new applications become possible: intelligent RPA, analysis of large volumes of unstructured data, document automation at a level that previously required human workers. Step 3.7 Flash demonstrates that this level is now available in production-ready form on standard hardware.

ZK
Hamidun News
AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

What do you think?
Loading comments…