StepFun Presents Step 3.7 Flash on NVIDIA GPU for Multimodal Work
StepFun launched Step 3.7 Flash on NVIDIA GPU, a multimodal model with 198 billion parameters. It processes text, images, videos, and documents in real time…
AI-processed from NVIDIA Developer Blog; edited by Hamidun News
StepFun presented Step 3.7 Flash, a multimodal AI model capable of simultaneously analyzing texts, images, videos, and documents. The model is already available on NVIDIA accelerators and is designed for enterprise-scale applications.
What is
Step 3.7 Flash Step 3.7 Flash is a language model with 198 billion parameters that supports multimodality. Unlike text-only models, it processes multiple types of input data simultaneously: text queries, high-resolution images, video sequences, and document scans. This allows applications to work with real-world business scenarios where information arrives in multiple formats. The model is trained to process this data in real time without requiring preprocessing or input conversion. Integration with NVIDIA infrastructure means companies can use existing GPU clusters without migrating to new systems.
Multimodal
Capabilities Step 3.7 Flash covers key enterprise scenarios: Visual content search—finds needed information in photo and video archives Document analysis—extracts data from tables, contracts, reports, and receipts Video analysis—understands narrative and extracts details from camera recordings or video conferences Hybrid queries—answers questions requiring cross-source information matching This approach is useful for law firms (contract and correspondence analysis), manufacturing (video-based quality control), medicine (analysis of medical images and reports), and finance (processing multiple documents).
Scaling and
Performance StepFun emphasizes that Step 3.7 Flash is not a research project but a production-ready solution. The model is optimized for NVIDIA GPUs, including new architectures. This means predictable latency, batch processing support for high-load systems, and guaranteed compatibility with enterprise infrastructure. Availability on NVIDIA accelerators is critical for companies that have already invested in GPU clusters. They can add multimodality to existing applications without retraining engineers or rewriting pipelines.
What
This Means AI's transition from text analysis to full multimodality is not just adding features—it's a paradigm shift. When a model sees the screen as a human does (text + image + video simultaneously), new applications become possible: intelligent RPA, analysis of large volumes of unstructured data, document automation at a level that previously required human workers. Step 3.7 Flash demonstrates that this level is now available in production-ready form on standard hardware.
Want to stop reading about AI and start using it?
AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.