Together AI Launched Self-Service Instant Clusters on NVIDIA H100 and B200
Together AI launched Instant Clusters — self-service GPU clusters for model training and inference. They support NVIDIA H100 and B200, ready to operate in minut
AI-processed from Together AI Blog; edited by Hamidun News
Together AI officially launched Instant Clusters — self-service GPU clusters that deploy in minutes and are ready for production without lengthy approvals and manual configuration.
What is it
Instant Clusters are GPU clusters based on NVIDIA H100 and B200, deployed via API as cloud services. You create a cluster through a web console, CLI, or programmatically, and within minutes it's ready to handle workloads.
The architecture lets you start with a compact configuration — 8 GPUs on a single node — and scale to hundreds of GPUs in a distributed network configuration without changing application code. Clusters are flexible in orchestration choice: they support Kubernetes for containerized workloads and Slurm for traditional HPC. You can pin NVIDIA Driver and CUDA versions for each cluster, ensuring reproducibility across runs and experiments. Integration with infrastructure-as-code tools (Terraform, SkyPilot) makes deployment part of your CI/CD pipeline.
Full Stack Included
Building a GPU cluster typically requires days of engineering work: installing drivers on each node, configuring network fabrics, setting up HTTPS certificates, organizing storage and resource management. Instant Clusters solve this problem: all critical components are already built into the image and ready to run.
What's included:
- GPU Operator — automatic installation and management of NVIDIA drivers, including runtime for CUDA and containers
- Ingress Controller — routing incoming traffic to the cluster, with load balancing and failover support
- NVIDIA Network Operator — management of high-speed networks (NVIDIA Quantum InfiniBand and Spectrum-X Ethernet with RoCE)
- Cert Manager — automatic creation and rotation of TLS certificates for HTTPS endpoints
- Storage — high-performance parallel storage located near compute nodes for fast access
Result: clusters are production-ready out of the box, without weeks of post-deployment configuration.
Optimized for Large-Scale Training
Clusters are designed for distributed model training. Inter-node communication uses NVIDIA Quantum-2 InfiniBand with guaranteed low latency and high bandwidth. Within each node, GPUs are connected via NVLink and NVLink Switch, enabling ultra-fast communication.
This architecture is critical for reinforcement learning, large model pre-training, and multi-phase training schedules. A concrete example: Latent Health trains models that reason like clinicians, analyzing multimodal clinical data. Models must account for complex preferences (e.g., how to resolve conflicting diagnoses) and requirements from different insurers. With Instant Clusters, they can run large-scale reinforcement learning on full clinical datasets, experiment quickly, then distill results into small, efficient models that often outperform much larger foundation models.
"With
Instant Clusters, we can start a full-scale experiment in hours instead of weeks of infrastructure preparation."
What It Means
GPU infrastructure finally feels like modern cloud: API-first, self-service, predictable scaling. Historically, GPU clusters were built manually—a long and complex process. Now it's a managed cloud service. For startups, this means a fast path to first inference without infrastructure engineering costs. For enterprises, it means quick response to demand: unexpected inference traffic growth or a new research project requires only an API call, not lengthy procurement.
Want to stop reading about AI and start using it?
AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.