CUDA
CUDA (Compute Unified Device Architecture) is NVIDIA's proprietary parallel computing platform and programming model that lets developers write C/C++ code executed directly on NVIDIA GPUs, turning them into general-purpose parallel processors for AI and scientific workloads.
CUDA (Compute Unified Device Architecture) is a parallel computing platform and application programming interface introduced by NVIDIA in 2006. It extends standard C/C++ with annotations and runtime libraries that allow developers to write kernels—functions that execute in parallel across thousands of GPU cores simultaneously—without knowledge of graphics programming. CUDA transformed NVIDIA GPUs from fixed-function graphics accelerators into programmable parallel processors usable for scientific simulation, image processing, cryptography, and eventually machine learning at scale.
In the CUDA programming model, the CPU (host) orchestrates execution while launching kernel functions that run on the GPU (device). Kernels are organized into a grid of thread blocks; each block runs on a streaming multiprocessor (SM) and shares fast on-chip shared memory among its threads. The CUDA runtime manages memory transfers between host DRAM and GPU HBM, thread scheduling, and synchronization. Higher-level libraries built on CUDA—cuBLAS for dense linear algebra, cuDNN for neural network primitives, cuSPARSE for sparse operations, NCCL for multi-GPU collective communication—allow frameworks such as PyTorch, TensorFlow, and JAX to call optimized GPU routines without developers writing raw CUDA kernels.
CUDA matters because it is the de facto standard programming interface for GPU-accelerated AI. PyTorch, TensorFlow, JAX, and virtually every major AI framework ship CUDA backends as their primary GPU execution path. The decade-long investment the research community has made in CUDA-optimized code—including NVIDIA's FlashAttention-compatible attention kernels, Triton (a Python-based kernel language that compiles to CUDA PTX), and cuDNN—creates substantial switching costs for moving to alternative hardware platforms. Benchmarking, debugging, and profiling tooling (NVIDIA Nsight, Compute Sanitizer) are all CUDA-native.
As of 2026, CUDA's dominance in AI training remains largely intact, though it faces pressure from AMD's ROCm platform, which supports PyTorch with minimal code changes, and from vendor-neutral compute APIs such as OpenCL, Vulkan Compute, and SYCL. NVIDIA introduced CUDA 12.x with support for Hopper (H100) and Blackwell (B200) architectures, adding FP8 tensor core support and improved distributed-memory primitives. The maturity gap between CUDA's ecosystem and its competitors remains wide for frontier model training, though inference deployments increasingly use higher-level runtimes—TensorRT, vLLM, ONNX Runtime—that abstract away direct CUDA programming.