العتاد

FLOPS

FLOPS (Floating-Point Operations Per Second) measures how many floating-point arithmetic operations a processor executes each second. It is the primary benchmark for comparing AI hardware, with modern high-end GPUs delivering thousands of teraFLOPS at the reduced-precision formats used in AI training.

FLOPS is a unit of computational throughput equal to one floating-point operation per second. Higher-order prefixes are standard: one teraFLOPS (TFLOPS) equals 10^12 operations per second, one petaFLOPS equals 10^15, and one exaFLOPS equals 10^18. The metric applies to CPUs, GPUs, and specialized AI accelerators alike, though AI workloads rely almost exclusively on GPUs and custom silicon where FLOPS figures are highest.

Floating-point operations involve arithmetic on numbers with decimal precision. Modern AI hardware frequently operates on reduced-precision formats — FP16, BF16, or FP8 — rather than full FP32, multiplying effective throughput while accepting minor precision trade-offs. Manufacturers therefore publish multiple FLOPS figures for the same chip depending on data type. NVIDIA's H100 SXM delivers roughly 2,000 TFLOPS at BF16 dense and approximately 67 TFLOPS at FP32; the Blackwell B200 roughly doubles BF16 throughput relative to the H100 and offers substantially higher throughput at FP8. Comparing chips without specifying the data type is misleading.

FLOPS figures directly determine the speed at which AI models can be trained and served. Larger language models require more arithmetic per training token, so researchers also express total training cost as an aggregate FLOPs count — a dimensionless total rather than a rate. Scaling laws documented by OpenAI and DeepMind in the early 2020s showed that model performance improves predictably with total training compute, making FLOPS a central planning variable for frontier AI labs. The estimated training cost of GPT-4 ran into tens of millions of dollars, driven primarily by GPU-hours consumed.

By 2026, large AI training clusters aggregate tens of thousands of NVIDIA B200 or H200 GPUs or equivalent custom accelerators from Google, Amazon, and Microsoft, reaching multi-exaFLOPS aggregate throughput. Training runs for leading frontier models routinely consume between 10^24 and 10^25 total FLOPs. Algorithmic improvements — including mixture-of-experts architectures and better data curation — have partially offset raw compute demand per unit of model quality, but total cluster-level hardware investment continues to grow.

مثال

When evaluating cloud GPU offerings for a distributed training job, an ML engineer checks that the reserved cluster delivers sufficient TFLOPS at BF16 precision to complete training within the project deadline and budget before committing to a capacity reservation.

مصطلحات مرتبطة

← المسرد