Habr AI→ original

NVIDIA Tesla V100 for local AI models: testing on real-world tasks

The 2017 Tesla V100 handles modern LLM models like Qwen35B and GPT-OSS-20B. Generation speed ranges from 38 to 109 tokens per second depending on the model. In

NVIDIA Tesla V100 for local AI models: testing on real-world tasks
Source: Habr AI. Collage: Hamidun News.
◐ Listen to article

Tesla V100 16 GB — server accelerator from NVIDIA from 2017. Can it work with modern large language models released in 2025–2026? Review authors decided to check and tested five popular open models (from Qwen to Gemma) on five real scenarios — from text generation to code and game creation.

Speed in Numbers

Tesla V100 is equipped with HBM2 memory with bandwidth of ~900 GB/s and peak performance of 125 TFLOPS in FP16 format (half precision). In practice, this gives 38–109 tokens per second depending on model, size, and quantization (weight compression level). The speed leader is GPT-OSS-20B (109 t/s).

The slowest under full load is Qwen3.6-35b-a3b in Q4 quantization (19 t/s). But here's an interesting twist: when researchers enabled Multi-Token Prediction (MTP) — a mode where the model predicts multiple tokens simultaneously — the same Qwen's speed skyrocketed to 77 t/s.

A fourfold increase due to parallel prediction. There's a catch: MTP works reliably on Vulkan, but on CUDA Qwen with MTP can be unstable. This is important to remember when choosing a backend.

Real Tasks

What tasks does V100 solve well in reality?

  • Single-page website: from text request to ready HTML+CSS+JavaScript — 1 minute 45 seconds (GPT-OSS-20B) or 7 minutes 24 seconds (Qwen without MTP). All five models successfully generated valid code, embedded media content, and structured markup correctly.
  • Flappy Bird in JavaScript: the game is created in 1–7 minutes depending on the model. Implementation quality varies from minimalist procedural code (basic pipe mechanics) to highly detailed graphics approaching the original game.
  • Document summarization: processing a 17-page scientific paper takes 17–180 seconds. GPT-OSS handles it in 17 seconds, Qwen without acceleration — in 3 minutes. The difference is fivefold. For comparison, a human reads and summarizes an article in 15–20 minutes.

Critical Factor: GPU Load

The main enemy of local LLMs is offloading model layers to system RAM instead of VRAM. When the model is entirely placed in VRAM, generation speed is stable: 38 t/s. When some layers are offloaded to RAM, speed drops to 19 t/s — twice slower. This is explained by the difference in bandwidth: HBM2 operates at 900 GB/s, while DDR4 on the motherboard operates at only 50–100 GB/s. For Qwen3.6-35b in Q4 quantization, 20–21 GB of VRAM is required, so 24 GB is the safe minimum for general use. 16 GB is suitable only for compact models up to 20B parameters in aggressive quantization (Q2_K), where quality is lost.

Test PC configuration: ASRock A520M Phantom Gaming 4 motherboard, AMD Ryzen 7 5700GE processor, 64 GB DDR4-3600, Kingston KC3000 1 TB SSD, FSP Vita 750W power supply, Windows 11 Pro, NVIDIA driver 553.74, LM Studio v0.4.14.

What This Means

Tesla V100 is still a practical accelerator for local LLMs in 2026. Not a speed king, but versatile and economical: cheaper than modern accelerators (H100, B200) when buying on the secondary market, with sufficient VRAM to work with 35B-parameter models. If you buy V100 with 24 GB of memory, configure MTP on the Vulkan backend, and install current drivers, you get a full-fledged local machine for developing, experimenting, and prototyping LLM applications. For niche use (code generation, document processing, game creation) it's a working and cost-effective solution. For production scenarios with latency requirements (below 100 ms) or high-performance batch processing, modern accelerators like H100 or B200 are needed.

ZK
Hamidun News
AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.
What do you think?
Loading comments…