Gemma 4 and Qwen Coder vs. the cloud: local LLMs in production

Q: Источник материала?

Оригинальная публикация на Habr AI. Hamidun News обрабатывает и адаптирует материалы с помощью AI.

Q: Когда опубликовано?

2026-05-17. Время чтения: 3 мин.

Local LLMs like Gemma 4 and Qwen Coder are ready for real work — writing, refactoring, and parsing code. All that is needed is a 16 GB GPU and proper parameter

Hamidun News Editorial

AI monitoring · Habr AI

2026-05-17· 3 min

Gemma 4 and Qwen Coder vs. the cloud: local LLMs in production — Source: Habr AI. Collage: Hamidun News.

◐ Listen to article

Local models like Gemma 4 and Qwen Coder are in a strange position: on one hand, they aren't taken seriously, on the other hand, few people have actually tested their capabilities in real work, not on synthetic benchmarks.

The YouTube Test Problem YouTube is full of tests of local LLMs.

But they're all similar: they take a large model, launch it somehow and ask it to write bubble sort. Of course, it will handle it. No one is impressed by this. The real question is different: can a local model write working code, refactor files with bugs and extract data from HTML — like in real projects? Most tests ignore parameters. And it's often the parameters that decide everything. Wrong temperature, context window, quantization scheme — and the result falls into the abyss. Getting a bad result with a local model is easy. Getting a good one requires time.

Gemma 4 and

Qwen: which models, which conditions Vyacheslav tested several models, choosing those that actually fit in 16 GB of VRAM of a regular graphics card: Gemma 4 (Google) — a universal model with good balance Qwen 3.6 (Alibaba) — balanced performance and speed Qwen Coder — specialized for code generation and analysis Running through llama.cpp with optimized parameters * GPU optimization and correct choice of quantization for memory The first part of the problem is simply getting llama.cpp API up and running. The second is choosing the right parameters. What quantization layer? What temperature? How many tokens to grow the context? These things need to be tuned for the specific task, not guessed.

Results in an agentic environment

The author tested the models not on isolated examples, but in a real agentic environment — with chains of actions, where an error in one step breaks everything else.

Writing working code on the first attempt Refactoring a codebase with logic and existing bugs Extracting structured data from HTML Following complex instructions in the context of a task Adapting when requirements change within a session The results showed: if parameters are chosen correctly, local models perform at the level of cloud solutions for typical tasks without network delays.

Why we need local LLMs It might seem like an academic question.

But there are scenarios where cloud APIs are not an option: sensitive data, closed loops, regulatory requirements, API costs at scale. Local models give you control. You know where the computation runs. No surprises with data logging. This is important when working with confidential information or in an environment where cloud APIs are prohibited.

What this means Local LLMs have moved out of the experimental stage.

They're ready for production work — if you're willing to spend time tuning parameters. For business, this means: an investment in a graphics card can replace cloud APIs for a whole class of problems, from coding to processing sensitive information.

Hamidun News

AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Telegram channel RSS hamidun.com