Gemma 4 in Codex CLI: local execution works, but still lags behind cloud
Gemma 4 can now run locally in Codex CLI for real code tasks, but still falls short of cloud models. In a test generating Python functions and running tests…
AI-processed from Habr AI; edited by Hamidun News
Local Gemma 4 is already capable of working in Codex CLI as an agent for everyday coding: reading files, writing patches, and running tests. But an experiment with two different setups showed that the fact of running itself is only half the battle. In terms of reliability, code precision, and first-attempt result quality, cloud-based GPT-5.
4 still remains noticeably ahead. The test author wanted to check not an abstract "local AI development," but a thoroughly grounded scenario: can the model replace the cloud API in daily work with Codex CLI. The motivation is clear: token costs, privacy requirements, and dependence on external services.
To verify this, two configurations were assembled. The first — MacBook Pro with M4 Pro chip and 24 GB of memory, where Gemma 4 26B MoE was run in Q4_K_M quantization via llama.cpp.
The second — Dell Pro Max GB10 with 128 GB unified memory and NVIDIA Blackwell, where Gemma 4 31B Dense was used via Ollama 0.20.5.
In both cases, the model was connected to Codex CLI as a custom provider in responses API mode. Setting up the local stack proved to be not so straightforward. On Mac, the Ollama version broke on tool calling due to streaming bugs and hung on long prompts, and for Codex CLI this is critical: a single system prompt there takes about 27 thousand tokens.
The working solution ultimately became llama.cpp with manual flag tuning, disabled web_search, and a 32,768 token context. On GB10, everything didn't work on the first try either: vLLM ran into incompatibility between PyTorch and CUDA builds for Blackwell, and manually built llama.
cpp inadequately handled certain types of tools. As a result, the most practical solution again turned out to be not the "ideal" stack, but the one that simply worked — Ollama. The benchmark was conducted on April 12, 2026 on Codex CLI v0.
120.0. Through codex exec --full-auto, all three configurations were given the same task — write a Python function parse_csv_summary with error handling, then prepare tests and run them.
Cloud-based GPT-5.4 with high reasoning effort performed best: delivered neat code with type hints, proper exception chain, and passed all five tests on the first attempt in 65 seconds. Local Gemma 4 31B on GB10 also delivered a working result on the first pass, but simpler in quality: without type hints and without boolean value recognition.
However, all five tests also passed immediately, and execution took about seven minutes and three tool calls. The most problematic was the Mac with 26B MoE: the model left dead code, rewrote the test file several times, and made ridiculous typos like a broken variable name or incorrect encoding string. In total, the task took 4 minutes 42 seconds but required 10 tool calls and five failed attempts to write tests.
Interestingly, the Mac unexpectedly outperformed the more powerful GB10 in "raw" speed. In llama-bench, 26B MoE on Mac delivered about 52 tokens per second versus 10 tokens on 31B Dense on GB10, and when processing a prompt on an 8K context, the machines ran almost evenly — 531 versus 548 tokens per second. The explanation lies in the Mixture of Experts architecture: with MoE, only part of the parameters activate at each step, so the amount of data that needs to be pulled from memory per token is drastically reduced.
But this advantage almost didn't help in the real task because the main time was consumed not by computation, but by model errors, repeated tool calls, and unnecessary fixes along the way. The main conclusion here is twofold. On one hand, Gemma 4 has indeed shifted local agentic coding from the category "breaks almost always" to the category "you can actually live with this": the author reminds us that on tau2-bench, function calling performance for Gemma 3 was 6.
6%, while for Gemma 4 31B it was 86.4%. On the other hand, in practical development, reliability on the first attempt matters more than tokens-per-second records.
Therefore, local mode already looks realistic for private tasks, fast iterations, and work without constant API spending, but in complex scenarios cloud models remain stronger for now. The most reasonable conclusion from the test appears to be hybrid mode: local model for some tasks, cloud — as the primary tool where the cost of error exceeds speed or privacy concerns.
Want to stop reading about AI and start using it?
AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.