Habr AI→ original

LM Studio and Qwen: How Local LLMs Handle Coding on MacBook M4 Pro

Local LLMs for coding can now be used without the cloud if the task is quick chats and simple edits. In the MacBook M4 Pro experiment, models via LM Studio…

AI-processed from Habr AI; edited by Hamidun News
LM Studio and Qwen: How Local LLMs Handle Coding on MacBook M4 Pro
Source: Habr AI. Collage: Hamidun News.
◐ Listen to article

Local language models can already be used for writing and editing code without sending source code to the cloud, but the comfort of such work still heavily depends on the task and available hardware. An experiment on a MacBook Pro with M4 Pro and 48 GB of memory shows that the LM Studio and modern models with open weights already deliver tangible results in chat mode, but in full agent mode they quickly hit memory constraints, heat, and execution time limits. The starting point here is simple: cloud models are convenient, but they have limitations, network dependency, and the main downside for many developers — data, code, and prompts are sent to remote servers.

Local execution promises privacy and full control, but requires understanding how a model consumes RAM and VRAM, how much memory remains for context, and how formats like GGUF and MLX differ. Tests were conducted on a MacBook Pro with M4 Pro chip and 48 GB unified memory, where CPU and GPU share a common memory pool. This helps accommodate larger models but simultaneously means the model competes for resources with IDE, Docker, and dozens of browser tabs.

A separate part of the analysis focuses on choosing a model for the hardware. The author suggests not only looking at the size in billions of parameters, but also at specialization, quantization, function calling support, and architecture type. For coding, he used Qwen3-Coder 30B A3B Instruct in MLX and GGUF variants, and also compared it with Qwen3-Coder Next, Qwen3.

5, Nvidia Nemotron-3 Nano, and Gemma 4 26B A4B. The article explains well the practical meaning of abbreviations: for example, A3B indicates a MoE approach, where only part of the parameters from a large model are activated, which makes speed closer to small models while quality approaches larger ones. LM Studio was chosen as the runtime: through it, models are easily downloaded, a local server is set up, CORS is enabled, and agents like Claude Code, Open Code, Kilo Code, and Aider can be connected.

The performance forecast for Qwen3-Coder promised around 150 tokens per second, but the actual measurement in LM Studio turned out to be closer to 82 tokens per second, which immediately brings the conversation from theory to practice. The most interesting part begins with the measurements. In regular chat mode, local models no longer look like a toy but like a working compromise.

Qwen3-Coder 30B A3B Instruct in MLX 4bit fit approximately in 2 minutes 9 seconds for the entire three-stage scenario and reached a final score of 8.5 out of 10. Gemma 4 26B A4B in GGUF showed one of the best balances: around 2 minutes 23 seconds and a final score of 10 out of 10.

More thinking models gave better results but at the cost of time: Qwen3.5 35B A3B reached 10 out of 10 in approximately 5 minutes 43 seconds, while Qwen3.5 27B stretched almost to half an hour.

The conclusion from this part is sober: local models already sometimes match cloud models in response speed, especially without thinking mode, but over the same time often lag behind in quality. Meanwhile, fresh MoE models look noticeably more practical than dense variants. In agent mode, the picture changes dramatically.

Context grows, the number of calls increases, and seconds turn into minutes or even tens of minutes. Aider with the same Qwen3-Coder MLX 4bit completed the scenario in 2 minutes 50 seconds with a score of 9.5, Open Code in 7 minutes 33 seconds with a score of 9, but Kilo Code with the same model took 15 minutes 5 seconds and only reached 6 points.

With the heavier Qwen3.5 35B A3B, Kilo Code took 57 minutes 3 seconds, although the final quality improved to 9 out of 10. Claude Code with Gemma 4 26B completed the experiment with a maximum score of 10 out of 10, but spent a total of 21 minutes 14 seconds, and the Claude Code with Qwen3-Coder combination actually crashed due to insufficient memory for context.

In parallel, the laptop suffered noticeably: the GPU heated up to around 100 degrees, the fans barely stopped, and swap in some scenarios bloated up to 20 GB. Against this backdrop, cloud agents looked trivially more convenient: for example, Kilo Code with Qwen3.5 Plus gave 9 out of 10 in 6 minutes 53 seconds, and Claude Opus 4.

6 — 10 out of 10 in 12 minutes 15 seconds, albeit at a cost. The conclusion is simple: local LLMs can now be seriously considered for private chat, one-off refactoring tasks, and simple scenarios where control over data matters more than absolute speed. But if you need constant agent mode on a work laptop, especially alongside IDE, browser, and Docker, the local stack remains a compromise.

The most reasonable scenario from this experience is to use fresh MoE models, use simpler agents like Aider or Open Code, and when possible, run the local model on a separate machine like Mac mini.

ZK
Hamidun News
AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

What do you think?
Loading comments…