Qwen 3.5 on MacBook Pro: Comparing Eight Local Servers for Team Workflows
Eight local MLX servers for Qwen 3.5 35B were compared on a MacBook Pro M2 Max with 64 GB memory. Under single load, leading solutions perform nearly…
AI-processed from Habr AI; edited by Hamidun News
Running large models locally on Mac has long ceased to be a hobby for enthusiasts, but the story with Qwen 3.5 35B demonstrates that there's a vast gap between "it runs" and "it works as a team API." The author took a MacBook Pro M2 Max with 64 GB of RAM and tested not the model itself, but the infrastructure around it: which MLX server can handle real workload, doesn't just produce nice numbers in logs, and doesn't collapse when two users arrive simultaneously.
For the test, they built a separate Python harness and ran eight local servers positioned as a quick way to launch an API on top of MLX models on macOS. The validation wasn't based on a single convenient question, but on a set of eight prompts of different types and lengths, including AIME-level tasks and long inputs up to 52 thousand tokens. Each scenario ran five times to eliminate random spikes and get a more honest picture of latency, generation speed, and overall behavior under load.
Special emphasis was placed on assessing not laboratory peak speed, but system behavior in conditions close to real work: with streaming responses, network overhead, and repeatable measurement conditions.
In single-user mode, there was little intrigue: the top three showed similar results, and on short sessions the difference between them looked rather cosmetic. That's precisely why marketing promises in README files easily mislead. If you only look at a single request, it seems like almost any modern MLX server is already good enough for everyday work. But this conclusion breaks down immediately once the local model transforms from a personal tool into a service for a team, where requests start overlapping in time.
The most revealing stage of the test—parallel load from two requests. This is where a real gap between solutions emerged. Four frameworks out of six essentially fell back to queueing and handled requests almost sequentially, though they still appeared multithreaded on the surface. Another server maintained parallelism only formally and dropped to a coefficient of 0.85x, meaning the second request hindered rather than helped utilize the hardware. Only one test participant showed honest acceleration of 2.17x, which already looks like suitable behavior for a local team API, where it matters not just to answer one user quickly, but to handle multiple requests without dramatic degradation.
Along the way, problems surfaced that matter more than dry numbers in a table. In one place, the author stumbled upon quadratic attention, which in 2026 can still sharply degrade behavior on long contexts. In another—phantom 14,000 tokens/sec that appeared not from magical optimization, but from a single line in an SSE parser that distorted the measurement. Separately worth mentioning is a zombie process that left around 20 GB of occupied RAM in its wake, although READMEs prefer to stay silent about such risk.
For those planning local production, these are not trifles: such bugs impact service predictability, monitoring, and support costs far more than differences of a few percent in raw speed.
The practical value of this work lies in shifting focus from beautiful promises to actual use cases. If a model is needed by one developer for occasional requests, you can look at simplicity of deployment and basic speed. But if we're talking about a team API with parallelism, long contexts, and the need to recover quickly from failures, choosing a server based on README is already dangerous.
This benchmark shows a simple thing: the local stack for Qwen 3.5 should be evaluated as infrastructure, not as a demo. Otherwise, you can end up with a system that looks "fast" on single tests but in real use transforms a powerful MacBook into an expensive queue of requests.
Want to stop reading about AI and start using it?
AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.