3DNews AI→ original

Ollama accelerates local AI on Apple M5: a Mac with at least 32 GB of memory is required

Ollama has released version 0.19 with hardware acceleration for Apple M5, M5 Pro and M5 Max. Thanks to MLX and the new accelerators, local models on Mac…

AI-processed from 3DNews AI; edited by Hamidun News
Ollama accelerates local AI on Apple M5: a Mac with at least 32 GB of memory is required
Source: 3DNews AI. Collage: Hamidun News.
◐ Listen to article

Ollama added hardware acceleration for Apple M5, M5 Pro and M5 Max, so local AI models on macOS run noticeably faster. The new scheme works in preview mode and requires a minimum of 32 GB of unified memory.

What Changed

Ollama is one of the most prominent tools for running large language models locally on Windows, Linux and macOS. In version 0.19, developers moved Apple Silicon operations to a new mode based on MLX — Apple's own ML framework, which better utilizes the chip's unified memory and computational blocks. For users, this means faster response starts and higher generation speeds without going to the cloud and without transmitting data to an external service.

The key point is that acceleration is currently tied specifically to the Apple M5 family. According to Ollama, the application has learned to access the new GPU Neural Accelerators in the M5, M5 Pro and M5 Max chips. These are what provide the improvement both in time to first token and overall output speed. This is especially important for scenarios where the model doesn't just answer in a chat interface, but continuously receives long context, tools and action history.

Where the Improvement is Visible

On the numbers, the update looks quite practical. In Ollama's official test, the company compared version 0.19 with 0.18 on the Qwen3.5-35B-A3B model: prefill speed increased from 1154 to 1810 tokens per second, and decode from 58 to 112 tokens per second. For int4 quantization, developers promise even higher numbers — up to 1851 tokens per second on prefill and up to 134 on decode. This is already a noticeable difference not only in benchmarks but also in everyday work.

"This is the fastest way to run

Ollama on Apple Silicon," the developers write in the preview release announcement.

Faster performance is expected not only for regular local chats, but also for tools where the model constantly processes code, commands and long prompts:

  • personal assistants like OpenClaw
  • code agents like Claude Code, OpenCode and Codex
  • long sessions with shared system prompts and dialogue branching
  • local scenarios where privacy and low latency matter

Additionally, Ollama updated the caching mechanism. Now the application can reuse cache between different conversations, save it at successful prompt points and hold common prefixes in memory longer. For code and agent scenarios, this matters more than it seems: when a tool frequently returns to the same system context, reducing unnecessary prompt reprocessing directly speeds up responses.

Limitations and Details

The main limitation is simple: you need a Mac with at least 32 GB of unified memory. For local AI, this is critical because on Apple Silicon memory is shared between the CPU, GPU and other accelerators, and large models quickly consume available capacity. In other words, the news concerns not just any M5 Mac, but only sufficiently expensive configurations where there's enough memory for the model itself, cache and workload.

There's a second limitation: for now it's about a preview implementation and a fairly narrow initial set. In the announcement, Ollama specifically noted that the release first accelerates the new Qwen3.5-35B-A3B model with settings for coding tasks. Support for other architectures and more convenient custom model imports are still in progress. That is, this is not instant acceleration of "everything at once," but the first step toward deeper optimization of local AI for new Macs.

Separately worth noting is support for NVFP4 and improvements that bring local execution closer to production environments. NVFP4 reduces memory and bandwidth requirements without significant quality loss, meaning users can get results closer to what modern inference providers offer. Combined with MLX, this turns Ollama from just a convenient model wrapper into a more serious local platform for development and experimentation.

What This Means

For the local AI market, this is an important signal: Mac is increasingly becoming a work machine not only for running small models with open weights, but also for full-fledged agent scenarios. For developers and advanced users, the advantage is clear — less latency, more privacy, less dependence on the cloud. But this story won't become mainstream yet: the entry cost remains high due to the Apple M5 requirement and 32 GB of memory.

ZK
Hamidun News
AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

What do you think?
Loading comments…