Open WebUI and llama.cpp: how to set up local web search with Qwen and Docker on Windows
If web search in Open WebUI was returning junk, the problem was most likely not the interface itself but the RAG chain configuration. The author broke down a…
AI-processed from Habr AI; edited by Hamidun News
Open WebUI can work with local models via llama.cpp and even perform web search, but without manual configuration the results are often poor. The author of a practical guide showed how to assemble a working setup on Windows with Qwen, Docker and separate models for embedding and reranking, so that search starts returning meaningful answers.
How the Setup Works
The basic idea is simple: llama.cpp runs a local server with multiple GGUF models, and Open WebUI serves as the interface on top of it. For main generation, the author uses two versions of one large Qwen3.5-27B-UD-Q4_K_XL model — with and without reasoning. Separately launched are a compact Qwen3.5-2B for auxiliary tasks, Qwen3-Embedding-4B for vectorization, and Qwen3-Reranker-4B for reranking search results. This arrangement is necessary because web search in Open WebUI is essentially tied to a RAG pipeline. If you leave the standard settings, the system can formally find pages, but poorly extracts meaning, loses relevant text chunks, and returns weak answers. According to the author, this is exactly what made search almost useless in the default configuration.
The guide is oriented toward fairly specific hardware: Windows 10 22H2, RTX 3090 with 24 GB VRAM and 32 GB of RAM. Even on such a machine, the large model doesn't load instantly: the first response can take up to a minute, and any extra parallel loading quickly consumes video memory.
Key Settings
The most important idea in the article is that you need to configure not just the chat model, but the entire infrastructure around it. The author runs models in llama-server, and Open WebUI connects to it via local URL through Docker.
- Main model for answers — Qwen3.5-27B-UD-Q4_K_XL, separately in instruct and thinking modes.
- Auxiliary model — Qwen3.5-2B-BF16: it generates chat titles and performs minor background tasks faster than the large model.
- For web search, separate Qwen3-Embedding-4B-f16 and Qwen3-Reranker-4B-f16 are used.
- Apache Tika is connected as the content extraction engine, and Open WebUI is run via Docker Compose.
- Brave Search is chosen as the search provider; the author also mentions Exa, Tavily, Serper, Linkup, and Valyu.
A separate nuance is the llama.cpp parameters. In the config, the author disables mmap, limits the number of simultaneously loaded models via `--models-max 1`, configures VRAM unloading by timeout, and divides context between parallel requests. This isn't cosmetic: if you load a second heavy model simultaneously with the main one, performance drops sharply.
In the Open WebUI interface there are also critical places. In the admin panel, you need to explicitly specify the URL of the local model server, then select a small model for local and external interface tasks, and in the Documents section connect Tika, the embedding model and an external rerank endpoint. Only after this does web search start working as a complete chain, rather than a formal wrapper over a browser query.
Limitations and Nuances
Even after configuration, such a scheme doesn't transform the local stack into a replacement for paid cloud services. The author explicitly states that the result still doesn't measure up to proprietary solutions, but can be sufficient if privacy, offline operation, or the ability to run unrestricted models without external APIs is important.
There are also practical compromises. If you specify a reasoning model for auxiliary interface tasks, Open WebUI starts spending too much time even on generating a chat title. With web search, embedding and reranking models are additionally loaded, so VRAM load and latencies grow even more. To check what is currently loaded, the author recommends checking llama-server logs.
The article also contains several extensions for those who want to go further. Instead of Tika, you can try Docling for more complex documents. SearXNG suits a local search engine, although its backends can be temporarily banned. The web loader based on Playwright worked for the author, but was too slow — latency was measured in minutes, not seconds. Later he also reduced the context of the main model to 32768 tokens: this helped eliminate hangs when the chat tab was open, because the model stopped occupying more free VRAM than the system could stably sustain.
What This Means
The material well shows that a local AI stack today can already be assembled from open components, but quality depends not so much on one "best model" as on the correct combination of generation, embedding, reranking, and content extraction. For developers, this is a practical recipe: if Open WebUI with web search works poorly for you, you should start not with replacing the model, but with rebuilding the entire RAG configuration.
Want to stop reading about AI and start using it?
AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.