Ollama and LiteLLM: Habr shows how to run a local LLM chat on Python without cloud
A clear starter guide on local LLM development in Python has emerged. It walks through step-by-step installation of Ollama, running the qwen2.5 model…
AI-processed from Habr AI; edited by Hamidun News
On Habr, a detailed introductory guide to local LLM development on Python was published. The author suggests starting not with cloud APIs, but with the Ollama and LiteLLM combination: install a model on your own computer, set up the environment, and get your first response straight from main.py.
Why Local
Most starting materials on LLMs lead a beginner to the cloud on the very first step: sign up, get an API key, attach a payment card, monitor limits. For a developer who just wants to understand the basic mechanics, this is unnecessary noise. The new guide offers a different route: first set everything up locally so you can see the model's logic without billing, external services, and fear of accidentally spending money on tests.
This approach is also good because it makes the entire request flow transparent. In the article, they literally break down the chain link by link: Python code sends a message to LiteLLM, which passes it to Ollama, and Ollama talks to the local model and returns the response back to the program. This breakdown is useful not only for beginners.
It helps quickly figure out where to look for a problem if the model doesn't respond, the service isn't running, or the code is pointing to the wrong address.
"This isn't 'AI magic,' but a regular software flow."
What's in the Stack
The author immediately divides the roles of the tools, because they're easy to confuse. Ollama is responsible for running the local model and accessing it through a local server. LiteLLM is a Python library with a unified interface for calling models. Because of this, code that today works with a local model can later be relatively easily transferred to a cloud provider without rewriting the application from scratch. For a first introduction, this is a practical compromise between simplicity and future-proofing.
The first part of the series is structured as a route without unnecessary theory. Readers aren't asked right away to design agents, connect memory, or build a complex interface. The task is simpler and more useful: make sure the local model works at all, that Python can reach it, and that the response comes back to the code without external infrastructure. Because of this, the material reads like a working checklist for a first evening, not an abstract overview of technologies.
- install Ollama for Windows, macOS, or Linux;
- download the qwen2.5:3b model and check the response right in the terminal;
- if your hardware is weak, switch to qwen2.5:1.5b;
- create a Python virtual environment and install LiteLLM;
- write a minimal main.py that sends a request to http://localhost:11434.
A separate plus is the choice of model for starting. qwen2.5:3b is presented as a compact and sufficiently convenient option for a regular laptop, especially if you need Russian language support. If resources are limited, the author immediately provides a backup scenario with a lighter version. This makes the material not abstract, but grounded: the article doesn't promise miracles, but helps you actually reach the first working response without lengthy config fiddling right from the start.
First Call from Python
The key moment in the text is a minimal Python example. It imports the completion function from LiteLLM, specifies the model in the format ollama_chat/qwen2.5:3b, indicates the local api_base, and passes the user's question to the messages list.
This is an important detail: even a single request is formatted in the same structure as a future dialogue. Essentially, the author doesn't just show a one-off call, but immediately lays the foundation for a console chat with message history and context. It's also useful that the article doesn't end on the happy path.
At the end, typical failures are analyzed: Connection refused if Ollama isn't running; Model not found if the model name in the code doesn't match the installed one; a very long first response due to model loading into memory; ModuleNotFoundError if the package was installed in the wrong environment; encoding issues in PowerShell. For a beginner developer, such a section is often more valuable than theory, because it's these small details that most early experiments break on. The author has already outlined the continuation of the series: in the second part, they'll build a small console chat from a single request, then add message history and context.
That is, it's not a scattered snippet, but a careful entrance into a longer route — from local model execution to a full-fledged application. This format is especially useful for those who want not just to run a demo, but to gradually turn an LLM into a part of an ordinary Python project.
What This Means
Interest in local models is growing again, and such materials lower the barrier to entry better than any general overview. The Ollama and LiteLLM combination shows that a first working prototype can be assembled without the cloud and API keys, and then when desired, the same architecture can be scaled further. For Russian-speaking developers, this is a good bridge between curiosity about LLMs and real code. It's exactly these kinds of instructions that most often turn interest into practice.
Want to stop reading about AI and start using it?
AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.