Habr AI→ original

Ollama Cloud compared in a code review: DeepSeek v3.1 proved stronger than Qwen and GPT-OSS

Can a full-fledged code review be entrusted to an LLM? In a practical test via Ollama Cloud, three models — Qwen 3.5, GPT-OSS, and DeepSeek v3.1 — reviewed…

AI-processed from Habr AI; edited by Hamidun News
Ollama Cloud compared in a code review: DeepSeek v3.1 proved stronger than Qwen and GPT-OSS
Source: Habr AI. Collage: Hamidun News.
◐ Listen to article

A practical test showed that cloud models through Ollama are already capable of handling some code review tasks on real Pull Requests, not just on demonstration examples. In the comparison of Qwen 3.5, GPT-OSS, and DeepSeek v3.1, DeepSeek demonstrated the best analysis depth and most applicable recommendations, though there was an important caveat regarding configuration.

How the test was conducted

The article author tested the models not on abstract tasks, but on a Pull Request from his own legacy Python project that is about four years old. For each model, a separate PR was prepared, but conditions remained the same: the same prompt, identical access to project context, and RAG enabled, so the system could pull in additional files and not be limited to just the diff. This approach is important because lack of context often makes AI reviews superficial.

Configuration was also aligned as much as possible: temperature 0.2, limit of 4000 tokens, high level of comment criticality, enabled security, performance and style problem detection, as well as the ability to suggest fixes. The models analyzed not only the diff, but also related code context.

The test included Qwen 3.5, GPT-OSS, and DeepSeek v3.1 — three notable open-weight models that are often considered as alternatives to SaaS tools for developers.

Models were evaluated on a five-point scale.

  • accuracy of finding real problems in the code
  • understanding of security risks
  • tendency to hallucinate
  • depth of analysis and understanding of consequences of changes
  • practical usefulness of proposed fixes

The author separately looked at human acceptance rate — how likely it is that developers will actually accept the model's comments rather than ignore them as noise.

Results by model Qwen 3.5 was a pleasant surprise.

It received a final score of 3.8 and showed a confident balance between accuracy, low levels of hallucinations, and practical advice. According to the author's assessment, the model well-attached comments to specific lines, often suggested real fix options, and overall behaved like a useful first reviewer.

Weak point — limited depth of architectural analysis and not very active use of available tools for additional context. GPT-OSS, on the other hand, performed noticeably worse and scored 2.9.

The main complaint — comments that were too generic. The model found some real problems, but was worse at linking comments to specific PR changes, less often suggested applicable auto-fixes, and more often made assumptions without sufficient basis. A plus was the clear style of responses, but for practical code review, this proved insufficient: developers need not neat formulations, but precise and useful comments.

DeepSeek v3.1 showed the strongest technical result. Without penalty, its final score was 4.

25: the model better explained the reasons for problems, more often noticed security risks, offered engineering-correct fixes, and more deeply analyzed the consequences of changes. Formally, the author lowered the score to 3.25 because the model could not use the tool without enabled think mode.

But even with this caveat, DeepSeek is named the deepest and most practical option among those tested.

"Cloud models through

Ollama can really be used for code review tasks".

Where

Ollama is appropriate The main conclusion of the article is not that Ollama automatically replaces specialized services like CodeRabbit, Claude Review, or QoDo. Rather the opposite: the quality of AI reviews strongly depends on the chosen model, settings, and how much context was provided to it. If you pick an unsuccessful model or limit it to just the diff without access to project files, the result quickly turns into a set of superficial comments.

However, Ollama has a strong use case where control and flexibility matter to the team. The author particularly emphasizes that this approach is especially interesting for projects with sensitive code, NDA restrictions, and a desire not to send source code to external infrastructure. Plus, the platform allows quick switching between models, building custom pipelines on top of the API, and, if necessary, switching to local execution instead of the cloud.

If the team has no strict privacy requirements and budget is not critical, ready-made SaaS solutions can still provide more stable out-of-the-box results. They have stronger workflow integration, more ready-made automation, and less manual configuration. The experiment rather shows that open models are catching up to this product class faster than many expected.

What this means

For development teams, this is a signal that AI code review can already be used not as a toy, but as a working layer of preliminary Pull Request checking. It does not replace human review, but with the right model, good context, and access to tools, it is capable of reducing some routine work, finding real problems, and suggesting fixes before the PR reaches a colleague.

ZK
Hamidun News
AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

What do you think?
Loading comments…