Habr AI→ original

Lemana Tech showed how it combined LLM, RAG, and traditional ML in tech support

Lemana Tech explained how it reworked support after ticket volume grew: high-volume classification was left to traditional ML, while LLMs with RAG were used…

AI-processed from Habr AI; edited by Hamidun News
Lemana Tech showed how it combined LLM, RAG, and traditional ML in tech support
Source: Habr AI. Collage: Hamidun News.
◐ Listen to article

Lemana Tech shared how it restructured Service Desk automation after a surge in request volume. The company didn't replace all support with a single large model, but instead assembled a hybrid scheme: mass classification was left to classical ML, while LLM with RAG was connected only where it actually delivers value.

Why Classical ML Alone Wasn't Enough

Within Lemana Tech's ecosystem, there are over 500 business systems, 2500 service operations, and around 100,000 support requests per month. For such a load, model quality matters, but so do error cost, response speed, and computation cost. The basic stack based on boosting and TF-IDF worked well for a long time: a model with additional features like job title, workplace, and request time delivered F1 of around 0.86 and covered a large share of typical routes. But as the number of scenarios grew, this stopped being enough.

The team tested LSTM, GRU, BERT, RoBERTa, Electra, Yandex Foundation Models, and LoRA adapters for open LLMs. Some approaches lost to boosting in metrics, others proved too expensive to train. Ultimately, the best classification result came not from a "pure" LLM approach, but from a transformer with additional tabular features and additive attention: this scheme raised F1 macro to 0.89 and better accounted for the context of a specific employee.

Where RAG Is Enabled

The LLM in this architecture doesn't attempt to solve everything. It's only enabled for request classes where users need a meaningful answer from internal documentation, not just correct ticket routing. One example is support for the MLOps platform, where employees need answers about Kubeflow, Jenkins, and internal pipelines.

Here the request goes to the chat, passes through the classifier, and enters the RAG loop based on Qwen2.5 8B with a custom embedder. If the answer is found in the knowledge base, the user receives it in roughly 60 seconds.

If the model isn't confident in the result, or the person presses the command to switch to a specialist, the ticket immediately goes to a live expert without waiting under normal SLA. This is an important point: the LLM doesn't put an unnecessary barrier in front of the human, but serves as a fast first layer where you can save time from expensive L4 specialists while maintaining quality control.

  • Qwen2.5 8B is used in quantized CPU version
  • Custom embedder trained on 10,000 triplets
  • Knowledge base search accuracy reached 92% Hit@3
  • Escalation triggers at confidence score below 0.7
  • User can instantly switch to a human

What Worked Best

A separate part of the case is auto-resolution. The team found recurring patterns of requests that could be closed without support involvement, but didn't blindly automate all frequent responses. To filter, they used Qwen2.5 14B: the model evaluated whether a human could realistically solve the problem themselves by instruction or whether nothing would work without a specialist. This cut out false patterns like password reset, where the email is standard, but the action must still be performed by a specialist.

"Using LLM everywhere, as is fashionable now, isn't the right approach."

After such filtering, what runs in production is again not an LLM, but a light model—logistic regression. It learns quickly, costs almost nothing on inference, and can continuously handle the request stream. The outcome: Lemana Tech reports growth in automated classification from 55% to 76%, an increase in classification accuracy to 92% accounting for thresholds, and a 20x speedup in successful auto-resolutions and bot responses. The LLM didn't replace classical ML here, but took a narrow but valuable place in the chain.

What It Means

The Lemana Tech case illustrates well the current mature logic of deploying generative AI in support: expensive LLMs don't have to be the core of the entire system. Often the best result comes from a hybrid where classical ML quickly sorts the stream, RAG answers only in complex domain zones, and humans connect without friction if model confidence is insufficient. For corporate teams, this is probably a more realistic path than trying to move the entire Service Desk to one universal model.

ZK
Hamidun News
AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

What do you think?
Loading comments…