How BERTopic with a local LLM helps Rostelecom analyze large text collections

Q: Источник материала?

Оригинальная публикация на Habr AI. Hamidun News обрабатывает и адаптирует материалы с помощью AI.

Q: Когда опубликовано?

2026-05-16. Время чтения: 4 мин.

An NLP developer at Rostelecom presented a BERTopic pipeline for fully automating the analysis of large text collections — reviews, support requests, comments,

Hamidun News Editorial

AI monitoring · Habr AI

2026-05-16· 3 min

How BERTopic with a local LLM helps Rostelecom analyze large text collections — Source: Habr AI. Collage: Hamidun News.

◐ Listen to article

Text automation is one of the most labour-intensive and underestimated tasks in NLP. When a company receives tens of thousands of reviews, support requests, or comments per day, manual categorization becomes not just impossible, but also pointless from an ROI perspective. Anton, an NLP engineer at Rostelecom, proposed a solution: a pipeline based on BERTopic with an integrated local LLM for obtaining interpretable topic names.

Why text automation is needed

Large bodies of text are a goldmine of unstructured data for any company. Reviews hide complaints about specific bugs and shortcomings, support requests reveal systemic problems and weak points, and comments on social media contain ideas for new features and products. But sorting through all of this manually, even a small team of analysts cannot do in a day or a week.

The classical approach is to manually read each text, understand its essence, and distribute it across categories. With volumes of 10,000 or more texts, this becomes economically unprofitable, painful for the specialist, and time-consuming. Plus, subjectivity is added: one analyst will assign a complaint to the category "network problems", another to "service quality", a third to "other".

Consistency is lost, conclusions become unreliable.

How BERTopic solves the clustering problem

BERTopic is a framework that combines several machine learning techniques for automatic topic discovery in texts. The process works like this:

Embeddings (BERT): each text is transformed into a vector of numbers (an embedding), where semantically similar texts lie nearby in multidimensional space. For Russian text, you can use ruBERT or other models.
Clustering (HDBSCAN): a fast algorithm finds natural clusters of texts in this space without needing to know the number of topics in advance.
Interpretation: BERTopic generates a name for each cluster based on TF-IDF — the most significant words in the group.

The result? From the chaos of 50,000 texts, you get, for example, 15 clear and natural topics: "internet problems", "billing questions", "bugs in the mobile app", "requests for benefits", and so on. However, there's a catch. The standard BERTopic interpretation often produces strange names like "subscriber_service_number" or "bug_bug_error", which are difficult to explain to business. This is where the language model comes in.

Integrating a local LLM for interpretation

Instead of mechanically selecting words from the cluster, a local language model (such as Mistral 7B or Llama 2) reads the top words and top documents of the cluster, and then generates a full description in Russian: "Clients complain about slow internet speed in rural areas, especially on weekends".

"Integrating a local LLM protects data confidentiality: all current data remains within the company, without being sent to OpenAI, Claude API or other cloud services.

This is critical for companies working with sensitive information," Anton emphasizes.

Moreover, the local model works faster than API requests and is completely independent of quotas, limits, and cost per token processed. The pipeline operates without internet, which reduces latency and increases system reliability.

Practical results and scaling

The Rostelecom pipeline allows in a few hours to do what previously took several weeks of manual labour:

1. Load a set of texts into BERTopic (can be thousands or tens of thousands of records) 2. Get ready-made clusters with LLM-generated descriptions of topics in language that business understands 3. Export results to Excel, CSV, or a database for further work by analysts and product managers

Plus the ability to reuse: a new batch arrived in support? The pipeline retrains in minutes and again outputs a structured result.

What this means for the industry

NLP tools are actively moving out of the laboratory and scientific papers into real production. When one engineer can in a day assemble a fully functional pipeline that previously required two to three weeks of manual labour and expertise of an entire team — this means that NLP is becoming a practical tool, not a scientific experiment, accessible only to large IT companies.

Hamidun News

AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Telegram channel RSS hamidun.com