Machine Learning Mastery→ original

Text clustering without labels: LLM embeddings and HDBSCAN from Machine Learning Mastery

Language models can do more than answer in chat — they turn text into numerical vectors that can be used to automatically find topic groups. Machine Learning…

AI-processed from Machine Learning Mastery; edited by Hamidun News
Text clustering without labels: LLM embeddings and HDBSCAN from Machine Learning Mastery
Source: Machine Learning Mastery. Collage: Hamidun News.
◐ Listen to article

LLM embeddings have taken unstructured text tasks far beyond chat interfaces. Machine Learning Mastery publishes a practical guide: how to combine vector representations from language models with the HDBSCAN algorithm — and automatically find thematic groups in text datasets without manual annotation or prior knowledge of data structure.

Why embeddings change the rules of the game

Language models can transform text into high-dimensional vectors. These are numerical representations in which semantically similar fragments end up geometrically close. "Customer dissatisfied with price" and "too expensive for me" will be neighbors in multidimensional space, while "delivery problem" ends up in a completely different part of it. This is what makes embeddings ideal input for clustering: the system accounts for meaning, not keyword matching. Synonym dictionaries and rules are no longer needed.

Popular models for obtaining embeddings include OpenAI `text-embedding-3-small`, Cohere Embed v3, and also open-source sentence-transformers, which work locally without API costs. A typical embedding has dimensionality 768–3072 — too many for direct clustering. Before HDBSCAN, UMAP is typically applied, which compresses the space to 5–50 dimensions. Without this step, the algorithm faces the "curse of dimensionality": in high-dimensional space, all points look roughly equally distant from each other, and clusters don't form.

HDBSCAN versus standard methods

Most clustering courses start with K-means. The problem: the algorithm requires specifying the number of clusters in advance — which is impossible if the data structure is unknown. HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) solves the problem differently:

  • Does not require specifying the number of clusters in advance
  • Automatically marks "noise" points — texts that don't fit into any group
  • Correctly handles clusters of different sizes and densities
  • Scales to tens of thousands of documents
  • Provides a hierarchy of clusters with adjustable granularity

Complete pipeline: embeddings → dimensionality reduction via UMAP → HDBSCAN → cluster labels. The entire code takes just a few dozen lines of Python with libraries `sentence-transformers`, `umap-learn`, and `hdbscan`. To interpret the topics found, it's enough to pass a few examples from each group back to the LLM and ask it to come up with a name — the cycle closes from language model to statistics and back.

Application without training data

The main advantage of this combination is zero need for annotation. There's no need to agree on categories, hire annotators, or form a training set. One pipeline discovers the structure by itself.

"The current era of generative AI is focused on chat interfaces, but the capabilities of language models go far beyond that," write the authors of

Machine Learning Mastery.

Typical scenarios: clustering thousands of support tickets, automatic categorization of news streams, grouping product reviews, analyzing open-ended survey questions, detecting anomalous patterns in logs. Results appear in minutes, without prior annotation. The approach is especially valuable when working with rapidly changing data: new topics are discovered automatically — there's no need to manually add classes to the classifier every time the subject area changes.

What this means

The combination of LLM embeddings with HDBSCAN is a ready-made tool for structuring large text datasets without supervision. Tasks that previously required weeks of manual work or expensive annotation are now solved with a small script. For teams working with user feedback, media monitoring, or content analytics, this is a direct resource savings — and an opportunity to extract insights from data that previously simply sat unused.

ZK
Hamidun News
AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Need AI working inside your business — not just in your newsfeed?

I build production AI for companies — custom CRM, internal tools, autonomous agents, workflow automation. Owned by you, shaped to your process, no per-seat tax. Built by Zhemal Khamidun, CPO of AlpinaGPT (AI platform, 6,000+ users).

What do you think?
Loading comments…