LLM Embeddings and HDBSCAN: A Revolution in Text Clustering

⚡

Key Takeaways

1Large language models transform raw text into embeddings, facilitating clustering.

2HDBSCAN, a density-based clustering method, uncovers hidden patterns in textual data.

3The use of open-source models and modern Python libraries simplifies the clustering process.

💡Why it matters — This approach allows for the structuring of unlabeled textual data, paving the way for new analyses and insights.

Introduction

In the ever-evolving world of generative artificial intelligence, attention often focuses on chat interfaces and invitation systems. However, large language models (LLMs) possess much broader capabilities. One of their most impressive features is the ability to convert raw, often messy and unstructured text into sophisticated mathematical representations known as embeddings. These embeddings serve as the foundation for many machine learning use cases, with clustering being one of the most promising.

When combined with advanced density-based clustering techniques such as HDBSCAN, embeddings allow for the discovery of hidden themes, patterns, or categories within collections of textual documents. This process does not require prior labeling, making it particularly powerful for analyzing unstructured data.

This article details the construction of a text clustering pipeline from scratch. We will use a publicly accessible dataset containing examples of text and an open-source embedding model to generate these representations. Additionally, we will leverage modern, free, and user-friendly Python libraries that offer implementations of clustering algorithms like HDBSCAN.

Steps to Follow

To get started, it is essential to install the necessary Python libraries:

Sentence transformers: This library allows you to load a pre-trained LLM for embedding generation from Hugging Face. A Hugging Face API key, or access token, is required to load the model.
Umap-learn: Used to apply a dimensionality reduction algorithm to the embeddings.

If you are working in a local IDE rather than a cloud notebook environment, you may also need to install scikit-learn and pandas if you haven't already.

!pip install sentence-transformers umap-learn

We now move on to the coding phase by fetching fresh data. The fetch_20newsgroups function is used to obtain a dataset containing categorized news article texts. Although this dataset contains labels, we will ignore them to simulate a situation where we do not know this information, with the aim of clustering the data based on their similarity. We reduce the dataset to 150 instances for our example.

import pandas as pd
from sklearn.datasets import fetch_20newsgroups

# Fetching a very targeted subset of data (~150-200 docs)
categories = ['sci.space', 'sci.med', 'rec.autos']
newsgroups = fetch_20newsgroups(subset='train', categories=categories, remove=('headers', 'footers', 'quotes'))

# Sampling a representative and illustrative subset
df = pd.DataFrame({'text': newsgroups.data, 'true_label': newsgroups.target})
df = df[df['text'].str.strip().str.len() > 100].sample(150, random_state=42).reset_index(drop=True)
print(f"Loaded {len(df)} textual documents.")
print("\nExample Document:")
print(df['text'].iloc[0][:150] + "...")

Generating Embeddings

The next step is to generate embeddings from the raw texts. To do this, we use the all-MiniLM-L6-v2 model from the sentence-transformers library by Hugging Face. This model is lightweight yet effective for quickly obtaining embeddings.

from sentence_transformers import SentenceTransformer

# Loading the free open-source model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Encoding textual documents into dense vector embeddings
print("Generating embeddings...")
embeddings = model.encode(df['text'].tolist(), show_progress_bar=True)
print(f"Shape of the embedding matrix: {embeddings.shape}")

Since the dimensionality of the embeddings is initially too high for clustering, we apply a dimensionality reduction technique using the UMAP algorithm from the previously installed library:

import umap

# Reducing the dimensions of the embeddings to 5, to retain enough density information for clustering
reducer = umap.UMAP(n_neighbors=15, n_components=5, min_dist=0.0, random_state=42)
reduced_embeddings = reducer.fit_transform(embeddings)
print(f"Shape of the reduced matrix: {reduced_embeddings.shape}")

Applying HDBSCAN

With our numerical embedding vectors reduced to five dimensions, we can now apply the HDBSCAN algorithm to see if this compact representation allows for meaningful clustering.

from sklearn.cluster import HDBSCAN

# Initializing HDBSCAN
# min_cluster_size=8: we specified that each cluster must contain at least 8 documents
clusterer = HDBSCAN(min_cluster_size=8, min_samples=3, store_centers='centroid')
df['cluster'] = clusterer.fit_predict(reduced_embeddings)

# Counting instances by cluster
cluster_counts = df['cluster'].value_counts()
print("\nCluster Distribution:")
print(cluster_counts)

Results

It appears that HDBSCAN has identified two main clusters associated with high-density areas in the data space. However, it is also possible that some points are considered noise and are not assigned to these clusters. Let's take a closer look at this:

for cluster_id in sorted(df['cluster'].unique()):
    if cluster_id == -1:
        print("\n=== CLUSTER: NOISE / UNCLASSIFIED ===")
    else:
        print(f"\n=== CLUSTER: Discovered Topic #{cluster_id} ===")
        # Obtaining up to 3 example texts from this cluster
        samples = df[df['cluster'] == cluster_id]['text'].head(3).tolist()
        for i, sample in enumerate(samples, 1):
            clean_sample = " ".join(sample.split())[:120]
            print(f"  {i}. {clean_sample}...")

Conclusion

The results of the clustering are partially influenced by the hyperparameters we set for HDBSCAN. It is advisable to experiment with different configurations for the minimum cluster size and other hyperparameters to explore how this affects the results.