Brief IA

Scikit-LLM: Comprehensive Sentiment Analysis Pipeline

🔬 Research·Tom Levy·

Scikit-LLM: Comprehensive Sentiment Analysis Pipeline

Scikit-LLM: Comprehensive Sentiment Analysis Pipeline
Key Takeaways
1Traditional machine learning pipelines rely on structured and numerical features for text classification.
2Techniques such as TF-IDF frequencies and token embeddings are commonly used to feed predictive models.
3Classic models include logistic regression, ensembles, and support vector machines.
💡Why it mattersThese traditional methods are essential for improving the accuracy and efficiency of text analysis.
Le brief IA que lisent les pros

Le brief IA que les pros lisent chaque soir

Les 7 actus IA du jour, décryptées en 5 min. Gratuit.

Inclus dès l'inscription : notre sélection des meilleurs guides & comparatifs IA.

Choisis ton rythme

Gratuit · Pas de spam · Désabonnement en 1 clic

📄
Full Analysis

Scikit-LLM: Complete Sentiment Analysis Pipeline

Introduction

Traditional machine learning pipelines for predictive tasks such as text classification typically rely on extracting structured numerical features from raw text — for example, TF-IDF frequencies or token embeddings — to feed into classical models like logistic regression, ensembles, or support vector machines.

With the emergence of large language models (LLMs), the game has somewhat changed: it is now possible to leverage zero-shot or few-shot reasoning on existing pre-trained models for linguistic tasks within a machine learning framework. Scikit-LLM is a Python library that addresses this need: it bridges classical machine learning and modern LLM API calls. In this article, we will use Scikit-LLM with Groq backend models to build a complete sentiment analysis pipeline (a specific form of text classification), achieving reasonably fast inference results with open-source models. From preprocessing to inference, we will utilize a large, realistically sized dataset — the IMDB movie reviews dataset.

Prerequisites, Setup, and Dataset Acquisition

To run the code presented in this tutorial, you need to have the Scikit-LLM library installed:

pip install scikit-llm

Once installed, the first step is to configure it and set the API credentials. In other words, we need to "connect" Scikit-LLM to an endpoint — namely, an LLM API repository like Groq. Make sure to sign up on Groq and generate an API key here: you will need to copy and paste it into the code below:

from skllm.config import SKLLMConfig

# 1. Pointing to a compatible Groq endpoint
SKLLMConfig.set_gpt_url("https://api.groq.com/[openai](/dossier/openai)/v1")

# 2. Set your free Groq API key
# Get yours at https://console.groq.com/keys
SKLLMConfig.set_openai_key("YOUR-API-KEY-WILL-GO-HERE")

Scikit-LLM uses an endpoint function, set_gpt_url, which is compatible with OpenAI by default; we redirected it to make internal requests to a custom Groq URL: https://api.groq.com/openai/v1.

The next step is to import the IMDB movie reviews dataset — which contains about 50,000 instances — and prepare it for the sentiment analysis pipeline we are going to build. The instances consist of a text review labeled with a sentiment, which can be positive or negative (this is a binary classification problem, solvable with models like logistic regression, for example).

For convenience, we will read the dataset from a public GitHub repository in CSV format:

import pandas as pd
from sklearn.model_selection import train_test_split

# Fetching a large realistically sized dataset (IMDB movie reviews - 50,000 rows)
# We will read the data from a public raw CSV for convenience
url = "https://raw.githubusercontent.com/Ankit152/IMDB-sentiment-analysis/master/IMDB-Dataset.csv"
print("Downloading the dataset...")
df = pd.read_csv(url)
print(f"Total dataset size: {df.shape[0]} rows")

# In a realistic LLM pipeline using a free API, sending 50,000 requests
# will likely trigger quota limits. Thus, we will use 500 rows to demonstrate the execution of our pipeline.
# Feel free to use more data if you have paid API access.
df_sampled = df.sample(n=500, random_state=42)

# The IMDB dataset contains HTML tags and formatting noise: perfect for testing our cleaner
X = df_sampled["review"]
y = df_sampled["sentiment"]  # Labels are 'positive' or 'negative'

# Splitting into training (to initialize zero-shot labels) and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Note that we retrieved only 500 rows for demonstration purposes, as otherwise, inference could take time without sufficient computing resources. You can freely modify this sample size, n=500, to suit your own needs.

Building the Sentiment Analysis Pipeline

Here comes the most interesting part of the process! A data science pipeline boils down to a series of preprocessing, cleaning, and data preparation steps followed by model configuration or training, inference, and evaluation. For a text-based predictive scenario like ours, preprocessing typically involves cleaning and normalizing the text. Scikit-learn provides a neat class, FunctionTransformer, to define and encapsulate preprocessing steps based on a custom function:

from sklearn.preprocessing import FunctionTransformer

def clean_text_data(texts):
    """Cleans raw text inputs by removing HTML tags and eliminating extra spaces."""
    series = pd.Series(texts).astype(str)
    # Remove HTML tags like <br />
    cleaned = series.str.replace(r'<[^>]+>', ' ', regex=True)
    # Remove extra spaces
    cleaned = cleaned.str.strip().str.replace(r'\s+', ' ', regex=True)
    return cleaned.tolist()

# Encapsulating the cleaning function to allow its use in a Pipeline object
text_cleaner = FunctionTransformer(clean_text_data)

We will now assemble this preprocessing object with a model instance to create the pipeline. Once defined, this pipeline orchestrates the entire data preparation process and model passing during both training and inference steps — even though we use the term "training," no weight-based training will occur, as we are using a pre-trained Groq model for zero-shot classification. The model fitting consists solely of passing it the classification labels to use.

from sklearn.pipeline import Pipeline
from skllm.models.gpt.classification.zero_shot import ZeroShotGPTClassifier

# Define the end-to-end pipeline
sentiment_pipeline = Pipeline([
    ("cleaner", text_cleaner),
    # Updated to use Groq's active Llama 3.1 8B model
    ("llm_classifier", ZeroShotGPTClassifier(model="custom_url::[llama](/dossier/meta-ia)-3.1-8b-instant"))
])

# Fit the pipeline
# Note: For Zero-Shot classification, fit() does not train the LLM.
# It simply records the unique labels present in 'y_train' (positive, negative).
print("Fitting the pipeline...")
sentiment_pipeline.fit(X_train, y_train)

Once we have executed the pipeline to "fit" the model, we use it again for inference. Both steps utilize familiar scikit-learn syntax. In addition to evaluating the pipeline's performance, we also display a few examples of predictions:

from sklearn.metrics import classification_report

print(f"Running predictions on {len(X_test)} test samples...")
# Run predictions through the pipeline
predictions = sentiment_pipeline.predict(X_test)

# Evaluate the pipeline's performance on realistic data
print("\n--- Classification Report ---")
print(classification_report(y_test, predictions))

# Display a few examples side by side
print("\n--- Prediction Examples ---")
for review, actual, predicted in zip(X_test[:3], y_test[:3], predictions[:3]):
    # Truncate review for display purposes
    short_review = review[:100]

Brief IA — L'actualité IA en français

L'essentiel de l'actualité de l'intelligence artificielle, décrypté et expliqué chaque jour.