Scikit-LLM: Comprehensive Sentiment Analysis Pipeline

⚡

Key Takeaways

1Traditional machine learning pipelines rely on structured and numerical features for text classification.

2Techniques such as TF-IDF frequencies and token embeddings are commonly used to feed predictive models.

3Classic models include logistic regression, ensembles, and support vector machines.

💡Why it matters — These traditional methods are essential for improving the accuracy and efficiency of text analysis.

Scikit-LLM: Complete Sentiment Analysis Pipeline

Introduction

Traditional machine learning pipelines for predictive tasks such as text classification typically rely on extracting structured numerical features from raw text — for example, TF-IDF frequencies or token embeddings — to feed into classical models like logistic regression, ensembles, or support vector machines.

With the emergence of large language models (LLMs), the game has somewhat changed: it is now possible to leverage zero-shot or few-shot reasoning on existing pre-trained models for linguistic tasks within a machine learning framework. Scikit-LLM is a Python library that addresses this need: it bridges classical machine learning and modern LLM API calls. In this article, we will use Scikit-LLM with Groq backend models to build a complete sentiment analysis pipeline (a specific form of text classification), achieving reasonably fast inference results with open-source models. From preprocessing to inference, we will utilize a large, realistically sized dataset — the IMDB movie reviews dataset.

Prerequisites, Setup, and Dataset Acquisition

To run the code presented in this tutorial, you need to have the Scikit-LLM library installed:

pip install scikit-llm

Once installed, the first step is to configure it and set the API credentials. In other words, we need to "connect" Scikit-LLM to an endpoint — namely, an LLM API repository like Groq. Make sure to sign up on Groq and generate an API key here: you will need to copy and paste it into the code below:

from skllm.config import SKLLMConfig

# 1. Pointing to a compatible Groq endpoint
SKLLMConfig.set_gpt_url("https://api.groq.com/[openai](/dossier/openai)/v1")

# 2. Set your free Groq API key
# Get yours at https://console.groq.com/keys
SKLLMConfig.set_openai_key("YOUR-API-KEY-WILL-GO-HERE")

Scikit-LLM uses an endpoint function, set_gpt_url, which is compatible with OpenAI by default; we redirected it to make internal requests to a custom Groq URL: https://api.groq.com/openai/v1.

The next step is to import the IMDB movie reviews dataset — which contains about 50,000 instances — and prepare it for the sentiment analysis pipeline we are going to build. The instances consist of a text review labeled with a sentiment, which can be positive or negative (this is a binary classification problem, solvable with models like logistic regression, for example).

For convenience, we will read the dataset from a public GitHub repository in CSV format:

import pandas as pd
from sklearn.model_selection import train_test_split

# Fetching a large realistically sized dataset (IMDB movie reviews - 50,000 rows)
# We will read the data from a public raw CSV for convenience
url = "https://raw.githubusercontent.com/Ankit152/IMDB-sentiment-analysis/master/IMDB-Dataset.csv"
print("Downloading the dataset...")
df = pd.read_csv(url)
print(f"Total dataset size: {df.shape[0]} rows")

# In a realistic LLM pipeline using a free API, sending 50,000 requests
# will likely trigger quota limits. Thus, we will use 500 rows to demonstrate the execution of our pipeline.
# Feel free to use more data if you have paid API access.
df_sampled = df.sample(n=500, random_state=42)

# The IMDB dataset contains HTML tags and formatting noise: perfect for testing our cleaner
X = df_sampled["review"]
y = df_sampled["sentiment"]  # Labels are 'positive' or 'negative'

# Splitting into training (to initialize zero-shot labels) and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Note that we retrieved only 500 rows for demonstration purposes, as otherwise, inference could take time without sufficient computing resources. You can freely modify this sample size, n=500, to suit your own needs.

Building the Sentiment Analysis Pipeline

Here comes the most interesting part of the process! A data science pipeline boils down to a series of preprocessing, cleaning, and data preparation steps followed by model configuration or training, inference, and evaluation. For a text-based predictive scenario like ours, preprocessing typically involves cleaning and normalizing the text. Scikit-learn provides a neat class, FunctionTransformer, to define and encapsulate preprocessing steps based on a custom function:

from sklearn.preprocessing import FunctionTransformer

def clean_text_data(texts):
    """Cleans raw text inputs by removing HTML tags and eliminating extra spaces."""
    series = pd.Series(texts).astype(str)
    # Remove HTML tags like <br />
    cleaned = series.str.replace(r'<[^>]+>', ' ', regex=True)
    # Remove extra spaces
    cleaned = cleaned.str.strip().str.replace(r'\s+', ' ', regex=True)
    return cleaned.tolist()

# Encapsulating the cleaning function to allow its use in a Pipeline object
text_cleaner = FunctionTransformer(clean_text_data)

We will now assemble this preprocessing object with a model instance to create the pipeline. Once defined, this pipeline orchestrates the entire data preparation process and model passing during both training and inference steps — even though we use the term "training," no weight-based training will occur, as we are using a pre-trained Groq model for zero-shot classification. The model fitting consists solely of passing it the classification labels to use.

from sklearn.pipeline import Pipeline
from skllm.models.gpt.classification.zero_shot import ZeroShotGPTClassifier

# Define the end-to-end pipeline
sentiment_pipeline = Pipeline([
    ("cleaner", text_cleaner),
    # Updated to use Groq's active Llama 3.1 8B model
    ("llm_classifier", ZeroShotGPTClassifier(model="custom_url::[llama](/dossier/meta-ia)-3.1-8b-instant"))
])

# Fit the pipeline
# Note: For Zero-Shot classification, fit() does not train the LLM.
# It simply records the unique labels present in 'y_train' (positive, negative).
print("Fitting the pipeline...")
sentiment_pipeline.fit(X_train, y_train)

Once we have executed the pipeline to "fit" the model, we use it again for inference. Both steps utilize familiar scikit-learn syntax. In addition to evaluating the pipeline's performance, we also display a few examples of predictions:

from sklearn.metrics import classification_report

print(f"Running predictions on {len(X_test)} test samples...")
# Run predictions through the pipeline
predictions = sentiment_pipeline.predict(X_test)

# Evaluate the pipeline's performance on realistic data
print("\n--- Classification Report ---")
print(classification_report(y_test, predictions))

# Display a few examples side by side
print("\n--- Prediction Examples ---")
for review, actual, predicted in zip(X_test[:3], y_test[:3], predictions[:3]):
    # Truncate review for display purposes
    short_review = review[:100]

Scikit-LLM: Comprehensive Sentiment Analysis Pipeline

Le brief IA que les pros lisent chaque soir

Scikit-LLM: Complete Sentiment Analysis Pipeline

Introduction

Prerequisites, Setup, and Dataset Acquisition

Building the Sentiment Analysis Pipeline

Brief IA — L'actualité IA en français