LLM: Textstat and LangChain Tackle Verbose Responses

⚡

Key Takeaways

1Large language models (LLMs) are often too verbose, increasing the risk of hallucinations.

2Textstat, a Python library, helps measure readability and reduce the complexity of LLM responses.

3An integrated LangChain pipeline in Google Colab uses Textstat to simplify responses and limit factual errors.

💡Why it matters — Reducing the verbosity of LLMs is crucial to minimize errors and improve the reliability of AI-generated responses.

Large language models (LLMs) are often criticized for their tendency to produce excessively verbose responses. This characteristic, while a result of optimization to make the models as useful and conversational as possible, can lead to problems. Indeed, the longer a response is, the more likely it is to stray from established facts and generate "hallucinations."

To counter this phenomenon, it is crucial to implement robust safeguards. One of the tools proposed for this is the Python library Textstat, which allows for measuring the readability of texts generated by LLMs. Textstat uses indices such as the Automated Readability Index (ARI) to assess the school level required to understand a text. If this level exceeds a certain threshold, for example, an ARI score of 10.0, a feedback loop can be triggered to simplify the response.

Defining a Complexity Budget with Textstat

The Textstat library can be used to calculate scores such as the Automated Readability Index (ARI), which estimates the school level necessary to understand a text, such as a model's response. If this complexity metric exceeds a budget or threshold — such as 10.0, equivalent to a 10th-grade reading level — a feedback loop can be automatically triggered to require a more concise and simple response. This strategy not only eliminates flowery language but can also help reduce the risks of hallucination, as the model adheres more strictly to fundamental facts as a result.

Implementing the LangChain Pipeline

To implement this strategy, a pipeline using LangChain can be set up in a Google Colab environment. You will need a Hugging Face API token, obtainable for free at https://huggingface.co/settings/tokens. Create a new "secret" named HF_TOKEN in the left menu of Colab by clicking on the "Secrets" icon (which looks like a key). Paste the generated API token into the "Value" field, and you are ready!

To get started, install the necessary libraries:

!pip install textstat langchain_huggingface langchain_community

The following code is specific to Google Colab, and you may need to adjust it based on your environment. It focuses on retrieving the stored API token:

from google.colab import userdata

# Retrieve the Hugging Face API token stored in your Colab session Secrets
HF_TOKEN = userdata.get('HF_TOKEN')

# Check if the token was retrieved
if not HF_TOKEN:
    print("WARNING: The 'HF_TOKEN' was not found. This may cause errors.")
print("Hugging Face token loaded successfully.")

In the following code, we perform several actions. First, it sets up components for local text generation via a pre-trained Hugging Face model — specifically distilgpt2. Then, the model is integrated into a LangChain pipeline.

from langchain_core.prompts import PromptTemplate
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from langchain_community.llms import HuggingFacePipeline

# Initialize a compatible, local, and free LLM for text generation
model_id = "distilgpt2"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

# Create a text generation pipeline
pipe = pipeline(
    "text-generation",
    tokenizer=tokenizer,
    max_new_tokens=100,
    device=0  # Use GPU if available, otherwise it will default to CPU
)

# Wrap the pipeline in HuggingFacePipeline
llm = HuggingFacePipeline(pipeline=pipe)

Our main mechanism for measuring and managing verbosity is then implemented. The following function generates a summary of the text passed to it (presumed to be a response from an LLM) and tries to ensure that the summary does not exceed a threshold complexity level. Note that by using an appropriate prompt model, generation models like distilgpt2 can be used to obtain text summaries, although the quality of these summaries may not match that of heavier models focused on summarization. We chose this model due to its reliability for local execution in a constrained environment.

def safe_summarize(text_input, complexity_budget=10.0):
    print("\n--- Starting summary process ---")
    print(f"Input text length: {len(text_input)} characters")
    print(f"Target complexity budget (ARI score): {complexity_budget}")

    # Step 1: Generate the initial summary
    print("Generating initial complete summary...")
    base_prompt = PromptTemplate.from_template(
        "Provide a complete summary of the following text: {text}"
    )
    chain = base_prompt | llm
    summary = chain.invoke({"text": text_input})
    print("Initial summary generated:")
    print("-------------------------")
    print(summary)
    print("-------------------------")

    # Step 2: Measure readability
    ari_score = textstat.automated_readability_index(summary)
    print(f"Initial ARI score: {ari_score:.2f}")

    # Step 3: Apply the complexity budget
    if ari_score > complexity_budget:
        print("Budget exceeded! The initial summary is too complex.")
        print("Triggering simplification safeguard...")
        simplification_prompt = PromptTemplate.from_template(
            "The following text is too verbose. Rewrite it concisely "
            "using simple vocabulary, eliminating flowery language:\n\n{text}"
        )
        simplify_chain = simplification_prompt | llm
        simplified_summary = simplify_chain.invoke({"text": summary})
        new_ari = textstat.automated_readability_index(simplified_summary)
        print("Simplified summary generated:")
        print("-------------------------")
        print(simplified_summary)
        print("-------------------------")
        print(f"New ARI score: {new_ari:.2f}")
        summary = simplified_summary
    else:
        print("The initial summary is within the complexity budget. No simplification needed.")

    print("--- Summary process completed ---")

Note also in the code above that ARI scores are calculated to estimate the complexity of the text.

Example Text and Function Test

The last part of the example code tests the previously defined function by passing in a sample text and a complexity budget of 10.0, displaying the final results.

Provide a very verbose and complex sample text:

sample_text = """
The inextricably linked permutations of cognitive spreadsheets in the
realm of Large Language Models often lead to a cascade of unnecessarily
labyrinthine lexical structures. This propensity for circumlocution, while
seemingly indicating profound erudition, frequently obfuscates the fundamental
semantic load, thereby rendering the generated discourse significantly less
accessible to the average citizen.
"""

- Call the function:

```python
print("Executing summary pipeline...\n")
final_output = safe_summarize(sample_text, complexity_budget=10.0)

Display the final result:

print("\n--- Final summary with safeguard ---")
print(final_output)

The resulting printed messages may be quite lengthy, but you will notice a slight decrease in the ARI score after calling the pre-trained model for the summary. However, do not expect miraculous results: the chosen model, while lightweight, is not excellent for summarizing text, so the reduction in ARI score is rather modest. You might try using other models like google/flan-t5-small to see how they perform for text summarization, but be cautious — these models will be heavier and more challenging to run.

This article demonstrates how to implement an infrastructure to measure and control overly verbose responses from LLMs by calling an auxiliary model to summarize them before approving their complexity level. Hallucinations are a byproduct of high verbosity in many scenarios. While the implementation presented here focuses on evaluating verbosity, there are specific checks that can also be used to measure hallucinations — such as semantic consistency checks, cross-encoders for natural language inference (NLI), and LLM solutions as judges.

Iván Palomares Carrascosa is a leader, writer, speaker, and advisor in AI, machine learning, deep learning, and LLMs. He trains and guides others in using AI in the real world.

LLM: Textstat and LangChain Tackle Verbose Responses

Le brief IA que les pros lisent chaque soir

Defining a Complexity Budget with Textstat

Implementing the LangChain Pipeline

Example Text and Function Test

Brief IA — L'actualité IA en français