Python: 5 Essential Decorators to Enhance LLM Applications

⚡

Key Takeaways

1Python decorators simplify the management of LLM APIs, which are often costly and slow.

2The use of lru_cache and diskcache optimizes performance by reducing the latency of API calls.

3Tenacity and ratelimit ensure resilience and limit management during network interactions.

💡Why it matters — These tools enhance the efficiency and reliability of applications using language models, which is crucial for developers.

Optimizing LLM Applications with Python

Python decorators are custom solutions that simplify the management of complex software logic, particularly in applications based on language models (LLM). These models often require interactions with third-party APIs, which can be unpredictable, slow, and costly. Decorators help make these interactions more efficient by wrapping API calls with optimized logic.

In-Memory Caching

Python's standard library functools offers the lru_cache decorator, which is particularly useful for costly functions involving LLMs. By wrapping an LLM API call in an LRU (Least Recently Used) decorator, you add a caching mechanism that prevents redundant requests containing identical inputs within the same execution or session. This elegantly optimizes latency issues.

Here’s an example illustrating its use:

from functools import lru_cache

@lru_cache(maxsize=100)
def summarize_text(text: str) -> str:
    print("Sending text to LLM...")
    time.sleep(1)  # Simulating network delay
    return f"Summary of {len(text)} characters."

print(summarize_text("The quick brown fox."))  # Takes one second
print(summarize_text("The quick brown fox."))  # Instant

Persistent Disk Caching

When it comes to caching, the external library diskcache goes even further by implementing a persistent disk cache, particularly via an SQLite database. This is especially useful for storing results of time-consuming functions, such as LLM API calls. In this way, results can be quickly retrieved during subsequent calls if necessary. This decorator model should be considered when in-memory caching is insufficient, as the execution of a script or application may stop.

Here’s how it works:

from diskcache import Cache

# Creating a lightweight local SQLite database directory
cache = Cache(".local_llm_cache")

@cache.memoize(expire=86400)  # Cached for 24 hours
def fetch_llm_response(prompt: str) -> str:
    print("Calling costly LLM API...")  # Replace this with an actual LLM API call
    time.sleep(2)  # Simulating API latency
    return f"Response to: {prompt}"

print(fetch_llm_response("What is quantum computing?"))  # 1st function call
print(fetch_llm_response("What is quantum computing?"))  # Instant load from disk here!

Network Resilient Applications

LLMs can often fail due to transient errors as well as timeouts and "502 Bad Gateway" responses on the internet. Using a network resilience library like tenacity with the @retry decorator can help intercept these common network failures.

The example below illustrates this implementation of resilient behavior by randomly simulating a 70% chance of network error. Try several times, and sooner or later, you will see this error appear: completely expected and intentional!

from tenacity import retry, wait_exponential, stop_after_attempt, retry_if_exception_type

class RateLimitError(Exception): pass

# Retry up to 4 times, waiting 2, 4, and 8 seconds between attempts
@retry(wait=wait_exponential(multiplier=2, min=2, max=10),
       stop=stop_after_attempt(4),
       retry=retry_if_exception_type(RateLimitError))
def call_flaky_llm_api(prompt: str):
    print("Attempting to call the API...")
    if random.random() < 0.7:  # Simulating a 70% chance of API failure
        raise RateLimitError("Rate limit exceeded! Downgrading.")
    return "Text generated successfully!"

print(call_flaky_llm_api("Write a haiku"))

Client-Side Rate Limiting

This decorator uses the ratelimit library to control the frequency of calls to a function, which is generally in high demand. This is useful to avoid client-side limits when using external APIs. The following example does this by setting limits of Requests per Minute (RPM). The provider will reject prompts from a client application when too many simultaneous prompts are launched.

from ratelimit import limits, sleep_and_retry

# Strictly enforcing a limit of 3 calls per 10-second window
@sleep_and_retry
@limits(calls=3, period=10)
def generate_text(prompt: str) -> str:
    print(f"[{time.strftime('%X')}] Processing: {prompt}")
    return f"Processed: {prompt}"

# The first 3 display immediately, the 4th pauses, thus respecting the limit
for i in range(5):
    generate_text(f"Prompt {i}")

Structured Output Binding

The fifth decorator on the list uses the magentic library in conjunction with Pydantic to provide an efficient interaction mechanism with LLMs via API, obtaining structured responses. It simplifies the process of calling LLM APIs. This process is important for encouraging LLMs to return formatted data as JSON objects reliably. The decorator would handle the underlying system prompts and Pydantic-driven parsing, thereby optimizing token usage accordingly and helping to maintain a balanced workload.

These Python decorators are essential for optimizing LLM-based applications, enhancing their efficiency and reliability.