AI Agentic: Efficiently Reducing Token Costs

⚡

Key Takeaways

1The costs of AI in production are rising rapidly, with prompts reaching up to 24,000 tokens.

2Without optimization, using Gemini 3.1 Pro can cost nearly $996 per month for 100 daily messages.

3Techniques like prompt caching and semantic caching help reduce costs by avoiding the reprocessing of the same data.

💡Why it matters — Optimizing token usage is crucial to making AI more accessible and economically viable for businesses.

Introduction to AI Costs

The use of artificial intelligence in production comes with significant costs. Providers are constantly seeking to reduce these expenses. This article examines design strategies to optimize AI agents and achieve savings.

Rising Costs of Agents

Initial agents may start with a system prompt of 500 tokens and two tools, but these numbers increase rapidly. For example, Claude's system prompt reaches around 24,000 tokens, while GPT-5's is about 15,000 tokens. Users of OpenClaw have reported sending over 150,000 tokens of input to Gemini 3.1 Pro for just 29 tokens of output during the first round.

Without optimization, sending 100 messages per day with 166,000 tokens of input costs approximately $996 per month on Gemini 3.1 Pro and about $2,490 on Claude Opus 4.6.

Four Design Principles for Savings

The article presents four principles for optimizing costs, each accompanied by an interactive calculator:

Token Reuse: Use prompt caching and semantic caching to avoid reprocessing the same data.
Minimization of Added Tokens: Reduce stable additions like memory and tool definitions.
Routing to Suitable Models: Choose between smaller or larger models based on needs.
Maintaining a Clean Context: Improve performance and reduce costs by compacting data.

Token Reuse

The cost of language models arises not only from frequent calls but also from the repetitive processing of the same tokens.

K/V Caching and Prefix Caching

Before a model generates a response, it must process the prompt, a step called prefill. This process consumes resources, leading to latencies and costs. To be efficient, it is crucial not to reprocess the same content.

When using a language model, the prompt is first tokenized, then transformed into vectors, which are projected into K/V tensors in each attention layer. Instead of discarding this cache at the end of the response, it can be stored for future use. When a new request arrives, it checks if part of the prompt matches something already in memory, thus avoiding reprocessing.

Prefix Caching for Self-Hosted Inference

For those hosting an open-source model, a service framework like vLLM is recommended. This framework divides the prompt into blocks, hashes each block based on its tokens, and stores the associated K/V tensors. To enable caching in vLLM, use the --enable-prefix-caching flag. Other options allow for adjusting the block size and K/V cache per GPU.

Prompt Caching via API Providers

When using API providers, structuring prompts to hit the cache is crucial. For example, for OpenAI, an exact prefix match is required for caching to work. This involves placing stable instructions, examples, and tools first, followed by variable content.

Semantic Caching

Semantic caching involves associating similar queries to return the cached result. This works well if many people ask nearly identical questions and the data does not become outdated quickly. However, it is important to manage similarity thresholds, the validity duration of responses, and multi-turn questions.

Conclusion

Prompt caching is an effective strategy for saving tokens, especially with long and stable system prompts. Semantic caching, while presenting challenges, can also offer significant savings in certain cases.