08 - Caching & Cost Optimization
Caching & Cost Optimization#
What Is Prompt Caching?#
When you send a request containing a large amount of static context (such as a system prompt or knowledge base), the model caches that content. Subsequent requests that reuse the same context can read from the cache instead of reprocessing it from scratch.Cache hit rate = cached tokens read / total input tokens x 100%Lower latency: Skips processing of cached portions, resulting in faster time-to-first-token
Lower cost: Cached tokens are billed at a reduced rate (typically 10%-25% of the standard price)
Which Models Support Prompt Caching?#
Support for Prompt Caching varies across models. We recommend verifying through the console, actual response data, and upstream model documentation.| Model Family | Recommendation |
|---|
| Claude | Most mature support; ideal for long system prompts, knowledge-base Q&A, and similar scenarios |
| GPT / Gemini | Support depends on the specific model and upstream capabilities; we recommend validating with low traffic first |
How to Improve Cache Hit Rates#
1. Keep Prefixes Fixed; Place Dynamic Content at the End#
{
"messages": [
{"role": "system", "content": "[Your fixed system prompt, ~2000 words...]"},
{"role": "user", "content": "[User's dynamic question]"}
]
}
The system prompt stays constant (cache hit); the user message varies each time (billed normally).2. Set Cache Breakpoints Strategically#
Place large blocks of static content in the first few messages, with dynamic content at the end. The model matches the cache from the beginning and stops at the first point of difference.3. Manage Request Intervals#
Caches have a TTL (typically 5-10 minutes). If too much time elapses between requests, the cache may expire. High-frequency workloads naturally achieve higher hit rates.4. Standardize Templates; Avoid Minor Variations#
These two requests will not share a cache:"You are a professional assistant."
"You are a professional assistant " (trailing space)
Even near-identical content will cause a cache miss if there is any character-level difference. Use a standardized prompt template.5. Reuse the Same Model and API Key#
Caches are not shared between different models. Same model + same prefix = highest hit rate.Additional Cost Optimization Tips#
Choose the Right Model#
Not every task requires the most powerful model:| Task Type | Recommended Models | Cost Tier |
|---|
| Simple Q&A, classification | claude-3-5-haiku-20241022 / gpt-5-nano / gemini-2.5-flash-lite | Low |
| General conversation, summarization | claude-sonnet-4-20250514 / gpt-5-mini / gemini-2.5-flash | Low-Medium |
| Code generation, analysis | claude-sonnet-4-20250514 / claude-3-7-sonnet-20250219 / gpt-5.2 | Medium |
| Complex reasoning, creative work | claude-opus-4-1-20250805 / claude-opus-4-20250514 / gpt-5.4 / gemini-3.1-pro | High |
Control max_tokens#
Set a reasonable max_tokens value to prevent the model from generating unnecessarily long output. For example, if you only need a "yes/no" judgment, max_tokens: 10 is sufficient.Keep System Prompts Concise#
An overly long system prompt increases the input token cost for every request. Keep prompts lean and remove unnecessary descriptions.Modified at 2026-04-04 16:03:00