Caching & Cost Optimization

What Is Prompt Caching?

When you send a request containing a large amount of static context (such as a system prompt or knowledge base), the model caches that content. Subsequent requests that reuse the same context can read from the cache instead of reprocessing it from scratch.

Cache hit rate = cached tokens read / total input tokens x 100%

Benefits of cache hits:

Lower latency: Skips processing of cached portions, resulting in faster time-to-first-token

Lower cost: Cached tokens are billed at a reduced rate (typically 10%-25% of the standard price)

Which Models Support Prompt Caching?

Support for Prompt Caching varies across models. We recommend verifying through the console, actual response data, and upstream model documentation.

Current guidance:

Model Family	Recommendation
Claude	Most mature support; ideal for long system prompts, knowledge-base Q&A, and similar scenarios
GPT / Gemini	Support depends on the specific model and upstream capabilities; we recommend validating with low traffic first

How to Improve Cache Hit Rates

1. Keep Prefixes Fixed; Place Dynamic Content at the End

{
  "messages": [
    {"role": "system", "content": "[Your fixed system prompt, ~2000 words...]"},
    {"role": "user", "content": "[User's dynamic question]"}
  ]
}

The system prompt stays constant (cache hit); the user message varies each time (billed normally).

2. Set Cache Breakpoints Strategically

Place large blocks of static content in the first few messages, with dynamic content at the end. The model matches the cache from the beginning and stops at the first point of difference.

3. Manage Request Intervals

Caches have a TTL (typically 5-10 minutes). If too much time elapses between requests, the cache may expire. High-frequency workloads naturally achieve higher hit rates.

4. Standardize Templates; Avoid Minor Variations

These two requests will not share a cache:

"You are a professional assistant."

"You are a professional assistant " (trailing space)

Even near-identical content will cause a cache miss if there is any character-level difference. Use a standardized prompt template.

5. Reuse the Same Model and API Key

Caches are not shared between different models. Same model + same prefix = highest hit rate.

Additional Cost Optimization Tips

Choose the Right Model

Not every task requires the most powerful model:

Task Type	Recommended Models	Cost Tier
Simple Q&A, classification	`claude-3-5-haiku-20241022` / `gpt-5-nano` / `gemini-2.5-flash-lite`	Low
General conversation, summarization	`claude-sonnet-4-20250514` / `gpt-5-mini` / `gemini-2.5-flash`	Low-Medium
Code generation, analysis	`claude-sonnet-4-20250514` / `claude-3-7-sonnet-20250219` / `gpt-5.2`	Medium
Complex reasoning, creative work	`claude-opus-4-1-20250805` / `claude-opus-4-20250514` / `gpt-5.4` / `gemini-3.1-pro`	High

Control max_tokens

Set a reasonable max_tokens value to prevent the model from generating unnecessarily long output. For example, if you only need a "yes/no" judgment, max_tokens: 10 is sufficient.

Keep System Prompts Concise

An overly long system prompt increases the input token cost for every request. Keep prompts lean and remove unnecessary descriptions.

08 - Caching & Cost Optimization

Caching & Cost Optimization#

What Is Prompt Caching?#

Which Models Support Prompt Caching?#

How to Improve Cache Hit Rates#

1. Keep Prefixes Fixed; Place Dynamic Content at the End#

2. Set Cache Breakpoints Strategically#

3. Manage Request Intervals#

4. Standardize Templates; Avoid Minor Variations#

5. Reuse the Same Model and API Key#

Additional Cost Optimization Tips#

Choose the Right Model#

Control max_tokens#

Keep System Prompts Concise#