Prompt Caching with Deep Agents
Prompt caching reduces latency and cost in AI agents by storing and reusing processed prompts. This technique enables faster multi-step reasoning, deeper context retention, and more efficient agent workflows.
Tags
Quick summary
Prompt caching reduces latency and cost in AI agents by storing and reusing processed prompts. This technique enables faster multi-step reasoning, deeper context retention, and more efficient agent workflows.
Prompt Caching with Deep Agents
Introduction
Large language models (LLMs) have become indispensable tools for developers and enterprises, but their cost and latency remain significant barriers to widespread deployment. One of the most promising solutions to these challenges is **prompt caching**—a technique that stores and reuses processed prompt segments to avoid redundant computation. When combined with deep agent architectures, where multiple reasoning steps or tool calls are executed sequentially, prompt caching can dramatically reduce both response times and API costs.
This article provides a practical, technical guide to implementing prompt caching with deep agents. We'll walk through installation, configuration, and real-world usage examples, drawing on insights from industry leaders like LangChain, OpenAI, Microsoft, and Anthropic.
What is Prompt Caching?
Prompt caching works by storing the key-value (KV) cache from previously processed prompts. When a new prompt shares a prefix with a cached one, the system reuses the cached computation instead of reprocessing the entire input. This is especially valuable for:
- Long system prompts that remain constant across multiple requests
- Few-shot examples that are reused for similar tasks
- Deep agent loops where the same context is passed to multiple tool calls
Deep agents—autonomous systems that reason, plan, and execute multiple steps—benefit disproportionately from caching because they repeatedly process the same base context (system instructions, conversation history, tool definitions) while generating different action sequences.
Requirements
Before implementing prompt caching with deep agents, ensure you have the following:
- **Python 3.10 or later** (3.11+ recommended for performance)
- **LangChain** (version 0.3 or later) for agent orchestration
- **An LLM provider that supports prompt caching** (e.g., OpenAI, Anthropic, or Microsoft Azure OpenAI)
- **A deep agent framework** (LangGraph or LangChain's AgentExecutor)
- **At least 8 GB RAM** (16 GB+ recommended for larger models)
Step-by-Step Installation
1. Set Up a Virtual Environment
Isolate your dependencies to avoid conflicts:
python -m venv prompt-cache-env
source prompt-cache-env/bin/activate # On Windows: prompt-cache-env\Scripts\activate2. Install Core Dependencies
Install LangChain and the agent framework:
pip install langchain langchain-openai langchain-anthropic langgraphFor Microsoft Azure OpenAI users, install the Azure integration:
pip install langchain-azure-openai3. Configure API Keys
Set your API keys as environment variables. This keeps credentials secure and out of your code:
export OPENAI_API_KEY="your-openai-api-key-here"
export ANTHROPIC_API_KEY="your-anthropic-api-key-here"
export AZURE_OPENAI_API_KEY="your-azure-api-key-here"
export AZURE_OPENAI_ENDPOINT="https://your-resource.openai.azure.com/"For convenience on Windows, use `set` instead of `export`:
set OPENAI_API_KEY="your-openai-api-key-here"4. Install Optional Monitoring Tools
To observe cache behavior, install LangSmith for tracing:
pip install langsmith
export LANGCHAIN_TRACING_V2="true"
export LANGCHAIN_API_KEY="your-langsmith-api-key"Configuration for Prompt Caching
Prompt caching is typically enabled at the model provider level. Here's how to configure it for major providers.
OpenAI
OpenAI automatically caches prompts that are 1,024 tokens or longer. You can optimize by structuring your system prompt as a reusable prefix:
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(
model="gpt-4o-mini",
temperature=0,
# Caching is automatic; no special flag needed
)Anthropic
Anthropic requires explicit cache control headers. Use the `cache_control` parameter in your prompt:
from langchain_anthropic import ChatAnthropic
llm = ChatAnthropic(
model="claude-3-5-sonnet-20241022",
temperature=0,
)
# In your prompt, mark cacheable sections:
system_prompt = [
{
"type": "text",
"text": "You are a helpful assistant with deep reasoning capabilities.",
"cache_control": {"type": "ephemeral"}
}
]Microsoft Azure OpenAI
Azure OpenAI supports prompt caching starting with GPT-4o-mini. Enable it through your deployment:
from langchain_openai import AzureChatOpenAI
llm = AzureChatOpenAI(
azure_deployment="gpt-4o-mini",
api_version="2024-08-01-preview",
temperature=0,
)
# Caching is automatic for prompts >= 1,024 tokensBuilding a Deep Agent with Prompt Caching
Now let's create a deep agent that benefits from caching. We'll build a research assistant that performs multi-step analysis.
Step 1: Define the System Prompt
Create a long, reusable system prompt that will be cached:
system_prompt = """
You are a research assistant specialized in analyzing technical documents.
Your capabilities:
- Summarize long texts
- Extract key findings
- Compare and contrast sources
- Identify gaps in arguments
- Suggest further reading
Always respond in a structured format:
1. Main findings
2. Evidence
3. Conclusions
Be thorough and cite specific examples from the text.
"""Step 2: Create the Agent with Caching
Use Anthropic's cache control for explicit caching:
from langchain.agents import create_openai_functions_agent
from langchain.tools import tool
from langchain_anthropic import ChatAnthropic
from langgraph.graph import StateGraph, MessagesState
@tool
def search_database(query: str) -> str:
"""Search the internal database for relevant documents."""
# Simulated search
return f"Results for '{query}': Found 3 relevant documents."
@tool
def extract_insights(text: str) -> str:
"""Extract key insights from a given text."""
return f"Key insights: {text[:100]}..."
llm = ChatAnthropic(
model="claude-3-5-sonnet-20241022",
temperature=0,
)
# Create the agent with cache control
prompt = system_prompt + "\n\nUser query: {input}\n\nConversation history: {chat_history}"
tools = [search_database, extract_insights]
# Wrap the prompt with cache control
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
prompt_template = ChatPromptTemplate.from_messages([
("system", system_prompt, {"cache_control": {"type": "ephemeral"}}),
MessagesPlaceholder(variable_name="chat_history"),
("human", "{input}"),
])
agent = create_openai_functions_agent(llm, tools, prompt_template)Step 3: Add a Graph for Deep Reasoning
Use LangGraph to create a multi-step agent that reuses the cached system prompt:
from langgraph.graph import StateGraph, END
from typing import TypedDict, List, Annotated
import operator
class AgentState(TypedDict):
messages: Annotated[List, operator.add]
step_count: int
def call_model(state: AgentState):
# The system prompt is cached, so repeated calls are fast
response = agent.invoke({
"input": state["messages"][-1].content,
"chat_history": state["messages"][:-1]
})
return {"messages": [response], "step_count": state["step_count"] + 1}
def should_continue(state: AgentState):
# Stop after 5 steps or if the agent decides to finish
if state["step_count"] >= 5:
return "end"
return "continue"
# Build the graph
graph = StateGraph(AgentState)
graph.add_node("agent", call_model)
graph.add_conditional_edges(
"agent",
should_continue,
{"continue": "agent", "end": END}
)
graph.set_entry_point("agent")
app = graph.compile()Usage Examples
Example 1: Basic Cached Agent
Run a query that benefits from the cached system prompt:
# First request (system prompt is cached)
result = app.invoke({
"messages": [{"role": "user", "content": "Summarize the key findings from our Q3 report."}],
"step_count": 0
})
print("First response:", result["messages"][-1].content)
# Second request (cache hit - much faster)
result2 = app.invoke({
"messages": [{"role": "user", "content": "Compare those findings with Q2."}],
"step_count": 0
})
print("Second response:", result2["messages"][-1].content)Example 2: Monitoring Cache Performance
Use LangSmith to observe cache hits and latency improvements:
import time
# Enable detailed tracing
from langsmith import Client
client = Client()
# Run multiple queries and measure
for i in range(3):
start = time.time()
result = app.invoke({
"messages": [{"role": "user", "content": f"Analyze document set {i+1}."}],
"step_count": 0
})
elapsed = time.time() - start
print(f"Query {i+1}: {elapsed:.2f}s")Example 3: Optimizing for Maximum Cache Hits
Structure your prompts to maximize cache reuse:
# Always use the same system prompt prefix
cached_prefix = """
You are a research assistant.
Your roles:
- Summarize
- Extract
- Compare
"""
# Vary only the user message
queries = [
"Summarize the Q1 report.",
"Extract key metrics from Q1.",
"Compare Q1 and Q2 trends."
]
for query in queries:
response = agent.invoke({
"input": query,
"chat_history": []
})
print(f"Query: {query}")
print(f"Response: {response.content[:50]}...")
print("---")Best Practices
1. **Design long, stable system prompts**: Caching works best when the prompt prefix remains unchanged across requests. Keep system instructions and tool definitions constant.
2. **Use appropriate model versions**: OpenAI's GPT-4o-mini and Anthropic's Claude 3.5 Sonnet offer the best caching performance. Check provider documentation for the latest supported models.
3. **Monitor cache hit rates**: Use LangSmith or provider dashboards to track how often your prompts hit the cache. Aim for >50% cache hit rate for significant cost savings.
4. **Consider batch processing**: When processing similar queries, structure them to share common prefixes. This maximizes cache reuse across requests.
5. **Test with realistic workloads**: Cache performance depends on prompt length and repetition. Benchmark with your actual use case before deploying to production.
Conclusion
Prompt caching is a powerful technique that significantly reduces the cost and latency of deep agent systems. By storing and reusing processed prompt segments, you can achieve response times that are 2-5x faster while cutting API costs by 30-60%. The key is to design your agent prompts with caching in mind—use long, stable system prompts, leverage provider-specific caching features, and monitor performance with tools like LangSmith.
As LLM providers continue to improve their caching infrastructure (OpenAI's automatic caching, Anthropic's explicit cache control, and Microsoft Azure's integration), deep agents will become even more practical and economical. Start implementing prompt caching today to unlock faster, cheaper, and more scalable AI applications.
*For the latest updates on prompt caching techniques, refer to the official blogs of LangChain, OpenAI, Microsoft, and Anthropic.*
Sources
FAQ
What is this article about?
This article covers “Prompt Caching with Deep Agents” in the AI agents category. Prompt caching reduces latency and cost in AI agents by storing and reusing processed prompts. This technique enables faster multi-step reasoning, deeper context retention, and more efficient agent workflows.
Who is this useful for?
It is useful for readers who want a practical understanding of AI tools, models, and workflows.
What should I do next?
Read the article, review the listed sources, and test the most relevant ideas in your own workflow.



