AI agentsArticle

Prompt Caching with Deep Agents

Prompt caching reduces latency and cost in AI agents by storing and reusing processed prompts. This technique enables faster multi-step reasoning, deeper context retention, and more efficient agent workflows.

By Nexus AI Editorial TeamPublished: June 27, 20267 min read1 viewAudio reading is not available in this browserLast updated: June 27, 2026

Quick summary

Prompt Caching with Deep Agents

Introduction

Large language models (LLMs) have become indispensable tools for developers and enterprises, but their cost and latency remain significant barriers to widespread deployment. One of the most promising solutions to these challenges is **prompt caching**—a technique that stores and reuses processed prompt segments to avoid redundant computation. When combined with deep agent architectures, where multiple reasoning steps or tool calls are executed sequentially, prompt caching can dramatically reduce both response times and API costs.

This article provides a practical, technical guide to implementing prompt caching with deep agents. We'll walk through installation, configuration, and real-world usage examples, drawing on insights from industry leaders like LangChain, OpenAI, Microsoft, and Anthropic.

What is Prompt Caching?

Prompt caching works by storing the key-value (KV) cache from previously processed prompts. When a new prompt shares a prefix with a cached one, the system reuses the cached computation instead of reprocessing the entire input. This is especially valuable for:

Long system prompts that remain constant across multiple requests
Few-shot examples that are reused for similar tasks
Deep agent loops where the same context is passed to multiple tool calls

Deep agents—autonomous systems that reason, plan, and execute multiple steps—benefit disproportionately from caching because they repeatedly process the same base context (system instructions, conversation history, tool definitions) while generating different action sequences.

Requirements

Before implementing prompt caching with deep agents, ensure you have the following:

**Python 3.10 or later** (3.11+ recommended for performance)
**LangChain** (version 0.3 or later) for agent orchestration
**An LLM provider that supports prompt caching** (e.g., OpenAI, Anthropic, or Microsoft Azure OpenAI)
**A deep agent framework** (LangGraph or LangChain's AgentExecutor)
**At least 8 GB RAM** (16 GB+ recommended for larger models)

Step-by-Step Installation

1. Set Up a Virtual Environment

Isolate your dependencies to avoid conflicts:

python -m venv prompt-cache-env
source prompt-cache-env/bin/activate  # On Windows: prompt-cache-env\Scripts\activate

2. Install Core Dependencies

Install LangChain and the agent framework:

pip install langchain langchain-openai langchain-anthropic langgraph

For Microsoft Azure OpenAI users, install the Azure integration:

pip install langchain-azure-openai

3. Configure API Keys

Set your API keys as environment variables. This keeps credentials secure and out of your code:

export OPENAI_API_KEY="your-openai-api-key-here"
export ANTHROPIC_API_KEY="your-anthropic-api-key-here"
export AZURE_OPENAI_API_KEY="your-azure-api-key-here"
export AZURE_OPENAI_ENDPOINT="https://your-resource.openai.azure.com/"

For convenience on Windows, use `set` instead of `export`:

set OPENAI_API_KEY="your-openai-api-key-here"

4. Install Optional Monitoring Tools

To observe cache behavior, install LangSmith for tracing:

pip install langsmith
export LANGCHAIN_TRACING_V2="true"
export LANGCHAIN_API_KEY="your-langsmith-api-key"

Configuration for Prompt Caching

Prompt caching is typically enabled at the model provider level. Here's how to configure it for major providers.

OpenAI

OpenAI automatically caches prompts that are 1,024 tokens or longer. You can optimize by structuring your system prompt as a reusable prefix:

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    model="gpt-4o-mini",
    temperature=0,
    # Caching is automatic; no special flag needed
)

Anthropic

Anthropic requires explicit cache control headers. Use the `cache_control` parameter in your prompt:

from langchain_anthropic import ChatAnthropic

llm = ChatAnthropic(
    model="claude-3-5-sonnet-20241022",
    temperature=0,
)

# In your prompt, mark cacheable sections:
system_prompt = [
    {
        "type": "text",
        "text": "You are a helpful assistant with deep reasoning capabilities.",
        "cache_control": {"type": "ephemeral"}
    }
]

Microsoft Azure OpenAI

Azure OpenAI supports prompt caching starting with GPT-4o-mini. Enable it through your deployment:

from langchain_openai import AzureChatOpenAI

llm = AzureChatOpenAI(
    azure_deployment="gpt-4o-mini",
    api_version="2024-08-01-preview",
    temperature=0,
)
# Caching is automatic for prompts >= 1,024 tokens

Building a Deep Agent with Prompt Caching

Now let's create a deep agent that benefits from caching. We'll build a research assistant that performs multi-step analysis.

Step 1: Define the System Prompt

Create a long, reusable system prompt that will be cached:

system_prompt = """
You are a research assistant specialized in analyzing technical documents.
Your capabilities:
- Summarize long texts
- Extract key findings
- Compare and contrast sources
- Identify gaps in arguments
- Suggest further reading

Always respond in a structured format:
1. Main findings
2. Evidence
3. Conclusions

Be thorough and cite specific examples from the text.
"""

Step 2: Create the Agent with Caching

Use Anthropic's cache control for explicit caching:

from langchain.agents import create_openai_functions_agent
from langchain.tools import tool
from langchain_anthropic import ChatAnthropic
from langgraph.graph import StateGraph, MessagesState

@tool
def search_database(query: str) -> str:
    """Search the internal database for relevant documents."""
    # Simulated search
    return f"Results for '{query}': Found 3 relevant documents."

@tool
def extract_insights(text: str) -> str:
    """Extract key insights from a given text."""
    return f"Key insights: {text[:100]}..."

llm = ChatAnthropic(
    model="claude-3-5-sonnet-20241022",
    temperature=0,
)

# Create the agent with cache control
prompt = system_prompt + "\n\nUser query: {input}\n\nConversation history: {chat_history}"

tools = [search_database, extract_insights]

# Wrap the prompt with cache control
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder

prompt_template = ChatPromptTemplate.from_messages([
    ("system", system_prompt, {"cache_control": {"type": "ephemeral"}}),
    MessagesPlaceholder(variable_name="chat_history"),
    ("human", "{input}"),
])

agent = create_openai_functions_agent(llm, tools, prompt_template)

Step 3: Add a Graph for Deep Reasoning

Use LangGraph to create a multi-step agent that reuses the cached system prompt:

from langgraph.graph import StateGraph, END
from typing import TypedDict, List, Annotated
import operator

class AgentState(TypedDict):
    messages: Annotated[List, operator.add]
    step_count: int

def call_model(state: AgentState):
    # The system prompt is cached, so repeated calls are fast
    response = agent.invoke({
        "input": state["messages"][-1].content,
        "chat_history": state["messages"][:-1]
    })
    return {"messages": [response], "step_count": state["step_count"] + 1}

def should_continue(state: AgentState):
    # Stop after 5 steps or if the agent decides to finish
    if state["step_count"] >= 5:
        return "end"
    return "continue"

# Build the graph
graph = StateGraph(AgentState)
graph.add_node("agent", call_model)
graph.add_conditional_edges(
    "agent",
    should_continue,
    {"continue": "agent", "end": END}
)
graph.set_entry_point("agent")

app = graph.compile()

Usage Examples

Example 1: Basic Cached Agent

Run a query that benefits from the cached system prompt:

# First request (system prompt is cached)
result = app.invoke({
    "messages": [{"role": "user", "content": "Summarize the key findings from our Q3 report."}],
    "step_count": 0
})
print("First response:", result["messages"][-1].content)

# Second request (cache hit - much faster)
result2 = app.invoke({
    "messages": [{"role": "user", "content": "Compare those findings with Q2."}],
    "step_count": 0
})
print("Second response:", result2["messages"][-1].content)

Example 2: Monitoring Cache Performance

Use LangSmith to observe cache hits and latency improvements:

import time

# Enable detailed tracing
from langsmith import Client
client = Client()

# Run multiple queries and measure
for i in range(3):
    start = time.time()
    result = app.invoke({
        "messages": [{"role": "user", "content": f"Analyze document set {i+1}."}],
        "step_count": 0
    })
    elapsed = time.time() - start
    print(f"Query {i+1}: {elapsed:.2f}s")

Example 3: Optimizing for Maximum Cache Hits

Structure your prompts to maximize cache reuse:

# Always use the same system prompt prefix
cached_prefix = """
You are a research assistant.
Your roles:
- Summarize
- Extract
- Compare

"""

# Vary only the user message
queries = [
    "Summarize the Q1 report.",
    "Extract key metrics from Q1.",
    "Compare Q1 and Q2 trends."
]

for query in queries:
    response = agent.invoke({
        "input": query,
        "chat_history": []
    })
    print(f"Query: {query}")
    print(f"Response: {response.content[:50]}...")
    print("---")

Best Practices

1. **Design long, stable system prompts**: Caching works best when the prompt prefix remains unchanged across requests. Keep system instructions and tool definitions constant.

2. **Use appropriate model versions**: OpenAI's GPT-4o-mini and Anthropic's Claude 3.5 Sonnet offer the best caching performance. Check provider documentation for the latest supported models.

3. **Monitor cache hit rates**: Use LangSmith or provider dashboards to track how often your prompts hit the cache. Aim for >50% cache hit rate for significant cost savings.

4. **Consider batch processing**: When processing similar queries, structure them to share common prefixes. This maximizes cache reuse across requests.

5. **Test with realistic workloads**: Cache performance depends on prompt length and repetition. Benchmark with your actual use case before deploying to production.

Conclusion

Prompt caching is a powerful technique that significantly reduces the cost and latency of deep agent systems. By storing and reusing processed prompt segments, you can achieve response times that are 2-5x faster while cutting API costs by 30-60%. The key is to design your agent prompts with caching in mind—use long, stable system prompts, leverage provider-specific caching features, and monitor performance with tools like LangSmith.

As LLM providers continue to improve their caching infrastructure (OpenAI's automatic caching, Anthropic's explicit cache control, and Microsoft Azure's integration), deep agents will become even more practical and economical. Start implementing prompt caching today to unlock faster, cheaper, and more scalable AI applications.

*For the latest updates on prompt caching techniques, refer to the official blogs of LangChain, OpenAI, Microsoft, and Anthropic.*

Sources

Prompt Caching with Deep AgentsLangChain Blog OpenAI NewsOpenAI News Microsoft AI BlogMicrosoft AI Blog Anthropic NewsAnthropic News

FAQ

What is this article about?

This article covers “Prompt Caching with Deep Agents” in the AI agents category. Prompt caching reduces latency and cost in AI agents by storing and reusing processed prompts. This technique enables faster multi-step reasoning, deeper context retention, and more efficient agent workflows.

Who is this useful for?

It is useful for readers who want a practical understanding of AI tools, models, and workflows.

What should I do next?

Read the article, review the listed sources, and test the most relevant ideas in your own workflow.

Tags

Quick summary

Prompt Caching with Deep Agents

Introduction

What is Prompt Caching?

Requirements

Step-by-Step Installation

1. Set Up a Virtual Environment

2. Install Core Dependencies

3. Configure API Keys

4. Install Optional Monitoring Tools

Configuration for Prompt Caching

OpenAI

Anthropic

Microsoft Azure OpenAI

Building a Deep Agent with Prompt Caching

Step 1: Define the System Prompt

Step 2: Create the Agent with Caching

Step 3: Add a Graph for Deep Reasoning

Usage Examples

Example 1: Basic Cached Agent

Example 2: Monitoring Cache Performance

Example 3: Optimizing for Maximum Cache Hits

Best Practices

Conclusion

Sources

FAQ

Related Articles