We Built a Routing Layer to Cut Our AI Costs. It Broke the Product.
A team built a custom AI routing layer to reduce API costs, but it introduced latency, errors, and unpredictable behavior that degraded the user experience, ultimately breaking the product.
Tags
Quick summary
A team built a custom AI routing layer to reduce API costs, but it introduced latency, errors, and unpredictable behavior that degraded the user experience, ultimately breaking the product.
We Built a Routing Layer to Cut Our AI Costs. It Broke the Product.
Cost optimization in AI is a double-edged sword. Every engineering team I know has faced the same pressure: reduce API bills without sacrificing quality. We thought we had found the perfect solution—a smart routing layer that dynamically chose the cheapest model for each request. It worked brilliantly in testing. Then it broke the product.
This is the story of what went wrong, the technical details of the routing layer we built, and the hard lessons we learned about the hidden costs of optimization.
The Problem We Tried to Solve
Our AI-powered product relied on multiple large language models (LLMs) from providers like OpenAI and Google. We used GPT-4 for complex reasoning, GPT-3.5 for simple tasks, and Google's PaLM 2 for specific domain queries. Our monthly bill was spiraling toward six figures.
The obvious fix: route each request to the cheapest model capable of handling it. We wanted a system that would analyze each prompt, estimate its complexity, and send it to the most cost-efficient model. If a user asked "What's the weather?" we'd route to GPT-3.5-mini. If they asked "Write a detailed legal analysis," we'd route to GPT-4.
Requirements
Before building, we defined what the routing layer needed to do:
- **Cost optimization**: Reduce average per-request cost by at least 40%.
- **Latency control**: Keep response times under 2 seconds for 95% of requests.
- **Quality preservation**: Maintain user satisfaction scores within 5% of baseline.
- **Fallback logic**: Automatically escalate to more capable models if the cheap one fails.
- **Observability**: Log every routing decision with model, cost, and latency.
Step-by-Step Installation
We built the routing layer as a Python middleware service using FastAPI and Redis. Here's how we set it up.
1. Install Dependencies
First, create a virtual environment and install the required packages.
python -m venv routing-env
source routing-env/bin/activate
pip install fastapi uvicorn redis openai google-generativeai pydantic python-dotenv2. Environment Configuration
Create a `.env` file with your API keys and model pricing.
# .env
OPENAI_API_KEY=sk-your-key-here
GOOGLE_API_KEY=your-google-key
REDIS_URL=redis://localhost:6379
# Model pricing per 1K tokens (USD)
GPT4_PRICE=0.03
GPT35_PRICE=0.0015
PALM2_PRICE=0.00053. Core Routing Logic
The heart of the system is a complexity estimator that scores prompts on a scale of 1-10.
# router.py
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import openai
import google.generativeai as palm
import redis
import os
from dotenv import load_dotenv
load_dotenv()
app = FastAPI()
r = redis.from_url(os.getenv("REDIS_URL"))
class Request(BaseModel):
prompt: str
user_id: str
class Response(BaseModel):
model: str
content: str
cost: float
latency_ms: int
def estimate_complexity(prompt: str) -> int:
"""Score prompt complexity 1-10 based on length, keywords, and structure."""
score = 1
if len(prompt) > 200:
score += 2
if len(prompt) > 500:
score += 2
if any(word in prompt.lower() for word in ["legal", "medical", "code", "math"]):
score += 3
if "?" not in prompt:
score += 1
return min(score, 10)
def select_model(complexity: int) -> str:
"""Route to cheapest model that can handle the complexity."""
if complexity <= 3:
return "palm-2" # $0.0005
elif complexity <= 6:
return "gpt-3.5-turbo" # $0.0015
else:
return "gpt-4" # $0.03
@app.post("/route")
async def route_request(request: Request):
complexity = estimate_complexity(request.prompt)
model = select_model(complexity)
# Log decision to Redis for monitoring
r.rpush("routing_log", f"{model}:{complexity}:{request.user_id}")
# Call the selected model
import time
start = time.time()
try:
if model == "gpt-4" or model == "gpt-3.5-turbo":
openai.api_key = os.getenv("OPENAI_API_KEY")
response = openai.ChatCompletion.create(
model=model,
messages=[{"role": "user", "content": request.prompt}]
)
content = response.choices[0].message.content
elif model == "palm-2":
palm.configure(api_key=os.getenv("GOOGLE_API_KEY"))
response = palm.generate_text(prompt=request.prompt)
content = response.result
latency = int((time.time() - start) * 1000)
cost = calculate_cost(model, len(request.prompt), len(content))
return Response(model=model, content=content, cost=cost, latency_ms=latency)
except Exception as e:
# Fallback to GPT-4 on failure
return await fallback(request.prompt)
def calculate_cost(model: str, input_tokens: int, output_tokens: int) -> float:
prices = {
"gpt-4": 0.03,
"gpt-3.5-turbo": 0.0015,
"palm-2": 0.0005
}
return prices[model] * (input_tokens + output_tokens) / 10004. Start the Service
Run the routing layer on port 8000.
uvicorn router:app --host 0.0.0.0 --port 8000 --reloadUsage Examples
Once the service is running, you can test it with curl.
Simple Query (Routes to PaLM 2)
curl -X POST http://localhost:8000/route \
-H "Content-Type: application/json" \
-d '{"prompt": "What is the capital of France?", "user_id": "test1"}'Expected response:
{
"model": "palm-2",
"content": "The capital of France is Paris.",
"cost": 0.0005,
"latency_ms": 340
}Complex Query (Routes to GPT-4)
curl -X POST http://localhost:8000/route \
-H "Content-Type: application/json" \
-d '{"prompt": "Write a detailed legal analysis of the Fourth Amendment implications of warrantless cell phone tracking by law enforcement agencies, including relevant Supreme Court precedents and dissenting opinions.", "user_id": "test2"}'Expected response:
{
"model": "gpt-4",
"content": "...",
"cost": 0.12,
"latency_ms": 2100
}What Broke the Product
The routing layer worked perfectly in isolation. Our cost per request dropped by 52% within the first week. But users started complaining about inconsistent quality, random failures, and strange behavior.
The Fragility of Complexity Estimation
Our `estimate_complexity` function was too simplistic. A user asking "What's the difference between a cat and a dog?" scored a 1 (simple) and got routed to PaLM 2. PaLM 2 returned a short, technically correct answer. But the user expected a detailed, conversational response like GPT-4 would give. The routing layer optimized for cost, not user experience.
Worse, ambiguous prompts like "Help me understand this code" scored high due to the word "code" and got routed to GPT-4 unnecessarily. We were burning money on simple requests.
The Fallback Nightmare
When PaLM 2 failed on a request (which happened ~8% of the time due to rate limits), our fallback to GPT-4 worked—but it doubled latency. Users saw spinning loaders for 4+ seconds. Our 95th percentile latency jumped from 1.2s to 4.8s.
Model Behavior Divergence
Each model has a different personality. GPT-4 is verbose and cautious. GPT-3.5 is concise and direct. PaLM 2 tends to be more creative. Users noticed that the same prompt returned wildly different tones depending on the routing decision. Our product lost its consistent voice.
What We Learned
1. Cost Optimization Must Include User Experience
We measured cost per request but ignored cost per satisfied user. A cheap response that makes a user leave is infinitely more expensive than a premium response that retains them.
2. Routing Needs Per-User Personalization
Some users prefer short answers. Others want deep analysis. Our routing layer should have learned user preferences and adjusted thresholds accordingly.
3. Fallbacks Need Time Budgets
Instead of blindly falling back to GPT-4, we should have set a maximum latency budget. If the cheap model takes more than 500ms, escalate immediately—don't wait for it to fail.
4. Monitor Quality, Not Just Cost
We tracked cost per request and latency. We should have tracked:
- User satisfaction scores per model
- Task completion rates per model
- Session abandonment rates
The Fixed Routing Layer
After the incident, we rebuilt the routing layer with these lessons in mind. Here's the improved version.
# improved_router.py
import asyncio
from typing import Optional
class AdaptiveRouter:
def __init__(self, user_preferences: dict = None):
self.user_prefs = user_preferences or {}
self.latency_budget = 2000 # max 2 seconds total
async def route_with_timeout(self, prompt: str, user_id: str):
complexity = self.estimate_complexity(prompt)
preferred_style = self.user_prefs.get(user_id, "balanced")
# Adjust threshold based on user preference
if preferred_style == "concise":
threshold = 4
elif preferred_style == "detailed":
threshold = 6
else:
threshold = 5
# Try cheapest model first, with timeout
model = self.select_model(complexity, threshold)
try:
result = await asyncio.wait_for(
self.call_model(model, prompt),
timeout=self.latency_budget / 1000
)
return result
except asyncio.TimeoutError:
# Escalate immediately if cheap model is slow
result = await self.call_model("gpt-4", prompt)
return resultConclusion
Building a routing layer to cut AI costs is tempting. The math looks great on paper: route 70% of requests to cheap models, save 50% on API bills. But the hidden costs—inconsistent user experience, increased latency, and quality degradation—can destroy your product.
Our experience taught us that AI cost optimization isn't just about choosing the cheapest model. It's about understanding your users, measuring the right metrics, and building adaptive systems that balance cost with quality. The best routing layer is invisible to users. Ours was anything but.
If you're building a similar system, start with a simple rule: never let a user see that you switched models. If they notice, you've already broken the product.
Sources
FAQ
What is this article about?
This article covers “We Built a Routing Layer to Cut Our AI Costs. It Broke the Product.” in the AI tools category. A team built a custom AI routing layer to reduce API costs, but it introduced latency, errors, and unpredictable behavior that degraded the user experience, ultimately breaking the product.
Who is this useful for?
It is useful for readers who want a practical understanding of AI tools, models, and workflows.
What should I do next?
Read the article, review the listed sources, and test the most relevant ideas in your own workflow.



