AI toolsArticle

We Built a Routing Layer to Cut Our AI Costs. It Broke the Product.

A team built a custom AI routing layer to reduce API costs, but it introduced latency, errors, and unpredictable behavior that degraded the user experience, ultimately breaking the product.

By Nexus AI Editorial TeamPublished: June 30, 20267 min read3 viewsAudio reading is not available in this browserLast updated: June 30, 2026

We Built a Routing Layer to Cut Our AI Costs. It Broke the Product.

Quick summary

A team built a custom AI routing layer to reduce API costs, but it introduced latency, errors, and unpredictable behavior that degraded the user experience, ultimately breaking the product.

We Built a Routing Layer to Cut Our AI Costs. It Broke the Product.

Cost optimization in AI is a double-edged sword. Every engineering team I know has faced the same pressure: reduce API bills without sacrificing quality. We thought we had found the perfect solution—a smart routing layer that dynamically chose the cheapest model for each request. It worked brilliantly in testing. Then it broke the product.

This is the story of what went wrong, the technical details of the routing layer we built, and the hard lessons we learned about the hidden costs of optimization.

The Problem We Tried to Solve

Our AI-powered product relied on multiple large language models (LLMs) from providers like OpenAI and Google. We used GPT-4 for complex reasoning, GPT-3.5 for simple tasks, and Google's PaLM 2 for specific domain queries. Our monthly bill was spiraling toward six figures.

The obvious fix: route each request to the cheapest model capable of handling it. We wanted a system that would analyze each prompt, estimate its complexity, and send it to the most cost-efficient model. If a user asked "What's the weather?" we'd route to GPT-3.5-mini. If they asked "Write a detailed legal analysis," we'd route to GPT-4.

Requirements

Before building, we defined what the routing layer needed to do:

**Cost optimization**: Reduce average per-request cost by at least 40%.
**Latency control**: Keep response times under 2 seconds for 95% of requests.
**Quality preservation**: Maintain user satisfaction scores within 5% of baseline.
**Fallback logic**: Automatically escalate to more capable models if the cheap one fails.
**Observability**: Log every routing decision with model, cost, and latency.

Step-by-Step Installation

We built the routing layer as a Python middleware service using FastAPI and Redis. Here's how we set it up.

1. Install Dependencies

First, create a virtual environment and install the required packages.

python -m venv routing-env
source routing-env/bin/activate
pip install fastapi uvicorn redis openai google-generativeai pydantic python-dotenv

2. Environment Configuration

Create a `.env` file with your API keys and model pricing.

# .env
OPENAI_API_KEY=sk-your-key-here
GOOGLE_API_KEY=your-google-key
REDIS_URL=redis://localhost:6379

# Model pricing per 1K tokens (USD)
GPT4_PRICE=0.03
GPT35_PRICE=0.0015
PALM2_PRICE=0.0005

3. Core Routing Logic

The heart of the system is a complexity estimator that scores prompts on a scale of 1-10.

# router.py
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import openai
import google.generativeai as palm
import redis
import os
from dotenv import load_dotenv

load_dotenv()

app = FastAPI()
r = redis.from_url(os.getenv("REDIS_URL"))

class Request(BaseModel):
    prompt: str
    user_id: str

class Response(BaseModel):
    model: str
    content: str
    cost: float
    latency_ms: int

def estimate_complexity(prompt: str) -> int:
    """Score prompt complexity 1-10 based on length, keywords, and structure."""
    score = 1
    if len(prompt) > 200:
        score += 2
    if len(prompt) > 500:
        score += 2
    if any(word in prompt.lower() for word in ["legal", "medical", "code", "math"]):
        score += 3
    if "?" not in prompt:
        score += 1
    return min(score, 10)

def select_model(complexity: int) -> str:
    """Route to cheapest model that can handle the complexity."""
    if complexity <= 3:
        return "palm-2"  # $0.0005
    elif complexity <= 6:
        return "gpt-3.5-turbo"  # $0.0015
    else:
        return "gpt-4"  # $0.03

@app.post("/route")
async def route_request(request: Request):
    complexity = estimate_complexity(request.prompt)
    model = select_model(complexity)
    
    # Log decision to Redis for monitoring
    r.rpush("routing_log", f"{model}:{complexity}:{request.user_id}")
    
    # Call the selected model
    import time
    start = time.time()
    
    try:
        if model == "gpt-4" or model == "gpt-3.5-turbo":
            openai.api_key = os.getenv("OPENAI_API_KEY")
            response = openai.ChatCompletion.create(
                model=model,
                messages=[{"role": "user", "content": request.prompt}]
            )
            content = response.choices[0].message.content
        elif model == "palm-2":
            palm.configure(api_key=os.getenv("GOOGLE_API_KEY"))
            response = palm.generate_text(prompt=request.prompt)
            content = response.result
        
        latency = int((time.time() - start) * 1000)
        cost = calculate_cost(model, len(request.prompt), len(content))
        
        return Response(model=model, content=content, cost=cost, latency_ms=latency)
    
    except Exception as e:
        # Fallback to GPT-4 on failure
        return await fallback(request.prompt)

def calculate_cost(model: str, input_tokens: int, output_tokens: int) -> float:
    prices = {
        "gpt-4": 0.03,
        "gpt-3.5-turbo": 0.0015,
        "palm-2": 0.0005
    }
    return prices[model] * (input_tokens + output_tokens) / 1000

4. Start the Service

Run the routing layer on port 8000.

uvicorn router:app --host 0.0.0.0 --port 8000 --reload

Usage Examples

Once the service is running, you can test it with curl.

Simple Query (Routes to PaLM 2)

curl -X POST http://localhost:8000/route \
  -H "Content-Type: application/json" \
  -d '{"prompt": "What is the capital of France?", "user_id": "test1"}'

Expected response:

{
  "model": "palm-2",
  "content": "The capital of France is Paris.",
  "cost": 0.0005,
  "latency_ms": 340
}

Complex Query (Routes to GPT-4)

curl -X POST http://localhost:8000/route \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Write a detailed legal analysis of the Fourth Amendment implications of warrantless cell phone tracking by law enforcement agencies, including relevant Supreme Court precedents and dissenting opinions.", "user_id": "test2"}'

Expected response:

{
  "model": "gpt-4",
  "content": "...",
  "cost": 0.12,
  "latency_ms": 2100
}

What Broke the Product

The routing layer worked perfectly in isolation. Our cost per request dropped by 52% within the first week. But users started complaining about inconsistent quality, random failures, and strange behavior.

The Fragility of Complexity Estimation

Our `estimate_complexity` function was too simplistic. A user asking "What's the difference between a cat and a dog?" scored a 1 (simple) and got routed to PaLM 2. PaLM 2 returned a short, technically correct answer. But the user expected a detailed, conversational response like GPT-4 would give. The routing layer optimized for cost, not user experience.

Worse, ambiguous prompts like "Help me understand this code" scored high due to the word "code" and got routed to GPT-4 unnecessarily. We were burning money on simple requests.

The Fallback Nightmare

When PaLM 2 failed on a request (which happened ~8% of the time due to rate limits), our fallback to GPT-4 worked—but it doubled latency. Users saw spinning loaders for 4+ seconds. Our 95th percentile latency jumped from 1.2s to 4.8s.

Model Behavior Divergence

Each model has a different personality. GPT-4 is verbose and cautious. GPT-3.5 is concise and direct. PaLM 2 tends to be more creative. Users noticed that the same prompt returned wildly different tones depending on the routing decision. Our product lost its consistent voice.

What We Learned

1. Cost Optimization Must Include User Experience

We measured cost per request but ignored cost per satisfied user. A cheap response that makes a user leave is infinitely more expensive than a premium response that retains them.

2. Routing Needs Per-User Personalization

Some users prefer short answers. Others want deep analysis. Our routing layer should have learned user preferences and adjusted thresholds accordingly.

3. Fallbacks Need Time Budgets

Instead of blindly falling back to GPT-4, we should have set a maximum latency budget. If the cheap model takes more than 500ms, escalate immediately—don't wait for it to fail.

4. Monitor Quality, Not Just Cost

We tracked cost per request and latency. We should have tracked:

User satisfaction scores per model
Task completion rates per model
Session abandonment rates

The Fixed Routing Layer

After the incident, we rebuilt the routing layer with these lessons in mind. Here's the improved version.

# improved_router.py
import asyncio
from typing import Optional

class AdaptiveRouter:
    def __init__(self, user_preferences: dict = None):
        self.user_prefs = user_preferences or {}
        self.latency_budget = 2000  # max 2 seconds total
    
    async def route_with_timeout(self, prompt: str, user_id: str):
        complexity = self.estimate_complexity(prompt)
        preferred_style = self.user_prefs.get(user_id, "balanced")
        
        # Adjust threshold based on user preference
        if preferred_style == "concise":
            threshold = 4
        elif preferred_style == "detailed":
            threshold = 6
        else:
            threshold = 5
        
        # Try cheapest model first, with timeout
        model = self.select_model(complexity, threshold)
        try:
            result = await asyncio.wait_for(
                self.call_model(model, prompt),
                timeout=self.latency_budget / 1000
            )
            return result
        except asyncio.TimeoutError:
            # Escalate immediately if cheap model is slow
            result = await self.call_model("gpt-4", prompt)
            return result

Conclusion

Building a routing layer to cut AI costs is tempting. The math looks great on paper: route 70% of requests to cheap models, save 50% on API bills. But the hidden costs—inconsistent user experience, increased latency, and quality degradation—can destroy your product.

Our experience taught us that AI cost optimization isn't just about choosing the cheapest model. It's about understanding your users, measuring the right metrics, and building adaptive systems that balance cost with quality. The best routing layer is invisible to users. Ours was anything but.

If you're building a similar system, start with a simple rule: never let a user see that you switched models. If they notice, you've already broken the product.

Sources

We Built a Routing Layer to Cut Our AI Costs. It Broke the Product.Towards Data Science OpenAI NewsOpenAI News Google AI BlogGoogle AI Blog Microsoft AI BlogMicrosoft AI Blog

FAQ

What is this article about?

This article covers “We Built a Routing Layer to Cut Our AI Costs. It Broke the Product.” in the AI tools category. A team built a custom AI routing layer to reduce API costs, but it introduced latency, errors, and unpredictable behavior that degraded the user experience, ultimately breaking the product.

Who is this useful for?

It is useful for readers who want a practical understanding of AI tools, models, and workflows.

What should I do next?

Read the article, review the listed sources, and test the most relevant ideas in your own workflow.

Tags

Quick summary

We Built a Routing Layer to Cut Our AI Costs. It Broke the Product.

The Problem We Tried to Solve

Requirements

Step-by-Step Installation

1. Install Dependencies

2. Environment Configuration

3. Core Routing Logic

4. Start the Service

Usage Examples

Simple Query (Routes to PaLM 2)

Complex Query (Routes to GPT-4)

What Broke the Product

The Fragility of Complexity Estimation

The Fallback Nightmare

Model Behavior Divergence

What We Learned

1. Cost Optimization Must Include User Experience

2. Routing Needs Per-User Personalization

3. Fallbacks Need Time Budgets

4. Monitor Quality, Not Just Cost

The Fixed Routing Layer

Conclusion

Sources

FAQ

Related Articles