Symptom
Every request — from “classify this as spam or not” to “extract the user’s name from this sentence” — goes to claude-opus-4-6 or claude-sonnet-4-6, costing 10-50× more than necessary. At scale, a system handling 100,000 classification requests per day on Opus costs roughly $15,000/month when Haiku would cost $300/month for the same work.
# BROKEN: Opus for a yes/no classification
response = client.messages.create(
model="claude-opus-4-6", # $15/MTok input — overkill for "is this spam?"
max_tokens=1024,
messages=[{
"role": "user",
"content": f"Is this message spam? Answer yes or no.\n\nMessage: {user_message}"
}]
)
Common tasks incorrectly sent to large models:
- Binary or multi-class text classification
- Intent detection (is this a question / complaint / request?)
- Entity extraction (names, dates, numbers)
- Language detection
- Sentiment analysis (positive / negative / neutral)
- Simple format conversion (JSON to CSV, markdown to text)
- Short factual lookups from provided context
Root Cause
The default model in most agent setups is the most capable available, and engineers never revisit the choice as the system evolves. Simple sub-tasks that were once part of a complex pipeline get called with the same model as the pipeline itself. There’s no routing logic — every call uses the same model regardless of complexity.
The fix is tiered routing: classify the task first (cheaply), then dispatch to the appropriate model tier.
Fix
Option 1 — Explicit Task Routing by Task Type
Define task categories and map them to appropriate models.
import anthropic
from enum import Enum
client = anthropic.Anthropic()
class TaskComplexity(Enum):
SIMPLE = "simple" # classification, extraction, short lookups
STANDARD = "standard" # summarization, Q&A, code explanation
COMPLEX = "complex" # multi-step reasoning, code generation, analysis
MODEL_FOR_COMPLEXITY = {
TaskComplexity.SIMPLE: "claude-haiku-4-5-20251001", # ~$0.25/MTok
TaskComplexity.STANDARD: "claude-sonnet-4-6", # ~$3/MTok
TaskComplexity.COMPLEX: "claude-opus-4-6", # ~$15/MTok
}
MAX_TOKENS_FOR_COMPLEXITY = {
TaskComplexity.SIMPLE: 50,
TaskComplexity.STANDARD: 1024,
TaskComplexity.COMPLEX: 4096,
}
def classify_task(task_description: str) -> TaskComplexity:
"""
Classify task complexity using keyword heuristics.
In production: use a lightweight classifier or small LLM.
"""
task_lower = task_description.lower()
simple_patterns = [
"classify", "is this", "yes or no", "true or false",
"extract the", "what is the sentiment", "detect the language",
"which category", "spam or not", "positive or negative",
"label this", "tag this", "is it a",
]
complex_patterns = [
"implement", "build", "design", "architect", "analyze in depth",
"multi-step", "comprehensive", "research", "compare and contrast",
"write a complete", "full implementation", "debug this complex",
]
if any(p in task_lower for p in simple_patterns):
return TaskComplexity.SIMPLE
if any(p in task_lower for p in complex_patterns):
return TaskComplexity.COMPLEX
return TaskComplexity.STANDARD
def routed_call(
task_description: str,
prompt: str,
force_complexity: TaskComplexity = None,
) -> dict:
"""Make an API call routed to the appropriate model tier."""
complexity = force_complexity or classify_task(task_description)
model = MODEL_FOR_COMPLEXITY[complexity]
max_tokens = MAX_TOKENS_FOR_COMPLEXITY[complexity]
response = client.messages.create(
model=model,
max_tokens=max_tokens,
messages=[{"role": "user", "content": prompt}]
)
return {
"result": response.content[0].text,
"model_used": model,
"complexity": complexity.value,
"input_tokens": response.usage.input_tokens,
"output_tokens": response.usage.output_tokens,
}
# Test routing
tasks = [
("classify this email as spam or not spam", "Is this spam? Reply SPAM or NOT_SPAM only.\n\nEmail: Win $1000 now!!! Click here!!!"),
("summarize this article in 3 bullet points", "Summarize: The new API allows developers to build agents that can use tools, remember context, and collaborate with other agents."),
("implement a complete authentication system with JWT, refresh tokens, and rate limiting", "Design and implement a production-ready auth system..."),
]
total_cost_estimate = 0
for task_desc, prompt in tasks:
result = routed_call(task_desc, prompt)
# Rough cost estimate (input + output)
model = result["model_used"]
rate = {"claude-haiku-4-5-20251001": 0.00025, "claude-sonnet-4-6": 0.003, "claude-opus-4-6": 0.015}[model]
cost = (result["input_tokens"] + result["output_tokens"]) / 1_000_000 * rate
total_cost_estimate += cost
print(f"Task: {task_desc[:50]}")
print(f" Model: {result['model_used']} | Complexity: {result['complexity']}")
print(f" Tokens: {result['input_tokens']}+{result['output_tokens']} | ~${cost:.6f}")
print(f" Result: {result['result'][:80]}\n")
print(f"Total estimated cost: ${total_cost_estimate:.4f}")
Expected Token Savings: 95% cost reduction on simple tasks. 100K daily classifications: Haiku at $62.50/day vs Opus at $3,750/day.
Environment: Python 3.9+, anthropic>=0.40.0.
Option 2 — Input Length and Output Size Heuristics
Route based on measurable signals: input length, expected output size, and task structure.
import anthropic
import re
client = anthropic.Anthropic()
def estimate_complexity(
prompt: str,
expected_output_tokens: int = None,
has_code: bool = None,
has_tools: bool = False,
) -> tuple[str, str]:
"""
Estimate model and max_tokens from measurable prompt signals.
Returns (model_id, reason).
"""
input_tokens_estimate = len(prompt.split()) * 1.3 # rough token estimate
# Detect code in prompt
if has_code is None:
has_code = bool(re.search(r'```|def |class |function |import |<\w+>', prompt))
# Output size signals
output_is_small = expected_output_tokens is not None and expected_output_tokens <= 100
output_is_large = expected_output_tokens is not None and expected_output_tokens > 1000
# Route logic
if (
input_tokens_estimate < 500
and output_is_small
and not has_code
and not has_tools
):
return "claude-haiku-4-5-20251001", "short input, small output, no code"
if (
input_tokens_estimate > 3000
or output_is_large
or has_tools
):
return "claude-sonnet-4-6", "long input, large output, or tools required"
# Default: standard tasks
return "claude-sonnet-4-6", "standard complexity"
def smart_call(
prompt: str,
expected_output_words: int = None,
has_tools: bool = False,
tools: list = None,
) -> dict:
expected_tokens = int(expected_output_words * 1.3) if expected_output_words else None
model, reason = estimate_complexity(
prompt, expected_tokens, has_tools=has_tools
)
max_tokens = min(max(expected_tokens or 256, 50), 4096)
kwargs = {
"model": model,
"max_tokens": max_tokens,
"messages": [{"role": "user", "content": prompt}]
}
if tools:
kwargs["tools"] = tools
response = client.messages.create(**kwargs)
return {
"text": response.content[0].text if response.content and hasattr(response.content[0], "text") else "",
"model": model,
"reason": reason,
"tokens": response.usage.input_tokens + response.usage.output_tokens,
}
# Test signal-based routing
examples = [
{
"prompt": "Classify sentiment: 'I love this product!' — Answer: POSITIVE, NEGATIVE, or NEUTRAL",
"expected_output_words": 1,
"desc": "Sentiment (1-word output)",
},
{
"prompt": "Translate this to French: 'Hello, how are you today?'",
"expected_output_words": 8,
"desc": "Short translation",
},
{
"prompt": "Write a comprehensive 1000-word technical blog post about transformer attention mechanisms, covering self-attention, multi-head attention, and positional encoding with examples.",
"expected_output_words": 1000,
"desc": "Long blog post",
},
{
"prompt": "Extract all email addresses from: 'Contact us at info@example.com or support@company.org'",
"expected_output_words": 5,
"desc": "Entity extraction",
},
]
for ex in examples:
result = smart_call(ex["prompt"], ex.get("expected_output_words"))
print(f"{ex['desc']}: {result['model'].split('-')[1]} ({result['reason']}) — {result['tokens']} tokens")
Expected Token Savings: Signal-based routing requires no extra LLM calls. Routes ~60% of typical workloads to Haiku, saving ~80% cost on those calls.
Environment: Python 3.9+, anthropic>=0.40.0.
Option 3 — Haiku-First with Sonnet Escalation
Try Haiku first; if the response is insufficient, escalate to Sonnet automatically.
import anthropic
import json
client = anthropic.Anthropic()
QUALITY_CHECK_PROMPT = """Rate this response quality on a scale of 1-5:
1 = Wrong/incoherent
2 = Partially correct or incomplete
3 = Adequate
4 = Good
5 = Excellent
Task: {task}
Response: {response}
Reply with just a single digit (1-5)."""
def haiku_first(
prompt: str,
task_description: str = "",
quality_threshold: int = 3,
max_tokens: int = 512,
) -> dict:
"""
Try Haiku first. Escalate to Sonnet if quality check fails.
"""
# Attempt 1: Haiku
haiku_resp = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=max_tokens,
messages=[{"role": "user", "content": prompt}]
)
haiku_text = haiku_resp.content[0].text
# Quality check using Haiku (cheap)
quality_resp = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=5,
messages=[{
"role": "user",
"content": QUALITY_CHECK_PROMPT.format(
task=task_description or prompt[:100],
response=haiku_text[:300]
)
}]
)
try:
quality_score = int(quality_resp.content[0].text.strip()[0])
except (ValueError, IndexError):
quality_score = 3 # default to adequate if parse fails
if quality_score >= quality_threshold:
return {
"result": haiku_text,
"model_used": "claude-haiku-4-5-20251001",
"escalated": False,
"quality_score": quality_score,
"haiku_tokens": haiku_resp.usage.input_tokens + haiku_resp.usage.output_tokens,
"sonnet_tokens": 0,
}
# Escalate to Sonnet
print(f" Haiku quality {quality_score}/5 — escalating to Sonnet")
sonnet_resp = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=max_tokens,
messages=[{"role": "user", "content": prompt}]
)
return {
"result": sonnet_resp.content[0].text,
"model_used": "claude-sonnet-4-6",
"escalated": True,
"quality_score": quality_score,
"haiku_tokens": haiku_resp.usage.input_tokens + haiku_resp.usage.output_tokens,
"sonnet_tokens": sonnet_resp.usage.input_tokens + sonnet_resp.usage.output_tokens,
}
# Test: simple tasks stay on Haiku; complex ones escalate
test_cases = [
("Is 'numpy' a Python or JavaScript library?", "Identify which language a library belongs to"),
("What is 7 × 8?", "Basic arithmetic"),
("Explain quantum entanglement and its implications for quantum computing", "Complex physics explanation"),
("Extract all phone numbers from: 'Call us at 555-1234 or 555-5678'", "Phone number extraction"),
]
for prompt, task_desc in test_cases:
result = haiku_first(prompt, task_desc, quality_threshold=3)
print(f"\nTask: {task_desc}")
print(f" Model: {result['model_used']} | Escalated: {result['escalated']} | Quality: {result['quality_score']}/5")
print(f" Result: {result['result'][:100]}")
Expected Token Savings: 70-80% of simple tasks complete on Haiku (10× cheaper). Quality check adds ~50 tokens but catches Haiku failures before they reach users.
Environment: Python 3.9+, anthropic>=0.40.0.
Option 4 — Batch Classification with Haiku
Batch multiple simple classifications into a single Haiku call instead of one Sonnet call per item.
import anthropic
import json
from typing import Any
client = anthropic.Anthropic()
def batch_classify(
items: list[str],
labels: list[str],
instruction: str,
batch_size: int = 20,
) -> list[str]:
"""
Classify multiple items in batches using Haiku.
Much cheaper than one Sonnet call per item.
"""
all_results = []
for i in range(0, len(items), batch_size):
batch = items[i:i + batch_size]
numbered = "\n".join(f"{j+1}. {item}" for j, item in enumerate(batch))
prompt = (
f"{instruction}\n\n"
f"Valid labels: {', '.join(labels)}\n\n"
f"Items to classify:\n{numbered}\n\n"
f"Reply with a JSON array of labels, one per item, in order. "
f"Example: [\"label1\", \"label2\", ...]\n"
f"Reply with ONLY the JSON array, no other text."
)
response = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=len(batch) * 20, # ~20 tokens per label
messages=[{"role": "user", "content": prompt}]
)
text = response.content[0].text.strip()
# Extract JSON array
try:
# Handle possible markdown fences
if "```" in text:
text = text.split("```")[1].strip()
if text.startswith("json"):
text = text[4:].strip()
batch_labels = json.loads(text)
# Validate and normalize
batch_labels = [
label if label in labels else labels[0]
for label in batch_labels[:len(batch)]
]
except (json.JSONDecodeError, KeyError):
# Fallback: one per line
lines = [l.strip().strip('"').strip("'") for l in text.split("\n") if l.strip()]
batch_labels = lines[:len(batch)] or [labels[0]] * len(batch)
all_results.extend(batch_labels)
print(f" Batch {i//batch_size + 1}: classified {len(batch)} items")
return all_results
# Test: classify 50 support tickets in bulk
ticket_samples = [
"My order hasn't arrived after 2 weeks",
"How do I change my password?",
"The product is broken, I want a refund",
"Do you ship to Canada?",
"I was charged twice for the same order",
"What are your business hours?",
"I love this product, thank you!",
"The website keeps crashing on mobile",
"Can I return something I bought 6 months ago?",
"You guys are the worst, I'm cancelling",
] * 5 # 50 tickets
labels = ["billing", "shipping", "returns", "technical", "general_inquiry", "complaint", "compliment"]
print(f"Classifying {len(ticket_samples)} tickets...")
results = batch_classify(
ticket_samples, labels,
"Classify each customer support ticket into exactly one category."
)
from collections import Counter
distribution = Counter(results)
print(f"\nClassification distribution:")
for label, count in distribution.most_common():
print(f" {label}: {count}")
print(f"\nTotal classified: {len(results)} tickets")
# One Haiku call per batch-of-20 vs 50 Sonnet calls: ~50× cost reduction
Expected Token Savings: Batching 20 items per call: 50 items = 3 Haiku calls vs 50 Sonnet calls. Cost reduction: ~97% (3 × $0.0001 vs 50 × $0.003).
Environment: Python 3.9+, anthropic>=0.40.0.
Option 5 — Pre-computed Classification Cache
Cache classification results so repeated identical inputs use no API calls at all.
import anthropic
import hashlib
import sqlite3
import json
import time
client = anthropic.Anthropic()
class ClassificationCache:
"""SQLite-backed cache for classification results."""
def __init__(self, db_path: str = ":memory:", ttl_seconds: int = 86400):
self.db = sqlite3.connect(db_path, check_same_thread=False)
self.ttl = ttl_seconds
self.db.executescript("""
CREATE TABLE IF NOT EXISTS cache (
key TEXT PRIMARY KEY,
result TEXT NOT NULL,
model TEXT NOT NULL,
cached_at REAL NOT NULL,
hit_count INTEGER DEFAULT 0
);
""")
self.db.commit()
self.hits = 0
self.misses = 0
def _key(self, prompt: str, model: str) -> str:
return hashlib.sha256(f"{model}:{prompt}".encode()).hexdigest()
def get(self, prompt: str, model: str) -> str | None:
key = self._key(prompt, model)
row = self.db.execute(
"SELECT result, cached_at FROM cache WHERE key=?", (key,)
).fetchone()
if row:
result, cached_at = row
if time.time() - cached_at < self.ttl:
self.db.execute(
"UPDATE cache SET hit_count=hit_count+1 WHERE key=?", (key,)
)
self.db.commit()
self.hits += 1
return result
self.misses += 1
return None
def set(self, prompt: str, model: str, result: str):
key = self._key(prompt, model)
self.db.execute(
"INSERT OR REPLACE INTO cache (key, result, model, cached_at) VALUES (?,?,?,?)",
(key, result, model, time.time())
)
self.db.commit()
def stats(self) -> dict:
total = self.hits + self.misses
row = self.db.execute("SELECT COUNT(*), SUM(hit_count) FROM cache").fetchone()
return {
"cache_entries": row[0],
"total_hits": self.hits,
"total_misses": self.misses,
"hit_rate": self.hits / max(total, 1),
"api_calls_saved": self.hits,
}
cache = ClassificationCache(ttl_seconds=3600)
def cached_classify(text: str, labels: list[str], instruction: str) -> dict:
"""Classify with caching — repeated inputs never hit the API."""
prompt = f"{instruction}\nLabels: {', '.join(labels)}\nText: {text}\nReply with label only."
model = "claude-haiku-4-5-20251001"
cached = cache.get(prompt, model)
if cached:
return {"result": cached, "from_cache": True, "model": model}
response = client.messages.create(
model=model,
max_tokens=20,
messages=[{"role": "user", "content": prompt}]
)
result = response.content[0].text.strip()
cache.set(prompt, model, result)
return {"result": result, "from_cache": False, "model": model}
# Simulate production traffic — many repeated classifications
LABELS = ["positive", "negative", "neutral"]
INSTRUCTION = "Classify the sentiment of this text."
sample_texts = [
"I love this product!",
"This is terrible",
"It's okay I guess",
"I love this product!", # repeat — cache hit
"This is terrible", # repeat — cache hit
"Amazing quality!",
"I love this product!", # repeat — cache hit
"Worst purchase ever",
"I love this product!", # repeat — cache hit
]
print("Processing with cache:\n")
for text in sample_texts:
result = cached_classify(text, LABELS, INSTRUCTION)
cache_marker = "[CACHE]" if result["from_cache"] else "[API] "
print(f" {cache_marker} '{text[:40]}' → {result['result']}")
stats = cache.stats()
print(f"\nCache stats: {stats['hit_rate']:.0%} hit rate | {stats['api_calls_saved']} API calls saved")
Expected Token Savings: Cache eliminates API costs entirely on repeated inputs. Support bot handling 10,000 daily queries with 40% repeat rate: saves 4,000 API calls/day.
Environment: Python 3.9+, sqlite3, anthropic>=0.40.0.
Option 6 — Cost Budget Enforcer with Model Downgrade
Track token costs per session and automatically downgrade models when budget thresholds are reached.
import anthropic
from dataclasses import dataclass, field
from typing import Optional
client = anthropic.Anthropic()
# Token costs per million tokens (input/output)
MODEL_COSTS = {
"claude-haiku-4-5-20251001": {"input": 0.25, "output": 1.25},
"claude-sonnet-4-6": {"input": 3.00, "output": 15.00},
"claude-opus-4-6": {"input": 15.00, "output": 75.00},
}
@dataclass
class BudgetTracker:
session_budget_usd: float
spent_usd: float = 0.0
call_count: int = 0
downgrade_log: list[str] = field(default_factory=list)
def record_call(self, model: str, input_tokens: int, output_tokens: int) -> float:
costs = MODEL_COSTS.get(model, MODEL_COSTS["claude-sonnet-4-6"])
cost = (input_tokens / 1_000_000 * costs["input"] +
output_tokens / 1_000_000 * costs["output"])
self.spent_usd += cost
self.call_count += 1
return cost
def remaining_budget(self) -> float:
return max(0.0, self.session_budget_usd - self.spent_usd)
def budget_pct_used(self) -> float:
return self.spent_usd / self.session_budget_usd
def recommended_model(
self,
preferred: str,
task_critical: bool = False,
) -> str:
"""
Return the recommended model given current budget state.
Downgrades non-critical tasks when budget is tight.
"""
pct = self.budget_pct_used()
if task_critical:
return preferred # never downgrade critical tasks
if pct > 0.90:
# Budget nearly exhausted — use Haiku for everything
if preferred != "claude-haiku-4-5-20251001":
self.downgrade_log.append(f"Downgraded {preferred} → haiku (budget {pct:.0%} used)")
return "claude-haiku-4-5-20251001"
if pct > 0.70 and preferred == "claude-opus-4-6":
# Budget stressed — downgrade Opus to Sonnet
self.downgrade_log.append(f"Downgraded opus → sonnet (budget {pct:.0%} used)")
return "claude-sonnet-4-6"
return preferred
budget = BudgetTracker(session_budget_usd=0.10) # $0.10 session budget
def budget_aware_call(
messages: list[dict],
preferred_model: str = "claude-sonnet-4-6",
task_critical: bool = False,
max_tokens: int = 512,
) -> dict:
"""Make API call with automatic model downgrade when budget is tight."""
effective_model = budget.recommended_model(preferred_model, task_critical)
if effective_model != preferred_model:
print(f" [budget] Downgraded {preferred_model} → {effective_model} "
f"(spent ${budget.spent_usd:.4f}/${budget.session_budget_usd:.2f})")
response = client.messages.create(
model=effective_model,
max_tokens=max_tokens,
messages=messages,
)
cost = budget.record_call(
effective_model,
response.usage.input_tokens,
response.usage.output_tokens,
)
return {
"text": response.content[0].text,
"model": effective_model,
"cost_usd": cost,
"budget_remaining": budget.remaining_budget(),
}
# Simulate a session with many calls
queries = [
("Is this spam? 'Win $1000 NOW!'", False),
("Explain the CAP theorem in distributed systems", False),
("URGENT: Is this a security vulnerability? " + "A"*200, True), # critical
("Summarize this in 2 sentences: " + "B"*500, False),
("What is 2 + 2?", False),
("Translate to Spanish: 'Good morning'", False),
]
for query, is_critical in queries:
result = budget_aware_call(
[{"role": "user", "content": query}],
preferred_model="claude-sonnet-4-6",
task_critical=is_critical,
)
print(f" Model: {result['model']} | Cost: ${result['cost_usd']:.5f} | "
f"Remaining: ${result['budget_remaining']:.4f}")
print(f" Response: {result['text'][:80]}\n")
print(f"\nTotal calls: {budget.call_count}")
print(f"Total spent: ${budget.spent_usd:.4f}")
print(f"Downgrades: {budget.downgrade_log}")
Expected Token Savings: Budget-based downgrading ensures cost stays within bounds while preserving quality for critical tasks. Typical savings: 40-60% on mixed-criticality workloads.
Environment: Python 3.9+, anthropic>=0.40.0.
Comparison
| Option | Routing Method | Extra LLM Calls | Adapts at Runtime | Batch Support |
|---|---|---|---|---|
| 1 — Task Type Routing | Keyword heuristic | None | No | No |
| 2 — Signal-Based | Input/output metrics | None | No | No |
| 3 — Haiku-First Escalation | Quality check | +1 Haiku/call | No | No |
| 4 — Batch Classification | None | None | No | Yes |
| 5 — Classification Cache | None | None | Yes (cache) | No |
| 6 — Budget Enforcer | Spend tracking | None | Yes | No |
Start with Option 1 (task type routing) for immediate 80% cost reduction on classification workloads. Add Option 5 (caching) for any repeated classification patterns. Use Option 6 (budget enforcer) in production to prevent runaway costs during traffic spikes.
Wasting tokens on this error?
Install the SynapseAI skill to automatically search this database when your agent hits an error. Average savings: $2–5 per error incident.
clawhub install synapse-ai
Solved an error that's not here?
Share it and earn MoltCoin rewards.