Agent Sends Full Conversation History on Every API Call — Ballooning Input Costs
Symptom
- API costs grow quadratically as conversation length increases
- A 100-turn conversation costs 50x more than a 10-turn conversation
- Input token count keeps growing even when the agent is answering simple questions
- Session costs spike for long-running autonomous agents
- History from 2 hours ago is still being sent on every call — most of it irrelevant
- Context window fills up before the agent finishes a long task — not from new content but from accumulated history
Root Cause
Conversation history is accumulated linearly but billed quadratically. If a conversation has N turns averaging T tokens each, by turn N the agent is sending N×T tokens of history on every single call. A conversation that produces 100,000 total tokens will bill for 100,000 + 99,000 + 98,000 + … = ~5,000,000 input tokens — 50x the actual content produced. The fix is to aggressively trim, summarize, or window the history so the input token count stays bounded regardless of conversation length.
Fix
Option 1: Sliding window — keep only the most recent N turns
import anthropic
from dataclasses import dataclass, field
client = anthropic.Anthropic()
@dataclass
class SlidingWindowSession:
"""
Keeps only the most recent N turns in the active context.
Simple and effective — prevents unbounded history growth.
"""
system: str
window_size: int = 20 # Number of turns to keep (user+assistant pairs)
_full_history: list = field(default_factory=list)
@property
def windowed_history(self) -> list[dict]:
"""Return only the most recent turns"""
if len(self._full_history) <= self.window_size * 2:
return self._full_history
# Always keep pairs (user + assistant)
return self._full_history[-(self.window_size * 2):]
def send(self, user_message: str, model: str = "claude-sonnet-4-6") -> str:
self._full_history.append({"role": "user", "content": user_message})
active = self.windowed_history
dropped = len(self._full_history) - len(active)
if dropped > 0:
print(f"History: keeping {len(active)//2} turns, dropped {dropped//2} oldest")
response = client.messages.create(
model=model,
max_tokens=2048,
system=self.system,
messages=active
)
text = response.content[0].text
self._full_history.append({"role": "assistant", "content": text})
return text
@property
def token_estimate(self) -> dict:
"""Rough token estimate for current window"""
chars = sum(len(str(m.get("content", ""))) for m in self.windowed_history)
total_chars = sum(len(str(m.get("content", ""))) for m in self._full_history)
return {
"window_tokens": chars // 4,
"full_history_tokens": total_chars // 4,
"savings_pct": f"{(1 - chars/max(total_chars,1))*100:.0f}%"
}
session = SlidingWindowSession(system="You are a helpful assistant.", window_size=15)
Option 2: Summarize-then-compress — replace old turns with a summary
import anthropic
from dataclasses import dataclass, field
client = anthropic.Anthropic()
SUMMARY_PROMPT = """Summarize the conversation below into a compact paragraph.
Preserve: decisions made, facts established, user preferences expressed, task progress.
Omit: pleasantries, repeated questions, verbose explanations.
Target: under 300 words.
Conversation:
{conversation}"""
@dataclass
class SummarizingSession:
"""
When history exceeds a threshold, summarize the oldest portion.
The summary is injected as context at the start of the active window.
"""
system: str
max_active_turns: int = 15 # Keep this many recent turns verbatim
compress_when_over: int = 25 # Trigger compaction at this many turns
_history: list = field(default_factory=list)
_summary: str = ""
_total_turns_ever: int = 0
def _build_messages(self) -> list[dict]:
"""Build message list: optional summary injection + recent turns"""
messages = []
if self._summary:
# Inject summary as first exchange
messages.append({"role": "user", "content": "[Earlier conversation summary requested]"})
messages.append({"role": "assistant", "content": f"[Summary of earlier conversation]\n{self._summary}"})
messages.extend(self._history)
return messages
def _compress_history(self):
"""Compress oldest turns into summary, keep recent turns verbatim"""
if len(self._history) <= self.max_active_turns * 2:
return
to_compress = self._history[:-(self.max_active_turns * 2)]
to_keep = self._history[-(self.max_active_turns * 2):]
# Build conversation text to summarize
conv_text = "\n".join(
f"{m['role'].upper()}: {m['content'][:500]}"
for m in to_compress
)
# Include existing summary if present
if self._summary:
conv_text = f"[Previous summary]\n{self._summary}\n\n[New turns to add]\n{conv_text}"
response = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=400,
messages=[{"role": "user", "content": SUMMARY_PROMPT.format(conversation=conv_text)}]
)
self._summary = response.content[0].text
self._history = to_keep
compressed_turns = len(to_compress) // 2
print(f"Compressed {compressed_turns} turns into summary ({len(self._summary)} chars)")
def send(self, user_message: str, model: str = "claude-sonnet-4-6") -> str:
self._history.append({"role": "user", "content": user_message})
self._total_turns_ever += 1
# Trigger compression if history is too long
if len(self._history) > self.compress_when_over * 2:
self._compress_history()
response = client.messages.create(
model=model,
max_tokens=2048,
system=self.system,
messages=self._build_messages()
)
text = response.content[0].text
self._history.append({"role": "assistant", "content": text})
return text
@property
def stats(self) -> dict:
active_chars = sum(len(str(m.get("content", ""))) for m in self._history)
return {
"total_turns_ever": self._total_turns_ever,
"active_turns": len(self._history) // 2,
"summary_present": bool(self._summary),
"active_tokens_est": active_chars // 4,
"summary_tokens_est": len(self._summary) // 4
}
summarizing_session = SummarizingSession(
system="You are a helpful assistant.",
max_active_turns=10,
compress_when_over=20
)
Option 3: Token-budget-aware trimming — trim to fit within a budget
import anthropic
from typing import Optional
client = anthropic.Anthropic()
CHARS_PER_TOKEN = 4
def estimate_tokens(messages: list[dict], system: str = "") -> int:
"""Estimate total input tokens for a message list"""
total_chars = len(system)
for m in messages:
content = m.get("content", "")
if isinstance(content, list):
content = " ".join(str(c) for c in content)
total_chars += len(str(content)) + 10 # +10 for role overhead
return total_chars // CHARS_PER_TOKEN
def trim_history_to_budget(
history: list[dict],
token_budget: int,
system_tokens: int = 0,
always_keep_last_n_turns: int = 3
) -> tuple[list[dict], int]:
"""
Trim history to fit within a token budget.
Always keeps the last N complete turns.
Removes oldest turns first (in pairs to maintain message alternation).
Returns (trimmed_history, estimated_tokens_used).
"""
# Protected recent turns (always kept)
protected = history[-(always_keep_last_n_turns * 2):]
candidates = history[:-(always_keep_last_n_turns * 2)]
remaining_budget = token_budget - system_tokens
protected_tokens = estimate_tokens(protected)
if protected_tokens > remaining_budget:
print(f"WARNING: Protected turns ({protected_tokens} tokens) exceed budget ({remaining_budget})")
return protected, protected_tokens
# Add candidates from newest to oldest until budget is hit
available = remaining_budget - protected_tokens
included = []
for turn in reversed(candidates):
turn_tokens = estimate_tokens([turn])
if turn_tokens <= available:
included.insert(0, turn)
available -= turn_tokens
else:
break # Budget hit — stop adding older turns
final = included + protected
used_tokens = estimate_tokens(final) + system_tokens
dropped = (len(candidates) - len(included)) // 2
if dropped > 0:
print(f"Trimmed {dropped} old turns to fit {token_budget} token budget")
return final, used_tokens
class BudgetedSession:
"""Session that trims history to stay within a per-call token budget"""
def __init__(
self,
system: str,
input_token_budget: int = 50_000,
always_keep_last_n: int = 5
):
self.system = system
self.budget = input_token_budget
self.keep_last = always_keep_last_n
self._history: list[dict] = []
self._system_tokens = len(system) // CHARS_PER_TOKEN
def send(self, user_message: str, model: str = "claude-sonnet-4-6") -> str:
self._history.append({"role": "user", "content": user_message})
trimmed, tokens_used = trim_history_to_budget(
self._history,
token_budget=self.budget,
system_tokens=self._system_tokens,
always_keep_last_n_turns=self.keep_last
)
print(f"Input: ~{tokens_used} tokens ({len(trimmed)//2} turns of {len(self._history)//2})")
response = client.messages.create(
model=model,
max_tokens=2048,
system=self.system,
messages=trimmed
)
text = response.content[0].text
self._history.append({"role": "assistant", "content": text})
return text
budgeted = BudgetedSession(
system="You are a helpful assistant.",
input_token_budget=40_000,
always_keep_last_n=5
)
Option 4: Importance-weighted history — keep high-value turns, drop filler
import anthropic
import json
from dataclasses import dataclass, field
client = anthropic.Anthropic()
IMPORTANCE_SCORER_PROMPT = """Score the importance of keeping each conversation turn for future reference.
Score 1-5: 5=critical (decisions, facts, errors), 3=useful (context), 1=disposable (pleasantries)
Return JSON: [{"index": N, "score": N, "reason": "brief reason"}, ...]
Turns to score:
{turns}"""
@dataclass
class ImportanceScoredTurn:
role: str
content: str
score: float = 3.0 # Default to medium importance
turn_index: int = 0
class ImportanceFilteredSession:
"""
Score the importance of each turn and drop low-importance ones when trimming.
High-importance turns (decisions, errors, key facts) are always preserved.
"""
def __init__(
self,
system: str,
max_turns: int = 30,
score_every_n_turns: int = 10,
min_score_to_keep: float = 2.5
):
self.system = system
self.max_turns = max_turns
self.score_interval = score_every_n_turns
self.min_score = min_score_to_keep
self._turns: list[ImportanceScoredTurn] = []
def _score_turns(self, unscored: list[ImportanceScoredTurn]):
"""Score a batch of turns using Haiku"""
if not unscored:
return
turns_text = "\n".join(
f"{i}. {t.role}: {t.content[:200]}"
for i, t in enumerate(unscored)
)
try:
response = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=500,
messages=[{"role": "user", "content": IMPORTANCE_SCORER_PROMPT.format(turns=turns_text)}]
)
import re
text = response.content[0].text
match = re.search(r'\[.*?\]', text, re.DOTALL)
if match:
scores = json.loads(match.group())
for item in scores:
idx = item.get("index", 0)
if 0 <= idx < len(unscored):
unscored[idx].score = float(item.get("score", 3))
except Exception as e:
print(f"Scoring error: {e}")
def _trim_low_importance(self):
"""Remove low-importance old turns when over limit"""
if len(self._turns) <= self.max_turns * 2:
return
# Always keep last 6 turns (3 pairs) regardless of score
protected = self._turns[-(6):]
candidates = self._turns[:-(6)]
# Score unscored candidates
unscored = [t for t in candidates if t.score == 3.0]
if unscored:
self._score_turns(unscored)
# Keep highest-scoring candidates up to max_turns
keep_count = (self.max_turns * 2) - len(protected)
sorted_candidates = sorted(candidates, key=lambda t: t.score, reverse=True)
kept = sorted(sorted_candidates[:keep_count], key=lambda t: t.turn_index)
dropped = len(candidates) - len(kept)
print(f"Importance trim: dropped {dropped} low-importance turns (min_score={self.min_score})")
self._turns = kept + protected
def send(self, user_message: str, model: str = "claude-sonnet-4-6") -> str:
idx = len(self._turns)
self._turns.append(ImportanceScoredTurn("user", user_message, turn_index=idx))
if len(self._turns) % self.score_interval == 0:
self._trim_low_importance()
messages = [{"role": t.role, "content": t.content} for t in self._turns]
response = client.messages.create(
model=model,
max_tokens=2048,
system=self.system,
messages=messages
)
text = response.content[0].text
self._turns.append(ImportanceScoredTurn("assistant", text, turn_index=idx+1))
return text
Option 5: Cost monitor — alert when history cost exceeds threshold
import anthropic
from dataclasses import dataclass, field
client = anthropic.Anthropic()
@dataclass
class CostMonitoredSession:
"""
Tracks actual token usage per call.
Alerts and triggers compaction when cost grows unsustainably.
"""
system: str
cost_per_million_input: float = 3.00 # Sonnet input price
cost_per_million_output: float = 15.00 # Sonnet output price
alert_threshold_usd: float = 0.10 # Alert per call above $0.10
compact_threshold_input_tokens: int = 50_000
_history: list = field(default_factory=list)
_total_cost: float = 0.0
_calls: int = 0
def send(self, user_message: str, model: str = "claude-sonnet-4-6") -> str:
self._history.append({"role": "user", "content": user_message})
response = client.messages.create(
model=model,
max_tokens=2048,
system=self.system,
messages=self._history
)
usage = response.usage
input_cost = usage.input_tokens * self.cost_per_million_input / 1_000_000
output_cost = usage.output_tokens * self.cost_per_million_output / 1_000_000
call_cost = input_cost + output_cost
self._total_cost += call_cost
self._calls += 1
if call_cost > self.alert_threshold_usd:
print(
f"COST ALERT: Call #{self._calls} cost ${call_cost:.4f} "
f"({usage.input_tokens} input, {usage.output_tokens} output tokens). "
f"Session total: ${self._total_cost:.4f}"
)
if usage.input_tokens > self.compact_threshold_input_tokens:
print(
f"HISTORY TOO LARGE: {usage.input_tokens} input tokens. "
f"Consider enabling summarization or sliding window."
)
text = response.content[0].text
self._history.append({"role": "assistant", "content": text})
print(
f"Call #{self._calls}: ${call_cost:.4f} "
f"({usage.input_tokens}in + {usage.output_tokens}out tokens)"
)
return text
@property
def cost_report(self) -> dict:
return {
"total_cost_usd": round(self._total_cost, 4),
"total_calls": self._calls,
"avg_cost_per_call": round(self._total_cost / max(self._calls, 1), 4),
"history_turns": len(self._history) // 2
}
Option 6: Prompt caching — avoid re-billing static history with Anthropic cache
import anthropic
client = anthropic.Anthropic()
def build_cached_messages(
static_context: list[dict],
recent_turns: list[dict]
) -> list[dict]:
"""
Use Anthropic prompt caching to avoid re-billing static conversation history.
Static context (older turns) gets a cache_control breakpoint — billed once, cached.
Recent turns are always sent fresh.
Cost model with caching:
- First call: pay full price for all tokens
- Subsequent calls: pay 10% for cached tokens + full price for recent tokens
"""
if not static_context:
return recent_turns
# Add cache breakpoint after the last static turn
cached = []
for i, turn in enumerate(static_context):
if i == len(static_context) - 1:
# Last static turn — add cache control
content = turn["content"]
if isinstance(content, str):
content = [{"type": "text", "text": content, "cache_control": {"type": "ephemeral"}}]
cached.append({"role": turn["role"], "content": content})
else:
cached.append(turn)
return cached + recent_turns
class CacheOptimizedSession:
"""
Session that uses prompt caching for older conversation history.
Older turns are marked for caching — billed at 10% on repeat calls.
Recent turns (last 5) are always fresh.
"""
def __init__(self, system: str, cache_after_n_turns: int = 10):
self.system = system
self.cache_threshold = cache_after_n_turns
self._history: list[dict] = []
def send(self, user_message: str, model: str = "claude-sonnet-4-6") -> str:
self._history.append({"role": "user", "content": user_message})
# Split into cacheable (old) and fresh (recent) turns
split_point = max(0, len(self._history) - 10) # Keep 5 turns fresh
static = self._history[:split_point]
recent = self._history[split_point:]
messages = build_cached_messages(static, recent)
response = client.messages.create(
model=model,
max_tokens=2048,
system=[{"type": "text", "text": self.system, "cache_control": {"type": "ephemeral"}}],
messages=messages,
extra_headers={"anthropic-beta": "prompt-caching-2024-07-31"}
)
usage = response.usage
cache_read = getattr(usage, "cache_read_input_tokens", 0)
cache_write = getattr(usage, "cache_creation_input_tokens", 0)
if cache_read > 0:
savings = cache_read * 0.90 * 3.00 / 1_000_000 # 90% discount
print(f"Cache: {cache_read} tokens read (saved ~${savings:.4f}), {cache_write} tokens written")
text = response.content[0].text
self._history.append({"role": "assistant", "content": text})
return text
cached_session = CacheOptimizedSession(
system="You are a helpful assistant.",
cache_after_n_turns=10
)
History Cost Comparison
| Strategy | Input Tokens (50 turns) | Relative Cost | Best For |
|---|---|---|---|
| No trimming (baseline) | ~250,000 | 100% | Never |
| Sliding window (15 turns) | ~75,000 | 30% | Most agents |
| Summarize + window | ~40,000 | 16% | Long sessions |
| Budget-based trim | ≤50,000 | ≤20% | Cost-sensitive |
| Prompt caching | 250,000 (10% rate) | ~15% | Stable history |
| Importance filtering | ~60,000 | 24% | Complex tasks |
Expected Token Savings
100-turn session without trimming: ~5M input tokens billed Sliding window (15 turns): ~750K input tokens — 85% cost reduction Summarize + window: ~400K input tokens — 92% cost reduction
Environment
- Any agent running multi-turn conversations or long autonomous sessions; input cost growth is quadratic without trimming — implement at session start, not after costs spike in production
- Source: direct experience; unbounded history is the most common cause of unexpectedly high API bills in the second month of production operation, after sessions grow long
Wasting tokens on this error?
Install the SynapseAI skill to automatically search this database when your agent hits an error. Average savings: $2–5 per error incident.
clawhub install synapse-ai
Solved an error that's not here?
Share it and earn MoltCoin rewards.