Symptom
User asks: “What is 2 + 2?”
Agent responds:
“Great question! Mathematics is a fascinating field that has been studied for thousands of years. The operation you are asking about is called addition, which is one of the four basic arithmetic operations. When we add the number 2 to itself, we apply the commutative property of addition… The answer is 4.”
The answer is buried in 200 tokens of preamble. For high-throughput applications, this multiplies cost and latency by 5–10x.
Root Cause
By default, Claude aims to be thorough and helpful. Without an explicit length constraint, it optimises for completeness rather than brevity. The model has no signal that the user wants a short answer.
Fix
Option 1 — Explicit Brevity Instruction in System Prompt
Add a clear, unambiguous brevity directive to the system prompt. Use specific constraints (“one sentence”, “under 20 words”) rather than vague guidance (“be brief”).
import anthropic
client = anthropic.Anthropic()
CONCISE_SYSTEM_PROMPT = """You are a concise assistant.
RESPONSE RULES:
- Answer in ONE sentence unless the question genuinely requires more
- Maximum 30 words for factual questions
- No preamble ("Great question!", "Of course!", "Certainly!")
- No postamble ("I hope that helps!", "Let me know if you need more!")
- No restating the question
- Never apologise for being brief
If the question requires a list: use a maximum of 5 bullet points, one line each.
If the question requires code: provide the code with a one-line explanation only."""
def ask_concise(question: str) -> str:
response = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=256,
system=CONCISE_SYSTEM_PROMPT,
messages=[{"role": "user", "content": question}],
)
return response.content[0].text
# Test conciseness
questions = [
"What is the capital of France?",
"What is 2 + 2?",
"What does HTTP stand for?",
"What is the speed of light?",
]
for q in questions:
answer = ask_concise(q)
print(f"Q: {q}")
print(f"A: {answer}")
print(f" ({len(answer.split())} words)")
print()
Expected Token Savings: ~70% reduction in output tokens for factual Q&A
Environment: pip install anthropic
Option 2 — Question Complexity Classifier + Adaptive Max Tokens
Before sending to the main model, classify the question’s complexity with Haiku. Use the complexity score to set max_tokens dynamically — simple questions get a strict token budget.
import anthropic
import json
client = anthropic.Anthropic()
def classify_complexity(question: str) -> dict:
"""Returns {complexity: 'simple'|'moderate'|'complex', max_tokens: int}"""
response = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=64,
system=(
'Classify the question complexity. '
'Respond with ONLY valid JSON: {"complexity": "simple"|"moderate"|"complex"}'
),
messages=[{"role": "user", "content": question}],
)
try:
result = json.loads(response.content[0].text.strip())
complexity = result.get("complexity", "moderate")
except (json.JSONDecodeError, KeyError):
complexity = "moderate"
token_budgets = {
"simple": 64,
"moderate": 512,
"complex": 2048,
}
return {
"complexity": complexity,
"max_tokens": token_budgets[complexity],
}
SYSTEM_BY_COMPLEXITY = {
"simple": "Answer in one short sentence. No preamble. No explanation unless asked.",
"moderate": "Be clear and concise. Answer the question directly in 2-4 sentences.",
"complex": "Provide a thorough, well-structured answer. Use headers and lists as needed.",
}
def adaptive_answer(question: str) -> tuple[str, str, int]:
classification = classify_complexity(question)
complexity = classification["complexity"]
max_tokens = classification["max_tokens"]
response = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=max_tokens,
system=SYSTEM_BY_COMPLEXITY[complexity],
messages=[{"role": "user", "content": question}],
)
answer = response.content[0].text
return answer, complexity, len(answer.split())
test_questions = [
"What year did World War II end?",
"How does HTTPS work?",
"Explain the CAP theorem and its implications for distributed database design.",
]
for q in test_questions:
answer, complexity, word_count = adaptive_answer(q)
print(f"Q: {q}")
print(f"Complexity: {complexity} | Words: {word_count}")
print(f"A: {answer}")
print()
Expected Token Savings: ~80% for simple questions, ~50% for moderate vs unconstrained
Environment: pip install anthropic
Option 3 — Response Length Validator with Regeneration
After receiving a response, check if it exceeds a word count threshold. If so, request a shorter version with the specific word target.
import anthropic
client = anthropic.Anthropic()
WORD_LIMITS = {
"factual": 15,
"explanation": 80,
"tutorial": 300,
}
def get_word_limit(question: str) -> tuple[str, int]:
q_lower = question.lower()
if any(w in q_lower for w in ["how to", "explain how", "step by step", "tutorial"]):
return "tutorial", WORD_LIMITS["tutorial"]
if any(w in q_lower for w in ["explain", "why", "what is", "describe"]):
return "explanation", WORD_LIMITS["explanation"]
return "factual", WORD_LIMITS["factual"]
def ask_with_length_enforcement(question: str, max_retries: int = 2) -> str:
question_type, word_limit = get_word_limit(question)
messages = [{"role": "user", "content": question}]
for attempt in range(max_retries + 1):
response = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=max(word_limit * 3, 64), # tokens ≈ words * 1.3
system=f"Answer concisely. Target: under {word_limit} words.",
messages=messages,
)
answer = response.content[0].text
word_count = len(answer.split())
if word_count <= word_limit * 1.5:
return answer
if attempt < max_retries:
messages.append({"role": "assistant", "content": answer})
messages.append({
"role": "user",
"content": (
f"Too long ({word_count} words). Please answer in under {word_limit} words. "
"Remove all preamble and explanation. Give only the core answer."
),
})
# Return last attempt even if still long
return answer
examples = [
"What is the boiling point of water in Celsius?",
"What is machine learning?",
"How do I reverse a list in Python?",
]
for q in examples:
answer = ask_with_length_enforcement(q)
print(f"Q: {q}")
print(f"A: {answer}")
print(f" ({len(answer.split())} words)")
print()
Expected Token Savings: ~65% output reduction; slight overhead on retry turns
Environment: pip install anthropic
Option 4 — Prefill Anchoring for Direct Answers
Use the assistant prefill to force the model into answer-first mode. Starting the response with the key fact leaves no room for preamble.
import anthropic
client = anthropic.Anthropic()
PREFILLS = {
"what_is": "", # Let model start directly
"yes_no": "Yes" , # Force yes/no start
"number": "", # Let model state the number
"list": "- ", # Force bullet list
"definition": "A ", # Force article-first definition
}
def detect_prefill(question: str) -> str:
q = question.strip().lower()
if q.startswith(("is ", "does ", "can ", "should ", "will ", "are ")):
return "Yes" if "not" not in q else "No"
if any(q.startswith(w) for w in ["list ", "give me a list", "what are"]):
return "- "
if q.startswith("define ") or "what does" in q:
return "A "
return ""
def ask_direct(question: str) -> str:
prefill = detect_prefill(question)
messages = [{"role": "user", "content": question}]
if prefill:
messages.append({"role": "assistant", "content": prefill})
response = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=128,
system=(
"You are a direct assistant. Answer immediately without preamble. "
"One sentence for facts. Bullet points for lists."
),
messages=messages,
)
raw = response.content[0].text
return (prefill + raw).strip()
test_cases = [
"Is Python dynamically typed?",
"What is the capital of Japan?",
"List the primary colors.",
"Define REST API.",
"What is 15% of 200?",
]
for q in test_cases:
answer = ask_direct(q)
print(f"Q: {q}")
print(f"A: {answer}")
print()
Expected Token Savings: ~75% reduction; prefill generates 0 additional input cost
Environment: pip install anthropic
Option 5 — User-Controlled Verbosity Level
Let users set their verbosity preference (terse / normal / detailed) which persists across the session. Inject the appropriate instruction per preference.
import anthropic
from enum import Enum
class Verbosity(str, Enum):
TERSE = "terse"
NORMAL = "normal"
DETAILED = "detailed"
VERBOSITY_INSTRUCTIONS = {
Verbosity.TERSE: (
"VERBOSITY: TERSE\n"
"- Maximum 15 words per answer\n"
"- Facts only — no context, no explanation\n"
"- No punctuation beyond commas and periods\n"
"- Single bullet per list item, max 5 items"
),
Verbosity.NORMAL: (
"VERBOSITY: NORMAL\n"
"- 1-3 sentences for most answers\n"
"- Include brief context only when essential\n"
"- No preamble or filler phrases"
),
Verbosity.DETAILED: (
"VERBOSITY: DETAILED\n"
"- Full explanation with examples\n"
"- Use headers and structured formatting\n"
"- Cover edge cases and nuances"
),
}
MAX_TOKENS_BY_VERBOSITY = {
Verbosity.TERSE: 64,
Verbosity.NORMAL: 256,
Verbosity.DETAILED: 2048,
}
class VerbosityAwareAgent:
def __init__(self, verbosity: Verbosity = Verbosity.NORMAL):
self.client = anthropic.Anthropic()
self.verbosity = verbosity
self.messages: list[dict] = []
def set_verbosity(self, level: Verbosity):
self.verbosity = level
print(f"[Verbosity set to: {level.value}]")
def chat(self, user_message: str) -> str:
# Check for verbosity command
lower = user_message.lower().strip()
if lower in ("terse", "brief", "short"):
self.set_verbosity(Verbosity.TERSE)
return "Got it — I'll keep answers very short."
if lower in ("normal", "default"):
self.set_verbosity(Verbosity.NORMAL)
return "Switched to normal response length."
if lower in ("detailed", "verbose", "explain more"):
self.set_verbosity(Verbosity.DETAILED)
return "I'll give detailed explanations from now on."
self.messages.append({"role": "user", "content": user_message})
response = self.client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=MAX_TOKENS_BY_VERBOSITY[self.verbosity],
system=VERBOSITY_INSTRUCTIONS[self.verbosity],
messages=self.messages,
)
reply = response.content[0].text
self.messages.append({"role": "assistant", "content": reply})
return reply
agent = VerbosityAwareAgent(verbosity=Verbosity.TERSE)
questions = [
"What is the speed of light?",
"What is Python?",
"detailed", # Switch to detailed
"Explain how TCP/IP works.",
]
for q in questions:
print(f"User: {q}")
print(f"Agent [{agent.verbosity.value}]: {agent.chat(q)}")
print()
Expected Token Savings: ~80% in TERSE mode vs unconstrained; user controls the trade-off
Environment: pip install anthropic
Option 6 — Post-Processing Trimmer for API Applications
For machine-to-machine contexts where responses are processed programmatically, post-process the output to strip known verbosity patterns before returning to the caller.
import re
import anthropic
client = anthropic.Anthropic()
# Patterns that indicate unnecessary verbosity
PREAMBLE_PATTERNS = [
r"^(Great|Excellent|Sure|Of course|Certainly|Absolutely|Happy to help)[!,.]?\s*",
r"^(That'?s? a (great|good|excellent|interesting) question[!.]?\s*)",
r"^(I'?d be (happy|glad|delighted) to (help|answer|explain)[.!]?\s*)",
r"^(Thank you for (asking|your question)[.!]?\s*)",
]
POSTAMBLE_PATTERNS = [
r"\s*(I hope (that|this) (helps?|answers? your question)[.!]?\s*)$",
r"\s*(Let me know if you (need|have) (any )?(more|other|additional) (questions?|help)[.!]?\s*)$",
r"\s*(Feel free to ask if you need (more|further) (information|clarification|details?)[.!]?\s*)$",
r"\s*(Is there anything else I can help (you with)?[?!]?\s*)$",
r"\s*(Please (let me know|feel free to ask) if (you have|there are) (any )?more (questions?|queries?)[.!]?\s*)$",
]
def strip_verbosity(text: str) -> str:
# Strip preamble
for pattern in PREAMBLE_PATTERNS:
text = re.sub(pattern, "", text, flags=re.IGNORECASE)
# Strip postamble
for pattern in POSTAMBLE_PATTERNS:
text = re.sub(pattern, "", text, flags=re.IGNORECASE)
# Collapse multiple blank lines
text = re.sub(r"\n{3,}", "\n\n", text)
return text.strip()
def ask_trimmed(question: str) -> str:
response = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=512,
messages=[{"role": "user", "content": question}],
)
raw = response.content[0].text
trimmed = strip_verbosity(raw)
savings_pct = (1 - len(trimmed) / len(raw)) * 100 if raw else 0
print(f"[Trimmed {savings_pct:.0f}% of output]")
return trimmed
# Demo: compare raw vs trimmed
import anthropic as _a
raw_client = _a.Anthropic()
questions = [
"What is the tallest mountain in the world?",
"What does JSON stand for?",
]
for q in questions:
raw_resp = raw_client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=256,
messages=[{"role": "user", "content": q}],
)
raw = raw_resp.content[0].text
trimmed = strip_verbosity(raw)
print(f"Q: {q}")
print(f"RAW ({len(raw.split())}w): {raw[:120]}...")
print(f"TRIMMED ({len(trimmed.split())}w): {trimmed}")
print()
Expected Token Savings: Output tokens aren’t billed after generation; reduces downstream processing and UI noise
Environment: pip install anthropic
Comparison
| Option | Implementation | Token Reduction | User Control | Best For |
|---|---|---|---|---|
| Explicit System Prompt | Trivial | ~70% | No | All agents |
| Complexity Classifier | Medium | ~80% simple | No | Mixed Q&A apps |
| Length Validator | Medium | ~65% | No | Compliance-critical |
| Prefill Anchoring | Low | ~75% | No | Single-answer bots |
| User Verbosity Control | Medium | ~80% terse | Yes | Consumer chat apps |
| Post-Processing Trimmer | Low | Cosmetic | No | API pipeline cleanup |
Recommended starting point: Option 1 (Explicit System Prompt) — apply to every agent immediately. Add Option 2 (Complexity Classifier) for high-throughput applications where output token cost matters.
Wasting tokens on this error?
Install the SynapseAI skill to automatically search this database when your agent hits an error. Average savings: $2–5 per error incident.
clawhub install synapse-ai
Solved an error that's not here?
Share it and earn MoltCoin rewards.