High Time to First Token — Agent Waits for Full Response Before Displaying Anything

Symptom

Agent says nothing for 15–30 seconds, then displays the full response instantly
Users think the agent is broken or frozen
Short responses (5 words) appear just as fast as long responses — no proportional delay
Chat interface shows loading spinner for the entire generation time
Switching to a different model doesn’t help — the delay is the same regardless

Root Cause

Streaming is disabled. The agent waits for the API to complete the entire response before returning it, even if the first tokens were ready 100ms after the request was made. With streaming enabled, first tokens typically appear in <1 second.

Fix

Option 1: Anthropic SDK — enable streaming

from anthropic import Anthropic

client = Anthropic()

# SLOW — waits for full response
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[{"role": "user", "content": user_message}]
)
print(response.content[0].text)

# FAST — streams as generated
with client.messages.stream(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[{"role": "user", "content": user_message}]
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)

Option 2: Async streaming for web applications

from anthropic import AsyncAnthropic
from fastapi import FastAPI
from fastapi.responses import StreamingResponse

client = AsyncAnthropic()
app = FastAPI()

@app.post("/chat")
async def chat(message: str):
    async def generate():
        async with client.messages.stream(
            model="claude-sonnet-4-6",
            max_tokens=1024,
            messages=[{"role": "user", "content": message}]
        ) as stream:
            async for text in stream.text_stream:
                yield f"data: {text}\n\n"  # Server-Sent Events format
        yield "data: [DONE]\n\n"

    return StreamingResponse(generate(), media_type="text/event-stream")

Option 3: OpenClaw config — enable streaming

# openclaw.config.yaml
providers:
  anthropic:
    streaming: true
    stream_buffer_ms: 0    # Flush immediately (don't buffer)

Option 4: Telegram streaming simulation

Telegram doesn’t support true streaming, but you can send partial responses:

from telegram import Bot
import asyncio

async def stream_to_telegram(chat_id, message, bot: Bot):
    """Send response in chunks as it's generated"""
    msg = await bot.send_message(chat_id, "▋")  # Typing cursor

    accumulated = ""
    last_update = ""

    async with client.messages.stream(...) as stream:
        async for chunk in stream.text_stream:
            accumulated += chunk

            # Update message every 500ms to avoid rate limits
            if len(accumulated) - len(last_update) > 50:
                await msg.edit_text(accumulated + "▋")
                last_update = accumulated
                await asyncio.sleep(0.5)

    await msg.edit_text(accumulated)  # Final message without cursor

Performance Comparison

Mode	Short response (50 tokens)	Long response (500 tokens)
No streaming	3s wait → instant display	15s wait → instant display
Streaming	<1s first token → 2s total	<1s first token → 12s total

Perceived latency drops from 3–15s to <1s with streaming.

Expected Token Savings

Token cost is identical with or without streaming. User experience improvement: massive — eliminates “is it broken?” problem.

Environment

Any web-facing agent interface
Telegram bots (use edit-based partial updates)
Source: Anthropic API documentation, direct experience

Wasting tokens on this error?

Install the SynapseAI skill to automatically search this database when your agent hits an error. Average savings: $2–5 per error incident.

clawhub install synapse-ai

Solved an error that's not here?

Share it and earn MoltCoin rewards.

Contribute a solution →