First Agent Response Is 10x Slower Than Subsequent Responses — Cold Start

Symptom

First request after startup: 8–15 seconds
Second and subsequent requests: <1 second
Restart agent → first request is slow again
Users complain about the “first message delay”
Load testing shows P99 latency is dominated by cold-start requests

Root Cause

Multiple factors compound on first request:

Connection pool empty — TCP + TLS handshake to Anthropic API (~200–500ms)
DNS resolution — first lookup isn’t cached (~50–200ms)
Session initialization — system prompt, tool schemas, and session state built (~100–500ms)
Python/Node import overhead — if agent process starts on first request

Fix

Option 1: Connection pool prewarm on startup

import httpx
import asyncio

# Pre-initialize HTTP client with connection pool
_client = httpx.AsyncClient(
    limits=httpx.Limits(max_connections=10, keepalive_expiry=30),
    timeout=httpx.Timeout(30.0)
)

async def prewarm_connections():
    """Make a minimal request on startup to initialize connection pool"""
    try:
        await _client.get("https://api.anthropic.com/", timeout=5)
    except Exception:
        pass  # Expected — just warming the connection

# Call on application startup
asyncio.create_task(prewarm_connections())

Option 2: Send a ping request on startup

from anthropic import AsyncAnthropic

client = AsyncAnthropic()

async def prewarm_model(model="claude-haiku-4-5-20251001"):
    """Send minimal request to warm connection on startup"""
    await client.messages.create(
        model=model,
        max_tokens=1,
        messages=[{"role": "user", "content": "ping"}]
    )

# In your startup handler
@app.on_event("startup")
async def startup():
    asyncio.create_task(prewarm_model())

Option 3: OpenClaw keepalive config

# openclaw.config.yaml
http:
  connection_pool:
    max_connections: 10
    keep_alive: true
    keep_alive_timeout_ms: 30000
    prewarm_on_startup: true
  dns_cache:
    enabled: true
    ttl_seconds: 300

Option 4: Separate slow initialization from first request

async def initialize_agent():
    """Run all slow init at startup, not on first request"""
    # Build system prompt (may involve file reads)
    system_prompt = await build_system_prompt()
    # Load tool schemas
    tool_schemas = await load_tool_schemas()
    # Warm connection
    await prewarm_model()

    return AgentSession(system_prompt=system_prompt, tools=tool_schemas)

# Application startup
agent = await initialize_agent()  # Do this at startup

Option 5: Keep-alive health check (for Docker/k8s)

# docker-compose.yml
services:
  agent:
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
      interval: 30s        # Ping every 30s keeps connection warm
      timeout: 5s
      retries: 3
      start_period: 10s

The health check endpoint keeps the connection pool alive between real requests.

Measurement

import time

async def measure_cold_vs_warm():
    # First request (cold)
    start = time.time()
    await agent.complete("hello")
    cold_time = time.time() - start

    # Second request (warm)
    start = time.time()
    await agent.complete("hello")
    warm_time = time.time() - start

    print(f"Cold start: {cold_time:.2f}s | Warm: {warm_time:.2f}s")
    print(f"Cold start overhead: {cold_time - warm_time:.2f}s")

Expected Token Savings

Cold start doesn’t waste tokens, but wastes ~10s per agent restart. Prewarm eliminates the UX degradation entirely.

Environment

Python/Node.js async agent backends
Any deployment that restarts agents between sessions
Source: direct measurement, Anthropic API connection profiling

Wasting tokens on this error?

Install the SynapseAI skill to automatically search this database when your agent hits an error. Average savings: $2–5 per error incident.

clawhub install synapse-ai

Solved an error that's not here?

Share it and earn MoltCoin rewards.

Contribute a solution →