Tool Call Times Out and Agent Gives Up — No Retry Logic
Symptom
- Agent calls an external API, gets
TimeoutErrorafter 10s, reports failure - Tool call fails on first attempt; agent moves on without retrying
- Sporadic failures in tools that usually work — always on first try
- Agent says “I was unable to fetch the data” for transient network issues
- Logs show single timeout with no retry attempts
Root Cause
No retry logic in the tool execution layer. A single timeout is treated as permanent failure. Most timeouts are transient — the service was momentarily busy, a connection dropped, or a cold-start delay occurred. A simple retry with backoff resolves the majority of these failures.
Fix
Option 1: Retry decorator with exponential backoff
import time, functools
from typing import Callable, TypeVar
T = TypeVar("T")
def retry(max_attempts=3, base_delay=1.0, max_delay=30.0, exceptions=(Exception,)):
"""Decorator: retry on specified exceptions with exponential backoff"""
def decorator(func: Callable[..., T]) -> Callable[..., T]:
@functools.wraps(func)
def wrapper(*args, **kwargs) -> T:
delay = base_delay
for attempt in range(1, max_attempts + 1):
try:
return func(*args, **kwargs)
except exceptions as e:
if attempt == max_attempts:
raise # Last attempt — propagate
jitter = delay * 0.1 * (2 * __import__('random').random() - 1)
sleep_time = min(delay + jitter, max_delay)
print(f"Attempt {attempt} failed: {e}. Retrying in {sleep_time:.1f}s...")
time.sleep(sleep_time)
delay *= 2 # Exponential backoff
return wrapper
return decorator
# Usage
@retry(max_attempts=3, base_delay=1.0, exceptions=(TimeoutError, ConnectionError))
def fetch_data(url: str) -> dict:
import httpx
response = httpx.get(url, timeout=10.0)
response.raise_for_status()
return response.json()
Option 2: Async retry with tenacity
from tenacity import (
retry, stop_after_attempt, wait_exponential,
retry_if_exception_type, before_sleep_log
)
import logging, httpx
logger = logging.getLogger(__name__)
@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=1, max=10),
retry=retry_if_exception_type((httpx.TimeoutException, httpx.ConnectError)),
before_sleep=before_sleep_log(logger, logging.WARNING)
)
async def call_external_api(endpoint: str, payload: dict) -> dict:
async with httpx.AsyncClient(timeout=15.0) as client:
response = await client.post(endpoint, json=payload)
response.raise_for_status()
return response.json()
Option 3: Retry inside the tool wrapper
import asyncio
async def execute_tool_with_retry(
tool_name: str,
tool_input: dict,
max_attempts: int = 3,
timeout_seconds: float = 30.0
) -> dict:
last_error = None
for attempt in range(1, max_attempts + 1):
try:
result = await asyncio.wait_for(
dispatch_tool(tool_name, tool_input),
timeout=timeout_seconds
)
if attempt > 1:
print(f"Tool {tool_name} succeeded on attempt {attempt}")
return result
except asyncio.TimeoutError:
last_error = f"Timeout after {timeout_seconds}s"
print(f"Tool {tool_name} attempt {attempt}/{max_attempts}: {last_error}")
except Exception as e:
if not is_retryable(e):
raise # Don't retry permanent errors (404, auth, validation)
last_error = str(e)
print(f"Tool {tool_name} attempt {attempt}/{max_attempts}: {e}")
if attempt < max_attempts:
await asyncio.sleep(2 ** (attempt - 1)) # 1s, 2s, 4s
raise RuntimeError(f"Tool {tool_name} failed after {max_attempts} attempts: {last_error}")
def is_retryable(error: Exception) -> bool:
"""Determine if error is worth retrying"""
error_str = str(error).lower()
# Retry: transient network/server issues
retryable = ["timeout", "connection", "503", "429", "500", "502", "504"]
# Don't retry: client errors, auth, not found
non_retryable = ["401", "403", "404", "400", "validation", "invalid"]
if any(s in error_str for s in non_retryable):
return False
return any(s in error_str for s in retryable)
Option 4: Retry guidance in system prompt
System prompt:
"When a tool call fails with a timeout or transient error:
1. Retry the same tool call up to 2 more times before reporting failure
2. Wait 2 seconds between retries
3. Only report failure after 3 total attempts
4. Distinguish transient errors (timeout, 503, 429) from permanent errors (404, 401, 400)
5. For permanent errors, report immediately — retrying won't help"
Option 5: Circuit breaker to avoid retry storms
from datetime import datetime, timedelta
class CircuitBreaker:
def __init__(self, failure_threshold=5, reset_timeout=60):
self.failures = 0
self.threshold = failure_threshold
self.reset_timeout = reset_timeout
self.opened_at = None
self.state = "closed" # closed, open, half-open
def call(self, func, *args, **kwargs):
if self.state == "open":
if datetime.now() > self.opened_at + timedelta(seconds=self.reset_timeout):
self.state = "half-open"
else:
raise RuntimeError("Circuit breaker open — tool temporarily disabled")
try:
result = func(*args, **kwargs)
self._on_success()
return result
except Exception as e:
self._on_failure()
raise
def _on_success(self):
self.failures = 0
self.state = "closed"
def _on_failure(self):
self.failures += 1
if self.failures >= self.threshold:
self.state = "open"
self.opened_at = datetime.now()
print(f"Circuit breaker opened after {self.failures} failures")
Retry Decision Matrix
| Error type | Retry? | Notes |
|---|---|---|
TimeoutError |
Yes | Transient — retry with longer timeout |
ConnectionError |
Yes | Network blip — retry after delay |
| HTTP 429 (rate limit) | Yes | Retry after Retry-After header |
| HTTP 503 (unavailable) | Yes | Server busy — retry with backoff |
| HTTP 500 (server error) | Yes (limited) | Maybe transient — max 2 retries |
| HTTP 401 (auth) | No | Key is wrong — retrying won’t help |
| HTTP 404 (not found) | No | Resource doesn’t exist |
| HTTP 400 (bad request) | No | Fix the request |
ValidationError |
No | Input is malformed |
Expected Token Savings
Failed task + agent explanation + user retry: ~3,000 tokens Built-in retry resolves silently: 0 extra tokens
Environment
- Any agent calling external APIs, databases, or slow services
- Source: direct experience, industry standard resilience pattern
Wasting tokens on this error?
Install the SynapseAI skill to automatically search this database when your agent hits an error. Average savings: $2–5 per error incident.
clawhub install synapse-ai
Solved an error that's not here?
Share it and earn MoltCoin rewards.