AI Agent Loop and Stuck Error Guide

An agent stuck in an infinite loop is the most expensive failure mode in AI systems. Unlike a crashed agent that stops billing, a looping agent keeps generating tokens — often burning $10–50 before a human notices.

Why Agents Get Stuck

Agents loop for three main reasons:

Retry without backoff: Tool call fails → agent retries immediately → fails again → retries → infinite loop
Undetected task completion: Agent completes the task but doesn’t recognize its own success condition
Dependency deadlock: Agent waits for output from a subtask that is itself waiting

All three are preventable with the right configuration.

The Token Cost of a Stuck Agent

Loop Duration	Typical Token Burn	Cost at Standard Rates
5 minutes	15,000–30,000	$4.50–9.00
15 minutes	50,000–100,000	$15–30
1 hour	200,000–400,000	$60–120

Most stuck agents go unnoticed for 10–30 minutes. That’s $15–90 per incident.

Fix 1: Circuit Breaker (Required)

The single most effective fix. A circuit breaker stops retrying after N failures:

# openclaw.config.yaml
agent:
  circuit_breaker:
    failure_threshold: 3        # Open after 3 consecutive failures
    reset_timeout_ms: 30000     # Try again after 30 seconds
    half_open_attempts: 1       # Test with 1 request before fully reopening
    on_open: log_and_notify     # What to do when circuit opens

With a circuit breaker, a stuck tool call fails fast instead of looping forever.

Fix 2: Per-Session Token Budget

Cap total tokens per session to bound worst-case cost:

limits:
  max_tokens_per_session: 50000
  max_tokens_per_task: 10000
  on_limit_reached: pause_and_report  # Not: silently_fail

When the limit is reached, the agent pauses, reports its current state, and waits for human input — rather than being silently killed.

Fix 3: Loop Detection

Detect when the agent is repeating the same actions:

agent:
  loop_detection:
    enabled: true
    window_size: 10           # Check last 10 actions
    similarity_threshold: 0.85 # Flag if 85% similar to a previous action
    on_detected: break_and_report

This catches the “try same thing 10 times” pattern before it burns significant tokens.

Fix 4: Explicit Success Conditions

Agents loop when they don’t know they’re done. Define explicit success conditions:

task:
  success_conditions:
    - type: file_exists
      path: ./output/result.json
    - type: api_response
      status: 200
      endpoint: /health
  failure_conditions:
    - type: max_attempts
      count: 5
    - type: elapsed_time_ms
      max: 60000

Without explicit conditions, the agent uses the model’s judgment — which can loop on “not quite right” indefinitely.

Fix 5: Heartbeat Monitoring

For long-running agents, monitor liveness and progress — not just “is the process alive”:

# Check if agent is making progress (not just running)
openclaw agent status --session $SESSION_ID --check-progress

# Output:
# last_action: 47s ago
# last_output_token: 12s ago
# loop_score: 0.23 (low = good, high = looping)

Add a watchdog that kills the session if loop_score > 0.7 for more than 60 seconds.

The Exec Storm Pattern (High Cost)

The most expensive stuck pattern — invisible on the output channel:

What happens:

Agent tool call fails with an error not surfaced to the output channel (e.g., Telegram)
Agent retries the tool call in an internal loop
No output to Telegram → user sees silence
Agent burns 3–5x normal token rate
After 10–30 minutes, either times out or OOMs the gateway

Symptoms:

Telegram channel: no output
Gateway CPU: elevated (30–80%)
openclaw logs --session $ID: rapid tool call failures
Token counter: growing faster than expected

Fix:

error_handling:
  mirror_tool_errors_to_channels: [telegram, slack]
  circuit_breaker:
    failure_threshold: 3
    reset_timeout_ms: 30000
channels:
  telegram:
    on_agent_silent_for_ms: 120000  # Alert after 2 min of silence
      action: send_status_update

Full exec storm solution →

The Unresponsive-But-Running Pattern

What happens: Agent process is alive. Health endpoint returns 200. But agent is not processing any new requests.

Symptom: openclaw status shows “running”, channels show “connected”, but messages go unanswered.

Root cause: Agent thread is blocked on a synchronous operation (database read, network call) with no timeout configured.

Fix:

agent:
  operation_timeout_ms: 10000    # Timeout any single operation at 10s
  watchdog:
    enabled: true
    heartbeat_interval_ms: 5000
    max_missed_heartbeats: 3     # Restart after 3 missed (15s)

Full solution →

Recovery After a Stuck Agent

When you discover a stuck agent:

# 1. Save what you can from current state
openclaw session export $SESSION_ID > session-backup.json

# 2. Kill the stuck session
openclaw session kill $SESSION_ID

# 3. Check what was actually completed
openclaw logs --session $SESSION_ID --filter "completed|success" --last 100

# 4. Start fresh from last known good state
openclaw session restore session-backup.json --from-checkpoint last_success

Detection Checklist

Before deploying any autonomous agent:

Circuit breaker configured with failure_threshold ≤ 5
Per-session token budget set
Loop detection enabled
Explicit success and failure conditions defined
Output channel error mirroring enabled
Watchdog heartbeat monitoring active
Human notification on circuit open or budget limit

← View all loop/stuck solutions

Stop stuck agent token burns

SynapseAI detects loop patterns in real time and triggers circuit breakers before the token cost becomes significant.

clawhub install synapse-ai