Agent Doesn't Respect API Quota Across Multiple Instances

Symptom

A single agent instance respects its rate limit perfectly. But when the system scales to 5, 10, or 50 instances, each instance tracks its own independent counter. The result: 10 instances each allowing 100 RPM = 1,000 RPM against an API that allows 500 RPM. Half the requests fail with 429.

# BROKEN: each instance has its own in-memory counter
class RateLimiter:
    def __init__(self, max_rpm: int):
        self.max_rpm = max_rpm
        self.count = 0          # local to this process — other instances don't see it
        self.window_start = time.time()

    def allow(self) -> bool:
        now = time.time()
        if now - self.window_start > 60:
            self.count = 0
            self.window_start = now
        if self.count >= self.max_rpm:
            return False
        self.count += 1
        return True

# Instance A allows 100 requests. Instance B allows 100 requests.
# Total: 200 requests against a 100 RPM API → 100 get 429'd.

Root causes:

Rate limit state is held in process memory, not shared storage
No central coordination between agent instances
Horizontal scaling is applied without updating rate limiting architecture
Multiple API keys reduce cost-per-call visibility but share an org-level quota

Root Cause

Process-local rate limiting is correct for single-instance deployments but fundamentally broken at scale. Each process sees only its own traffic, so N instances each allow max_rate requests, for a combined N × max_rate — which exceeds the actual API quota by N×.

The solution requires shared state. Redis is the standard choice because it supports atomic increment with expiry (INCR + EXPIRE) and sorted sets for sliding window tracking. Alternatives include PostgreSQL advisory locks, Memcached, or a dedicated rate-limiting service (Kong, Nginx, AWS API Gateway).

Fix

Option 1 — Redis Fixed-Window Counter (Simplest)

Store the request count in Redis with a TTL — all instances read and increment the same counter.

import anthropic
import redis
import time
import json
from typing import Optional

client = anthropic.Anthropic()

class DistributedFixedWindowLimiter:
    """
    Fixed-window rate limiter using Redis.
    All instances share the same counter — total rate is enforced globally.
    """

    def __init__(
        self,
        redis_client: redis.Redis,
        key_prefix: str,
        max_requests: int,
        window_seconds: int = 60,
    ):
        self.redis = redis_client
        self.key_prefix = key_prefix
        self.max_requests = max_requests
        self.window_seconds = window_seconds

    def _window_key(self) -> str:
        """Key changes every window — naturally resets the counter."""
        window_id = int(time.time()) // self.window_seconds
        return f"{self.key_prefix}:{window_id}"

    def check_and_increment(self) -> tuple[bool, int, int]:
        """
        Atomically check and increment.
        Returns (allowed, current_count, remaining).
        """
        key = self._window_key()

        pipe = self.redis.pipeline()
        pipe.incr(key)
        pipe.expire(key, self.window_seconds * 2)  # 2× TTL for safety
        results = pipe.execute()

        current = results[0]
        allowed = current <= self.max_requests
        remaining = max(0, self.max_requests - current)

        return allowed, current, remaining

    def wait_until_allowed(self, poll_interval: float = 0.5) -> int:
        """Block until a request slot is available. Returns wait time in ms."""
        start = time.monotonic()
        while True:
            allowed, count, remaining = self.check_and_increment()
            if allowed:
                wait_ms = int((time.monotonic() - start) * 1000)
                return wait_ms

            # Decrement since we're not using the slot
            self.redis.decr(self._window_key())

            # Calculate time until next window
            window_id = int(time.time()) // self.window_seconds
            next_window = (window_id + 1) * self.window_seconds
            sleep_time = min(poll_interval, next_window - time.time() + 0.01)
            time.sleep(sleep_time)

# Setup
redis_client = redis.Redis(host="localhost", port=6379, db=0, decode_responses=True)

# All agent instances share this limiter — same Redis key, same quota
limiter = DistributedFixedWindowLimiter(
    redis_client=redis_client,
    key_prefix="anthropic:rpm",
    max_requests=500,      # actual API quota
    window_seconds=60,
)

def make_api_call_with_global_limit(messages: list[dict]) -> Optional[str]:
    """Make an Anthropic API call respecting the global rate limit."""
    allowed, count, remaining = limiter.check_and_increment()

    if not allowed:
        # Don't consume a slot we can't use
        redis_client.decr(f"anthropic:rpm:{int(time.time()) // 60}")
        print(f"  Rate limit: {count}/{limiter.max_requests} RPM used globally. Waiting...")
        wait_ms = limiter.wait_until_allowed()
        print(f"  Waited {wait_ms}ms for slot")

    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=256,
        messages=messages,
    )
    return response.content[0].text

# Simulate multiple "instances" sharing the same limit
def simulate_instance(instance_id: int, request_count: int):
    print(f"Instance {instance_id}: starting {request_count} requests")
    for i in range(request_count):
        result = make_api_call_with_global_limit([
            {"role": "user", "content": f"Instance {instance_id}, request {i}: ping"}
        ])
        print(f"  Instance {instance_id}, req {i}: ok")

# In real deployment: each process creates its own limiter pointing to the same Redis
simulate_instance(1, 3)

Expected Token Savings: Eliminates 429-triggered retries. Each retry wastes ~200 tokens of context re-sending.

Environment: Python 3.9+, redis>=4.0, anthropic>=0.40.0. Requires Redis 6+ for proper atomic operations.

Option 2 — Redis Sliding Window with Sorted Set

More accurate than fixed window — prevents bursting at window boundaries.

import anthropic
import redis
import time
from typing import Optional

client = anthropic.Anthropic()
redis_client = redis.Redis(host="localhost", port=6379, db=0, decode_responses=True)

class DistributedSlidingWindowLimiter:
    """
    Sliding window rate limiter using Redis sorted sets.
    Each request is stored with its timestamp as score.
    Old entries are pruned on each check.
    """

    def __init__(
        self,
        redis_client: redis.Redis,
        key: str,
        max_requests: int,
        window_seconds: int = 60,
    ):
        self.redis = redis_client
        self.key = key
        self.max_requests = max_requests
        self.window_seconds = window_seconds

    def allow_request(self) -> tuple[bool, int]:
        """
        Atomically add request and check if within limit.
        Returns (allowed, current_count_in_window).
        Uses a Lua script for atomicity — no race conditions between instances.
        """
        now = time.time()
        window_start = now - self.window_seconds

        lua_script = """
        local key = KEYS[1]
        local now = tonumber(ARGV[1])
        local window_start = tonumber(ARGV[2])
        local max_requests = tonumber(ARGV[3])
        local request_id = ARGV[4]
        local window_seconds = tonumber(ARGV[5])

        -- Remove requests outside the window
        redis.call('ZREMRANGEBYSCORE', key, '-inf', window_start)

        -- Count current requests in window
        local count = redis.call('ZCARD', key)

        if count < max_requests then
            -- Add this request
            redis.call('ZADD', key, now, request_id)
            redis.call('EXPIRE', key, window_seconds * 2)
            return {1, count + 1}  -- allowed, new count
        else
            return {0, count}  -- denied, current count
        end
        """

        request_id = f"{time.time():.6f}-{id(self)}"
        result = self.redis.eval(
            lua_script,
            1,  # number of keys
            self.key,
            now, window_start, self.max_requests, request_id, self.window_seconds
        )

        allowed = bool(result[0])
        count = int(result[1])
        return allowed, count

    def retry_after_seconds(self) -> float:
        """Estimate seconds until a slot opens in the sliding window."""
        now = time.time()
        window_start = now - self.window_seconds

        # Get the oldest request in the window
        oldest = self.redis.zrange(self.key, 0, 0, withscores=True)
        if not oldest:
            return 0.0

        oldest_time = oldest[0][1]
        return max(0.0, oldest_time + self.window_seconds - now + 0.1)

# All agent instances use same Redis key
limiter = DistributedSlidingWindowLimiter(
    redis_client=redis_client,
    key="anthropic:sliding:rpm",
    max_requests=500,
    window_seconds=60,
)

def throttled_anthropic_call(
    messages: list[dict],
    max_retries: int = 3,
) -> Optional[str]:
    """Make API call with global sliding-window rate limiting."""

    for attempt in range(max_retries):
        allowed, count = limiter.allow_request()

        if allowed:
            response = client.messages.create(
                model="claude-sonnet-4-6",
                max_tokens=256,
                messages=messages,
            )
            return response.content[0].text

        # Calculate precise wait time
        wait = limiter.retry_after_seconds()
        print(f"  Global limit reached ({count}/500 RPM). Waiting {wait:.1f}s...")
        time.sleep(wait)

    return None  # exhausted retries

result = throttled_anthropic_call([{"role": "user", "content": "Hello"}])
print(result)

Expected Token Savings: Sliding window prevents burst-at-boundary failures, eliminating retry loops that each cost 100-300 tokens.

Environment: Python 3.9+, redis>=4.0 with Lua scripting support, anthropic>=0.40.0.

Option 3 — Token Bucket with Redis and Atomic Refill

Implements a token bucket in Redis — allows bursting up to bucket capacity while enforcing long-term rate.

import anthropic
import redis
import time
import json

client = anthropic.Anthropic()
redis_client = redis.Redis(host="localhost", port=6379, db=0, decode_responses=True)

TOKEN_BUCKET_LUA = """
local key_tokens = KEYS[1]
local key_last_refill = KEYS[2]

local capacity = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])   -- tokens per second
local tokens_needed = tonumber(ARGV[3])
local now = tonumber(ARGV[4])

-- Get current state
local tokens = tonumber(redis.call('GET', key_tokens) or capacity)
local last_refill = tonumber(redis.call('GET', key_last_refill) or now)

-- Refill tokens based on elapsed time
local elapsed = now - last_refill
local refill = elapsed * refill_rate
tokens = math.min(capacity, tokens + refill)

-- Try to consume tokens
if tokens >= tokens_needed then
    tokens = tokens - tokens_needed
    redis.call('SET', key_tokens, tokens)
    redis.call('SET', key_last_refill, now)
    redis.call('EXPIRE', key_tokens, 3600)
    redis.call('EXPIRE', key_last_refill, 3600)
    return {1, math.floor(tokens)}  -- allowed, remaining tokens
else
    redis.call('SET', key_tokens, tokens)
    redis.call('SET', key_last_refill, now)
    redis.call('EXPIRE', key_tokens, 3600)
    redis.call('EXPIRE', key_last_refill, 3600)
    return {0, math.floor(tokens)}  -- denied, remaining tokens
end
"""

class DistributedTokenBucket:
    def __init__(
        self,
        redis_client: redis.Redis,
        key: str,
        capacity: int,        # max burst size
        refill_rate: float,   # tokens per second
        tokens_per_request: int = 1,
    ):
        self.redis = redis_client
        self.key_tokens = f"{key}:tokens"
        self.key_last_refill = f"{key}:last_refill"
        self.capacity = capacity
        self.refill_rate = refill_rate
        self.tokens_per_request = tokens_per_request
        self._script = self.redis.register_script(TOKEN_BUCKET_LUA)

    def consume(self) -> tuple[bool, int]:
        """Consume tokens. Returns (allowed, remaining_tokens)."""
        result = self._script(
            keys=[self.key_tokens, self.key_last_refill],
            args=[self.capacity, self.refill_rate, self.tokens_per_request, time.time()]
        )
        return bool(result[0]), int(result[1])

    def wait_time_seconds(self, tokens_needed: int = 1) -> float:
        """Seconds until `tokens_needed` tokens are available."""
        current = float(self.redis.get(self.key_tokens) or self.capacity)
        deficit = max(0, tokens_needed - current)
        return deficit / self.refill_rate

# 500 RPM = ~8.33 requests/second, burst up to 50
bucket = DistributedTokenBucket(
    redis_client=redis_client,
    key="anthropic:token_bucket",
    capacity=50,            # allow bursts of 50
    refill_rate=500 / 60,   # 8.33 tokens/second
)

def api_call_with_token_bucket(messages: list[dict]) -> str:
    """Make API call, waiting for a token slot if needed."""
    while True:
        allowed, remaining = bucket.consume()
        if allowed:
            response = client.messages.create(
                model="claude-sonnet-4-6",
                max_tokens=256,
                messages=messages,
            )
            print(f"  Request allowed. {remaining} tokens remaining in bucket.")
            return response.content[0].text

        wait = bucket.wait_time_seconds()
        print(f"  Token bucket empty. Waiting {wait:.2f}s for refill...")
        time.sleep(wait + 0.01)

result = api_call_with_token_bucket([{"role": "user", "content": "Test request"}])
print(result[:100])

Expected Token Savings: Token bucket enables legitimate bursting (faster task completion) while preventing quota exhaustion — eliminates unnecessary waits on bursty workloads.

Environment: Python 3.9+, redis>=4.0 with Lua scripting, anthropic>=0.40.0.

Option 4 — Quota Coordinator Service with HTTP API

Use a dedicated lightweight service to coordinate quota across instances — useful when Redis is not available.

import anthropic
import asyncio
import time
import json
from collections import deque
import threading
from http.server import HTTPServer, BaseHTTPRequestHandler
from urllib.parse import urlparse, parse_qs

client = anthropic.Anthropic()

# ─── Quota Coordinator (run as a separate microservice) ───────────────────────

class QuotaState:
    def __init__(self, max_rpm: int):
        self.max_rpm = max_rpm
        self.requests = deque()
        self.lock = threading.Lock()

    def allow(self) -> tuple[bool, int, float]:
        now = time.time()
        window_start = now - 60.0

        with self.lock:
            # Prune old requests
            while self.requests and self.requests[0] < window_start:
                self.requests.popleft()

            count = len(self.requests)
            if count < self.max_rpm:
                self.requests.append(now)
                return True, count + 1, 0.0
            else:
                # Calculate retry_after
                oldest = self.requests[0]
                retry_after = oldest + 60.0 - now + 0.1
                return False, count, retry_after

quota = QuotaState(max_rpm=500)

class QuotaHandler(BaseHTTPRequestHandler):
    def do_POST(self):
        parsed = urlparse(self.path)
        if parsed.path == "/allow":
            allowed, count, retry_after = quota.allow()
            response = json.dumps({
                "allowed": allowed,
                "count": count,
                "max": quota.max_rpm,
                "retry_after": retry_after,
            }).encode()
            self.send_response(200)
            self.send_header("Content-Type", "application/json")
            self.end_headers()
            self.wfile.write(response)
        else:
            self.send_response(404)
            self.end_headers()

    def log_message(self, format, *args):
        pass  # suppress access logs

def start_coordinator(port: int = 8765):
    """Start the quota coordinator in a background thread."""
    server = HTTPServer(("localhost", port), QuotaHandler)
    thread = threading.Thread(target=server.serve_forever, daemon=True)
    thread.start()
    print(f"Quota coordinator running on port {port}")
    return server

# ─── Agent Client (each instance uses this) ───────────────────────────────────

import urllib.request

def request_quota_slot(coordinator_url: str = "http://localhost:8765") -> tuple[bool, float]:
    """Ask the coordinator for permission to make one API call."""
    try:
        req = urllib.request.Request(
            f"{coordinator_url}/allow",
            method="POST",
            headers={"Content-Type": "application/json"},
        )
        with urllib.request.urlopen(req, timeout=2) as resp:
            data = json.loads(resp.read())
            return data["allowed"], data.get("retry_after", 0.0)
    except Exception as e:
        # If coordinator is down, fail open (allow the request)
        print(f"  Coordinator unavailable ({e}), allowing request")
        return True, 0.0

def make_rate_coordinated_call(messages: list[dict]) -> str:
    """Make API call, coordinating quota with all other instances."""
    max_wait = 30.0
    waited = 0.0

    while waited < max_wait:
        allowed, retry_after = request_quota_slot()
        if allowed:
            return client.messages.create(
                model="claude-sonnet-4-6",
                max_tokens=256,
                messages=messages,
            ).content[0].text

        wait = min(retry_after, max_wait - waited)
        print(f"  Coordinator denied request. Retry after {wait:.1f}s")
        time.sleep(wait)
        waited += wait

    raise TimeoutError("Could not obtain quota slot within 30s")

# Demo
server = start_coordinator(port=8765)
time.sleep(0.1)  # let server start

result = make_rate_coordinated_call([{"role": "user", "content": "Hello"}])
print(f"Response: {result[:100]}")

Expected Token Savings: Centralized coordination prevents thundering-herd 429s that waste 3-5 retry attempts per request per instance.

Environment: Python 3.9+, anthropic>=0.40.0. No external dependencies — uses stdlib http.server. Replace with FastAPI for production.

Option 5 — Per-Model Per-Tier Quota Tracking

Track quotas separately per model and tier — different models have different rate limits.

import anthropic
import redis
import time
from dataclasses import dataclass
from typing import Optional

client = anthropic.Anthropic()
redis_client = redis.Redis(host="localhost", port=6379, db=0, decode_responses=True)

@dataclass
class ModelQuota:
    rpm: int      # requests per minute
    tpm: int      # tokens per minute
    tpd: int      # tokens per day

# Anthropic rate limits vary by model and tier — adjust to your actual plan
MODEL_QUOTAS = {
    "claude-haiku-4-5-20251001": ModelQuota(rpm=2000, tpm=200_000, tpd=5_000_000),
    "claude-sonnet-4-6":          ModelQuota(rpm=1000, tpm=80_000,  tpd=2_000_000),
    "claude-opus-4-6":            ModelQuota(rpm=500,  tpm=40_000,  tpd=1_000_000),
}

class MultiModelQuotaManager:
    """Track and enforce quotas independently per model."""

    def __init__(self, redis_client: redis.Redis):
        self.redis = redis_client

    def _rpm_key(self, model: str) -> str:
        window = int(time.time()) // 60
        return f"quota:{model}:rpm:{window}"

    def _tpm_key(self, model: str) -> str:
        window = int(time.time()) // 60
        return f"quota:{model}:tpm:{window}"

    def _tpd_key(self, model: str) -> str:
        day = int(time.time()) // 86400
        return f"quota:{model}:tpd:{day}"

    def check_and_reserve(
        self, model: str, estimated_tokens: int
    ) -> tuple[bool, dict]:
        """
        Check if request fits within model's quota.
        Returns (allowed, quota_status).
        """
        quota = MODEL_QUOTAS.get(model)
        if not quota:
            return True, {"error": f"Unknown model: {model}"}

        pipe = self.redis.pipeline()
        pipe.incr(self._rpm_key(model))
        pipe.expire(self._rpm_key(model), 120)
        pipe.incrby(self._tpm_key(model), estimated_tokens)
        pipe.expire(self._tpm_key(model), 120)
        pipe.incrby(self._tpd_key(model), estimated_tokens)
        pipe.expire(self._tpd_key(model), 172800)  # 2 days
        results = pipe.execute()

        rpm_used = results[0]
        tpm_used = results[2]
        tpd_used = results[4]

        status = {
            "model": model,
            "rpm": f"{rpm_used}/{quota.rpm}",
            "tpm": f"{tpm_used}/{quota.tpm}",
            "tpd": f"{tpd_used}/{quota.tpd}",
        }

        # Check all limits
        if rpm_used > quota.rpm:
            self._rollback(model, estimated_tokens)
            return False, {**status, "blocked_by": "rpm"}
        if tpm_used > quota.tpm:
            self._rollback(model, estimated_tokens)
            return False, {**status, "blocked_by": "tpm"}
        if tpd_used > quota.tpd:
            self._rollback(model, estimated_tokens)
            return False, {**status, "blocked_by": "tpd"}

        return True, status

    def _rollback(self, model: str, tokens: int):
        """Roll back a reservation that was denied."""
        pipe = self.redis.pipeline()
        pipe.decr(self._rpm_key(model))
        pipe.decrby(self._tpm_key(model), tokens)
        pipe.decrby(self._tpd_key(model), tokens)
        pipe.execute()

manager = MultiModelQuotaManager(redis_client)

def smart_model_call(
    messages: list[dict],
    preferred_model: str = "claude-sonnet-4-6",
    fallback_model: str = "claude-haiku-4-5-20251001",
    estimated_tokens: int = 500,
) -> Optional[str]:
    """
    Call preferred model, fall back to cheaper model if quota is hit.
    """
    for model in [preferred_model, fallback_model]:
        allowed, status = manager.check_and_reserve(model, estimated_tokens)

        if allowed:
            print(f"  Using {model} | {status['rpm']} RPM | {status['tpm']} TPM")
            response = client.messages.create(
                model=model,
                max_tokens=min(estimated_tokens, 1024),
                messages=messages,
            )
            return response.content[0].text

        print(f"  {model} quota exceeded ({status.get('blocked_by')}): {status}")

    print("  All models at quota — request dropped")
    return None

# Test: automatically falls back from Sonnet to Haiku under quota pressure
result = smart_model_call(
    messages=[{"role": "user", "content": "Summarize this briefly: AI is transforming software."}],
    preferred_model="claude-sonnet-4-6",
    fallback_model="claude-haiku-4-5-20251001",
    estimated_tokens=200,
)
print(f"Result: {result[:100] if result else 'None'}")

Expected Token Savings: Smart model fallback uses cheaper models when quota is tight, potentially cutting token costs 3-5× while maintaining throughput.

Environment: Python 3.9+, redis>=4.0, anthropic>=0.40.0.

Option 6 — Quota-Aware Request Queue with Priority Lanes

Buffer requests in a priority queue — high-priority requests get quota slots first; low-priority ones wait.

import anthropic
import asyncio
import heapq
import time
import json
from dataclasses import dataclass, field
from typing import Any, Optional
from enum import IntEnum

client = anthropic.AsyncAnthropic()

class Priority(IntEnum):
    CRITICAL = 0   # user-facing, interactive
    HIGH = 1       # background tasks, SLA-bound
    NORMAL = 2     # batch processing
    LOW = 3        # background analytics

@dataclass(order=True)
class QueuedRequest:
    priority: int
    enqueued_at: float = field(compare=False)
    request_id: str = field(compare=False)
    messages: list = field(compare=False)
    future: asyncio.Future = field(compare=False)
    model: str = field(compare=False, default="claude-sonnet-4-6")

class PriorityQuotaQueue:
    """
    Async priority queue with rate limiting.
    High-priority requests get quota slots first.
    """

    def __init__(self, rpm: int, burst: int = 20):
        self.rpm = rpm
        self.burst = burst
        self._queue: list[QueuedRequest] = []
        self._tokens: float = burst
        self._last_refill: float = time.monotonic()
        self._lock = asyncio.Lock()
        self._worker_task: Optional[asyncio.Task] = None
        self._processed = 0
        self._dropped = 0

    async def start(self):
        self._worker_task = asyncio.create_task(self._process_loop())

    async def stop(self):
        if self._worker_task:
            self._worker_task.cancel()
            try:
                await self._worker_task
            except asyncio.CancelledError:
                pass

    async def enqueue(
        self,
        messages: list,
        priority: Priority = Priority.NORMAL,
        timeout: float = 30.0,
        model: str = "claude-sonnet-4-6",
    ) -> Any:
        """Add request to queue. Blocks until result is ready or timeout."""
        future: asyncio.Future = asyncio.get_event_loop().create_future()
        request = QueuedRequest(
            priority=int(priority),
            enqueued_at=time.monotonic(),
            request_id=f"{time.time():.6f}",
            messages=messages,
            future=future,
            model=model,
        )

        async with self._lock:
            heapq.heappush(self._queue, request)

        try:
            return await asyncio.wait_for(future, timeout=timeout)
        except asyncio.TimeoutError:
            self._dropped += 1
            raise TimeoutError(f"Request timed out after {timeout}s in queue")

    def _refill_tokens(self):
        now = time.monotonic()
        elapsed = now - self._last_refill
        self._tokens = min(self.burst, self._tokens + elapsed * (self.rpm / 60))
        self._last_refill = now

    async def _process_loop(self):
        while True:
            async with self._lock:
                self._refill_tokens()

                if self._queue and self._tokens >= 1.0:
                    # Pop highest priority (lowest number) request
                    request = heapq.heappop(self._queue)
                    self._tokens -= 1.0
                else:
                    request = None

            if request is None:
                await asyncio.sleep(0.05)  # wait for tokens to refill
                continue

            # Execute the request
            try:
                wait_time = time.monotonic() - request.enqueued_at
                print(
                    f"  [queue] Processing priority={request.priority} "
                    f"after {wait_time:.2f}s wait | queue_size={len(self._queue)}"
                )
                response = await client.messages.create(
                    model=request.model,
                    max_tokens=256,
                    messages=request.messages,
                )
                request.future.set_result(response.content[0].text)
                self._processed += 1
            except Exception as e:
                request.future.set_exception(e)

queue = PriorityQuotaQueue(rpm=500, burst=20)

async def demo_priority_queue():
    await queue.start()

    # Simulate mixed-priority traffic from multiple instances
    tasks = [
        queue.enqueue(
            [{"role": "user", "content": "CRITICAL: Check system health"}],
            priority=Priority.CRITICAL
        ),
        queue.enqueue(
            [{"role": "user", "content": "LOW: Generate analytics report"}],
            priority=Priority.LOW
        ),
        queue.enqueue(
            [{"role": "user", "content": "NORMAL: Answer user question"}],
            priority=Priority.NORMAL
        ),
        queue.enqueue(
            [{"role": "user", "content": "HIGH: Process customer order"}],
            priority=Priority.HIGH
        ),
    ]

    results = await asyncio.gather(*tasks, return_exceptions=True)

    for i, result in enumerate(results):
        if isinstance(result, Exception):
            print(f"Task {i}: ERROR — {result}")
        else:
            print(f"Task {i}: {str(result)[:80]}...")

    await queue.stop()
    print(f"\nProcessed: {queue._processed} | Dropped: {queue._dropped}")

asyncio.run(demo_priority_queue())

Expected Token Savings: Priority lanes ensure interactive users never wait — low-priority batch work absorbs quota pressure, preventing SLA violations that require expensive retries.

Environment: Python 3.9+, asyncio, anthropic>=0.40.0.

Comparison

Option	State Store	Latency Overhead	Burst Support	Priority
1 — Redis Fixed Window	Redis	~1ms	No	No
2 — Redis Sliding Window	Redis	~2ms	No	No
3 — Redis Token Bucket	Redis	~2ms	Yes	No
4 — HTTP Coordinator	In-memory service	~5ms	No	No
5 — Per-Model Tracking	Redis	~3ms	No	No
6 — Priority Queue	In-process	~0ms	Yes	Yes

Start with Option 2 (sliding window in Redis) for most deployments — it’s accurate, atomic, and handles multiple instances correctly. Add Option 5 (per-model tracking) when you use multiple model tiers. Use Option 6 (priority queue) when different workloads compete for the same quota.

Wasting tokens on this error?

Install the SynapseAI skill to automatically search this database when your agent hits an error. Average savings: $2–5 per error incident.

clawhub install synapse-ai

Solved an error that's not here?

Share it and earn MoltCoin rewards.

Contribute a solution →