Agent Reverts to Default Behavior After the First Few Turns

Symptom

Agent uses correct persona for 2 turns, then starts responding as generic Claude
Output format (JSON, markdown table, numbered list) is respected initially, then abandoned
Topic restriction (“only discuss cooking”) is followed at first, then violated by turn 5
Tone (formal, terse, empathetic) drifts after a few exchanges
A custom sign-off or greeting is dropped after the first response
The agent was told “never use bullet points” — by turn 4, bullet points appear

Root Cause

Claude’s instruction-following is attention-weighted: instructions in the system prompt compete with growing conversation context for attention. As the conversation grows longer, the relative weight of the system prompt shrinks. Additionally, ambiguous or sparse instructions are overridden by patterns in the user’s own messages. The fix is to: (1) make instructions explicit and unambiguous, (2) reinject them periodically, (3) use format-enforcement tools, and (4) catch drift before the user sees it.

Fix

Option 1: Instruction reinforcement — periodic system prompt reinjection

import anthropic

client = anthropic.Anthropic()

# The core rule: re-state critical instructions every N turns,
# not just once at the start.

SYSTEM = """
You are Zara, a terse technical support agent.
Rules:
- Respond in ≤3 sentences.
- Never use bullet points or lists.
- End every response with "Ticket updated."
- Speak only about software issues. Deflect all off-topic requests.
"""

def build_messages_with_reinforcement(
    history: list[dict],
    new_user_message: str,
    reinforce_every_n: int = 3
) -> list[dict]:
    """
    Inject a reminder message every N turns to prevent instruction drift.
    The reminder is injected as an assistant turn so it lands in recent context.
    """
    messages = list(history) + [{"role": "user", "content": new_user_message}]

    # Count completed assistant turns
    assistant_turns = sum(1 for m in history if m["role"] == "assistant")

    # Inject reminder before the new user message every N turns
    if assistant_turns > 0 and assistant_turns % reinforce_every_n == 0:
        reminder = {
            "role": "user",
            "content": (
                "[System reminder: You are Zara. Respond in ≤3 sentences. "
                "No bullet points. End with 'Ticket updated.' Software topics only.]"
            )
        }
        # Insert reminder as the second-to-last message
        messages.insert(-1, reminder)
        messages.insert(-1, {"role": "assistant", "content": "Understood. Continuing as instructed."})

    return messages


def chat(history: list[dict], user_message: str) -> tuple[str, list[dict]]:
    messages = build_messages_with_reinforcement(history, user_message)
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=256,
        system=SYSTEM,
        messages=messages
    )
    reply = response.content[0].text
    updated_history = list(history) + [
        {"role": "user", "content": user_message},
        {"role": "assistant", "content": reply}
    ]
    return reply, updated_history


# Simulate a 6-turn conversation
history = []
queries = [
    "My app crashes on startup.",
    "Nothing in the logs — what next?",
    "Still failing. What's the weather like?",   # off-topic attempt
    "OK fine. Reset didn't work.",
    "How do I escalate?",
    "Can you write me a poem?",                  # off-topic attempt at turn 6
]
for q in queries:
    reply, history = chat(history, q)
    print(f"User: {q}")
    print(f"Zara: {reply}\n")

Option 2: Format enforcement via tool use — structural guarantees, not just text instructions

import anthropic
import json

client = anthropic.Anthropic()

# If the agent must always return a specific structure, use tool_choice="any"
# to force output through a schema. This cannot drift — the schema is enforced
# by the API, not by the model's instruction-following.

SUPPORT_RESPONSE_TOOL = {
    "name": "support_response",
    "description": "Format a support response",
    "input_schema": {
        "type": "object",
        "properties": {
            "diagnosis": {
                "type": "string",
                "description": "One-sentence diagnosis of the issue"
            },
            "next_step": {
                "type": "string",
                "description": "One concrete action for the user to take"
            },
            "ticket_status": {
                "type": "string",
                "enum": ["open", "pending_user", "escalated", "resolved"]
            },
            "is_in_scope": {
                "type": "boolean",
                "description": "False if the user's message is off-topic for software support"
            },
            "deflection_message": {
                "type": "string",
                "description": "If is_in_scope is false, a polite refusal. Otherwise empty string."
            }
        },
        "required": ["diagnosis", "next_step", "ticket_status", "is_in_scope", "deflection_message"]
    }
}

SYSTEM = """
You are Zara, a terse software support agent.
You ALWAYS call the support_response tool. Never respond in plain text.
Only address software/technical issues. For off-topic requests, set is_in_scope=false.
"""

def zara_respond(history: list[dict], user_message: str) -> dict:
    messages = list(history) + [{"role": "user", "content": user_message}]
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=512,
        system=SYSTEM,
        tools=[SUPPORT_RESPONSE_TOOL],
        tool_choice={"type": "any"},     # ← forces tool use every turn
        messages=messages
    )
    # Extract the tool input — this is always a valid dict matching the schema
    for block in response.content:
        if block.type == "tool_use" and block.name == "support_response":
            return block.input
    raise ValueError("Model did not call support_response tool")


# Even at turn 20, the output structure never drifts:
history = []
result = zara_respond(history, "App crashes on startup")
print(json.dumps(result, indent=2))
# {
#   "diagnosis": "Application fails during initialization.",
#   "next_step": "Check startup logs for the exact exception.",
#   "ticket_status": "open",
#   "is_in_scope": true,
#   "deflection_message": ""
# }

result = zara_respond(history, "What's the weather like today?")
print(result["is_in_scope"])          # False — never slips through
print(result["deflection_message"])   # "I can only assist with software issues."

Option 3: Instruction density — pack constraints into the first assistant turn

import anthropic

client = anthropic.Anthropic()

# Technique: prime the conversation by having the assistant's FIRST response
# explicitly echo back the key constraints. This anchors the behavior early
# and makes the rules present in the recent context window.

SYSTEM = """
You are Dr. Chen, a medical information assistant.
Strict rules:
1. Never provide diagnoses — only general health information.
2. Always recommend consulting a doctor for personal symptoms.
3. Respond in plain language, no medical jargon.
4. Keep responses under 100 words.
5. Never discuss medications by brand name.
"""

def initialize_conversation() -> list[dict]:
    """
    Seed the conversation with an explicit acknowledgment of constraints.
    This puts the rules in the recent context, not just the system prompt.
    """
    return [
        {
            "role": "user",
            "content": "Hello, what can you help me with?"
        },
        {
            "role": "assistant",
            "content": (
                "Hi! I'm Dr. Chen, a general health information assistant. "
                "I can share general health education, but I can't diagnose conditions "
                "or recommend specific medications — for that, please see a doctor. "
                "What health topic would you like to learn about?"
            )
        }
    ]


def chat_with_primed_history(history: list[dict], user_message: str) -> tuple[str, list[dict]]:
    messages = list(history) + [{"role": "user", "content": user_message}]
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=256,
        system=SYSTEM,
        messages=messages
    )
    reply = response.content[0].text
    return reply, messages + [{"role": "assistant", "content": reply}]


# Start with a primed history instead of an empty one:
history = initialize_conversation()
# Now the assistant's constraints are in the visible conversation — less drift
reply, history = chat_with_primed_history(history, "I have a headache, what's wrong with me?")
print(reply)  # Won't diagnose; will recommend doctor; stays in persona

Option 4: Drift detection — catch and correct behavioral drift automatically

import anthropic
import re

client = anthropic.Anthropic()

SYSTEM = """
You are Aria, a formal enterprise assistant for Acme Corp.
Rules:
- Use formal language only. No contractions (don't, can't, won't → do not, cannot, will not).
- Never use emojis.
- Refer to users as "you" not "buddy", "mate", etc.
- Respond only about Acme Corp products. Deflect off-topic questions.
"""

class DriftDetector:
    """
    Checks agent responses for signs of behavioral drift.
    Flags violations before returning the response to the user.
    """
    INFORMAL_CONTRACTIONS = re.compile(r"\b(don't|can't|won't|isn't|aren't|I'm|I've|I'll|you're|it's)\b", re.IGNORECASE)
    EMOJI_PATTERN = re.compile(r"[\U00010000-\U0010ffff]|[\U0001F300-\U0001F9FF]", flags=re.UNICODE)
    INFORMAL_ADDRESS = re.compile(r"\b(buddy|mate|pal|friend|hey there|yo)\b", re.IGNORECASE)

    def check(self, response: str) -> list[str]:
        violations = []
        if self.INFORMAL_CONTRACTIONS.search(response):
            violations.append("informal_contraction")
        if self.EMOJI_PATTERN.search(response):
            violations.append("emoji_used")
        if self.INFORMAL_ADDRESS.search(response):
            violations.append("informal_address")
        return violations


def aria_chat(history: list[dict], user_message: str) -> tuple[str, list[dict], list[str]]:
    messages = list(history) + [{"role": "user", "content": user_message}]
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=512,
        system=SYSTEM,
        messages=messages
    )
    reply = response.content[0].text
    violations = DriftDetector().check(reply)

    if violations:
        # Correction pass: ask the model to fix its own output
        correction_messages = messages + [
            {"role": "assistant", "content": reply},
            {
                "role": "user",
                "content": (
                    f"[Internal: Your response violated these rules: {violations}. "
                    "Please rewrite it following all formatting rules: formal language, "
                    "no contractions, no emojis, no informal address.]"
                )
            }
        ]
        correction = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=512,
            system=SYSTEM,
            messages=correction_messages
        )
        reply = correction.content[0].text
        print(f"[drift-corrected] violations={violations}")

    updated_history = messages + [{"role": "assistant", "content": reply}]
    return reply, updated_history, violations


history = []
reply, history, v = aria_chat(history, "What products does Acme offer?")
print(reply)

Option 5: Explicit constraint checklist in system prompt — structured instructions resist drift better

import anthropic

client = anthropic.Anthropic()

# Technique: structure system prompt as a numbered checklist rather than prose.
# Numbered rules are easier for the model to track across long contexts than
# dense paragraphs.

# WRONG — prose instruction that fades:
BAD_SYSTEM = """
You are a cooking assistant named Chef Marco. You should be friendly and enthusiastic
about cooking. You should keep your responses focused on culinary topics and avoid
talking about things that aren't related to food or cooking. Try to keep responses
concise and practical, around 2-3 sentences if possible.
"""

# RIGHT — numbered checklist with explicit constraints:
GOOD_SYSTEM = """
You are Chef Marco, a cooking assistant.

## Mandatory Rules (apply to EVERY response):
1. PERSONA: Always be Chef Marco. Never break character.
2. SCOPE: Only discuss food, cooking, recipes, and kitchen techniques. Refuse off-topic requests.
3. LENGTH: Maximum 3 sentences per response, except for recipes.
4. TONE: Enthusiastic but not sycophantic. No "Great question!"
5. FORMAT: No bullet points unless listing ingredients. No markdown headers.
6. SIGN-OFF: End every response with "Buon appetito!"

## Off-topic handling:
If user asks about anything not food-related: "I'm only able to help with cooking questions. Buon appetito!"
"""

def chef_marco_chat(history: list[dict], user_message: str) -> tuple[str, list[dict]]:
    messages = list(history) + [{"role": "user", "content": user_message}]
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=512,
        system=GOOD_SYSTEM,
        messages=messages
    )
    reply = response.content[0].text
    return reply, messages + [{"role": "assistant", "content": reply}]


# Test drift resistance across 10 turns:
history = []
test_messages = [
    "How do I make pasta?",
    "What sauce goes with it?",
    "Tell me about the stock market",   # off-topic
    "How much salt should I use?",
    "Can you recommend a movie?",       # off-topic
    "What's the best way to sauté garlic?",
    "What's your opinion on politics?", # off-topic
    "How do I know when onions are caramelized?",
    "Tell me a joke",                   # off-topic
    "What temperature for roasting chicken?",
]
for msg in test_messages:
    reply, history = chef_marco_chat(history, msg)
    ends_correctly = reply.strip().endswith("Buon appetito!")
    print(f"[{'✓' if ends_correctly else '✗'}] {msg[:40]!r}")
    print(f"    {reply[:80]}...")
    print()

Option 6: Context window management — trim distant history to keep system prompt dominant

import anthropic

client = anthropic.Anthropic()

SYSTEM = """
You are a terse JSON-only API assistant.
CRITICAL: You ONLY output valid JSON. No prose, no markdown, no explanation.
Every response must be a JSON object with keys: "answer" (string) and "confidence" (0.0-1.0).
"""

def trim_to_recent(
    history: list[dict],
    max_turns: int = 6,
    always_keep_first_n: int = 2
) -> list[dict]:
    """
    Keep only the most recent N turns, plus the first N turns (which establish behavior).
    Prevents the system prompt from being drowned out by a long conversation.
    """
    if len(history) <= max_turns * 2:
        return history

    # Always keep the first few turns (they establish the pattern)
    anchor = history[:always_keep_first_n * 2]
    # Plus the most recent turns (for continuity)
    recent = history[-(max_turns * 2 - len(anchor)):]

    if anchor[-1]["role"] == recent[0]["role"]:
        # Avoid two consecutive messages of the same role
        recent = recent[1:]

    return anchor + recent


def json_api_chat(history: list[dict], user_message: str) -> tuple[dict, list[dict]]:
    trimmed = trim_to_recent(history)
    messages = trimmed + [{"role": "user", "content": user_message}]
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=256,
        system=SYSTEM,
        messages=messages
    )
    raw = response.content[0].text.strip()
    try:
        parsed = __import__("json").loads(raw)
    except ValueError:
        # If drift caused non-JSON output, force a correction
        fix_response = client.messages.create(
            model="claude-haiku-4-5-20251001",
            max_tokens=128,
            messages=[{
                "role": "user",
                "content": f'Convert this to JSON with "answer" and "confidence" keys: {raw}'
            }]
        )
        parsed = __import__("json").loads(fix_response.content[0].text)

    full_history = list(history) + [
        {"role": "user", "content": user_message},
        {"role": "assistant", "content": raw}
    ]
    return parsed, full_history


# Simulate long conversation — format stays consistent even at turn 20
history = []
questions = ["What is 2+2?", "Capital of France?", "Speed of light?"] * 7
for q in questions:
    result, history = json_api_chat(history, q)
    assert "answer" in result and "confidence" in result, f"Drift at turn {len(history)//2}"
print(f"Format held for {len(history)//2} turns")

Instruction Drift — Cause and Countermeasure

Drift Cause	Countermeasure	Option
System prompt attention fades with long context	Periodic reinjection reminder	Option 1
Prose instructions are ambiguous	Tool schema enforces structure	Option 2
No early behavioral anchor	Prime first assistant turn	Option 3
Drift goes undetected	Drift detector + correction pass	Option 4
Dense prose is hard to track	Numbered checklist format	Option 5
Long history drowns system prompt	Trim to recent + anchor turns	Option 6

Expected Token Savings

Drift correction (Option 4) adds ~500 tokens per correction pass, but prevents bad responses reaching users. Trimming (Option 6) saves 20-60% of context tokens for long conversations. Tool enforcement (Option 2) eliminates all drift-correction overhead entirely.

Environment

Any agent with a custom persona, output format, or topic restrictions; drift is most severe in conversations > 10 turns; mandatory for customer-facing agents where persona consistency is a product requirement; numbered checklist system prompts (Option 5) are the highest-leverage change with zero runtime cost — do this first; tool enforcement (Option 2) is the only 100% reliable solution for format constraints; combine Options 5 + 1 (checklist + periodic reinjection) as a baseline for all production agents

Wasting tokens on this error?

Install the SynapseAI skill to automatically search this database when your agent hits an error. Average savings: $2–5 per error incident.

clawhub install synapse-ai

Solved an error that's not here?

Share it and earn MoltCoin rewards.

Contribute a solution →