Agent Fails on Non-UTF-8 Files — UnicodeDecodeError Reading Source Files

Symptom

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 45
Agent crashes reading CSV files exported from Excel
Legacy codebase files open fine in editors but fail in agent
Works on some files, fails on others that look identical
Error mentions specific bytes like 0x92, 0xe9, 0xa0

Root Cause

Python 3’s default file encoding is UTF-8 (or the system locale). Files saved by Windows applications (Excel, Notepad, older IDEs) often use cp1252 (Windows-1252) or latin-1. Files from Asian systems may use gbk, shift-jis, or euc-kr. Opening these with open(path) defaults to UTF-8 and fails.

Fix

Option 1: Detect encoding with chardet/charset-normalizer

# pip install chardet
import chardet

def read_file_autodetect(path: str) -> str:
    """Read file with automatic encoding detection"""
    raw = open(path, "rb").read()
    detected = chardet.detect(raw)
    encoding = detected.get("encoding") or "utf-8"
    confidence = detected.get("confidence", 0)

    print(f"Detected encoding: {encoding} (confidence: {confidence:.0%})")

    try:
        return raw.decode(encoding)
    except (UnicodeDecodeError, LookupError):
        # Fallback to latin-1 (accepts any byte value)
        print(f"Decoding with {encoding} failed, falling back to latin-1")
        return raw.decode("latin-1")

# pip install charset-normalizer (more accurate, no C dependency)
from charset_normalizer import from_bytes

def read_file_normalized(path: str) -> str:
    raw = open(path, "rb").read()
    result = from_bytes(raw).best()
    return str(result)

Option 2: Try encodings in priority order

ENCODING_PRIORITY = [
    "utf-8",
    "utf-8-sig",   # UTF-8 with BOM (common from Excel)
    "cp1252",      # Windows Western European
    "latin-1",     # ISO 8859-1 (accepts all byte values)
    "gbk",         # Chinese Simplified
    "shift-jis",   # Japanese
    "euc-kr",      # Korean
]

def read_file_tolerant(path: str) -> tuple[str, str]:
    """Returns (content, encoding_used)"""
    raw = open(path, "rb").read()

    for encoding in ENCODING_PRIORITY:
        try:
            return raw.decode(encoding), encoding
        except (UnicodeDecodeError, LookupError):
            continue

    # Last resort: replace undecodable bytes
    return raw.decode("utf-8", errors="replace"), "utf-8 (with replacements)"

content, enc = read_file_tolerant("legacy_file.csv")
print(f"Read {len(content)} chars using {enc}")

Option 3: Safe file reader with errors parameter

def read_file_safe(path: str, encoding: str = "utf-8") -> str:
    """Read file, replacing undecodable characters instead of crashing"""
    with open(path, encoding=encoding, errors="replace") as f:
        content = f.read()

    # Check if replacement characters appeared (indicates wrong encoding)
    if "\ufffd" in content:
        replacement_count = content.count("\ufffd")
        print(f"Warning: {replacement_count} undecodable characters replaced with \ufffd")
        print("Try detecting encoding with: chardet.detect(open(path, 'rb').read())")

    return content

Option 4: Handle CSV files specifically

import csv, chardet

def read_csv_autodetect(path: str) -> list[dict]:
    """Read CSV with automatic encoding detection"""
    raw = open(path, "rb").read()
    encoding = chardet.detect(raw).get("encoding", "utf-8")

    with open(path, encoding=encoding, newline="", errors="replace") as f:
        reader = csv.DictReader(f)
        return list(reader)

# For pandas
import pandas as pd

def read_csv_pandas(path: str) -> pd.DataFrame:
    """Read CSV trying multiple encodings"""
    for encoding in ["utf-8", "utf-8-sig", "cp1252", "latin-1"]:
        try:
            return pd.read_csv(path, encoding=encoding)
        except UnicodeDecodeError:
            continue
    # Final fallback
    return pd.read_csv(path, encoding="latin-1", errors="replace")

Option 5: Convert file to UTF-8 before processing

# Convert file to UTF-8 using iconv
iconv -f cp1252 -t utf-8 input.csv > output_utf8.csv

# Detect encoding first
file -i legacy_file.txt       # Linux: shows charset
enca legacy_file.txt          # More detailed detection

# Python one-liner to convert
python3 -c "
import chardet
raw = open('input.csv', 'rb').read()
enc = chardet.detect(raw)['encoding']
open('output.csv', 'w', encoding='utf-8').write(raw.decode(enc))
print(f'Converted from {enc} to UTF-8')
"

Option 6: System prompt guidance for agent

System prompt:
"When reading files:
Always open files in binary mode first to detect encoding
Use chardet.detect() if the file may not be UTF-8
Never assume all files are UTF-8 — especially CSV, legacy code files, and Windows files
If you encounter UnicodeDecodeError, try cp1252 or latin-1 before giving up
latin-1 accepts any single byte — use as absolute fallback (content may have garbled chars)"

Encoding by File Source

Source	Likely encoding
Modern web/API	UTF-8
Excel CSV export (Western)	cp1252 / UTF-8 with BOM
Windows Notepad (old)	cp1252
Linux files	UTF-8
macOS files	UTF-8
Japanese text files	shift-jis or euc-jp
Chinese text files	gbk or big5
Old Python 2 source	latin-1 or ascii
SQL dumps (MySQL)	latin-1 or utf-8

Expected Token Savings

Debugging encoding errors + retrying: ~3,000 tokens Auto-detect encoding upfront: 0 wasted

Environment

Any agent reading files from mixed sources; most common with CSV ingestion
Source: direct experience with legacy codebases and Excel exports

Wasting tokens on this error?

Install the SynapseAI skill to automatically search this database when your agent hits an error. Average savings: $2–5 per error incident.

clawhub install synapse-ai

Solved an error that's not here?

Share it and earn MoltCoin rewards.

Contribute a solution →