2 Commits

Author SHA1 Message Date
Mohamed Hegazy
6a63e35e75
security-guidance: lenient UTF-8 decode in 6 git-subprocess helpers (#2056)
Fixes anthropics/claude-plugins-official#2056 — on Windows, when the
worktree contains an untracked file whose name has a character undefined
in cp1252 (accented capitals like Á Í Ï Ð Ý, most CJK, emoji), the
UserPromptSubmit hook crashes:

  Exception in thread Thread-5 (_readerthread):
    UnicodeDecodeError: 'charmap' codec can't decode byte 0x81
  Traceback (most recent call last):
    File diffstate.py, line 338, in _list_untracked
      for p in r.stdout.split('\\0'):
  AttributeError: 'NoneType' object has no attribute 'split'

Non-blocking (UPS failures still let the prompt through) but the
baseline-untracked snapshot is silently lost, so the Stop-hook review
mis-handles pre-existing untracked files.

Root cause (reporter's diagnosis, verified):

1. core.quotePath=false makes git emit raw UTF-8 for non-ASCII filenames.
2. subprocess.run(..., text=True) decodes via
   locale.getpreferredencoding(False) in strict mode — on Windows that
   is cp1252, in which 0x81 / 0x8D / 0x8F / 0x90 / 0x9D are undefined.
   Those bytes appear in the UTF-8 encodings of Á (C3 81), Í (C3 8D),
   Ï (C3 8F), Ð (C3 90), Ý (C3 9D), and a large fraction of CJK / emoji
   codepoints.
3. The decode runs in the subprocess reader thread. The thread raises
   UnicodeDecodeError, threading prints 'Exception in thread Thread-N',
   subprocess.run returns with stdout=None. The handler then does
   None.split('\\0') -> AttributeError, which is NOT in the narrow
   except (TimeoutExpired, FileNotFoundError, OSError) tuple, so it
   escapes the helper, propagates out of UserPromptSubmit's
   ThreadPoolExecutor.result(), and exits the hook non-zero.

This is internally inconsistent: gitutil._git_diff_range,
security_reminder_hook._reflog_amend_lookup (line ~540), and the commit
diff loop (line ~1115) already do bytes + decode utf-8/replace, with
comments explicitly noting that text=True would crash. The fix below
extends that established pattern to the helpers that were holdouts.

Affected helpers (6 total):

  - diffstate._list_untracked            <- reporter, hot path, CRITICAL
  - diffstate.capture_git_baseline       <- reporter, latent
  - diffstate.get_baseline_file_content  <- audit, file content read, HIGH
  - gitutil._git_name_only                <- reporter, latent
  - gitutil._git_status_porcelain         <- reporter, latent
  - gitutil._git_reflog_recent_commits    <- audit, embeds %gs commit msg, HIGH

For each one:

  - Drop text=True from subprocess.run.
  - Decode r.stdout / r.stderr as .decode('utf-8', errors='replace').
  - Add ValueError to the except tuple as defense against any future
    strict-decode regression (UnicodeDecodeError is a ValueError
    subclass; including it explicitly degrades the helper to its
    empty/None return instead of escaping out of the hook).

Verified locally on macOS Python 3.13:

  - py_compile clean on both files.
  - 45 existing smoke + extensibility tests still pass.
  - 21 new internal tests (not in this PR — added to the team's local
    test suite at staging/tests/test_unicode_decode.py):
      * 18 static-shape parametrized: each of the 6 fixed helpers has
        no text=True in its subprocess calls, contains errors='replace',
        and lists ValueError in its except.
      * Deterministic end-to-end: create real git repo + Ávila_report.txt
        untracked, call _list_untracked, verify it returns
        {'Ávila_report.txt': <mtime>} without crashing.
      * Deterministic end-to-end: same for capture_git_baseline (verifies
        the latent stderr-warning case stays valid).
      * Deterministic end-to-end: get_baseline_file_content on a file
        whose content has 山田太郎 + 🎉; verify the bytes round-trip
        through the decode.
  - 66/66 tests pass total (45 existing + 21 new).

NOT verified end-to-end on Windows — would need actual cp1252 strict
decode to fire. Reporter has the deterministic repro and will
re-verify on their Win11 / Python 3.14.x setup before merge.

Not in this PR (defense-in-depth, lower risk):

  - 3 git rev-parse calls returning path output (gitutil._find_git_index,
    _git_toplevel, _git_dir) could fail on Windows if cwd is in a
    non-ASCII install directory. Same fix shape but unreported and
    much lower probability — worth a separate follow-up if anyone
    actually hits it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 23:15:16 -07:00
Mohamed Hegazy
0bde168648
Update security-guidance plugin 2026-05-26 14:06:52 -07:00