mirror of
https://github.com/anthropics/claude-plugins-official.git
synced 2026-06-13 22:26:03 -03:00
security-guidance: lenient UTF-8 decode in 6 git-subprocess helpers (#2056)
Fixes anthropics/claude-plugins-official#2056 — on Windows, when the worktree contains an untracked file whose name has a character undefined in cp1252 (accented capitals like Á Í Ï Ð Ý, most CJK, emoji), the UserPromptSubmit hook crashes: Exception in thread Thread-5 (_readerthread): UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 Traceback (most recent call last): File diffstate.py, line 338, in _list_untracked for p in r.stdout.split('\\0'): AttributeError: 'NoneType' object has no attribute 'split' Non-blocking (UPS failures still let the prompt through) but the baseline-untracked snapshot is silently lost, so the Stop-hook review mis-handles pre-existing untracked files. Root cause (reporter's diagnosis, verified): 1. core.quotePath=false makes git emit raw UTF-8 for non-ASCII filenames. 2. subprocess.run(..., text=True) decodes via locale.getpreferredencoding(False) in strict mode — on Windows that is cp1252, in which 0x81 / 0x8D / 0x8F / 0x90 / 0x9D are undefined. Those bytes appear in the UTF-8 encodings of Á (C3 81), Í (C3 8D), Ï (C3 8F), Ð (C3 90), Ý (C3 9D), and a large fraction of CJK / emoji codepoints. 3. The decode runs in the subprocess reader thread. The thread raises UnicodeDecodeError, threading prints 'Exception in thread Thread-N', subprocess.run returns with stdout=None. The handler then does None.split('\\0') -> AttributeError, which is NOT in the narrow except (TimeoutExpired, FileNotFoundError, OSError) tuple, so it escapes the helper, propagates out of UserPromptSubmit's ThreadPoolExecutor.result(), and exits the hook non-zero. This is internally inconsistent: gitutil._git_diff_range, security_reminder_hook._reflog_amend_lookup (line ~540), and the commit diff loop (line ~1115) already do bytes + decode utf-8/replace, with comments explicitly noting that text=True would crash. The fix below extends that established pattern to the helpers that were holdouts. Affected helpers (6 total): - diffstate._list_untracked <- reporter, hot path, CRITICAL - diffstate.capture_git_baseline <- reporter, latent - diffstate.get_baseline_file_content <- audit, file content read, HIGH - gitutil._git_name_only <- reporter, latent - gitutil._git_status_porcelain <- reporter, latent - gitutil._git_reflog_recent_commits <- audit, embeds %gs commit msg, HIGH For each one: - Drop text=True from subprocess.run. - Decode r.stdout / r.stderr as .decode('utf-8', errors='replace'). - Add ValueError to the except tuple as defense against any future strict-decode regression (UnicodeDecodeError is a ValueError subclass; including it explicitly degrades the helper to its empty/None return instead of escaping out of the hook). Verified locally on macOS Python 3.13: - py_compile clean on both files. - 45 existing smoke + extensibility tests still pass. - 21 new internal tests (not in this PR — added to the team's local test suite at staging/tests/test_unicode_decode.py): * 18 static-shape parametrized: each of the 6 fixed helpers has no text=True in its subprocess calls, contains errors='replace', and lists ValueError in its except. * Deterministic end-to-end: create real git repo + Ávila_report.txt untracked, call _list_untracked, verify it returns {'Ávila_report.txt': <mtime>} without crashing. * Deterministic end-to-end: same for capture_git_baseline (verifies the latent stderr-warning case stays valid). * Deterministic end-to-end: get_baseline_file_content on a file whose content has 山田太郎 + 🎉; verify the bytes round-trip through the decode. - 66/66 tests pass total (45 existing + 21 new). NOT verified end-to-end on Windows — would need actual cp1252 strict decode to fire. Reporter has the deterministic repro and will re-verify on their Win11 / Python 3.14.x setup before merge. Not in this PR (defense-in-depth, lower risk): - 3 git rev-parse calls returning path output (gitutil._find_git_index, _git_toplevel, _git_dir) could fail on Windows if cwd is in a non-ASCII install directory. Same fix shape but unreported and much lower probability — worth a separate follow-up if anyone actually hits it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
502de97746
commit
6a63e35e75
@ -138,7 +138,17 @@ def restore_unreviewed_stop_state(session_id, paths, baseline_sha):
|
|||||||
|
|
||||||
|
|
||||||
def get_baseline_file_content(session_id, file_path, cwd):
|
def get_baseline_file_content(session_id, file_path, cwd):
|
||||||
"""Get the content of a file at the baseline SHA. Returns None if unavailable."""
|
"""Get the content of a file at the baseline SHA. Returns None if unavailable.
|
||||||
|
|
||||||
|
Decode the file content as UTF-8 with errors="replace" rather than using
|
||||||
|
text=True: source files in user repos can be latin-1 / cp1252 / shift-jis
|
||||||
|
/ etc., and on Windows text=True would decode via locale.getpreferredencoding()
|
||||||
|
in strict mode and raise UnicodeDecodeError in the subprocess reader
|
||||||
|
thread — leaving result.stdout=None and propagating AttributeError when
|
||||||
|
the caller tries to use it. Same class as the existing migrations at
|
||||||
|
security_reminder_hook.py:540 (reflog subjects) and :1115 (commit
|
||||||
|
diffs); this helper was missed in that pass. See
|
||||||
|
anthropics/claude-plugins-official#2056."""
|
||||||
baseline_sha = load_baseline_sha(session_id)
|
baseline_sha = load_baseline_sha(session_id)
|
||||||
if not baseline_sha:
|
if not baseline_sha:
|
||||||
return None
|
return None
|
||||||
@ -151,12 +161,12 @@ def get_baseline_file_content(session_id, file_path, cwd):
|
|||||||
return None
|
return None
|
||||||
result = subprocess.run(
|
result = subprocess.run(
|
||||||
[*GIT_CMD, "show", f"{baseline_sha}:{rel_path}"],
|
[*GIT_CMD, "show", f"{baseline_sha}:{rel_path}"],
|
||||||
cwd=cwd, capture_output=True, text=True, timeout=5
|
cwd=cwd, capture_output=True, timeout=5
|
||||||
)
|
)
|
||||||
if result.returncode == 0:
|
if result.returncode == 0:
|
||||||
return result.stdout
|
return (result.stdout or b"").decode("utf-8", errors="replace")
|
||||||
return None
|
return None
|
||||||
except (subprocess.TimeoutExpired, FileNotFoundError, OSError):
|
except (subprocess.TimeoutExpired, FileNotFoundError, OSError, ValueError):
|
||||||
return None
|
return None
|
||||||
|
|
||||||
|
|
||||||
@ -173,11 +183,16 @@ def capture_git_baseline(cwd):
|
|||||||
and `compute_v2_review_set` subtracts that set so pre-existing untracked
|
and `compute_v2_review_set` subtracts that set so pre-existing untracked
|
||||||
files are not reviewed as Claude-authored.
|
files are not reviewed as Claude-authored.
|
||||||
"""
|
"""
|
||||||
|
# stdout is a SHA so text=True is safe on stdout, but a non-ASCII
|
||||||
|
# filename in `git stash create`'s STDERR warning (e.g. a worktree
|
||||||
|
# with `Ávila_report.txt` triggers a quotePath/locale warning) would
|
||||||
|
# trip the stderr reader thread on Windows cp1252. Decode both streams
|
||||||
|
# leniently for symmetry with _list_untracked. See #2056.
|
||||||
try:
|
try:
|
||||||
# Check if HEAD exists (i.e., repo has at least one commit)
|
# Check if HEAD exists (i.e., repo has at least one commit)
|
||||||
head_check = subprocess.run(
|
head_check = subprocess.run(
|
||||||
[*GIT_CMD, "rev-parse", "HEAD"],
|
[*GIT_CMD, "rev-parse", "HEAD"],
|
||||||
cwd=cwd, capture_output=True, text=True, timeout=5
|
cwd=cwd, capture_output=True, timeout=5
|
||||||
)
|
)
|
||||||
if head_check.returncode != 0:
|
if head_check.returncode != 0:
|
||||||
# No commits yet — skip review rather than creating commits in the user's repo
|
# No commits yet — skip review rather than creating commits in the user's repo
|
||||||
@ -186,20 +201,20 @@ def capture_git_baseline(cwd):
|
|||||||
|
|
||||||
result = subprocess.run(
|
result = subprocess.run(
|
||||||
[*GIT_CMD, "stash", "create"],
|
[*GIT_CMD, "stash", "create"],
|
||||||
cwd=cwd, capture_output=True, text=True, timeout=15
|
cwd=cwd, capture_output=True, timeout=15
|
||||||
)
|
)
|
||||||
sha = result.stdout.strip()
|
sha = (result.stdout or b"").decode("utf-8", errors="replace").strip()
|
||||||
if sha:
|
if sha:
|
||||||
return sha
|
return sha
|
||||||
|
|
||||||
# Working tree is clean — stash create returns empty. Use HEAD.
|
# Working tree is clean — stash create returns empty. Use HEAD.
|
||||||
result = subprocess.run(
|
result = subprocess.run(
|
||||||
[*GIT_CMD, "rev-parse", "HEAD"],
|
[*GIT_CMD, "rev-parse", "HEAD"],
|
||||||
cwd=cwd, capture_output=True, text=True, timeout=5
|
cwd=cwd, capture_output=True, timeout=5
|
||||||
)
|
)
|
||||||
sha = result.stdout.strip()
|
sha = (result.stdout or b"").decode("utf-8", errors="replace").strip()
|
||||||
return sha if sha else None
|
return sha if sha else None
|
||||||
except (subprocess.TimeoutExpired, FileNotFoundError, OSError) as e:
|
except (subprocess.TimeoutExpired, FileNotFoundError, OSError, ValueError) as e:
|
||||||
debug_log(f"Failed to capture git baseline: {e}")
|
debug_log(f"Failed to capture git baseline: {e}")
|
||||||
return None
|
return None
|
||||||
|
|
||||||
@ -323,19 +338,35 @@ def _list_untracked(cwd):
|
|||||||
mtime is captured so an in-place edit during the turn is still reviewed.
|
mtime is captured so an in-place edit during the turn is still reviewed.
|
||||||
|
|
||||||
Uses ls-files (not status) for the UPS path: the index diff isn't needed,
|
Uses ls-files (not status) for the UPS path: the index diff isn't needed,
|
||||||
and ls-files --others only walks the worktree against .gitignore."""
|
and ls-files --others only walks the worktree against .gitignore.
|
||||||
|
|
||||||
|
Decodes stdout/stderr as UTF-8 with errors="replace" instead of using
|
||||||
|
text=True. With core.quotePath=false git emits raw UTF-8 bytes for
|
||||||
|
non-ASCII filenames; text=True decodes via locale.getpreferredencoding()
|
||||||
|
in strict mode — on Windows that's cp1252 with several undefined bytes
|
||||||
|
(0x81/0x8D/0x8F/0x90/0x9D), all of which appear in UTF-8 encodings of
|
||||||
|
common accented capitals (Á Í Ï Ð Ý) and most CJK/emoji codepoints.
|
||||||
|
A non-ASCII filename in the worktree crashed the subprocess reader
|
||||||
|
thread, left r.stdout=None, and propagated AttributeError out of the
|
||||||
|
helper — silently losing the baseline snapshot every UserPromptSubmit.
|
||||||
|
See anthropics/claude-plugins-official#2056. The sibling helpers in
|
||||||
|
gitutil.py already follow the lenient pattern; this function and
|
||||||
|
capture_git_baseline / _git_name_only / _git_status_porcelain were
|
||||||
|
the holdouts."""
|
||||||
try:
|
try:
|
||||||
repo = _git_toplevel(cwd) or cwd
|
repo = _git_toplevel(cwd) or cwd
|
||||||
r = subprocess.run(
|
r = subprocess.run(
|
||||||
[*GIT_CMD, "-c", "core.quotePath=false", "ls-files",
|
[*GIT_CMD, "-c", "core.quotePath=false", "ls-files",
|
||||||
"--others", "--exclude-standard", "-z"],
|
"--others", "--exclude-standard", "-z"],
|
||||||
cwd=repo, capture_output=True, text=True, timeout=15,
|
cwd=repo, capture_output=True, timeout=15,
|
||||||
)
|
)
|
||||||
if r.returncode != 0:
|
if r.returncode != 0:
|
||||||
debug_log(f"_list_untracked rc={r.returncode}: {r.stderr[:200]}")
|
stderr_str = (r.stderr or b"").decode("utf-8", errors="replace")
|
||||||
|
debug_log(f"_list_untracked rc={r.returncode}: {stderr_str[:200]}")
|
||||||
return {}
|
return {}
|
||||||
|
stdout = (r.stdout or b"").decode("utf-8", errors="replace")
|
||||||
out = {}
|
out = {}
|
||||||
for p in r.stdout.split("\0"):
|
for p in stdout.split("\0"):
|
||||||
if not p:
|
if not p:
|
||||||
continue
|
continue
|
||||||
try:
|
try:
|
||||||
@ -346,7 +377,9 @@ def _list_untracked(cwd):
|
|||||||
debug_log(f"_list_untracked: capped at {UNTRACKED_BASELINE_CAP}")
|
debug_log(f"_list_untracked: capped at {UNTRACKED_BASELINE_CAP}")
|
||||||
break
|
break
|
||||||
return out
|
return out
|
||||||
except (subprocess.TimeoutExpired, FileNotFoundError, OSError) as e:
|
except (subprocess.TimeoutExpired, FileNotFoundError, OSError, ValueError) as e:
|
||||||
|
# ValueError guards against any future strict-decode regression
|
||||||
|
# so the helper degrades to {} instead of crashing the hook.
|
||||||
debug_log(f"_list_untracked error: {e}")
|
debug_log(f"_list_untracked error: {e}")
|
||||||
return {}
|
return {}
|
||||||
|
|
||||||
|
|||||||
@ -259,19 +259,29 @@ def _git_reflog_recent_commits(repo_root, max_age_s=120, max_n=5):
|
|||||||
# %gs (the reflog subject) is `commit: <commit-msg first line>` and can
|
# %gs (the reflog subject) is `commit: <commit-msg first line>` and can
|
||||||
# contain `|`; put it LAST so split("|", 2) leaves it intact. %H is
|
# contain `|`; put it LAST so split("|", 2) leaves it intact. %H is
|
||||||
# hex and %ct is integer, so the first two fields are delimiter-safe.
|
# hex and %ct is integer, so the first two fields are delimiter-safe.
|
||||||
|
#
|
||||||
|
# Bytes + decode utf-8/replace: %gs embeds commit-message subjects
|
||||||
|
# which git stores as raw bytes — commits can be authored in
|
||||||
|
# latin-1 / cp1252 / shift-jis etc., and text=True would raise
|
||||||
|
# UnicodeDecodeError in the subprocess reader thread on Windows
|
||||||
|
# cp1252 (subprocess.run returns r.stdout=None, then
|
||||||
|
# r.stdout.splitlines() AttributeErrors). Mirrors the existing
|
||||||
|
# migration at security_reminder_hook.py:540 — same pattern was
|
||||||
|
# missed here. See anthropics/claude-plugins-official#2056.
|
||||||
r = subprocess.run(
|
r = subprocess.run(
|
||||||
[*GIT_CMD, "log", "-g", "-n", str(max_n),
|
[*GIT_CMD, "log", "-g", "-n", str(max_n),
|
||||||
"--format=%H|%ct|%gs", "HEAD"],
|
"--format=%H|%ct|%gs", "HEAD"],
|
||||||
cwd=repo_root, capture_output=True, text=True, timeout=5,
|
cwd=repo_root, capture_output=True, timeout=5,
|
||||||
)
|
)
|
||||||
except (subprocess.TimeoutExpired, FileNotFoundError, OSError):
|
except (subprocess.TimeoutExpired, FileNotFoundError, OSError, ValueError):
|
||||||
return [], 0
|
return [], 0
|
||||||
if r.returncode != 0:
|
if r.returncode != 0:
|
||||||
return [], 0
|
return [], 0
|
||||||
|
stdout = (r.stdout or b"").decode("utf-8", errors="replace")
|
||||||
import time as _time
|
import time as _time
|
||||||
now = int(_time.time())
|
now = int(_time.time())
|
||||||
fresh, stale = [], 0
|
fresh, stale = [], 0
|
||||||
for idx, line in enumerate(r.stdout.splitlines()):
|
for idx, line in enumerate(stdout.splitlines()):
|
||||||
parts = line.split("|", 2)
|
parts = line.split("|", 2)
|
||||||
if len(parts) != 3:
|
if len(parts) != 3:
|
||||||
continue
|
continue
|
||||||
@ -306,23 +316,31 @@ def _git_name_only(cwd, base, include_untracked=False):
|
|||||||
must distinguish None (error → don't trust as a filter) from set()
|
must distinguish None (error → don't trust as a filter) from set()
|
||||||
(genuinely nothing changed). `-c core.quotePath=false -z` keeps non-ASCII
|
(genuinely nothing changed). `-c core.quotePath=false -z` keeps non-ASCII
|
||||||
and space-containing paths intact."""
|
and space-containing paths intact."""
|
||||||
|
# Decode stdout/stderr as UTF-8 with errors="replace" instead of using
|
||||||
|
# text=True. core.quotePath=false makes git emit raw UTF-8 for non-ASCII
|
||||||
|
# paths, and text=True on Windows decodes via cp1252 strict — a non-ASCII
|
||||||
|
# changed path would crash the subprocess reader thread, leave
|
||||||
|
# result.stdout=None, and propagate AttributeError out of the helper.
|
||||||
|
# Same fix shape as diffstate._list_untracked. See #2056.
|
||||||
def _run(env):
|
def _run(env):
|
||||||
result = subprocess.run(
|
result = subprocess.run(
|
||||||
[*GIT_CMD, "-c", "core.quotePath=false", "diff", "--name-only", "-z", base],
|
[*GIT_CMD, "-c", "core.quotePath=false", "diff", "--name-only", "-z", base],
|
||||||
cwd=cwd, capture_output=True, text=True, timeout=30,
|
cwd=cwd, capture_output=True, timeout=30,
|
||||||
env=env,
|
env=env,
|
||||||
)
|
)
|
||||||
if result.returncode != 0:
|
if result.returncode != 0:
|
||||||
debug_log(f"_git_name_only({base!r}) rc={result.returncode}: {result.stderr[:200]}")
|
stderr_str = (result.stderr or b"").decode("utf-8", errors="replace")
|
||||||
|
debug_log(f"_git_name_only({base!r}) rc={result.returncode}: {stderr_str[:200]}")
|
||||||
return None
|
return None
|
||||||
return {p for p in result.stdout.split("\0") if p}
|
stdout = (result.stdout or b"").decode("utf-8", errors="replace")
|
||||||
|
return {p for p in stdout.split("\0") if p}
|
||||||
|
|
||||||
try:
|
try:
|
||||||
if not include_untracked:
|
if not include_untracked:
|
||||||
return _run(None)
|
return _run(None)
|
||||||
with _temp_index(cwd) as env:
|
with _temp_index(cwd) as env:
|
||||||
return _run(env)
|
return _run(env)
|
||||||
except (subprocess.TimeoutExpired, FileNotFoundError, OSError) as e:
|
except (subprocess.TimeoutExpired, FileNotFoundError, OSError, ValueError) as e:
|
||||||
debug_log(f"_git_name_only({base!r}) error: {e}")
|
debug_log(f"_git_name_only({base!r}) error: {e}")
|
||||||
return None
|
return None
|
||||||
|
|
||||||
@ -339,17 +357,22 @@ def _git_status_porcelain(cwd):
|
|||||||
collapses to `dir/`). Required so the untracked set subtracts cleanly
|
collapses to `dir/`). Required so the untracked set subtracts cleanly
|
||||||
against the UPS-time `_list_untracked` snapshot, which uses ls-files and
|
against the UPS-time `_list_untracked` snapshot, which uses ls-files and
|
||||||
therefore always lists individual files."""
|
therefore always lists individual files."""
|
||||||
|
# Lenient decode: same UTF-8 + errors="replace" pattern as the
|
||||||
|
# sibling helpers — a non-ASCII path in the worktree would otherwise
|
||||||
|
# crash the cp1252 reader thread on Windows. See #2056.
|
||||||
try:
|
try:
|
||||||
r = subprocess.run(
|
r = subprocess.run(
|
||||||
[*GIT_CMD, "-c", "core.quotePath=false", "status",
|
[*GIT_CMD, "-c", "core.quotePath=false", "status",
|
||||||
"--porcelain=v1", "-uall", "-z"],
|
"--porcelain=v1", "-uall", "-z"],
|
||||||
cwd=cwd, capture_output=True, text=True, timeout=30,
|
cwd=cwd, capture_output=True, timeout=30,
|
||||||
)
|
)
|
||||||
if r.returncode != 0:
|
if r.returncode != 0:
|
||||||
debug_log(f"_git_status_porcelain rc={r.returncode}: {r.stderr[:200]}")
|
stderr_str = (r.stderr or b"").decode("utf-8", errors="replace")
|
||||||
|
debug_log(f"_git_status_porcelain rc={r.returncode}: {stderr_str[:200]}")
|
||||||
return None, None
|
return None, None
|
||||||
tracked, untracked = set(), set()
|
tracked, untracked = set(), set()
|
||||||
entries = r.stdout.split("\0")
|
stdout = (r.stdout or b"").decode("utf-8", errors="replace")
|
||||||
|
entries = stdout.split("\0")
|
||||||
i = 0
|
i = 0
|
||||||
while i < len(entries):
|
while i < len(entries):
|
||||||
e = entries[i]
|
e = entries[i]
|
||||||
@ -368,7 +391,9 @@ def _git_status_porcelain(cwd):
|
|||||||
i += 1
|
i += 1
|
||||||
i += 1
|
i += 1
|
||||||
return tracked, untracked
|
return tracked, untracked
|
||||||
except (subprocess.TimeoutExpired, FileNotFoundError, OSError) as e:
|
except (subprocess.TimeoutExpired, FileNotFoundError, OSError, ValueError) as e:
|
||||||
|
# ValueError guards against any future strict-decode regression
|
||||||
|
# so the helper degrades to (None, None) instead of crashing.
|
||||||
debug_log(f"_git_status_porcelain error: {e}")
|
debug_log(f"_git_status_porcelain error: {e}")
|
||||||
return None, None
|
return None, None
|
||||||
|
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user