mirror of
https://github.com/anthropics/claude-plugins-official.git
synced 2026-06-13 14:15:56 -03:00
The real dominant Linux failure, identified by a CCR Linux repro.
A CCR container reproduced the production signature — non-zero exit +
EMPTY stdout + EMPTY stderr (~60k fires/day, 4,485 Linux users on 2.0.4):
running `python -m venv` under a tight memory limit (ulimit -v) kills the
memory-heavy venv+ensurepip/pip subprocess with SIGSEGV (-11, RLIMIT_AS)
or SIGKILL (-9, kernel OOM-killer) BEFORE it writes anything. This is
NOT the ensurepip/packaging case (that always writes to stderr, code 11)
and NOT fixable by --target (a --target pip install is also memory-heavy
and gets killed too). Three earlier hypotheses (stdout, packaging,
Option A fixes Linux) were wrong — the repro corrected them.
Changes:
- Detect the signal kill (rc<0, or 128+sig: 134/137/139) in the venv/pip
and --target paths → err_kind "signal_killed:<rc>" (new code 16). The
returncode rides in a new sdk_bootstrap_rc metric so prod confirms
which signal dominates (-9 OOM-killer vs -11 RLIMIT_AS).
- Cooldown: on a signal kill, write a marker and return the new
SKIP_COOLDOWN outcome (9) on subsequent sessions for 24h — stops the
retry storm (every session was re-attempting a build that just gets
re-killed, burning the user's memory/CPU). Retries once per window so a
machine that frees memory still recovers.
- --no-cache-dir on both pip installs (venv + --target) trims pip's peak
memory; may get marginal machines under the OOM threshold.
No happy-path change: signal detection is at the top of the existing
failure handler; cooldown is checked only after all no-op probes
(NOOP_SYSTEM/VENV/TARGET short-circuit first).
Verified locally on macOS Python 3.13:
- py_compile clean.
- 35 new tests (test_signal_kill_cooldown.py): _is_signal_kill across
signals/exit-codes, rc decode, signal_killed→code 16, cooldown
lifecycle (none→write→expire), and an integration flow — simulated
SIGKILL'd venv → BUILD_FAILED/signal_killed:-9 + cooldown written →
2nd run SKIP_COOLDOWN without re-attempting → retry after window;
non-signal failure does NOT cool down; --no-cache-dir present on both
pip paths; sdk_bootstrap_rc emitted conditionally.
- End-to-end harness: the full kill→categorize→cooldown→skip→retry
chain confirmed in-process.
The original CCR repro (ulimit -v ≤7000 KB → rc=-11, empty streams) is
the ground truth this fix is built on. Can be re-validated on CCR with the
same ulimit approach.
Version 2.0.5 -> 2.0.6 per the per-PR-bump policy (#2114).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>