Mohamed Hegazy 4e56d19dd8
security-guidance: handle signal-killed venv builds (memory) + cooldown (2.0.5 → 2.0.6)
The real dominant Linux failure, identified by a CCR Linux repro.

A CCR container reproduced the production signature — non-zero exit +
EMPTY stdout + EMPTY stderr (~60k fires/day, 4,485 Linux users on 2.0.4):
running `python -m venv` under a tight memory limit (ulimit -v) kills the
memory-heavy venv+ensurepip/pip subprocess with SIGSEGV (-11, RLIMIT_AS)
or SIGKILL (-9, kernel OOM-killer) BEFORE it writes anything. This is
NOT the ensurepip/packaging case (that always writes to stderr, code 11)
and NOT fixable by --target (a --target pip install is also memory-heavy
and gets killed too). Three earlier hypotheses (stdout, packaging,
Option A fixes Linux) were wrong — the repro corrected them.

Changes:
  - Detect the signal kill (rc<0, or 128+sig: 134/137/139) in the venv/pip
    and --target paths → err_kind "signal_killed:<rc>" (new code 16). The
    returncode rides in a new sdk_bootstrap_rc metric so prod confirms
    which signal dominates (-9 OOM-killer vs -11 RLIMIT_AS).
  - Cooldown: on a signal kill, write a marker and return the new
    SKIP_COOLDOWN outcome (9) on subsequent sessions for 24h — stops the
    retry storm (every session was re-attempting a build that just gets
    re-killed, burning the user's memory/CPU). Retries once per window so a
    machine that frees memory still recovers.
  - --no-cache-dir on both pip installs (venv + --target) trims pip's peak
    memory; may get marginal machines under the OOM threshold.

No happy-path change: signal detection is at the top of the existing
failure handler; cooldown is checked only after all no-op probes
(NOOP_SYSTEM/VENV/TARGET short-circuit first).

Verified locally on macOS Python 3.13:
  - py_compile clean.
  - 35 new tests (test_signal_kill_cooldown.py): _is_signal_kill across
    signals/exit-codes, rc decode, signal_killed→code 16, cooldown
    lifecycle (none→write→expire), and an integration flow — simulated
    SIGKILL'd venv → BUILD_FAILED/signal_killed:-9 + cooldown written →
    2nd run SKIP_COOLDOWN without re-attempting → retry after window;
    non-signal failure does NOT cool down; --no-cache-dir present on both
    pip paths; sdk_bootstrap_rc emitted conditionally.
  - End-to-end harness: the full kill→categorize→cooldown→skip→retry
    chain confirmed in-process.

The original CCR repro (ulimit -v ≤7000 KB → rc=-11, empty streams) is
the ground truth this fix is built on. Can be re-validated on CCR with the
same ulimit approach.

Version 2.0.5 -> 2.0.6 per the per-PR-bump policy (#2114).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-06-12 00:53:02 -07:00
..