The real dominant Linux failure, identified by a CCR Linux repro.
A CCR container reproduced the production signature — non-zero exit +
EMPTY stdout + EMPTY stderr (~60k fires/day, 4,485 Linux users on 2.0.4):
running `python -m venv` under a tight memory limit (ulimit -v) kills the
memory-heavy venv+ensurepip/pip subprocess with SIGSEGV (-11, RLIMIT_AS)
or SIGKILL (-9, kernel OOM-killer) BEFORE it writes anything. This is
NOT the ensurepip/packaging case (that always writes to stderr, code 11)
and NOT fixable by --target (a --target pip install is also memory-heavy
and gets killed too). Three earlier hypotheses (stdout, packaging,
Option A fixes Linux) were wrong — the repro corrected them.
Changes:
- Detect the signal kill (rc<0, or 128+sig: 134/137/139) in the venv/pip
and --target paths → err_kind "signal_killed:<rc>" (new code 16). The
returncode rides in a new sdk_bootstrap_rc metric so prod confirms
which signal dominates (-9 OOM-killer vs -11 RLIMIT_AS).
- Cooldown: on a signal kill, write a marker and return the new
SKIP_COOLDOWN outcome (9) on subsequent sessions for 24h — stops the
retry storm (every session was re-attempting a build that just gets
re-killed, burning the user's memory/CPU). Retries once per window so a
machine that frees memory still recovers.
- --no-cache-dir on both pip installs (venv + --target) trims pip's peak
memory; may get marginal machines under the OOM threshold.
No happy-path change: signal detection is at the top of the existing
failure handler; cooldown is checked only after all no-op probes
(NOOP_SYSTEM/VENV/TARGET short-circuit first).
Verified locally on macOS Python 3.13:
- py_compile clean.
- 35 new tests (test_signal_kill_cooldown.py): _is_signal_kill across
signals/exit-codes, rc decode, signal_killed→code 16, cooldown
lifecycle (none→write→expire), and an integration flow — simulated
SIGKILL'd venv → BUILD_FAILED/signal_killed:-9 + cooldown written →
2nd run SKIP_COOLDOWN without re-attempting → retry after window;
non-signal failure does NOT cool down; --no-cache-dir present on both
pip paths; sdk_bootstrap_rc emitted conditionally.
- End-to-end harness: the full kill→categorize→cooldown→skip→retry
chain confirmed in-process.
The original CCR repro (ulimit -v ≤7000 KB → rc=-11, empty streams) is
the ground truth this fix is built on. Can be re-validated on CCR with the
same ulimit approach.
Version 2.0.5 -> 2.0.6 per the per-PR-bump policy (#2114).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Option A, the data-gated fix for venv_ensurepip_fail (#2154 follow-up).
v2.0.4 telemetry made the call: of the venv_ensurepip_fail cohort, ~95%
HAVE pip (sdk_has_pip=true) and run Python 3.11–3.14 — so it's not the
Apple-3.9 problem; it's modern interpreters where `python -m venv` can't
bootstrap pip (Debian python3-venv absent, or python.org/pyenv builds
without ensurepip) but pip itself works. `pip install --target` needs only
pip, so it recovers the agentic reviewer for them instead of degrading to
pattern + single-shot review.
Producer (ensure_agent_sdk.py):
- New outcomes BUILT_TARGET=7, NOOP_TARGET=8; new phase pip_target=5.
- _build_via_target(): `pip install --target <state>/agent-sdk-libs
--upgrade --prefer-binary claude-agent-sdk`. Failures categorized via
_pip_err_from_stderr (sibling of main()'s pip chain — kept separate to
avoid disturbing the working venv categorizer); errno embedded for
OSError-family exceptions.
- _target_sdk_importable(): probes a prior target install → NOOP_TARGET.
Dir-check short-circuits before any subprocess, and it's only reached
when there's no working venv, so the 81% NOOP_VENV cohort never pays.
- main() falls through to the target build ONLY on venv_ensurepip_fail;
every other venv/pip failure stays terminal BUILD_FAILED. The sentinel
is released before the target build so a retry isn't seen as SKIP_SENTINEL.
Consumer (llm.py):
- _inject_agent_sdk_venv_into_syspath() adds the flat agent-sdk-libs dir
(packages sit directly in it, not under site-packages). The existing
pywin32 .pth bootstrap applies (target installs don't run .pth either).
No change to the happy path — the new branch is taken only on the
ensurepip failure, and the extra candidate dir is a no-op when absent.
Verified locally on macOS Python 3.13:
- py_compile clean.
- 30 new tests (test_venv_target_fallback.py): outcome/phase codes
(append-only, 4 stays retired), _pip_err_from_stderr categories,
_build_via_target success/CalledProcessError/timeout/exc+errno (mocked
subprocess), _target_sdk_importable dir-short-circuit, main() wiring
(ensurepip→target fallthrough + NOOP_TARGET probe + sentinel release),
consumer adds the flat dir. Full suite 533/533 pass + 2 skipped.
- END-TO-END harness (real install, simulated ensurepip failure):
main() → BUILT_TARGET, target dir has claude_agent_sdk; 2nd run →
NOOP_TARGET; consumer _inject → `import claude_agent_sdk` resolves
FROM the --target dir. Full chain proven without needing a
broken-ensurepip box.
- Real `pip install --target` + import confirmed independently (exit 0,
SDK imports from the flat layout).
NOT validated in tmux: the ensurepip failure can't be reproduced on macOS
(working ensurepip), so the fallback was proven via the real-install
harness above instead. The happy path (NOOP_VENV / normal agentic review)
is unchanged and covered by the existing hook-smoke suite.
Version 2.0.4 -> 2.0.5 per the per-PR-bump policy (#2114).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>