mirror of https://github.com/anthropics/claude-plugins-official.git synced 2026-05-11 14:05:52 -03:00

math-olympiad: add LICENSE, marketplace entry, and prettier formatting

- Add Apache 2.0 LICENSE file
- Register plugin in marketplace.json
- Run prettier (prose-wrap=always, 80 cols) over all plugin markdown
- Simplify model tier naming in reference docs

🏠 Remote-Dev: homespace

2026-03-30 19:22:03 +00:00

2.9 KiB

Raw Blame History

Model Tier Defaults

Parameters scale with model capability. Budget is not the constraint — the constraints are diminishing returns (more voters stop helping past a point) and the asymmetric noise floor (Haiku verifiers are individually less reliable, so the right response is width not depth).

Haiku

Width compensates for per-sample noise. Scaffolding is where the leverage is.

Parallel solvers: 12 (wide fan — each individual solve is weaker, so cast a wider net)
Vote budget: 7 verifiers, need 5-confirm / 3-refute (pigeonhole exit: stop when outcome decided)
Abstain threshold: 3 consecutive revise cycles fail
Pattern sweep: all 12 patterns — Haiku can follow a checklist, the patterns are the scaffold
Presentation pass: yes, 3 drafts, comparator picks cleanest. Haiku's raw output is rougher, so this matters MORE not less.
Rationale: The skill's value is highest where the base model is weakest. Give Haiku the full harness. The 3-refute threshold (higher than Sonnet's 2) accounts for Haiku verifiers being individually noisier — don't let 2 confused Haikus kill a correct proof.

Sonnet

Balanced.

Parallel solvers: 6
Vote budget: 5 verifiers, need 4-confirm / 2-refute
Abstain threshold: 3 consecutive revise cycles fail
Pattern sweep: all 12
Presentation pass: 2 drafts, comparator picks cleaner
Rationale: 4-of-5 tolerates one flake. 2 dissents is signal.

Opus

Depth. Each sample is strong, so invest in making the adversarial pass harder.

Parallel solvers: 4
Vote budget: 5 general verifiers (4-confirm / 2-refute) PLUS one dedicated verifier per pattern in verifier_patterns.md (12 targeted attacks). Any pattern-specific HOLE FOUND counts toward refute.
Abstain threshold: 5 consecutive revise cycles fail (trust the model's ability to eventually fix)
Pattern sweep: all 12, each with its own dedicated agent
Presentation pass: 3 drafts with different instructions ("most elegant," "most elementary," "shortest"), comparator picks the best. Strong models can genuinely produce different styles of proof.
Rationale: Opus can execute the deep patterns (#19 base-vs-derived, #22 mean-first) that need real mathematical judgment. The 12 dedicated pattern passes are where the model's capability is best spent — it's the difference between "be skeptical" and "check THIS specific thing."

On the pigeonhole exit

Kept at all tiers — not because of cost, but because once inflight >= confirm_needed + refute_needed - 1, the remaining votes carry no information regardless of how they land. Launching them anyway is pure latency.

Identifying the tier

If the orchestrating session doesn't know which model it is, default to Sonnet configuration. A reasonable heuristic: ask the model to self-identify in its first response and match against haiku/sonnet/opus in the output.

2.9 KiB Raw Blame History