Training stack · v0.6 + v0.7 results

CHINI-train

Teaching small models to design systems that survive the bench. Simulator as reward.

TL;DR

Two RL runs on top of a Qwen2.5 SFT base, scored against the 20-problem CHINI-bench held-out split. Both moved mean composite score; neither cracked the pass@1 ceiling.

v0.6 RL (1.5B, flat reward): mean 78.05 (+0.6 over base 77.43), pass@1 10%.
v0.7 RL (7B, four-term shaped reward, 200 steps): mean 82.75 (+8.7 over base 74.05), pass@1 still 10%.
Per-tier on v0.7: dp1 = 100%, dp2 = 33%, dp3 to dp6 = 0%. Mean scores on dp3-dp6 sit at 78-80, ~5-10 points short of pass.

Reward shaping fixes quality of canvas. It can't manufacture latent capability the warm-start doesn't already have. The remaining lever is data: a tier-weighted curriculum that puts gradient signal directly on the dp3+ cliff. v0.7c is queued.

What CHINI-train is

CHINI-train is an open RL training pipeline for graph-valued outputs. The training target is a CanvasState JSON document, the kind a Chinilla user would build by hand: nodes, edges, behaviors, parameters. The reward signal is the CHINI-bench simulator, called over a local HTTP endpoint, returning a deterministic 0 to 100 score and a breakdown across stability, delivery, cost, constraints, and design quality.

The stack is small on purpose. ~2,000 lines of Python. SFT followed by GRPO (Group Relative Policy Optimization). Trained on a single A10G via Modal for under $15 per full run. The base model is Qwen2.5-1.5B. No frontier-scale training, no closed reward model, no human preference data.

The reason the parameter count is small is the same reason the simulator is real: we are testing a recipe, not chasing a leaderboard. If a 1.5B model can learn to design systems that survive a deterministic simulator, that recipe transfers to anything else with executable ground truth: code that compiles, tool-use DAGs that return 2xx, queries that return correct rows. System design is the testbed because we already built the simulator.

v0.6 results in detail

Held-out evaluation on 20 problems the model never trained on. All numbers are mean composite score, scale 0 to 100.

Configuration	Score	vs base greedy
Base, greedy	77.43	baseline
Base, best-of-8	79.80	+2.37
RL, greedy	78.05	+0.62
RL, best-of-8	81.00	+3.57
RL, K=1 + structural repair	82.00	+4.57
RL, oracle ceiling (best of 8)	86.05	+8.62

Pass@1 (score ≥ 85): 10% for both base and RL. Bench score moved; capability ceiling did not.

What RL actually learned

The bench numbers above tell one story. The distributional analysis tells another, and it is the more important one.

The RL-trained model produces canvases that are systematically smaller than the base model. Mean component count dropped from 8.6 to 6.6. Mean cost utilization dropped from 96% of budget to 77%. Mean capacity utilization from 83% to 65%. The policy learned to stay safely under every constraint by underbuilding.

It also collapsed in distribution. The base model produced 7.0 distinct canvas topologies per K=8 sample. The RL model produced 3.65. Per-problem standard deviation of component count fell from 1.84 (base) to 0.16 (RL). Effectively deterministic given a problem.

Both pathologies trace to the same root cause: the bench's cost subscore is asymmetric. Going over budget hurts. Coming in under budget caps at full credit the moment you are below the line. The reward function was the raw bench composite divided by 100, with no shaping. Once the policy figured out that minimal canvases reliably score above the median, gradient descent did the rest.

This is a clean, replicable, mechanistically-explained failure of a flat reward. It is also fixable, which is what v0.7 is about.

v0.7 results

v0.7 changed the reward function and the base model. Same problem set, same held-out split. Four additive shaping terms (utilization, size, coverage, diversity), bounded so the bench composite stayed the dominant signal. Bumped the warm-start to a 7B SFT adapter (Qwen2.5-7B-Instruct + fmtA). Two runs on Modal A10G.

Configuration	Mean	Pass@1	Distinct topos / K=8
Base 7B (greedy)	74.05	0%	7.0
fmtA SFT warm-start	79.95	10%	—
v0.7 pilot, 50 steps	80.55	10%	5.10
v0.7 full, 200 steps	82.75	10%	—

Distinct-topology counts are the only direct evidence anti-collapse worked: v0.6 RL camped at 3.65; v0.7 lifted it to 5.10 in 50 steps and held there.

The shaping landed in the way it could. Mean composite climbed +2.8 over the warm-start (+8.7 over the cold base). Distinct topologies per K=8 went from 3.65 (v0.6 RL collapse) to 5.10. The quality distribution of canvases the model already knows how to produce got better.

Pass@1 did not move. 2/20 problems passed before training, 2/20 passed after. The diagnosis is in the per-tier breakdown: dp1 = 100%, dp2 = 33%, dp3 through dp6 = 0%. Mean score on dp3-dp6 sits at 78-80, uniformly. The model produces near-miss canvases on hard problems but never crosses the line.

Mechanism finding from the pilot. Of the four shaping terms, only one bit. The diversity bonus (added to the GRPO advantage at the group level) was ~10x more effective than the per-rollout reward shaping (utilization, size, coverage) at the same alpha. The reason is gradient amplitude: 0.10 alpha × bounded ±0.05 = ~0.005 reward delta per canvas, dwarfed by the 0.1-0.3 bench reward variance within a K=4 group. Advantage-level shaping is a much stronger gradient signal because it isn't fighting noise.

What v0.7c is doing about it

Pass@1 is stuck because the model never gets gradient signal on the cliff zone. The current train pool is dp1=28, dp2=58, dp3=114, dp4=114, dp5=58, dp6=28, but uniform sampling means the easy tiers (which the model already solves) account for most of every batch. v0.7c routes ~76% of training samples to dp3+dp4 via a tier-weighted curriculum sampler.

Curriculum sampler shipped. src/chini_train/data/curriculum.py with three presets (uniform / near_miss / hard_only). Wired into the training script via --curriculum near_miss.
Same alphas as v0.7-full. Holding the reward function constant isolates the curriculum effect. If pass@1 moves on dp3 specifically, curriculum is the lever and we'll know.
Falsifiable success criterion. If pass@1 stays at 10% after curriculum, the diagnosis becomes "base model is the limit, train data is the limit, or the bench is the limit," not "reward function is the limit." That's a meaningful narrowing.

v0.7c is the last cheap experiment in the v0.7 budget. Next step after that is a scale-up call: bigger base, bigger train pool, or refactor the bench-difficulty curve.

Why this is interesting

Most published RL-on-LLM work uses RLHF with a learned reward model. The reward model is itself a noisy LLM trained on preference data. You cannot tell whether the policy is gaming the reward model or actually getting better. We have a reward signal that is a deterministic discrete-event simulator. There is no reward model to game; either the simulated traffic survives or it does not.

That makes CHINI-train a useful environment for studying reward-shaping mechanics directly. The v0.6 collapse-into-minimalism result above is a clean example: a textbook reward-hacking failure caught and explained in a setting where you can actually see what happened, because the grader has no opinions.

The recipe is also small enough to reproduce. Total v0.6 training cost was under $15 on a single A10G. Inference replays are free. The full v0.6 inference autopsy and v0.7 reward specification are in the repo.

What this is not

Not a frontier-model result. The model is 1.5B parameters. The point is the recipe.
Not a claim that RL solved system design. v0.6 moved bench score by less than a point at greedy. v0.7 is the first serious attempt at fixing the reward.
Not a claim of safety relevance, yet. The "deterministic ground truth removes reward-hacking ambiguity" framing is a hypothesis we are testing, not a finished argument.
Not pretending the held-out split is large. 20 problems is a pilot, not a paper. Held-out scaling is on the v0.7 followup list.

Source: github.com/collapseindex/chini-train (private through v0.7c; opening once curriculum results land)
v0.6 inference autopsy: 600+ line writeup of every ablation, picker, repair, and oracle-ceiling experiment
v0.7 reward specification + pilot session log: the four-term shaping design, the dry-run band-tuning saga, and the tier-cliff finding from the K=8 heldout
v0.7-full result: 200-step run on Modal A10G, ~7.9h, $8. Mean composite +2.8 over warm-start, pass@1 unchanged.
Currently fundraising for v0.7c + scale-up on Manifund

If you are working on small-model RL, structured-output evaluation, or simulator-based training, get in touch.

Contact →