Research
Chinilla started as a system design simulator. Once the simulator was real, two research threads fell out of it.
Both threads share one premise: a deterministic simulator with executable ground truth is a rare thing in AI evaluation, and almost everyone is using LLM-as-judge or human raters instead. We have the simulator. So we built a benchmark on top of it, and then we built a training pipeline that uses the same simulator as the reward signal. Same engine, two ends of the pipeline.
CHINI-bench
A deterministic, simulator-graded benchmark for AI system design. Models emit a Chinilla architecture, the simulator runs it through stress scenarios, pass or fail is mechanical. No LLM judge.
CHINI-train
An open RL training stack that uses the CHINI-bench simulator as the reward signal. Teaching a small model (1 to 3B params) to design systems that survive failure modes the bench cares about.
Why we publish this
AI evaluation is having a credibility problem. Most benchmarks are either multiple-choice trivia, LLM-judged subjective scoring, or static datasets that leak into pretraining. None of those tell you whether a model can do a real engineering task and have its output survive contact with reality.
A simulator solves that. Either the design holds under load or it does not. Either the queue overflows or it does not. The grader has no opinions, no preferences, and no memory of training data.
We made the bench public for the same reason we made the simulator public. If the methodology survives external scrutiny, the result is worth something. If it does not, we want to know.
Working on something adjacent?
Reach out if you are running model evals, training small models on graph-valued outputs, working on AI safety measurement, or designing curricula for AI literacy. We are interested in collaborations.
Get in touch →CHINI-train v0.7 is currently fundraising on Manifund. View the application →