Private preview is open to a few more founding customers.Apply
Engineering · Apr 07, 2026

Haiku is the reviewer. Here’s why a smaller model does this job.

Every entry in your context gets a quality dot — empty, thin, or sharp — scored by a small model running constantly in the background. Why Haiku, and why not the frontier model, is the entire trick.

Every field in your Sempleo context carries a small dot next to it — empty, thin, or sharp. That dot is how we tell a team, at a glance, where their context is strong and where an agent is about to guess.

The dot is not set by hand and it is not set by the writing agent. It is set by a small model running constantly in the background, reading every entry as it changes. The model is Claude Haiku. Choosing Haiku — not Sonnet, not Opus — for this job is one of the more deliberate technical choices in the system.

What the job actually is. When a field is written or updated, a short job fires. The job hands Haiku the field’s type (e.g. client.red_lines), the field’s current value, the values of a few neighbouring fields for context, and a rubric. The rubric is different for each field type — “red lines” needs to be actionable and specific, “brand voice” needs to be characteristic, and so on. Haiku returns a score, a one-sentence reason, and a suggested improvement when the score is below sharp. That is the entire interaction.

It is a small, bounded, repetitive task. It runs thousands of times a day per tenant. The inputs are short. The outputs are shorter. The correctness standard is fuzzy — a score within one dot of what a careful human would give is perfectly fine. That is the exact shape of task a small model is built for.

Why not Opus. You could run the scoring through a frontier model. It would be marginally better. It would also cost twenty or thirty times more per call and take three or four times longer. A frontier model on a small task is not a feature, it is a category error. The moment you use a frontier model to judge a one-sentence field, you have turned a background system into a budget item your CFO will see.

There is a deeper reason too. The scoring model is a second-order system — it judges other things, and other things are then acted upon based on the judgement. Second-order systems should be cheap, predictable, and well-characterised. A frontier model in that seat will drift as it is updated; a small model will drift less. You want the judge to be boring.

Why not a classifier. The alternative at the other extreme is a dedicated classifier — a small fine-tuned model that scores quality. We looked at this seriously. The problem is that the rubric is different per field type, and new field types are added by customers in their team layers every week. A classifier would need to be retrained every time we introduced a new field type. Haiku handles that variation in the prompt, correctly, for cents. The flexibility is worth the small loss in raw accuracy.

What the dots actually change. The dot is not decoration. When an agent resolves the layers it needs for a task, it sees the dots alongside the values. Sharp fields are used with confidence. Thin fields are used, but the agent is instructed to flag its own uncertainty in the output — “drafted based on a thin client preference; worth a skim.” Empty fields are escalated: the agent pauses, proposes that the gap be filled, and files a review queue item. No guessing into a silent void.

That is the whole loop. Haiku scores. Agents read the scores. The team sees, for every field, whether their context is carrying weight or leaking it. When a field sits on “thin” for a week, someone gets a nudge. When a field gets flagged empty by three separate agent runs, it shows up at the top of the owner’s queue.

The bigger pattern. I think the right way to build agent systems is to use small models for the small, repetitive, well-defined tasks, and the frontier model only for the place where judgement actually matters — the user-facing draft, the decision, the synthesis. Most products in this category invert that. They use the best model for everything, which is expensive, slow, and — counter-intuitively — less reliable, because the best model brings unpredictable creativity to a place that needed a boring judge.

Haiku as the reviewer is the concrete form of that belief. If you are designing an agent system and you find yourself reaching for the frontier model for a task that runs ten thousand times a day, stop and ask whether a smaller model would do. It usually will.

More on the data model on the context page. The founding-customer cohort is open to five teams in 2026.

Shape the team-context
layer with us.

We're onboarding a small cohort of founding customers to deploy Sempleo on real workflows. A 45-minute call with the founder — you leave with a plan; we leave with the shape of how your team actually works.