The chasm between a working demo and a working agent.

Every enterprise AI project I have sat in on over the last two years has the same shape. The demo is electric. The pilot is promising. The retro is polite. Nothing ships.

The gap between “this demo is extraordinary” and “this agent is running against real work every day” is where the category actually lives, and almost nobody is writing about it. It is more interesting than benchmarks. It is more interesting than model releases. It is, as far as I can tell, the single most important thing in enterprise AI that is not being talked about seriously.

I want to walk through what actually happens in the chasm, because once you have seen the pattern you cannot un-see it.

Week one: the demo. Someone on the team has been asked to evaluate tools. They open the vendor’s sandbox, pick a handful of representative prompts, and watch the model hit them out of the park. The output is thoughtful. The reasoning is sound. A screenshot goes into the deck. A slot is booked on the exec calendar. The tool is approved for a pilot.

Week two to four: the ramp. The tool is wired into real data — a Drive folder, a CRM export, a Slack channel. A few real users are given access. Early outputs still look good, because the first few questions tend to be ones the evaluators have already tried and like. An enthusiastic champion emails the team. A second screenshot goes into the deck.

Week four to eight: the quiet failure. This is where the chasm opens. Users start asking questions the evaluators never tried. The model answers confidently. About a third of the answers are correct, about a third are plausibly wrong, and about a third are correct in a way that is useless — technically accurate, operationally meaningless. Users start double-checking the model. When you double-check more than you save, the tool is net-negative. Usage quietly flatlines.

Week eight to ten: the retro. The champion writes a thoughtful doc. Accuracy is cited. Adoption is cited. Integrations are cited. Some change is suggested. The tool is rolled back to a smaller group, or turned off, or left in a permanent “beta” state that everyone understands is a polite shelf.

Four or five months later, a new vendor runs the same demo, and the cycle begins again.

What is actually breaking. In every one of those week-four-to-eight failures I have looked into, the root cause was not the model. It was not even the integration, usually. It was a thing the model did not know and had no way to learn.

The model did not know that the client the user was drafting to had a standing red line on certain language. It did not know that the reviewer for that kind of draft was not the manager listed in the org chart but the senior colleague two desks over. It did not know that the team uses the word “brief” to mean one thing and the firm uses it to mean another. It did not know that the last time this client got the standard summary, they called in furious, and the summary style was changed.

All of that is context. None of it lives in a place a model can read. Some of it lives in people’s heads. Some of it lives in Slack threads nobody will search again. Some of it lives in a document that has not been updated since last April. And so the model does its best with what it has, which is not enough, and the chasm is the distance between enough-for-a-demo and enough-for-real-work.

What closes the chasm. You close the chasm by making the implicit explicit. You give the team a place to write down the things the model needs to know, structured so an agent can read it, owned so a human will keep it fresh, authority-aware so it reflects the real shape of the team rather than a flattened pile. You make the editing of that context someone’s job — usually the person who was already the informal keeper of it. You close the feedback loop, so that when an agent guesses and a human corrects, the correction lands back in the context, not in an email thread nobody reads.

That is the work. It is unglamorous. It does not demo. A sharpened context layer does not produce a screenshot that will go into a deck. It produces a week eight where the usage curve does not flatline.

Sempleo exists because I am convinced that the team that solves the chasm, at an infrastructure level, for the category, becomes the system of record for how teams collaborate with AI. That is a ten-year position. The sooner we stop evaluating AI tools by their week-one demo and start evaluating them by their week-eight retention, the sooner the category grows up.

The thesis post has more on the capability-vs-context framing. Applications for the founding cohort are open to five teams.

Five founding customers. One founder. The opening.

Sempleo speaks MCP both ways. Here’s why that matters.

Five layers are the shape of a team. Here’s why Sempleo modelled them.

Shape the team-context
layer with us.

Five founding customers. One founder. The opening.

Sempleo speaks MCP both ways. Here’s why that matters.

Five layers are the shape of a team. Here’s why Sempleo modelled them.

Shape the team-context layer with us.

Shape the team-context
layer with us.