Every business application vendor in 2025 advertises AI features. The
features range from genuine productivity layers to button-and-popup
window dressing that does not change the workflow it claims to
improve. As a firm that builds operational software for a living, we
have an obligation to be careful about what we ship and why.
This essay is about the test we apply when deciding whether AI belongs
in a given module, and what AI looks like when it earns its presence
versus when it is being added because the marketing function asked
for it.
The test
AI earns its presence in operational software when three conditions
are true.
First, there is a specific manual cost it removes. Not "saves time
broadly" but a measurable workflow that currently takes the team
minutes or hours, where the AI cuts that time substantially.
Second, the manual cost is in translation, classification, or
extraction work, not in judgement. AI is good at structural translation
between formats (an unstructured customer request to a structured
quote line, a photograph of a part to a specification record, a
handwritten document to a search index). AI is less good at judgement
the way an experienced practitioner exercises it, and pretending
otherwise is where the trouble starts.
Third, the human reviews the AI output before it commits to the
system. The reviewer is the practitioner whose work the AI is
assisting. The audit trail records what came from the AI versus what
came from human entry. The relationship between AI and human is
clear: the AI accelerates the slow, error-prone work; the human
remains responsible for the decision.
When these three conditions are true, AI belongs in the module. When
any one of them is missing, AI is being shoehorned in for reasons
other than the work.
A working example
In the CFX system we built for three
cutting-tool manufacturers in Mumbai, the field sales team uses an Android app that integrates
Gemini AI at one specific layer: spec translation.
A field sales engineer at a customer factory captures a request in
natural language ("they need 50 of the M16 bolts, length 80mm,
stainless 316, with the standard collar fitting"). They optionally
attach a photo of an existing part. The model processes both, extracts
the structured specification fields the quote system expects, and
presents the extracted record to the engineer.
The engineer reviews. They correct any field the model got wrong. They
commit. The record lands in the headquarters CRM with the audit trail
attached: which fields came from the AI extraction, which were
corrected by hand.
The AI is not writing the quote. It is doing the structural translation
between conversation and form. The engineer remains responsible for
the specification's accuracy. The headquarters team that processes
the request never sees the AI; they see the structured record the
engineer signed off on.
This module passes all three tests. The manual cost being removed is
the clipboard-and-translate work that field sales has always done
slowly and error-prone. The work is classification and extraction,
not judgement. The human reviews before commit, with a clear audit
trail.
The module ships in production. It works.
A counter-example we deliberately did not build
The same client asked, during scoping, whether the AI could draft the
follow-up email to the customer after the spec was captured. The
field sales engineer would just press send.
We declined to build that feature.
The reason is the third condition. An AI-generated customer email
that the engineer just presses send on is not a human-reviewed
artefact in any meaningful sense. The engineer skims it at best. The
customer receives a message that the engineer did not really write.
The relationship language drifts toward generic. The customer notices
over time. The trust the engineer has built with the customer erodes.
The work the AI was being asked to do here was not translation. It
was relationship maintenance, which is judgement work. The right
answer was not to add AI; it was to leave the email writing to the
engineer.
This is the harder call. The client wanted the feature. The case for
it was plausible. The math of saving the engineer two minutes per
email looked positive. We said no because the long-term cost was
larger than the short-term saving.
What this means for buyers of operational software
A founder evaluating operational software with AI features should
apply the three tests at the demo. For each AI feature the vendor
shows, ask three questions.
What specific manual cost does this remove? If the answer is vague,
the feature is window dressing.
Is the work translation or judgement? If the vendor claims AI handles
judgement well, the vendor is selling something other than what they
have.
Does a human review before commit, with a clear audit trail? If the
answer is no, the system is making decisions on the firm's behalf
that the firm cannot inspect.
A system that passes all three tests for each AI feature is the kind
of system worth running. A system that fails one or more is the kind
of system that produces marketing slides but operational
disappointment.
The category will sort itself out
The current state of AI in business applications is messy because
the technology shifted faster than vendors could thoughtfully
integrate it. Most current AI features will be replaced or removed
over the next two years as the practitioners using them discover
what works and what does not.
The features that survive will be the ones that pass the three
tests. The features that do not will quietly disappear from the
marketing decks, and the vendors will pretend they were never there.
For now, the right posture for a buyer is skepticism on the AI claim
and a willingness to ask the three questions. For us, the right
posture as a builder is to apply the tests rigorously and ship only
the AI that earns its presence.