"How's it working?" — the first question leadership asks after any AI project delivery.
Most projects answer in one of three ways:
- Demo-style: "Watch this — I type a question, AI answers instantly. Impressive, right?"
- Model-style: "Our model hit 92% accuracy, 15 points above industry average."
- Deck-style: "Employee satisfaction is 4.3/5 based on feedback forms."
None of these pass. They get a smile from leadership, but they don't answer the finance team's next question: "so how much did we earn or save this year?"
This piece lays out how to actually validate AI project ROI — six business metrics that matter, plus the standard 30-day revisit protocol.
1. Six business metrics — pick 2-3 by project type
Not every metric fits every project. Pick 2-3 that match what your AI solved.
Metric 1: Efficiency (person-days / person-hours)
Most common, most direct.
Definition: person-hours a process consumed before AI launch → person-hours after.
Fits: content production (email, proposal, report), data processing (reconciliation, form entry, input), decision support (filtering, classification, recommendation).
Method:
- Before launch, pick 10-20 typical tasks; record person-hours per task
- At day 30 post-launch, pick 10-20 same-type tasks; record again
- Compare averages and medians
Caveats:
- Tasks must be comparable (don't compare simple before to complex after)
- Sample size matters (under 10 is unreliable)
- Don't cherry-pick (use real business distribution)
Typical result: at one manufacturer, customer quoting dropped from 4h/person to 40min/person. Annualized (2000 quotes × 4 people) ≈ 5000 person-hours saved per year.
Metric 2: Error rate
Top choice for quality-focused AI projects.
Definition: error-type frequency before launch → after.
Fits: quality inspection (industrial vision AI replacing human inspectors), data accuracy (invoice recognition, form entry), compliance review (contract risk, financial audit).
Method:
- Fix a window (say, 3 months before launch); count errors / total samples
- At day 30 post-launch, same count
Caveats:
- "Error" must be defined upfront. Without it, adjusting the definition makes numbers arbitrary
- Discovery mechanism must be consistent (errors found by AI vs. humans should both be counted)
- Rare error types have high variance. If errors occur ~5 times/month, 30 days isn't enough sample
Typical result: at an auto-parts factory, visual QC missed-defect rate dropped from 0.8% to 0.1%. Annualized (500k units/year × ¥50/unit) ≈ ¥1.75M recall cost saved.
Metric 3: Response time
Top choice for CX and decision-focused projects.
Definition: mean (or P95/P99) response time change before/after.
Fits: customer service (reply time), decision chains (time to conclusion), cross-department coordination (handling duration).
Method:
- Before: extract response-time distribution from system logs
- After: same extraction; compare mean, median, P95
Caveats:
- Look at distribution, not just mean. Slow P95 means long-tail problems remain
- Compare same time windows (weekday daytime only, don't mix with weekends)
- Filter extreme outliers (a few anomalies skew the mean heavily)
Typical result: at a logistics company, customer inquiry response time dropped from 2h to 15min. Satisfaction lift drove repeat-buy rate from 18% to 27%.
Metric 4: Training completion / capability transfer
Core metric for adoption-focused projects.
Definition: among the target employees in scope, how many completed training AND actually used the tool X times.
Fits: AI training projects, tool rollout, knowledge base projects.
Method:
- Define "completion" (e.g., completed 2h training + used ≥5 times within 30 days)
- Measure at day 30, 60, 90
Caveats:
- "Actual use" must come from system logs, not self-report
- Different roles should have different thresholds (frontline vs management)
- Low usage triggers investigation — tool UX? wrong scenario? resistance?
Typical result: one corporate training program targeted "300 sales reps, 30 days, training + ≥10 uses". Achievement 82%. The unachieved 18% surfaced "travel-heavy, no time" — led to mobile-only optimization.
Metric 5: Process replacement rate
Terminal metric for org-level transformation projects.
Definition: an N-step manual process post-AI has what fraction automated (or semi-automated).
Fits: process automation (approval, dispatch, handling), cross-system coordination (order fulfillment, customer service), knowledge work (analysis, reporting, decision support).
Method:
- Before: list 10-20 manual steps of the process
- After: which still require human, which are automated
Caveats:
- Semi-automated (AI + human review) counts, but tracks separately
- 100% replacement isn't always good (some steps shouldn't be fully automated, e.g., customer complaints)
- Quality of replaced process must be tracked too ("automated" doesn't mean "outcome correct")
Typical result: at a retailer, the assortment-planning process dropped from 8 manual steps to 3 (AI does data analysis + initial filtering; buyer makes final call). 62.5% replacement.
Metric 6: Cost (hard money number)
The metric leadership and finance care most about. Also the easiest to polish.
Definition: actual cost reduction (or revenue increase) post-launch.
Fits: all projects.
Method:
Simplest formula:
Savings = (pre-launch labor + other cost) - (post-launch labor + other cost + AI system run cost)
Key: AI system run cost must be included. Many reports deliberately omit this, only counting "labor saved" — but private deployment depreciation, API fees, ops headcount are all real costs.
Caveats:
- Full-period accounting, not month one
- Be conservative (e.g., discount labor savings to 70%, since employees aren't actually laid off)
- Revenue attribution carefully (AI is a contributor, not sole cause)
Typical result: Guangdong hardware factory reduced order-tracking team from 5 to 2; shipment delay rate dropped 65%. ¥100k/person/year × 3 + ¥200k in delay penalty reduction = ~¥500k/year savings. Minus ¥210k annual run cost = ¥290k net ROI/year.
2. 30-day data revisit — standard protocol
Metrics alone aren't enough. There has to be a review rhythm. Our standard with all clients: 30-day revisit + 90-day validation.
30-day revisit process
Step 1: Sign the baseline at kickoff (~2-hour session)
- Pick 2-3 core metrics (from the six above)
- Freeze the calculation formula (shouldn't differ across people)
- Freeze the target number (e.g., "order-tracking person-hours: 4h → target 1h")
- Freeze the validation date (day 30 and day 90 post-signing)
These four items go into the contract annex. No signature, no project kickoff.
Step 2: Weekly data review (15-minute standups)
- Pull data every Wednesday
- Compare trend vs baseline
- Escalate anomalies immediately
Adoption owner (business-side appointee) runs this. It's business watching business metrics — not IT reporting.
Step 3: Day-30 review meeting (1-2 hours)
- Core metrics vs targets (met / missed / partial)
- Cause analysis for misses (tech, usage, data, process)
- Remediation plan + timeline
If core metrics miss, the consultancy bears the remediation cost through day 90. This is the backbone of the validation clause.
Step 4: Day-90 validation meeting (2-3 hours)
- Final validation report: baseline / day 30 / day 90 for each core metric
- Signed by leadership and finance
- Project formally closed
- Next-phase arrangements (renewal / client takeover / termination)
Why 30 days and 90 days
Day 30 is the minimum honesty window — novelty effect peaks in month one, starts fading at 30 days, so you see near-steady-state numbers.
Day 90 is the full validation window — long enough to see:
- Whether usage habits stabilized (not three-minute enthusiasm)
- Whether edge cases have surfaced
- Whether data-quality issues got patched
- Whether second-order effects of process redesign show up
Beyond 90 days, the business environment itself shifts too much — AI's effect gets mixed with other factors.
3. A counter-example: ROI validation as deck-craft
The most typical failure mode we've seen:
A group-IT department delivered a "smart customer service" project, then presented 6 months later. Their data:
- CSAT: 4.1 (was 4.0) — sample size went from 200 to 500, not comparable
- CS ticket volume: down 30% — that period was sales low-season
- AI usage: "average 3.2 times/week per employee" — no baseline
- "Positive employee feedback" — no quantification
The deck looked great. Executives were pleased. Eighteen months after launch, this "smart customer service" was quietly decommissioned — cost too high, actual usage declining, CSAT not sustainably improved. Total investment ¥3M.
Lesson: ROI validation rigor = probability of sustained operation. Validation that slips through is seeding a future shutdown.
4. Closing
AI ROI validation isn't mystical. Six business metrics + 30/90-day review + pre-signed baseline.
The hard part isn't method. It's willingness to commit — the vendor willing to put ROI in the contract, the client willing to validate against it. Most projects don't get there, not for technical reasons — for fuzzy accountability.
Every engagement we run requires ROI clause at kickoff, 30-day revisit, vendor-funded remediation on miss. The cap of 20 clients per year exists because this validation process is expensive.
If your current AI project has no ROI validation, add baseline + target now — even mid-flight. Starting today is better than no data at all. A free AI audit includes an assessment of your project's ROI measurability — where the gaps are and how to close them.