"How's it working?" — the first question leadership asks after any AI project delivery.

Most projects answer in one of three ways:

  1. Demo-style: "Watch this — I type a question, AI answers instantly. Impressive, right?"
  2. Model-style: "Our model hit 92% accuracy, 15 points above industry average."
  3. Deck-style: "Employee satisfaction is 4.3/5 based on feedback forms."

None of these pass. They get a smile from leadership, but they don't answer the finance team's next question: "so how much did we earn or save this year?"

This piece lays out how to actually validate AI project ROI — six business metrics that matter, plus the standard 30-day revisit protocol.

1. Six business metrics — pick 2-3 by project type

Not every metric fits every project. Pick 2-3 that match what your AI solved.

Metric 1: Efficiency (person-days / person-hours)

Most common, most direct.

Definition: person-hours a process consumed before AI launch → person-hours after.

Fits: content production (email, proposal, report), data processing (reconciliation, form entry, input), decision support (filtering, classification, recommendation).

Method:

  1. Before launch, pick 10-20 typical tasks; record person-hours per task
  2. At day 30 post-launch, pick 10-20 same-type tasks; record again
  3. Compare averages and medians

Caveats:

  • Tasks must be comparable (don't compare simple before to complex after)
  • Sample size matters (under 10 is unreliable)
  • Don't cherry-pick (use real business distribution)

Typical result: at one manufacturer, customer quoting dropped from 4h/person to 40min/person. Annualized (2000 quotes × 4 people) ≈ 5000 person-hours saved per year.

Metric 2: Error rate

Top choice for quality-focused AI projects.

Definition: error-type frequency before launch → after.

Fits: quality inspection (industrial vision AI replacing human inspectors), data accuracy (invoice recognition, form entry), compliance review (contract risk, financial audit).

Method:

  1. Fix a window (say, 3 months before launch); count errors / total samples
  2. At day 30 post-launch, same count

Caveats:

  • "Error" must be defined upfront. Without it, adjusting the definition makes numbers arbitrary
  • Discovery mechanism must be consistent (errors found by AI vs. humans should both be counted)
  • Rare error types have high variance. If errors occur ~5 times/month, 30 days isn't enough sample

Typical result: at an auto-parts factory, visual QC missed-defect rate dropped from 0.8% to 0.1%. Annualized (500k units/year × ¥50/unit) ≈ ¥1.75M recall cost saved.

Metric 3: Response time

Top choice for CX and decision-focused projects.

Definition: mean (or P95/P99) response time change before/after.

Fits: customer service (reply time), decision chains (time to conclusion), cross-department coordination (handling duration).

Method:

  1. Before: extract response-time distribution from system logs
  2. After: same extraction; compare mean, median, P95

Caveats:

  • Look at distribution, not just mean. Slow P95 means long-tail problems remain
  • Compare same time windows (weekday daytime only, don't mix with weekends)
  • Filter extreme outliers (a few anomalies skew the mean heavily)

Typical result: at a logistics company, customer inquiry response time dropped from 2h to 15min. Satisfaction lift drove repeat-buy rate from 18% to 27%.

Metric 4: Training completion / capability transfer

Core metric for adoption-focused projects.

Definition: among the target employees in scope, how many completed training AND actually used the tool X times.

Fits: AI training projects, tool rollout, knowledge base projects.

Method:

  1. Define "completion" (e.g., completed 2h training + used ≥5 times within 30 days)
  2. Measure at day 30, 60, 90

Caveats:

  • "Actual use" must come from system logs, not self-report
  • Different roles should have different thresholds (frontline vs management)
  • Low usage triggers investigation — tool UX? wrong scenario? resistance?

Typical result: one corporate training program targeted "300 sales reps, 30 days, training + ≥10 uses". Achievement 82%. The unachieved 18% surfaced "travel-heavy, no time" — led to mobile-only optimization.

Metric 5: Process replacement rate

Terminal metric for org-level transformation projects.

Definition: an N-step manual process post-AI has what fraction automated (or semi-automated).

Fits: process automation (approval, dispatch, handling), cross-system coordination (order fulfillment, customer service), knowledge work (analysis, reporting, decision support).

Method:

  1. Before: list 10-20 manual steps of the process
  2. After: which still require human, which are automated

Caveats:

  • Semi-automated (AI + human review) counts, but tracks separately
  • 100% replacement isn't always good (some steps shouldn't be fully automated, e.g., customer complaints)
  • Quality of replaced process must be tracked too ("automated" doesn't mean "outcome correct")

Typical result: at a retailer, the assortment-planning process dropped from 8 manual steps to 3 (AI does data analysis + initial filtering; buyer makes final call). 62.5% replacement.

Metric 6: Cost (hard money number)

The metric leadership and finance care most about. Also the easiest to polish.

Definition: actual cost reduction (or revenue increase) post-launch.

Fits: all projects.

Method:

Simplest formula:

Savings = (pre-launch labor + other cost) - (post-launch labor + other cost + AI system run cost)

Key: AI system run cost must be included. Many reports deliberately omit this, only counting "labor saved" — but private deployment depreciation, API fees, ops headcount are all real costs.

Caveats:

  • Full-period accounting, not month one
  • Be conservative (e.g., discount labor savings to 70%, since employees aren't actually laid off)
  • Revenue attribution carefully (AI is a contributor, not sole cause)

Typical result: Guangdong hardware factory reduced order-tracking team from 5 to 2; shipment delay rate dropped 65%. ¥100k/person/year × 3 + ¥200k in delay penalty reduction = ~¥500k/year savings. Minus ¥210k annual run cost = ¥290k net ROI/year.

2. 30-day data revisit — standard protocol

Metrics alone aren't enough. There has to be a review rhythm. Our standard with all clients: 30-day revisit + 90-day validation.

30-day revisit process

Step 1: Sign the baseline at kickoff (~2-hour session)

  • Pick 2-3 core metrics (from the six above)
  • Freeze the calculation formula (shouldn't differ across people)
  • Freeze the target number (e.g., "order-tracking person-hours: 4h → target 1h")
  • Freeze the validation date (day 30 and day 90 post-signing)

These four items go into the contract annex. No signature, no project kickoff.

Step 2: Weekly data review (15-minute standups)

  • Pull data every Wednesday
  • Compare trend vs baseline
  • Escalate anomalies immediately

Adoption owner (business-side appointee) runs this. It's business watching business metrics — not IT reporting.

Step 3: Day-30 review meeting (1-2 hours)

  • Core metrics vs targets (met / missed / partial)
  • Cause analysis for misses (tech, usage, data, process)
  • Remediation plan + timeline

If core metrics miss, the consultancy bears the remediation cost through day 90. This is the backbone of the validation clause.

Step 4: Day-90 validation meeting (2-3 hours)

  • Final validation report: baseline / day 30 / day 90 for each core metric
  • Signed by leadership and finance
  • Project formally closed
  • Next-phase arrangements (renewal / client takeover / termination)

Why 30 days and 90 days

Day 30 is the minimum honesty window — novelty effect peaks in month one, starts fading at 30 days, so you see near-steady-state numbers.

Day 90 is the full validation window — long enough to see:

  • Whether usage habits stabilized (not three-minute enthusiasm)
  • Whether edge cases have surfaced
  • Whether data-quality issues got patched
  • Whether second-order effects of process redesign show up

Beyond 90 days, the business environment itself shifts too much — AI's effect gets mixed with other factors.

3. A counter-example: ROI validation as deck-craft

The most typical failure mode we've seen:

A group-IT department delivered a "smart customer service" project, then presented 6 months later. Their data:

  • CSAT: 4.1 (was 4.0) — sample size went from 200 to 500, not comparable
  • CS ticket volume: down 30% — that period was sales low-season
  • AI usage: "average 3.2 times/week per employee" — no baseline
  • "Positive employee feedback" — no quantification

The deck looked great. Executives were pleased. Eighteen months after launch, this "smart customer service" was quietly decommissioned — cost too high, actual usage declining, CSAT not sustainably improved. Total investment ¥3M.

Lesson: ROI validation rigor = probability of sustained operation. Validation that slips through is seeding a future shutdown.

4. Closing

AI ROI validation isn't mystical. Six business metrics + 30/90-day review + pre-signed baseline.

The hard part isn't method. It's willingness to commit — the vendor willing to put ROI in the contract, the client willing to validate against it. Most projects don't get there, not for technical reasons — for fuzzy accountability.

Every engagement we run requires ROI clause at kickoff, 30-day revisit, vendor-funded remediation on miss. The cap of 20 clients per year exists because this validation process is expensive.

If your current AI project has no ROI validation, add baseline + target now — even mid-flight. Starting today is better than no data at all. A free AI audit includes an assessment of your project's ROI measurability — where the gaps are and how to close them.