A defensible AI agent ROI report has four parts: a documented baseline measured before the agent shipped, a controlled pilot with both treatment and control cohorts, a measured rollout that captures incremental gains, and a report that ties dollars to specific operational metrics with statistical confidence. This playbook lays out the 90-day cadence — what to do in each two-week sprint — to produce a CFO-ready report at the end.
Why most ROI claims fail audit
Three patterns blow up. First, no real baseline: the team measured "after" but never wrote down "before," so the savings number compares the new world to a hazy memory. Second, no control: the agent shipped during a quarter when other things changed (new tools, new processes, new hires) and the savings get attributed to the agent when other factors mattered too. Third, soft metrics: "time saved per task" multiplied by "number of tasks" multiplied by "loaded labor cost" is the right shape, but if any of those numbers is estimated rather than measured, the report falls apart in a finance review. The fix is a structured 90 days that locks in measurement before the agent goes live.
Days 1–14: Baseline
Pick the workflow. Time-box it: which workflow, which team, what is in scope and what is not. Pull at least 30 days of historical data. Measure: throughput (units of work per week), time per unit (minutes from start to done), error rate (percent rejected, returned, or escalated), cost per unit (loaded labor cost ÷ units per period), customer-facing SLAs hit (percentage). Document the methodology — exactly how each metric is measured, exactly what counts as a unit, exactly what counts as an error. The baseline document is the artifact you will refer back to in week thirteen, so write it as if you are handing it to a finance auditor.
Days 15–28: Pilot Setup
Define the cohorts. The simplest design is a parallel cohort: half the new work gets routed to the agent, half stays with the human team. Randomize at the unit level — by ticket, by lead, by call — not by user, to avoid selection bias. Define the success metrics matching the baseline. Define the stop conditions: at what error rate or what customer-facing SLA miss do you pause and investigate? Lock the agent's prompt, tools, and version for the pilot period. Stand up the trace and metrics dashboards that the operations team will watch every day.
Days 29–56: Pilot Run
Run the pilot for at least four weeks. Two weeks is too short — the variance in throughput week-to-week swamps the signal. Six weeks is better if you can afford it. Watch the dashboards. Tag every escalation, error, or anomaly. Hold a weekly review with the operations team — not to change the agent, which would invalidate the measurement, but to log every observation that will inform the rollout. Resist the urge to "improve" the agent mid-pilot; if you change it, you are running a new pilot.
Days 57–70: Analysis
Compute the differences. For each metric, calculate the agent cohort's mean and the human cohort's mean. Run a basic statistical test — for binary metrics like error rate, a chi-squared test is fine; for continuous metrics like time-per-unit, a t-test is fine; for skewed distributions, use a non-parametric test. Report the effect size (the absolute difference and the percentage difference) and the p-value. Compute the dollar impact: (units routed to agent) × (cost-per-unit savings) extrapolated to a full year of expected volume. Subtract the agent's runtime cost and amortized build cost. The result is your annualized net savings with statistical confidence intervals attached.
Days 71–84: Rollout
If the pilot meets the bar, ramp the agent's traffic share to 50%, then 80%, then 100% across two weeks. At each step, check the dashboards against the pilot baseline. If the metrics drift more than your tolerance — say, error rate climbs more than 20% from the pilot value — pause and investigate. Common reasons for drift: the broader case mix is harder than the pilot's randomized sample (because the team had been triaging), or volume is high enough to hit a tool's rate limit, or a model upgrade landed during the rollout. The rollout dashboards must be in place before the ramp; standing them up under pressure is where outages live.
Days 85–90: Report
The report has six sections. Executive summary: one number — annualized net savings, with confidence interval — and the answer to "should we do this for the next workflow?" Baseline: the document from days 1–14. Pilot design: cohort design, metrics, stop conditions. Pilot results: the differences, the statistics, the dollar impact. Rollout: the ramp, the metrics during ramp, any incidents and how they resolved. Forward plan: the next workflow to attack, the next $ of savings to target, the operating cadence to maintain.
Stakeholders and cadence
Three roles, three cadences. Operations lead owns the dashboards and a daily standup with the agent's eval and ops metrics. Finance partner is briefed weekly with the running savings number and a flag on any methodological concern. Executive sponsor is briefed at days 14, 56, 84, and 90 — the boundary of each phase. If any of these is missing or unengaged, the project has a measurement risk that no amount of engineering will fix. The 90-day playbook is as much about stakeholder management as it is about agents.
Mistakes to avoid
Cherry-picking the pilot's task mix to make the agent look good. Not capturing baseline error rates, only baseline throughput. Letting the team adjust the agent during the pilot. Reporting "time saved" without a per-unit baseline. Counting team headcount changes as savings when those changes were planned anyway. Ignoring runtime cost. Skipping the statistical test because "the numbers look obvious."
What good looks like
A 90-day playbook executed cleanly produces a single number — annualized net savings with a confidence interval — that survives a finance audit and survives a CFO who has been burned by AI ROI claims before. Teams that operate this way compound: the second workflow they automate is faster because the measurement infrastructure is reusable; the third is faster still. By the fourth or fifth workflow, the team has a repeatable agentic-automation muscle that is itself a strategic advantage.