The AI reporting problem
Executives want to know how AI investments are performing. Most organizations cannot tell them. The metrics being tracked measure activity, not value, and the reporting structures that work for traditional software do not transfer to AI. Here is what better AI reporting looks like.
By Ramiro Enriquez
A common pattern in organizations that have invested in AI: the executive team wants a status report on AI performance. The team responsible for AI delivery produces a report. The report shows adoption rates, usage volumes, number of features shipped, model accuracy metrics, and uptime statistics. The executives read it and still do not know whether the AI investments are producing business value.
This is the AI reporting problem. The metrics being tracked are real and not meaningless, but they answer the wrong question. Executives asking about AI performance are asking: are we getting a return on this investment? Are we better at [specific business activity] because of AI? What would we have done differently if we had known what we know now? The activity metrics that fill most AI reports do not answer these questions.
The reporting gap is not malicious. The teams producing these reports are tracking what is measurable. Usage is measurable. Uptime is measurable. Accuracy on a benchmark dataset is measurable. The business outcomes that AI is supposed to produce are harder to attribute, slower to manifest, and require a different measurement approach than most organizations have built.
Why activity metrics are not outcome metrics
The distinction between activity metrics and outcome metrics matters more for AI than for most other technology investments.
For traditional software, there is often a reasonable proxy chain from activity to outcome. More users using a CRM system means more sales activity being tracked, which means better sales management, which means more revenue. The proxy chain is imperfect but not unreasonable, and tracking usage as a leading indicator of business value has a reasonable basis.
For AI, the proxy chain breaks down in a specific way. AI usage does not automatically produce value. An AI tool that is used heavily but produces low-quality outputs, requires substantial human review and correction, or is used for low-stakes tasks that did not need AI produces activity without value. An AI tool that is used occasionally but consistently reduces the time to complete a high-value task or improves the quality of a critical decision produces value without impressive activity metrics.
The difference between these scenarios is not detectable in usage data. Distinguishing them requires measuring what happened to the business activity the AI was supposed to improve, not just whether the AI was used.
What outcome metrics for AI look like
Outcome metrics for AI answer: is the business activity the AI was supposed to improve actually improving?
This requires starting with a specific business activity, not with the AI system. The question is not “how is our AI performing” but “how is our [customer support resolution time / sales cycle length / contract review throughput / code deployment frequency / whatever the AI was applied to] performing, compared to before we had AI and compared to similar activities where AI is not involved.”
Several elements make this measurement tractable.
A specific, measurable activity. AI outcome measurement works best when the business activity the AI is supposed to improve is already measured or can be measured. Customer support resolution time, time-to-first-response, first-contact resolution rate: these are measurable. “Improved customer experience”: this is not, without operationalization.
A baseline from before AI deployment. Measuring the impact of AI requires knowing what the baseline was before AI. Organizations that deploy AI without documenting current performance have no basis for measuring improvement. The baseline measurement is often the most important single investment in AI outcome measurement, and it needs to happen before deployment.
A comparison group. Where possible, having a group of similar activities conducted without AI allows for controlled comparison. Not all AI applications support this, but when it is possible, the comparison group is the most reliable way to isolate the AI’s contribution from other changes happening simultaneously.
Lagged measurement. AI impact on business outcomes often takes time to manifest. A coding assistant might improve developer productivity, but the improvement in delivery throughput shows up over months, not days. Measurement cadences that look for immediate results often miss real impact that develops more slowly.
The attribution problem
The hardest part of AI outcome measurement is attribution: when a business metric improves after AI deployment, how much of the improvement is due to AI versus other factors?
This is not a problem unique to AI. It is the general challenge of measuring the impact of any intervention in a complex system where many things are changing simultaneously. The approaches that work for other interventions also work for AI.
Controlled experiments where possible. If it is practical to run the AI for some users or use cases and not others, the comparison directly attributes the difference to AI. This is the most reliable method and should be used when organizational and technical constraints allow it.
Difference-in-differences analysis. When controlled experiments are not possible, comparing the trend in the treated metric before and after AI deployment to the trend in a comparable untreated metric can isolate the AI’s contribution from general trends. If customer support resolution time improved 15% for the team using AI and 3% for a comparable team without AI over the same period, the difference is attributable to AI.
Leading and lagging indicator chains. When the ultimate outcome metric is too distant from AI deployment to measure directly, document the assumed causal chain and measure each link. If the theory is that AI improves code quality, which reduces bug rates, which reduces support volume, which improves customer satisfaction: measure code quality, bug rates, and support volume, not just customer satisfaction. The chain creates visibility into where the AI’s impact is and is not materializing.
Honest uncertainty quantification. Attribution is rarely definitive outside of controlled experiments. Outcome reports that acknowledge uncertainty are more useful than reports that overclaim. “We estimate the AI contributed X% of the improvement with a range of Y% to Z%, based on the following reasoning” is more honest and more actionable than “AI improved [metric] by X%.”
What good AI reporting looks like
An AI report that answers the executive question has a different structure than one that reports activity.
It leads with business outcomes. For each AI initiative, what business activity was targeted? What was the baseline? What is the current performance? What is the estimated contribution of AI versus other factors?
It distinguishes between initiatives at different maturity stages. A new AI deployment where outcome data is not yet available is in a different category from a mature deployment where outcome data is clear. Reporting that treats all initiatives the same obscures the portfolio’s actual state.
It includes the initiatives where AI did not produce the expected outcome. Selective reporting that highlights successes without acknowledging failures makes the overall portfolio look better than it is and prevents learning. The initiatives that underdelivered contain information about what to do differently.
It connects to forward-looking decisions. The report should end with: given these outcomes, what are the investment recommendations? Which initiatives should be expanded? Which should be restructured? Which should be discontinued? AI reporting that does not inform decisions is just documentation.
Building the reporting infrastructure
Most organizations find that building good AI outcome reporting requires infrastructure work that was not prioritized alongside the AI work itself.
Baseline measurement systems. The business activities that AI is applied to need to be measured before and during AI deployment. In many cases, this measurement infrastructure does not exist and needs to be built. It is not glamorous work and it often takes longer than anticipated, but it is the prerequisite for outcome measurement.
Attribution methodology documentation. The team needs to agree on how AI impact will be attributed before data is available. Agreeing on methodology when the answer is already known invites motivated reasoning. Agreed-upon methodology in advance is what makes subsequent reporting credible.
Regular review cadences. AI outcome metrics do not change as frequently as operational metrics. Monthly or quarterly reviews that connect AI investment to business outcome changes are more appropriate than weekly dashboards. The cadence should match the rate at which the underlying business activities change.
The investment in this infrastructure is not optional if the organization wants to make rational decisions about its AI portfolio. Without it, AI investment decisions are made on the basis of enthusiasm, activity metrics, and anecdotes. With it, they are made on the basis of evidence about what is actually producing value and what is not. That difference in decision quality compounds over time in ways that determine whether an organization’s AI investment produces durable capability or a legacy of underperforming systems.
Zylver ships AI products: Forge, Signal, Agents, Flows, and Meter. View all products.
More from Zylver
What your board needs to know about AI
Boards are being asked to provide oversight on AI at a moment when most board members lack the background to evaluate what they are hearing. The gap between what boards need to know and what they typically get in management presentations is real and consequential.
How AI is changing customer service
Customer service is one of the business functions most visibly transformed by AI. The changes are happening faster than most organizations planned for, and the outcomes depend heavily on implementation decisions that are easy to get wrong.
How to scale AI adoption from one team to the whole organization
Getting AI to work in one team is a different challenge from scaling it across an organization. What worked for the first team often fails when applied elsewhere, and the failure mode is usually invisible until the expansion is already stalled.
Get insights like this delivered monthly.
No spam. Unsubscribe anytime.