How to think about testing AI systems

Software testing is a mature discipline. The unit-test-to-integration-test-to-end-to-end-test pyramid, the red-green-refactor cycle, the practice of writing tests before code: these patterns have been refined over decades and produce reliable results for deterministic software. Engineers who have internalized them are appropriately skeptical of code that lacks test coverage.

AI systems break this discipline in specific ways, and the engineers who try to apply traditional testing patterns to AI without modification discover the breakage through experience. The unit test that passes 100% of the time is not a reliable quality signal for AI outputs. The integration test that verifies the system produces a response does not verify that the response is useful. The end-to-end test that checks a happy path gives false confidence about a system whose failure modes are mostly in the long tail.

Understanding where traditional testing breaks down, and what to do instead, is a prerequisite for building AI systems that are actually reliable in production.

Where traditional testing breaks down

Traditional testing assumes determinism. Given the same inputs, a function produces the same output. A test that verifies the output for a given input will continue to verify it correctly as long as the function is not changed. This assumption is so fundamental that testing frameworks are built on it.

AI system outputs are not deterministic in this sense. A language model given the same prompt may produce different outputs on different calls. A retrieval system may return results in different orders depending on indexing state. The output that is correct on one run may not be produced on the next run. Tests that verify specific outputs fail non-deterministically, creating a false-negative problem: the tests are failing but the system is not broken.

The response many teams reach for is to make tests non-assertive: check that the system produces an output, not what the output is. This solves the false-negative problem but creates a false-positive problem: a test that only verifies the system produced some output will pass when the system produces a wrong, harmful, or nonsensical output. You have test coverage with no quality signal.

The deeper issue is that AI system quality is not a binary that a test can verify. It is a distribution over many possible inputs. A system that produces excellent outputs 95% of the time and terrible outputs 5% of the time may pass a test suite that covers a hundred happy-path cases, while in production the 5% case is what users encounter and complain about. The test suite is measuring the wrong thing.

What useful AI testing looks like

Useful AI testing works at three levels that map loosely onto the traditional pyramid but with different mechanics at each level.

Structural tests at the component level. The parts of an AI system that are deterministic should be tested with deterministic tests. Prompt templates that insert variables into fixed text can be tested to verify the insertion works correctly, the template produces valid syntax, and required fields are populated. Schema validation for structured outputs can be tested to verify the schema is correctly defined and the validator behaves as expected on known-good and known-bad inputs. Retrieval systems can be tested with queries where the correct result is known, to verify the retrieval logic is working.

These tests do not verify AI output quality, but they verify that the scaffolding around the AI is functioning correctly. Scaffolding bugs (broken prompt templates, misconfigured parsers, retrieval that returns empty results silently) are deterministic and testable, and they are a significant source of AI system failures in practice.

Evaluation sets for output quality. Output quality is not a binary but it is not unmeasurable. Evaluation sets, collections of inputs paired with explicit quality criteria, provide a way to measure output quality in a way that is more rigorous than eyeballing but more practical than comprehensive unit tests.

A useful evaluation set has three components: a set of inputs that covers the important cases, including edge cases and known-hard cases; a set of quality criteria that can be checked against outputs (correctness, format, tone, absence of specific failure modes); and a way to run the criteria against outputs and aggregate the results. For some quality criteria, automated checking is possible: schema validation, presence or absence of specific strings, response length within bounds. For others, human review of a sample is the checking mechanism.

The key practice is running evaluations on every significant change and tracking the results over time. A system that maintains 92% pass rate on the evaluation set across multiple model updates is behaving reliably; a system that drops to 76% after a model update has a detectable quality regression that can be investigated before the change reaches production.

Behavioral tests for specific failure modes. Known failure modes deserve explicit tests that specifically probe for them. If the system is supposed to never output certain content types, have tests that prompt it in ways likely to produce those content types and verify it does not. If the system is supposed to handle malformed inputs gracefully, have tests with malformed inputs. If the system is supposed to stay within a topic domain, have out-of-domain tests.

Behavioral tests of this kind are more informative than happy-path tests because they probe the failure modes rather than the normal case. A system that passes only happy-path tests might fail in obvious ways on common edge cases. A system that passes behavioral tests for its known failure modes is providing actual evidence about the robustness of the system.

Evaluation design is the hard problem

The limiting factor in AI testing is not running evaluations; it is designing evaluations that measure what actually matters.

The instinct when building an evaluation set is to use inputs that the system handles well: cases that are representative of the intended use, phrased clearly, without edge cases. This produces an evaluation set that the system passes easily and that does not surface quality problems. The evaluation is measuring the system’s best case, not its typical or worst case.

A better evaluation set is designed adversarially. It includes:

Distribution-covering inputs. Not just typical inputs but inputs from across the distribution the system will encounter in production, including rare but important cases. If the system is a customer support assistant, the evaluation should include angry customers, unusual edge cases, requests for information the system does not have, and off-topic requests, not just clear, well-formed questions about common topics.

Known-failure probes. Inputs specifically designed to trigger known failure modes. If language models tend to confabulate when asked about recent events, the evaluation should include questions about recent events and check for confabulation. If the system has a known tendency to be overly verbose, the evaluation should include cases where verbosity is a problem and check for it.

Regression cases. Inputs that previously caused quality problems, added to the evaluation set when the problem is identified. This prevents the same failure from recurring without detection.

Designing this kind of evaluation set requires understanding how the system fails, which requires having seen it fail. New AI systems benefit from an adversarial exploration phase before formal evaluation design: deliberately trying to break the system, cataloging the failure modes observed, and using those failure modes to guide evaluation design.

Monitoring as the extension of testing

Even well-designed evaluation sets have gaps. They cover the cases that were thought of when the evaluation was designed; production reveals cases that were not thought of. Testing for a deployed AI system continues in production through monitoring.

The monitoring that extends testing tracks the signals that indicate quality problems: user correction rates, escalation rates, explicit negative feedback, output length anomalies, latency outliers that often correlate with model confusion. These signals do not identify the specific failure mode, but they identify that something has changed in output quality, prompting investigation and evaluation set expansion.

The organizations that build AI systems with durable quality over time treat testing and monitoring as a continuous loop: test before deployment, monitor after deployment, expand the test set based on what monitoring surfaces, and use the expanded test set to prevent regression on the newly discovered failure modes.

This loop is not cheap to run. It requires investing in evaluation infrastructure, maintaining evaluation sets as the system and its requirements evolve, and staffing the monitoring and response function. The alternative is discovering that the system is producing poor outputs through user complaints and production incidents, which is more expensive still.

AI systems that are reliable in production are not reliable because they were built correctly once. They are reliable because they have been tested, monitored, and corrected continuously. The testing discipline for AI is less about getting coverage on a function and more about maintaining a living model of where the system fails and ensuring those failures are caught before users encounter them.

How to think about testing AI systems

Where traditional testing breaks down

What useful AI testing looks like

Evaluation design is the hard problem

Monitoring as the extension of testing

More from Zylver

What your board needs to know about AI

How AI is changing customer service

How to scale AI adoption from one team to the whole organization