The AI vendor due diligence checklist

Traditional software procurement has well-developed evaluation criteria: does it integrate with our stack, can it handle our data volume, what does the SLA look like, how is pricing structured. These questions matter for AI vendors too. They are also insufficient, because AI systems have failure modes and vendor risks that traditional software does not.

Organizations that evaluate AI vendors with the same criteria they apply to traditional software tend to discover the gaps after they have committed. The contract is signed, the integration is built, and the production system is behaving in ways the evaluation process did not predict. The following categories of questions are the ones that traditional procurement frameworks miss.

Production reliability versus demo reliability

The most common evaluation mistake is testing a vendor’s system in conditions that do not represent production. A demo or pilot typically involves well-formed, representative inputs, a small volume, and close attention from vendor personnel who want the evaluation to go well. Production involves edge cases, unusual inputs, high volume, and no vendor attention.

The questions that reveal production reliability are about failure behavior, not success behavior. What happens when the input is malformed? What does the system return when it has low confidence in its output? How does the system behave when downstream dependencies are slow or unavailable? What is the failure rate on inputs that fall outside the core use case?

Ask for production incident data from comparable deployments, not sanitized case studies. Ask what the observed error rate is in production for inputs similar to yours. Ask how the system behaves under load: does latency increase linearly, or does it degrade suddenly above a threshold?

Vendors who cannot answer these questions with data are either not running comparable production deployments or not monitoring them well enough to know. Both are relevant information.

Data handling and training boundaries

The data handling questions for AI vendors are more complex than for traditional SaaS because of the possibility that your data is used to train or improve the model. The answers to these questions have both privacy and competitive implications.

What data is retained after an API call completes? For how long, under what conditions, and with access by whom? Is retained data used for model training, fine-tuning, or evaluation? If so, can you opt out, and what is the actual scope of that opt-out?

The answers to these questions are often buried in terms of service rather than surfaced in sales conversations. Read the data processing agreements, not just the pitch deck. Specifically look for clauses about model improvement, anonymization (and what anonymization actually means in their definition), and what happens to your data after contract termination.

For sensitive industries, the follow-up questions are about data residency. Where are models running? Where is inference data processed? Where is it stored if retained? Some vendors offer regional deployments or private deployments that address these concerns; others do not.

Model versioning and silent behavior changes

This is the due diligence category that most organizations miss entirely, and it is the one most likely to cause production problems.

AI models are not software in the traditional sense: a model update can change behavior on every input, not just on inputs that touch the changed code. A vendor that updates a model is effectively updating behavior across your entire production workload, often without notice and sometimes without a versioning scheme that lets you roll back.

Ask whether model versions are fixed or whether the model behind a given API endpoint updates automatically. Ask how version transitions are communicated. Ask whether previous model versions remain accessible after updates, and for how long. Ask whether the vendor’s SLA covers behavioral consistency or only uptime.

The vendors who take this seriously offer pinned model versions, change logs that describe behavioral differences between versions, and a deprecation timeline for older versions that gives customers time to evaluate and migrate. The vendors who do not take it seriously describe model updates as improvements and expect customers to be grateful rather than notified.

If you are building a production system where consistency matters, pinned model versions are not optional. If the vendor does not offer them, that is a product risk that needs to be weighed explicitly, not assumed away.

Observability and debugging

When something goes wrong in a traditional software system, you typically have access to logs, traces, and error messages that let you reconstruct what happened. When something goes wrong in an AI system, the vendor controls much of the observability.

Ask what logging is available for API calls. Do you get the inputs, outputs, and latency for every call? Are tokens logged so you can reconstruct exactly what the model received? Is there a way to retrieve historical call data for debugging, or does the logging only surface in aggregate metrics?

Ask whether there is a prompt inspection or debugging interface that lets you understand why the model produced a particular output. For RAG systems, ask whether you can see what documents were retrieved for a given query and how they were scored.

The observability gap matters because debugging AI systems in production requires understanding specific inputs and outputs. Without that, when the system produces a bad output you can observe that it happened but not why, which makes systematic improvement difficult.

Evaluation and quality measurement support

A vendor whose system you cannot evaluate rigorously is a vendor you cannot hold accountable. Ask what the vendor provides to support systematic quality measurement.

This includes test sets representative of production inputs, benchmark data that reflects real-world performance rather than curated examples, and tooling for running evaluations against your own labeled data. It also includes documentation of known failure modes: what categories of inputs does the system handle poorly, and what inputs are explicitly out of scope?

Vendors who have done this work can tell you specifically what their system is bad at. Vendors who have not done it will give you benchmark scores and case studies but not a candid assessment of limitations. Knowing the limitations in advance lets you design mitigations into your integration; discovering them in production costs more.

Also ask about support for ongoing monitoring. Can the vendor’s system emit quality signals that you can incorporate into your monitoring stack? Is there a feedback mechanism that lets production errors inform future evaluation?

Lock-in and portability

AI vendor lock-in operates differently from traditional SaaS lock-in. The integration cost is real: you have built prompts, evaluation sets, and monitoring infrastructure around a specific vendor’s behavior. Switching means rebuilding all of that, not just changing an API endpoint.

Assess the switching cost honestly before committing. If the vendor uses a proprietary fine-tuned model, your investment in prompt engineering for that model may not transfer. If the vendor’s RAG architecture uses proprietary chunking and retrieval, your indexed documents are not portable. If the vendor’s evaluation tooling is proprietary, your labeled test sets may not be usable elsewhere.

The mitigation is not avoiding vendors with proprietary components. It is building your integration with a clear-eyed view of what is portable and what is not, so that the switching cost is a known quantity rather than a surprise. Evaluation test sets built on open formats and labeled independently of the vendor’s tooling are portable. Prompt libraries that are heavily optimized for one model’s quirks are less portable.

For high-stakes or long-term integrations, ask the vendor directly how customers have migrated away in the past and what the process looked like. The vendors who have supported migrations and can describe the process are the ones who take portability seriously; the vendors who have not or who deflect the question are telling you something about their relationship with customer lock-in.

Putting it together

No vendor will score well on every dimension. The point of this evaluation is not to find a perfect vendor but to understand the tradeoffs explicitly before committing. A vendor with excellent production reliability but weak observability may be acceptable if the use case is low-stakes. A vendor with model versioning risks may be acceptable if you are in an early exploration phase rather than building a long-lived production system.

The due diligence that matters is matching the vendor’s actual profile against your actual requirements, with enough specificity that the gaps are known and managed rather than discovered in production.

The AI vendor due diligence checklist

Production reliability versus demo reliability

Data handling and training boundaries

Model versioning and silent behavior changes

Observability and debugging

Evaluation and quality measurement support

Lock-in and portability

Putting it together

More from Zylver

What your board needs to know about AI

How AI is changing customer service

How to scale AI adoption from one team to the whole organization