Skip to main content
Back to blog
6 min read

What to ask before buying an AI platform

Most AI platform evaluations focus on benchmark scores and feature checklists. The questions that predict whether a platform will work in production are different ones.

By Ramiro Enriquez

Enterprise AI platform evaluations tend to follow a pattern. The vendor gives a demo. The demo works well. The procurement team asks about pricing tiers, SLA guarantees, and compliance certifications. The technical team asks about API access and model options. A contract gets signed.

Six months later, the platform is live on a narrow use case, the integration was harder than expected, the cost per call is higher than the estimates suggested, and the team is discovering limitations that were not surfaced in the evaluation. The contract is in year one of a three-year deal.

This is not a failure of due diligence in the conventional sense. The procurement team asked the right questions for buying enterprise software. They did not ask the right questions for buying an AI platform, because AI platforms have failure modes that traditional enterprise software does not.

What is different about evaluating AI platforms

Traditional enterprise software either works or it does not. A database either returns the correct record or it throws an error. A workflow tool either completes the step or it fails. The failure modes are binary and visible.

AI platforms fail probabilistically. The model returns a response for every input. Some responses are correct. Some are subtly wrong. Some are confidently wrong in ways that look correct. The failure rate is not visible in the dashboard unless you have built evaluation infrastructure before you need it. And the failure rate changes over time as the model is updated, as your input distribution shifts, and as edge cases accumulate that were not in the evaluation set.

This changes what you need to evaluate before buying.

The questions that predict production success

What happens when the model is wrong?

Every AI platform will have cases where the model produces incorrect output. The question is not whether this happens but how the platform handles it. Does it surface confidence scores you can act on? Does it provide mechanisms to route low-confidence outputs to human review? Does it log the inputs that produced incorrect outputs so you can analyze them? A platform that cannot answer this question clearly is one where you will discover the failure handling yourself, after go-live, without the tooling to do it systematically.

How does the platform handle model updates?

AI model providers update their underlying models on a schedule that is not always announced in advance. When a model is updated, behavior changes. Some changes improve outputs. Some introduce regressions on cases that were previously handled well. Ask the vendor how they notify customers of model updates, whether you can pin to a specific model version, and what their rollback process looks like. A platform where you have no control over model versioning is one where your production behavior can change without your knowledge or consent.

What does the cost structure look like at five times your initial volume?

AI platform costs scale with usage in ways that compound quickly. Token costs per call, plus embedding costs, plus retrieval infrastructure costs, plus fine-tuning costs if your use case requires it. Run the numbers at your expected production volume, then at five times that volume, then at ten times. Ask the vendor whether their pricing changes at scale and whether there are caps or soft limits you should know about. Many platforms have introductory pricing that looks favorable and enterprise pricing that is significantly different.

Can you export your data and your configurations?

AI platforms often accumulate proprietary value: fine-tuned models trained on your data, prompt libraries developed through iteration, evaluation datasets that represent months of work. Ask what happens to that value if you switch vendors. Can you export fine-tuned weights? Can you export the evaluation datasets? Can you export prompt configurations in a format usable with another provider? A platform where the answer to these questions is no is one where switching costs will compound over time, whether or not that is the vendor’s intent.

What does the observability layer look like in practice?

Request to see the actual monitoring dashboard, not a mockup. Ask how you would answer: which inputs drove the highest cost last week? What is the p95 latency for your specific use case over the last 30 days? What percentage of outputs passed your validation criteria? If the platform cannot answer these questions from its native dashboard, you will be building that observability layer yourself. That is work that takes engineering time you may not have budgeted.

Who owns the inputs you send through the platform?

AI platforms vary significantly in how they handle training on customer inputs. Some providers use API calls to improve their models. Some require explicit opt-out. Some offer data isolation by default. This is a legal and compliance question for regulated industries, and it is a competitive sensitivity question for everyone else. The answer belongs in the contract, not in a verbal assurance during the sales call.

Questions about the integration path

What does a production integration actually involve?

Ask the vendor to walk you through the last three customer integrations with a use case similar to yours. Not the easy ones: the ones that hit complications. What were the complications? How long did the integration take from contract to production? What was the team size on the customer side? Vendors who cannot produce these specifics are either not tracking them or the integrations have not gone as smoothly as the sales narrative suggests.

What is the rate limit structure?

AI platforms impose rate limits that are not always visible in early testing because the evaluation load is lower than production load. Ask for the specific rate limits at your expected volume and at peak. Ask what happens when you hit a limit: do calls queue, do they fail, or does the system degrade gracefully? This matters for user-facing applications where a rate limit that drops calls becomes a user-visible outage.

What does the SLA actually cover?

Many AI platform SLAs cover API availability but not output quality or model behavior consistency. A platform that is 99.9% available but returns subtly degraded outputs for a week after a model update has technically honored its SLA while creating a production problem for you. Ask specifically what the SLA covers, what it does not cover, and what remedies exist for each category.

The evaluation structure that surfaces real risk

The standard evaluation process gives vendors an advantage: they control the demo, they can cherry-pick examples, and the evaluation timeline is short enough that edge cases do not surface.

A more reliable evaluation runs your own production data, not the vendor’s curated examples. Take a representative sample of real inputs from your current workflow (with appropriate privacy handling), define evaluation criteria in advance, and run the platform against them. Score the outputs against your criteria before you see the vendor’s pricing. This gives you a ground-truth performance number that is specific to your use case rather than a generic benchmark.

Run the evaluation over several days, not several hours. Model behavior can vary across time windows. A model that performs well on Tuesday morning may behave differently on Thursday afternoon, not because the model changed but because you are seeing natural variance that is invisible in a short evaluation.

Ask a dissenting question in the final stages: assume this deployment fails 12 months in. What would the most likely cause be? A vendor who can engage seriously with this question and give you a specific, plausible answer understands the failure modes of their platform. A vendor who deflects it does not.

The goal of the evaluation is not to confirm that the platform can do the demo. It almost certainly can. The goal is to discover what happens outside the demo conditions: in production, at scale, with real inputs, over time. Those are the conditions that determine whether the platform is a good fit. They are also the conditions the vendor is least motivated to surface in the sales process.

The questions above are not adversarial. They are the information you need to make a sound purchasing decision. Vendors who cannot answer them clearly are telling you something important.

Zylver ships AI products: Forge, Signal, Agents, Flows, and Meter. View all products.

Get insights like this delivered monthly.

No spam. Unsubscribe anytime.