The case for structured outputs in production AI

A support ticket routing system processes 4,000 tickets per day. The LLM reads each ticket and returns a response in prose: “This ticket appears to be a billing issue, likely related to the customer’s recent subscription change, with medium priority.” A parser extracts the category, reason, and priority from that sentence.

It works fine for two months. Then the model provider updates the model. The new version writes “This seems to be a billing question (medium priority) related to a recent subscription event.” The parser breaks. Tickets pile up in an unrouted queue for six hours before someone notices.

The fix is fifteen minutes of work. But it should never have been necessary.

The problem with prose outputs

Most LLM integrations are built around a simple pattern: send a prompt, receive text, process the text. This works in prototypes because prose is flexible. It fails in production because prose is unpredictable.

Every time a model is updated, fine-tuned, or replaced, its output format shifts. Capitalization changes. Sentence structure varies. Fields appear in different order. Systems that parse prose are brittle by design.

There is also a cost problem. A prose response to “categorize this ticket” might be 60 tokens. The same information expressed as {"category": "billing", "priority": "medium", "reason": "subscription change"} is 18 tokens. At scale, that difference compounds.

And there is an observability problem. Prose is hard to aggregate. If you want to know the distribution of ticket categories over the past week, you cannot run a GROUP BY on prose. You have to run another LLM call to extract the data you already asked for once.

Structured outputs solve all three problems simultaneously.

What structured outputs are

Structured outputs enforce that an LLM returns data in a specific schema rather than free text. Instead of asking “What category is this ticket?”, you define a JSON schema and tell the model to return an object that matches it:

{
  "type": "object",
  "properties": {
    "category": {
      "type": "string",
      "enum": ["billing", "technical", "account", "general"]
    },
    "priority": {
      "type": "string",
      "enum": ["high", "medium", "low"]
    },
    "reason": {
      "type": "string",
      "maxLength": 200
    }
  },
  "required": ["category", "priority", "reason"]
}

The model is constrained to return output that validates against this schema. No parsing required. No brittle extraction logic. No surprise failures when the model changes how it phrases things.

The two main implementation paths are tool use (function calling) and native structured output mode. Both enforce schema compliance at the API layer, so you get validation before the response reaches your code.

The cost difference is real

Structured outputs reduce token consumption in two ways.

First, the response itself is shorter. A classification that takes 60-80 tokens in prose takes 15-25 tokens as JSON. On a system processing 4,000 requests per day, at $0.015 per thousand output tokens, this is roughly $12-18 per day saved on outputs alone. That is $4,000-6,000 per year from one classification step.

Second, you eliminate the second-pass calls. Many teams extract data from prose outputs with a follow-up LLM call: “Given this response, extract the category and priority as JSON.” This doubles the per-request cost for the extraction step. Structured outputs remove the second call entirely.

At low volume, neither saving is interesting. At production volume across multiple AI operations, the aggregate is significant. If you have ten AI-assisted operations in your system and half of them are currently returning prose that gets parsed, switching half of them to structured outputs will measurably reduce your monthly AI spend.

The reliability difference is larger than the cost difference

Structured outputs do not just reduce costs. They change the failure mode of your system.

With prose outputs, failures are silent and downstream. A parser fails. An exception is swallowed. A queue fills up. A metric looks wrong. By the time someone notices, the problem is an hour or more old. Debugging requires reconstructing what the LLM said, what the parser did, and where it broke.

With structured outputs, the failure is immediate and explicit. If the model returns something that does not conform to the schema, the API rejects it and returns an error before your code ever sees the response. You can catch that error at the exact call site, log the input, and handle it cleanly. The failure surface shrinks from “anywhere the parsed value is used” to “the single place where you call the LLM.”

This is the same principle as type safety in programming. A type error caught at compile time is easier to fix than a null pointer exception caught at runtime. Schema enforcement at the API layer is type safety for LLM responses.

Structured outputs improve observability by default

When every AI operation returns a validated JSON object, your logs become queryable data rather than unstructured text.

Consider what you can measure when ticket classifications are structured:

Distribution of categories over time (detect shifts when customer behavior changes)
Accuracy of priority assignments (compare against human review samples)
Reason length distribution (proxy for confidence: long reasons often indicate ambiguous inputs)
Category correlations with resolution time (identify which ticket types take longest to resolve)

None of this is possible without structure. You can extract it from prose retrospectively, but that requires another set of LLM calls and introduces its own parsing fragility.

Structured outputs make observability free. The data you need for analysis is already there, in the right format, every time.

Implementation patterns

Use schema-constrained tool calls for classification. When the AI operation is fundamentally a decision, define an enum of valid decisions and enforce it. Do not let the model free-form its way to a classification.

Keep schemas flat where possible. Nested structures are harder to validate and harder to query. A flat object with five fields is more reliable than a tree with three levels of nesting.

Include a confidence field when decisions have downstream consequences. Add a confidence: "high" | "medium" | "low" field to any schema where low-confidence outputs should trigger human review. The model is reasonably calibrated on this when the field is named clearly.

Use maxLength constraints on string fields. Without them, the model may return unexpectedly long values for fields like “reason” or “summary.” Set reasonable limits that match what your downstream system can handle.

Log the full structured output, not just the fields you use. The fields you ignore today may be what you need for debugging six months from now.

When structured outputs are not appropriate

Structured outputs are the right default for classification, extraction, routing, and decision-support tasks. They are not appropriate for everything.

Creative and generative tasks. Writing a blog post draft, summarizing a free-form document, generating alternative phrasings: these are tasks where the output is the prose itself. Constraining them to a schema makes no sense.

Open-ended reasoning. If you need the model to work through a problem it has not seen before, constraining output format prematurely limits reasoning depth. Let the model think in prose, then extract a structured decision at the end.

User-facing copy. Any response the user will read directly should not be constrained to JSON. The user does not want to see your schema.

The decision is simple: if a human would not want to read the output, it should probably be structured. If a human will read it, it should be prose.

The pattern to follow

Start by auditing your existing LLM calls. For each one, answer two questions: Does this call return data my code needs to process? Does a human read this output directly?

If the answer to the first question is yes and the second is no, the call should return structured output.

In most production AI systems, this applies to 40-60% of all LLM calls. Switching those calls to structured outputs reduces token costs, eliminates parsing fragility, and turns opaque prose logs into queryable operational data.

The support ticket system that broke when the model updated should have been requesting structured output from the start. The model can change its phrasing however it wants. The schema does not care.

Related: Reading an LLM bill covers how to analyze where your AI spend is actually going. The AI observability gap covers what to measure once your systems are producing data worth measuring.

The case for structured outputs in production AI

The problem with prose outputs

What structured outputs are

The cost difference is real

The reliability difference is larger than the cost difference

Structured outputs improve observability by default

Implementation patterns

When structured outputs are not appropriate

The pattern to follow

More from Zylver

Why long-running AI agents fail silently

Reading an LLM bill: line items that actually matter

Multi-tenant AI: what you can't fake when you have 50 customers