Why Your AI Works in Dev and Breaks in Production

Blog post description.

WHAT BREAKS

Marcia Coulter

5/1/20263 min read

worm's-eye view photography of concrete building
worm's-eye view photography of concrete building

You get something working locally.

You test a few cases.
Tweak the prompt.
Add a couple of examples.

It looks solid.

So you wire it into your app, push it through staging, and ship.

Then production starts doing things you didn’t see before.

What Changes?

At first, nothing obvious.

The prompt is the same.
The model is the same.

But the behavior isn’t.

You start seeing things like:

  • Outputs that don’t match what you tested

  • Edge cases appearing more often than expected

  • Previously “fixed” issues showing up again

  • Slight wording changes causing different results

And the most frustrating one:

You can’t reliably reproduce the problem.

A Simple Case

Say you’re extracting structured data from support emails.

You test locally with something like:

It works.

So you add a bit more structure:

Still works.

You test a handful of emails:

  • clean formatting

  • clear intent

  • predictable phrasing

Everything looks stable.

Production Is Messier

Then real inputs start coming in:

“Hey, this is John—my account’s acting weird again, same issue as last week, kind of urgent but not sure if I picked the right category.”

Now you get:

That’s fine.

Then another message:

“It’s me again. Still locked out. Billing said it wasn’t them.”

Now you get:

That’s… less fine.

So you refine the prompt:

“If a user references a prior issue, infer context from the message. Prioritize current problem over historical mentions.”

That helps.

Until it doesn’t.

The Drift

A week later:

  • Similar messages produce different JSON structures

  • “urgency” sometimes becomes “priority”

  • “issue_type” flips between categories for similar inputs

Nothing changed in your code.

But the behavior shifted.

You Try to Debug It

You log:

  • the raw input

  • the model output

You compare failing cases to working ones.

They look almost identical.

You tweak the prompt again.

You add more examples.

You tighten the wording.

Some things improve.

Other things break.

At some point you realize:

You’re not debugging a function.

You’re chasing a moving target.

What’s Actually Different in Production

It’s not just “more data.”

It’s different conditions:

  • wider variation in phrasing

  • longer or shorter inputs

  • partial context

  • repeated interactions

  • multiple developers touching the system

Each of these slightly changes how the model interprets the same prompt.

And none of that is captured anywhere.

The Real Problem

In dev, you’re testing a handful of cases.

In production, you’re relying on consistency across thousands.

But the system doesn’t retain:

  • how it resolved earlier edge cases

  • what rules you introduced along the way

  • which fixes were meant to stabilize behavior

Every request is effectively starting fresh.

Why This Feels Worse Than Normal Bugs

In a typical system:

  • you can trace logic

  • you can reproduce behavior

  • you can isolate changes

Here, you can’t easily do any of that.

The logic isn’t stored.
The reasoning isn’t visible.

So when something breaks, you don’t have a path back.

Where This Leads

You end up compensating in ways that don’t scale:

  • adding more prompt rules

  • adding more examples

  • narrowing inputs

  • hoping it stabilizes

Sometimes it does.

But it doesn’t hold.

The Shift

At some point, the question changes.

Not:

“Why does this prompt behave differently in production?”

But:

“What would it take for this system to behave the same way every time?”

What’s Missing

To get there, you’d need to carry forward things like:

  • the rules you introduced

  • how edge cases were resolved

  • what “worked” and why

  • how decisions were made

Across requests.

Across sessions.

Across environments.

Right now, none of that persists.

Where This Points

This isn’t just a testing gap.

It’s an architectural one.

What’s missing is a layer where reasoning doesn’t reset between interactions—
where it can be carried from development into production,
and remain consistent over time.

In other words:

a Durable Reasoning Layer™.

“Extract name, issue type, and urgency from this message.”

{
"name": "John",
"issue_type": "account",
"urgency": "urgent"
}

"Return JSON with fields: name, issue_type, urgency. If missing, use null."

{
"name": null,
"issue_type": "billing",
"urgency": "high"
}