Why Your AI Works in Dev and Breaks in Production
Blog post description.
WHAT BREAKS
Marcia Coulter
5/1/20263 min read
You get something working locally.
You test a few cases.
Tweak the prompt.
Add a couple of examples.
It looks solid.
So you wire it into your app, push it through staging, and ship.
Then production starts doing things you didn’t see before.
What Changes?
At first, nothing obvious.
The prompt is the same.
The model is the same.
But the behavior isn’t.
You start seeing things like:
Outputs that don’t match what you tested
Edge cases appearing more often than expected
Previously “fixed” issues showing up again
Slight wording changes causing different results
And the most frustrating one:
You can’t reliably reproduce the problem.
A Simple Case
Say you’re extracting structured data from support emails.
You test locally with something like:
It works.
So you add a bit more structure:
Still works.
You test a handful of emails:
clean formatting
clear intent
predictable phrasing
Everything looks stable.
Production Is Messier
Then real inputs start coming in:
“Hey, this is John—my account’s acting weird again, same issue as last week, kind of urgent but not sure if I picked the right category.”
Now you get:
That’s fine.
Then another message:
“It’s me again. Still locked out. Billing said it wasn’t them.”
Now you get:
That’s… less fine.
So you refine the prompt:
“If a user references a prior issue, infer context from the message. Prioritize current problem over historical mentions.”
That helps.
Until it doesn’t.
The Drift
A week later:
Similar messages produce different JSON structures
“urgency” sometimes becomes “priority”
“issue_type” flips between categories for similar inputs
Nothing changed in your code.
But the behavior shifted.
You Try to Debug It
You log:
the raw input
the model output
You compare failing cases to working ones.
They look almost identical.
You tweak the prompt again.
You add more examples.
You tighten the wording.
Some things improve.
Other things break.
At some point you realize:
You’re not debugging a function.
You’re chasing a moving target.
What’s Actually Different in Production
It’s not just “more data.”
It’s different conditions:
wider variation in phrasing
longer or shorter inputs
partial context
repeated interactions
multiple developers touching the system
Each of these slightly changes how the model interprets the same prompt.
And none of that is captured anywhere.
The Real Problem
In dev, you’re testing a handful of cases.
In production, you’re relying on consistency across thousands.
But the system doesn’t retain:
how it resolved earlier edge cases
what rules you introduced along the way
which fixes were meant to stabilize behavior
Every request is effectively starting fresh.
Why This Feels Worse Than Normal Bugs
In a typical system:
you can trace logic
you can reproduce behavior
you can isolate changes
Here, you can’t easily do any of that.
The logic isn’t stored.
The reasoning isn’t visible.
So when something breaks, you don’t have a path back.
Where This Leads
You end up compensating in ways that don’t scale:
adding more prompt rules
adding more examples
narrowing inputs
hoping it stabilizes
Sometimes it does.
But it doesn’t hold.
The Shift
At some point, the question changes.
Not:
“Why does this prompt behave differently in production?”
But:
“What would it take for this system to behave the same way every time?”
What’s Missing
To get there, you’d need to carry forward things like:
the rules you introduced
how edge cases were resolved
what “worked” and why
how decisions were made
Across requests.
Across sessions.
Across environments.
Right now, none of that persists.
Where This Points
This isn’t just a testing gap.
It’s an architectural one.
What’s missing is a layer where reasoning doesn’t reset between interactions—
where it can be carried from development into production,
and remain consistent over time.
In other words:
“Extract name, issue type, and urgency from this message.”
{
"name": "John",
"issue_type": "account",
"urgency": "urgent"
}
"Return JSON with fields: name, issue_type, urgency. If missing, use null."
{
"name": null,
"issue_type": "billing",
"urgency": "high"
}