Back to Articles
Development

What Broke First When We Let AI Agents Write Code

Appskill TeamFeb 1, 2026 4 min read
autonomous-ai-agents

Two months into our experiment with autonomous coding agents, we had a codebase that compiled, passed tests, and looked professional. We also had no idea why half of it existed.

The Setup

We started simple. An AI agent would handle routine refactoring, implement well-specified features, and keep dependencies updated. The kind of work that's straightforward but time-consuming.

Early results were promising. PRs came in clean. Code reviews found fewer issues than we expected. Velocity increased.

Then we tried to add a new authentication system.

The First Crack

The agent implemented OAuth perfectly. It also broke assumptions buried in three other parts of the codebase that nobody had documented.

Not with errors—the tests still passed. It broke them semantically. A session management pattern we'd been using implicitly was now half-replaced, half-intact. The code worked, but the mental model was gone.

We traced it back. Every individual change made sense. The agent had been asked to "modernize auth patterns" two weeks earlier, then "standardize session handling" the week after. Both tasks were completed successfully.

The problem was that these weren't independent tasks. We knew that. The agent didn't.

What We Observed

Over the next few weeks, we started seeing a pattern. The agent would complete a task, then a related task, then another task that touched the same domain. Each one was correct in isolation.

But the cumulative effect was architectural drift. Not technical debt—something stranger. The codebase was evolving without a coherent direction. Like watching a city built by people who never talked to each other. Streets connected, but districts evolved in different directions

We called it "contextless accumulation." Each change optimized for the immediate request while slowly eroding the structure we'd built up over months.

The first failure mode we observed was not incorrect code, but loss of architectural continuity across tasks.

correctness-vs-coherence

Why Normal Safeguards Failed

What made this difficult to diagnose was that none of our usual safeguards were designed to catch it.

We had code review. We had tests. We had clear specifications.

None of it caught the problem because the problem wasn't in any single change. It was in the relationship between changes over time.

A human developer carries forward implicit knowledge: "We tried this approach last month and hit scaling issues," or "This pattern exists because of a constraint in the legacy system." They make decisions informed by history they don't need to document.

The agent had no such memory. Every prompt was a clean slate.

detectable-vs-undetectable

What We Changed

We stopped trying to make the agent smarter and started making our system more explicit.

First, we created decision records for architectural choices that would affect future work. Not for everything. Only for decisions we didn’t want re-litigated later.

Second, we limited scope. The agent could modify anything within a feature boundary, but cross-cutting changes required human approval. Not because the agent couldn't do them well, but because those changes depended on global context.

Third, we separated exploration from implementation. The agent could propose refactoring patterns, but we'd validate the direction before applying them broadly.

What Improved

The agent didn't get better at coding. Our system got better at maintaining coherence.

Drift slowed. We started seeing patterns hold across changes. Code reviews shifted from reconstructing context to evaluating intent.

Interestingly, the agent's output quality stayed the same. What changed was our ability to integrate that output without losing the thread of what we were building.

What This Meant for Other Teams

We started comparing notes. Turns out, a lot of teams hit similar issues around the same time—just in different ways.

One team found their API design fragmenting.

Another noticed their error handling becoming inconsistent.

A third watched their database access patterns slowly diverge from their performance requirements.

Same root cause:

Local optimization by autonomous agents tends to surface as architectural drift rather than immediate defects.

The teams that adapted fastest weren't the ones with the best AI tools. They were the ones who'd already built systems to maintain architectural clarity—design docs, ADRs, strong conventions. The AI just exposed how much other teams had been relying on implicit knowledge.

Where We Are Now

We still use AI agents for code. They're legitimately useful for a lot of tasks.

But we're clearer about what they're good at and what they're not. They're excellent execution engines with no institutional memory. That's not a limitation to fix—it's a characteristic to design around.

The first thing that broke wasn't our code. It was our assumption that intelligence alone creates coherence. Turns out, coherence requires continuity. And continuity requires something agents don't have: persistent context that survives between tasks.

We're still figuring out the right balance. But at least now we know what we're balancing.

What Broke First When We Let AI Agents Write Code | Appskill