The thing coding agents still can't do

The story about coding agents in 2025 is that they write most code.

The reality is more specific. They generate code matching common patterns. They don't yet reason about constraints that live only in your codebase.

The gap is small in typical cases. It is catastrophic in the specific cases that matter.

What they do well

Boilerplate with well-known shape. Standard CRUD endpoints, common component patterns, migrations matching typical shapes, tests that look like other tests. Anything where the right answer is "do the usual thing."

This is a large fraction of engineering work. Not criticizing the win. The productivity gain from shipping that fraction faster is genuine.

What they still miss

Things that live only in the history and context of your specific codebase.

The "we used to do it that way and it caused an outage" knowledge. Not in the code. In the postmortem. In the engineer's head.
Subtle invariants not encoded in types. "This function assumes the caller has already validated X." Tests don't always catch these. Agents don't always read them.
Cross-module consistency. A change that looks right in one file but breaks a convention the team has been maintaining across twenty files.
The stuff that looks redundant but isn't. A double-check. An extra assertion. A duplicated guard. Agents aggressively remove things that look like dead code. A fraction of the time, that code was load-bearing in a way the tests don't cover.
Decisions whose rationale isn't in the repository. "We don't use ORM here because of a performance issue we hit two years ago." Not in the README. Not in the code. In the tribal knowledge.

Each is addressable with careful context-passing. Most projects don't do it.

The specific failure mode

Agents are confident. They produce code that compiles, passes tests, and looks right. The confidence is warranted on most changes. It is not warranted on the changes that touch the project's specific history.

Typical shape of the bug: an agent removes a defensive check that was added after an incident. The check wasn't in the test suite because it was paranoia, not behavior. It was there because a specific rare input had once reached that code path and caused corruption.

Agent sees the check, sees it is never hit by tests, removes it as dead code. Reviewer misses it because the diff looks like a cleanup. Two months later, the rare input comes back.

This has a name in the literature now. "Context erosion." The same kind of failure a new engineer would make on their first week, except scaled across every refactor in a codebase.

Why this won't be fixed by bigger models

The information the agent needs is not in the codebase. It is in the team.

A bigger model that reads the same code better does not have access to the incident postmortem, the team Slack, the architectural debate from last year that the decision came out of. The knowledge lives in places the agent can't reach.

Closing the gap is a context-passing problem, not a model problem. Some of the context can be made machine-readable. A lot of it is in human channels and will stay there until the cost of migrating it is lower than the cost of losing it.

What this means for how you use them

Use agents with full awareness of the gap.

Keep rationale in the code, not in the ticket. A comment that says "do not remove, see incident #4712" costs five seconds and saves an outage.
Review agent diffs with "what did it delete" as the primary question. The additions are usually fine. The deletions are where project-specific knowledge gets lost.
Don't let an agent refactor at scale without a human who knows the codebase reviewing each file. "Clean up the codebase" is the highest-risk prompt you can give.
Treat agents as fast junior engineers with excellent training but no tenure. That framing matches their actual capabilities pretty well. It also sets the right review depth.

The position

Coding agents are already useful. The useful range is well-understood patterns in codebases where the context is either machine-readable or the changes are shallow.

What they still can't do: reason about the parts of your system that exist only because of decisions made in rooms they were not in. Until the context-passing problem is solved, that gap defines where the human in the loop still matters.

All articles