Working Frontier

Cognitive Debt

Andrei-Mihai Nicolae — Tue, 28 Apr 2026 06:01:05 GMT

Before agents, a natural governor kept software complexity in check. Humans wrote the code. Humans are slow. Because we were slow, we kept up with what we built. Building speed and comprehension speed moved roughly together.

That coupling is gone.

An agent produces in minutes what a developer once needed days to write. It writes correct code, passes the tests, ships. But the developer who triggered the task understands that code no better than if a stranger had written it. The codebase grew. The team’s comprehension did not.

This is cognitive debt. Not technical debt—the code works. Cognitive debt is the widening gap between what your system does and what your team understands about how it does it.

Technical debt has a sibling

Technical debt is familiar. You cut a corner, you know you cut it, you plan to fix it later. The debt is visible.

Cognitive debt is invisible. The code works. Nothing looks wrong. But nobody on the team can explain why the service handles retries the way it does, or what happens when the cache layer fails. The system is correct and opaque.

Cognitive debt is worse than technical debt: you discover it only when you need the understanding you lack. A production incident hits a module an agent built three months ago. Nobody remembers the design—it passed CI and shipped. Now you are debugging a system you do not understand, under pressure, with no working picture to guide you.

Technical debt slows you down. Cognitive debt leaves you lost.

The compounding problem

Cognitive debt compounds faster than technical debt.

Every module an agent builds that you do not understand makes your next decision worse. Not because the code is bad—because your mental model of the system is incomplete. You approve an architectural choice that conflicts with a service an agent built last month. You make integration decisions from an understanding that is months out of date.

When humans built every module, they understood it. The team’s collective understanding stayed current with the codebase. At agent speed, the codebase outruns your mental model within weeks. Catching up means reading code you did not write, for a design you did not choose. Most teams will not do it. The debt accumulates in silence.

Existing tools fall short

Code review shares knowledge, but it was built for human-speed delivery. When agents produce changes faster than reviewers can read them, review becomes a bottleneck or a rubber stamp.

Documentation is always stale. It is worse now because the rate of change has accelerated while documentation habits stayed frozen.

Tests prove behavior, not understanding. A passing test suite does not tell you why the system works, how the pieces connect, or what assumptions underpin the design. Full test coverage and full cognitive debt coexist easily.

AI-generated summaries help, but only at the surface. An agent can walk you through a module, answer questions, diagram dependencies. That aids comprehension. It does not aid navigation. When a production incident strikes, the problem is rarely “I cannot understand this code.” The problem is “I do not know which of forty services to look at first.” You can ask an agent to explain a module, but you cannot ask it which of three modules you have never read is causing your outage. The navigational problem precedes the explanatory one.

If full comprehension is impossible, the question becomes: what replaces it?

Nobody knows every street

You will not solve cognitive debt by understanding everything. Not at agent speed. The volume is too high, the rate of change too fast.

Think about a city. Nobody knows every street—not the mayor, not the taxi drivers, not the lifelong residents. But cities work. People navigate them every day, not because they memorized the layout, but because cities are navigable. Street signs, grid systems, neighborhoods with distinct character.

Your codebase must become a city, not a maze. A maze is complex and unsigned—you navigate it only by memorizing the path. A city is complex and signed everywhere. The complexity remains, but it is legible. Most codebases built at agent speed will become mazes by default, because agents build for correctness, not human navigation.

Making the codebase navigable

Forget full comprehension. Navigability becomes the goal—making a system findable and orientable for a human who did not build it and lacks time to read all of it.

Explicit architecture. A top-level document describes the major components, their responsibilities, and how they connect. Each component carries its own doc: the data model it owns, the contracts it exposes, the assumptions it depends on. When an agent adds a service, someone updates both layers. Without them, every exploration starts from scratch.

Consistent patterns and clear boundaries. When every service handles errors, retries, and logging the same way, a developer who understands one service can reason about all of them. Pair that consistency with explicit module interfaces, and you get a neighborhood structure: you need not know the whole city, just the block you are working on and where it connects. Agents follow patterns well when told to. Enforce them.

Decision records. When someone makes a significant design choice—human or agent—write down what was chosen and why. Not a novel. A paragraph. Not in a wiki. Not in a Slack thread. In the repo, next to the code it affects. This history lets future developers understand intent, not just implementation.

Orientation before implementation. Before building, read. Before an agent writes a feature, trace the relevant paths in the system. Not to review the code—to build the understanding that makes future decisions sound. Five minutes of orientation saves hours of debugging.

The discipline is new

Technical debt has decades of shared language—linters, refactoring sprints, architecture reviews. Cognitive debt has none. Most teams feel the symptoms—slow debugging, bad decisions, fragile changes—but never name the cause.

Name it. Then build a city, not a maze.

Technical Debt Is Dead

Andrei-Mihai Nicolae — Tue, 21 Apr 2026 06:02:02 GMT

Technical debt was always an economic problem. Not a laziness problem, not a skill problem — an economics problem. Good code took time. The right abstraction, the clean interface, the proper error handling — all of it cost hours or days of a developer’s attention. And developer attention was the scarcest resource on any team.

So teams cut corners. They shipped the quick version because the right version cost too much. They hardcoded the config, skipped the validation, duplicated the logic, and wrote a TODO. Not because they wanted to — because the alternative was missing the deadline. The tradeoff was real: ship now with debt, or ship later with quality. Every team chose ship. This applied most directly to implementation-level debt — code quality, test coverage, validation, naming — the kind that accumulates line by line under deadline pressure.

That tradeoff no longer exists.

The price of good code collapsed

An AI agent writes the clean version as fast as the messy version. The proper abstraction costs the same as the shortcut — minutes, not days. Ask an agent for a well-structured service with proper error handling, clear interfaces, and comprehensive tests. It delivers. Ask it for a quick hack. It delivers that too, just as fast. Agents still need review and still make mistakes — but a clean first draft that needs minor correction is a different problem than a mess that needs a rewrite.

When the clean solution and the shortcut cost the same, choosing the shortcut is not pragmatism. It is habit.

Teams still carry the old instinct. Decades of expensive code trained the instinct to cut corners. Developers learned that “do it right” meant “do it slow.” That association runs deep. But the underlying economics have changed, and the instinct has not caught up.

The old excuses

Every excuse for technical debt traces back to the same root: we did not have time.

We did not have time to write the abstraction. We did not have time to add the validation. We did not have time to write the tests. We shipped what worked and promised to come back later. We rarely came back.

Time was the constraint, and cutting corners was the release valve. Remove the constraint and the valve is useless.

An agent does not get tired on a Friday afternoon and skip the edge cases. It does not get bored writing the third integration test. It does not decide that “good enough” is good enough because the sprint ends tomorrow. It applies the same rigor to the last test as the first. The discipline failures that created technical debt were human failures rooted in fatigue, boredom, and deadlines. The constraints are gone.

What remains

Not all technical debt came from time pressure. Some came from ignorance — the team did not know the right approach when they built it. Some came from changing requirements — the right approach last year is the wrong approach now. Some came from dependencies — a library forced an awkward integration.

These sources still exist. An agent cannot solve a problem the team has not yet understood. It cannot predict next year’s requirements. It cannot fix a bad dependency API.

But these are qualitatively different from time-pressure debt and require different solutions — better discovery, better planning, better vendor choices. Most technical debt, in most codebases, is the accumulated residue of a simple calculation: we could not afford the right solution at the time. That calculation no longer holds.

The new standard

If good code is cheap, the standard changes. Technical debt stops being the cost of shipping and becomes a choice — a bad one.

This means teams need to stop treating debt as normal. Stop planning “tech debt sprints” to clean up messes that agents could have avoided. Stop accepting PRs that take shortcuts when the shortcut saves no time.

The conversation shifts from “when will we pay down the debt?” to “why did we take on debt at all?”

The discipline is different now

The old discipline was prioritization. Which debt do we pay down first? How much time do we allocate? What is the interest rate on this shortcut?

The new discipline is specification. Tell the agent what good looks like. Define the patterns. Enforce the standards. Make the right way the default way, and agents will follow it consistently.

This is easier and cheaper than paying down debt. Instead of writing clean code yourself — which is what made it expensive — you describe what clean means and the agent writes it. The cost of quality shifted from execution to definition.

Teams that still accumulate technical debt in 2026 are not under pressure. They are under-specified. They left “good” undefined, so the agents produce whatever works. The debt is not a tradeoff — it is a configuration error.

The irony

For decades, the industry built tools, processes, and entire careers around managing technical debt. Linters to catch it. Refactoring tools to fix it. Sprint ceremonies to prioritize it.

Now the same technology that made code cheap enough to write well makes most of those tools obsolete. You do not need a tech debt sprint if you never take on the debt. You do not need a refactoring tool if the agent writes it clean the first time.

The irony: AI agents kill technical debt not by paying it down but by making it pointless to accumulate.

Good code used to be a luxury. Now it is the default — if you ask for it.

Ask for it.

Not Every Session Needs a Plan

Andrei-Mihai Nicolae — Tue, 14 Apr 2026 06:00:37 GMT

Most coding agents ship with a plan mode. Some have already removed it — Amp dropped theirs, deciding it added ceremony without adding value. Others still treat it as the recommended starting point: outline the approach, list the files, describe the changes, get approval, then execute. It sounds disciplined. It also sounds familiar. It is waterfall with a different font.

The instinct comes from a reasonable place. You want the session to go well. You want to avoid wasted work. So you front-load specification. But the problem is the same one that sank waterfall in teams: you cannot specify what you do not yet understand. A plan written before you have touched the code is a guess dressed as a decision. It feels productive. It is often wrong.

Planning is not the problem

Planning itself is fine. Boris Cherny, one of the creators of Claude Code, starts most sessions in plan mode:

and iterates on the plan until he likes it. That works because he treats the plan as a conversation, not a contract. His tasks are pull-request-sized features with known scope. The plan aligns direction. Then he moves.

The problem starts when planning becomes a default for every session. When you reach for plan mode before a bug fix, before a spike, before an exploration — before you even know what the task demands. That is not discipline. That is comfort. You are writing a PRD because it feels productive, not because the task needs one.

Match the workflow to the task

Not every task needs the same starting point.

If you are building a feature with clear scope — a new API endpoint, a settings page, a migration — plan mode earns its keep. You know roughly what the result looks like. A brief plan aligns direction before the work begins.

If you are fixing a bug where the behavior is already clear, a plan is overhead. A failing test is a better starting point. Pin the contract, confirm the failure, then fix it. I wrote about this approach recently: make the test fail first, then make the narrowest change that turns it green.

If you are exploring — spiking an idea, trying a library, feeling out an architecture — you do not know enough to plan. Start coding. Let the shape emerge. Planning an exploration is a contradiction.

The session is a conversation

The deeper issue is how you think about the session itself.

Plan mode tempts you into a handoff mentality: specify, then execute. But the best sessions are not handoffs. They are conversations. You start with a direction, the model makes something, you react, it adjusts, you push further. The work improves because both sides contribute judgment along the way.

That is the agile insight applied to a different scale. You do not need a two-week sprint to benefit from iteration. A thirty-minute session works the same way. Start with enough direction to move. Adjust as you learn. Trust that course correction is cheaper than perfect specification.

Pick the right tool

Plan mode is a tool. Red-green TDD is a tool. Freeform exploration is a tool. The mistake is picking one and applying it everywhere.

Plan when the scope is known. Test-first when the contract is clear. Explore when it is not. And in every case, stay in the loop. The session is not a spec you hand off. It is a conversation you steer.

Software Feels Free Again

Andrei-Mihai Nicolae — Tue, 07 Apr 2026 06:02:11 GMT

I installed a popular read-later app last month. It worked well. I saved articles, organized links, built a small library. Nothing told me I was on a trial.

Then the trial ended. The app locked me out. Not “you can still read what you saved, but you need to upgrade to add more.” Locked out entirely. My data, behind a paywall, with no warning that the clock had been running.

I did not pay. I opened my editor and started building a replacement.

The reflex that changed

Two years ago, that impulse would have been a fantasy. Building a read-later app from scratch meant weeks of work: storage, syncing, parsing, a decent interface. The rational move was to pay, grumble, and move on.

Now the rational move is different. With an AI agent and a weekend, I had a working app. It did exactly what I needed, nothing more. No pricing tiers, no dark patterns, no trial clock ticking in the background. Mine.

That is the shift. The cost of building a simple tool for yourself has collapsed. What used to take weeks takes hours. What used to require a team requires one person with a clear idea and good tools.

The subscription trap loses its leverage

Subscription software depends on a gap between what users want and what they can build. The wider the gap, the more leverage the vendor holds. You pay monthly because the alternative—building it yourself—is too expensive, too slow, or too hard.

That gap is closing fast.

When building a basic replacement takes a day, the threat changes direction. The vendor no longer holds your data hostage. You hold the option to leave. Not in theory—users could always leave in theory. In practice, with current tools, they actually can.

Some subscriptions will survive. Some products carry genuine ongoing costs: server infrastructure, real-time data, large-scale collaboration. A monthly price makes sense when the service costs money to run every month. But most subscription software is not infrastructure. Most of it is static functionality repackaged as a recurring charge—a tax on the gap between wanting and building.

That tax just got much harder to collect.

Software feels free again

For a long time, software felt like something imposed on you.

You picked from the options available, accepted the pricing, tolerated the dark patterns, and worked around the limitations. If the vendor changed the terms, you adapted or left. That was the relationship: you rented, they decided.

Once you know you can build, that posture changes.

You open a new app and notice the pricing page before the features. You calculate how long it would take to build the parts you actually need. The question shifts from “can I afford this?” to “does this earn my use over what I would build myself?” That makes you a peer, not a captive.

Software stops feeling like renting and starts feeling like making. You run into a problem and reach for your editor before you reach for the App Store. If someone already built it well, you pay for the craft. If not, you build your own. Either way, the choice is yours.

Software feels free again. Not free as in price. Free as in agency. Free as in unlocked.

What is left to sell

If anyone can build a basic version, what justifies charging others?

The answer is polish. When you build for yourself, you solve your own problem. You skip the edge cases that do not affect you, ignore the platforms you do not use, and tolerate rough spots you understand. That is fine for a personal tool. It is not enough for a product.

A product means someone else solved the edge cases, tested on devices you do not own, and made it reliable for people whose workflows differ from yours. That work has real value. It deserves to be paid for.

But it deserves to be paid for once. Polish has ongoing costs—platform updates, security patches, new devices. But the same shift that collapsed build costs collapsed maintenance costs too. A one-time payment—the price of a couple of coffees—is a fair exchange for someone else’s craft. It says: I could build this myself, but you already did, and you did it well. That is worth a few dollars.

A monthly subscription that holds your data hostage says something different. It says: you need me, and I intend to keep collecting.

The new deal

The old deal was simple: you pay monthly because you cannot build it yourself. The new deal is simpler: you pay once for craft, or you build your own.

One-time payments for polished tools. Subscriptions only where there is genuine ongoing cost. Everything else, build it yourself. Not every case is clear-cut. But when you can build the alternative, the vendor has to earn each month. The tools are good enough. The time is short enough. The ceiling on what one person can attempt has moved.

That is where software is heading. Not back to some nostalgic era of shareware and hobbyist code. Forward, to a place where building is cheap, agency is real, and the subscription trap has lost the only leverage it ever had.

The Cheap Subscription Is Not Enough

Andrei-Mihai Nicolae — Tue, 31 Mar 2026 06:01:02 GMT

I used to think a cheap AI subscription was enough to stay current. Pay for a basic plan, use the models when needed, let work cover the rest, and move on. That is enough to sample the tools. It is not enough to build the tacit knowledge that makes them truly useful.

Reps build judgment

Heavy AI use gives you something harder to describe than a prompt library: a feel for the system.

You learn when a model will handle a task cleanly and when it will bluff, how much context it can carry, when a vague instruction is fine, when the task needs tighter scaffolding, when to retry, when to switch models, and when to stop. After enough use, you can often predict the shape of the reply before it arrives.

That judgment does not come from benchmarks or one impressive demo. It comes from volume. You get it by pushing models into real work, watching them fail, trying again, and seeing enough patterns that they stop feeling mystical. They become legible. That is how trust forms. Not blind trust. Working trust.

Volume costs money

This is the part I underestimated.

A cheap plan, even with some work usage on top, is enough to keep you curious. It is not enough to make you fluent.

Fluency requires more usage than most people admit. You need long sessions, failed attempts, exploration, side projects, repeated prompting, and enough room to compare approaches until you understand why one worked and another did not.

That is where the more expensive plans stop looking indulgent. They start looking like the price of serious practice.

Cheap models are useful. Open models matter. But if you want tacit knowledge of what frontier AI can actually do, you have to spend real time at the frontier. Otherwise, you build your intuitions around the limitations of weaker systems and mistake that for realism.

A rare moment to lean in

This is also an extraordinary moment to be alive and to build.

I do not mean that in a childish sense. It is simply rare to live through a tooling shift that expands what one motivated person can attempt this much. You can prototype faster, explore more directions, and try projects that would have been too tedious or too lonely to begin a few years ago. The tools are uneven. They still fail in stupid ways. The larger fact remains: the ceiling has moved.

That is why I think engineers who can afford serious usage should stop treating it as optional software spend. This is not a moral argument, and it is not aimed at people who genuinely cannot afford a higher subscription. It is aimed at people who can afford it and still file it under “nice to have.”

For many engineers, a high-usage frontier subscription is one of the best career investments available right now. Not because paying more is virtuous, but because the return is not only output. It is speed, instinct, ambition, and better judgment.

Make the Test Fail First

Andrei-Mihai Nicolae — Tue, 24 Mar 2026 07:02:45 GMT

At university, and for years after, TDD felt backward to me.

I understood the theory. I still preferred to write the code first, handle the edge cases, and add tests around what I had built. When implementation was the slow part, that order felt natural. The hard work was getting the code onto the page.

Working with agents changed that.

Now an implementation can appear in seconds. The expensive step is no longer producing code. It is deciding whether the result deserves to stay. That changes the job of the test. I no longer think of test-first as a ritual for disciplined programmers. I think of it as a way to pin the contract before the implementation starts drifting.

That is why red-green TDD feels useful to me again.

Make red specific

If you already know the behavior you want, start by making the failure concrete.

Suppose your billing job should skip paused subscriptions, but a paused account is still getting an invoice. A vague prompt would say, `Fix billing for paused subscriptions.` A better prompt would say, `Write a test that proves a paused subscription is invoiced by the monthly job. Run it. Confirm that it fails for the expected reason. Then make the narrowest correct change that makes it pass.`

That difference matters. The first prompt asks the agent to guess. The second asks it to establish a contract.

The test must fail, and it must fail for the right reason. That means the agent has to run it before touching the implementation. If it passes immediately, you have learned something important: the bug report is wrong, the fixture is wrong, or the test never touched the path you care about. If it fails because a seed script broke or a factory cannot build the record, you still do not have the contract. Fix the setup first.

This part is easy to skip, especially with fast code generation. A model can write a plausible test file and jump straight into the change. That is exactly what you do not want. Until the agent has executed the test and seen the right failure, it is still working from a story, not from evidence.

Red is not paperwork. Red is the moment when the expected behavior stops being prose and becomes something the codebase can reject.

Then make green small

Once the failure is real, the implementation gets simpler.

Now the agent is no longer coding against a paragraph. It is coding against an executable example. That narrows the solution space. It also makes wrong answers easier to spot. A change that sounds plausible but does not satisfy the example is not done.

This is the part that feels different in the agent era. Before, test-first could feel like extra typing before the real work. Now the implementation is often the easy part. The hard part is avoiding an answer that is locally convincing and globally wrong.

That shift helps in review too. I can ask two concrete questions instead of one fuzzy one: does the test capture the intended behavior, and does the change satisfy it without breaking adjacent cases? That is a much cheaper judgment than trying to reconstruct intent after the fact.

Where I would not force it

I would not use this pattern for everything.

If I am exploring a new API, tuning UI feel, or spiking an architecture change, I may not know the right assertion yet. In that kind of work, writing the test first can become theater. The contract is still moving.

Red-green pays off when the behavior is clear enough to state precisely: a bug, a business rule, a parser edge case, a regression, a transformation, an authorization boundary. In those cases, the test does not slow the work down. It prevents the work from drifting.

That is also why I care less about strict dogma than about order. I do not need every change to follow a perfect textbook cycle. I do want the contract to become executable before I trust the result.

The prompt can stay short

In practice, the instruction can be brief: `Use red-green TDD for this bug.`

That small prompt carries more structure than it seems to. It means: write the focused test, run it before touching the implementation, check that the failure matches the bug, fix the setup if the failure is wrong, make the narrowest correct change that turns the test green, and rerun the checks.

Sometimes that test is unit-level. Sometimes it is an integration test, a CLI snapshot, or a browser assertion. The important point is not the layer. The important point is that the acceptance criteria exist before the implementation lands.

I used to think TDD felt unnatural because it asked me to describe the answer before I had written it. With agents, it feels natural for the opposite reason. I am not using the test to help me type. I am using it to decide whether a fast answer is trustworthy.

When code arrives cheaply, the scarce resource is confidence. A good failing test is one of the fastest ways to buy it.

What an AI Agent Should Find in a New Component

Andrei-Mihai Nicolae — Tue, 17 Mar 2026 07:01:54 GMT

If you want an AI agent to become useful fast in an unfamiliar component, do not start with a bigger prompt. Start with a component that explains itself.

When an agent lands in `shipping-quotes`, it should be able to answer four questions quickly:

- What does this component own?

- How do I work here?

- How do I verify a change?

- Where does the tricky context live?

If the component cannot answer those questions locally, the agent reads too much, touches too much, and asks too many basic questions. Ownership is fuzzy. The right commands are tribal knowledge. The important context lives in a slide deck, a spreadsheet, or somebody’s memory. So the agent wanders.

That is not always a model problem. Often, it is a component problem.

This is also a different problem from feedback after the edit. Before an agent can verify its own work, it has to orient itself in the component.

What does this component own?

An agent works best when it can reason locally.

That means the component should have a clear boundary, a small blast radius, and obvious contracts. In `shipping-quotes`, the agent should be able to tell what this component owns, what it depends on, and what it should not touch. It should know whether this component computes carrier quotes, consumes package and destination data, applies zone rules, or also owns checkout totals. Those are different jobs. A good boundary makes that visible in the interfaces, the file layout, and the local docs.

This matters more than context-window size. Even with a huge context window, a tangled component is still hard to change. The problem is not only how much code the agent can read. The problem is how much uncertainty the component creates. If every small edit may break three adjacent systems, the cost of acting goes up fast.

The first test is simple: can the agent make a local change without dragging half the repo into scope?

How do I work here?

Once the boundary is clear, the local path should be obvious.

If I drop an agent into `shipping-quotes`, I want it to find a short local guide. That might be `AGENTS.md`, `CLAUDE.md`, or a tight `README`. The name matters less than the job. The file should say what this component owns, which invariants matter, which commands to run, where the tests live, and what not to edit casually.

The component should also expose one obvious way to work. There should be one obvious command to run the local flow, one obvious command to exercise the tests, and one obvious place to look when something fails. If the repo offers four test commands, three half-working scripts, and two undocumented setup paths, the agent spends its first half hour on archaeology.

Good output belongs here too. When something breaks, the error should narrow the problem. “Missing carrier rate for zone” is useful. “Package dimensions exceed supported service limits” is useful. A stack trace with no domain signal is not. Legible components produce legible failures.

How do I verify a change?

Before an agent can fix a component, it needs a fast way to learn what the component does.

That is why tests matter so much. They do not only verify the change after the edit. They orient the agent before the edit. A strong suite shows what inputs matter, which edge cases the team cares about, and how wide the surface really is.

In `shipping-quotes`, a small suite might reveal the expected output for a domestic order, what happens to oversized packages, how missing carrier rates fail, and which rounding rules matter. In five minutes, that tells the agent more than a long design doc full of abstractions.

If a component is hard to understand, the fix is often not more prose. It is better tests. A good test file is both a guardrail and a map.

Where does the tricky context live?

The hard part of a component is often not the syntax. It is the domain logic.

That context should live where the agent can read it: in or near the component. Examples, fixtures, schemas, expected outputs, notes on invariants, and short explanations of weird cases all help. They are discoverable, versioned, searchable, and easy for an agent to inspect.

If the meaning of `shipping-quotes` lives in a slide deck, a spreadsheet, or somebody’s memory, the agent starts blind. The operator then has to shuttle basic context into the prompt by hand. That is slow, fragile, and easy to get wrong.

This does not mean every component needs a Markdown essay. It means the important context should exist in agent-readable form and sit beside the work. A canonical fixture is better than a paragraph. A sample output is better than a vague comment. A short glossary beside the code is better than a forgotten deck.

The standard to aim for

A good operator should be able to drop an agent into `shipping-quotes` without writing a custom tour guide every time.

The agent should find a clear boundary, one local path, tests that teach the terrain, and context that lives beside the code. If it cannot, a bigger prompt will not solve the underlying problem. The component is still illegible.

That is the standard I care about. Not whether an agent can eventually muddle through, but whether it can land in an unfamiliar component and become useful fast.

If you want better work from agents, stop treating the prompt as the only interface. The component is an interface too.

The Three Feedback Loops of Agentic Software

Andrei-Mihai Nicolae — Fri, 13 Mar 2026 11:34:25 GMT

Most teams try to improve agents through prompt work. They rewrite instructions, add context, and ask for a better plan. That helps, especially when the instructions are weak. But once the basics are good enough, the main bottleneck is usually the feedback loop.

If an agent cannot see what happened, check its own work, and recover quickly, you cannot trust it with much autonomy. You end up doing courier work: take the screenshot, paste the log, restate the bug, and ask it to try again. The system looks agentic, but the human is still carrying evidence back and forth.

I think about this as three nested loops. They are not three competing philosophies. They are three layers of feedback that tighten what the agent can do and narrow what the human must do.

- Loop 1: the human verifies after the fact

- Loop 2: the agent verifies its own work

- Loop 3: the human closes the loop on reality

Better feedback loops are what let agents work with more autonomy without losing contact with reality.

Loop 1: The human verifies after the fact

This is where most teams start.

Imagine a small web demo with an SVG orbit and a moving marker. You ask the agent to adjust the motion, line up the marker with a target, or change the animation. The agent edits the code and says it is done. Then you open the app, inspect the result, take a screenshot, and describe what is still wrong.

The loop works. It is also slow.

The human becomes the transport layer between the agent and the finished state. Because the agent cannot inspect the result on its own, it cannot correct itself. Every iteration depends on another round trip through you.

Loop 2: The agent verifies its own work

The second loop starts when the agent can inspect the system directly.

Give the same orbit demo browser automation, screenshots, logs, DOM assertions, and a fast test runner. Now the agent can change the code, open the page, capture the result, check whether the marker reached the target, and try again without waiting for a human to relay the evidence.

That is the key step up from Loop 1. The agent no longer needs the human to act as courier for basic evidence. It can compare the code against the rendered result itself. But the human often still has to do the last piece of work: open the final state, decide whether it really makes sense, and translate that judgment back.

That change is larger than it sounds. Better tools do not merely make the workflow nicer. They increase how much autonomy you can safely allow.

The same pattern holds outside toy demos. A broken onboarding form, a flaky CLI flow, or a staging-only UI regression all become easier when the agent can run the app, inspect structured output, and verify the result directly.

The analogy is simple. A developer can write software in a bare text editor. The same developer will move faster and make fewer mistakes with search, debugging, tests, and fast feedback. Agents are no different. A shell and a pile of text are enough to start. Logs, structured file access, screenshots, and browser tooling are what make self-correction practical.

That is Loop 2. The agent can inspect its own work, verify the result, and retry without waiting for a human to relay the evidence.

If you want agents to do more unsupervised work, start here. Improve what they can see and shorten the time it takes them to check themselves. But do not stop there. The next step is to build a shared playground around the app.

Loop 3: The human closes the loop on reality

Loop 3 starts when you build a shared playground around the app.

Loop 2 is about self-verification. The agent can run checks, inspect output, and correct itself. Loop 3 adds a shared surface for inspection. The agent and the human can open the same exact state on demand, look at the same evidence, and talk about the same thing.

That distinction matters. Passing tests is not the same as shipping the right thing. A screenshot can prove that the marker is on the target. It cannot tell you whether the motion feels awkward, whether the interaction matches the real intent, or whether the change solves the user’s problem.

Take the orbit demo one step further. Add query parameters that pin the marker to an exact angle, set the target, pause the motion, and render known scenarios on demand. Now the agent can open a precise state by itself. It can run checks and capture screenshots. The human can open the same URL and inspect the same state. The agent closes the verification loop; the human closes the product loop.

The form of that playground depends on the app. In a web app, it might be query parameters, fixtures, and debug routes. In a CLI app, it might be a scripted tmux session with fixed terminal size, seeded input, known fixtures, and a command that recreates the flow on demand. The tool changes. The principle does not: build a surface where the agent and the human can inspect the same reality.

That is the job the human should keep. Not micromanaging every step. Not approving every command. The human anchors the work to product intent, taste, and real-world constraints.

What to add in practice

If you want agents to do more useful work, shape the environment around them. In a real repo, that often means adding a few boring, high-leverage affordances:

- Build a shared playground for the app. For a web app, that may mean fixtures, query parameters, and debug routes. For a CLI app, it may mean a scripted tmux session with fixed dimensions, seeded input, and known fixtures.

- Expose important state on purpose. Add seed scripts, stable fixtures, and one-command setup paths that let an agent open a known state without guessing.

- Make verification cheap. Keep one fast command for focused tests, one reliable way to run the app, and one short path to capture logs or screenshots.

- Prefer structured tools over scraping. Return JSON from diagnostics, use stable DOM locators, and expose clear file or API interfaces instead of forcing brittle shell parsing.

None of this is glamorous. It is infrastructure for feedback. But it changes the shape of the work. When the system is legible, the agent can move faster. When checks are cheap, the agent can recover faster. When the human and the agent can inspect the same state, trust gets easier to build.

A small toy project helps because it makes these loops easy to see. But the lesson is not about toys. In real teams, the ceiling on agent autonomy is usually set less by intelligence than by feedback. If agents can inspect state, run checks, and recover on their own, they stop needing a human to ferry evidence between attempts. If they can step into a shared playground built for the app, the human can stop acting as courier and start acting as judge.

Prompting still matters. But once the task is understood, the practical questions are simpler: What can the agent see? What can it verify? What shared playground can it open with you? Better answers to those questions are what turn a clever demo into a reliable way of working.

Subscribe now