Make the Test Fail First

A practical way to use red-green TDD with AI agents

Mar 24, 2026

At university, and for years after, TDD felt backward to me.

I understood the theory. I still preferred to write the code first, handle the edge cases, and add tests around what I had built. When implementation was the slow part, that order felt natural. The hard work was getting the code onto the page.

Working with agents changed that.

Now an implementation can appear in seconds. The expensive step is no longer producing code. It is deciding whether the result deserves to stay. That changes the job of the test. I no longer think of test-first as a ritual for disciplined programmers. I think of it as a way to pin the contract before the implementation starts drifting.

That is why red-green TDD feels useful to me again.

Make red specific

If you already know the behavior you want, start by making the failure concrete.

Suppose your billing job should skip paused subscriptions, but a paused account is still getting an invoice. A vague prompt would say, `Fix billing for paused subscriptions.` A better prompt would say, `Write a test that proves a paused subscription is invoiced by the monthly job. Run it. Confirm that it fails for the expected reason. Then make the narrowest correct change that makes it pass.`

That difference matters. The first prompt asks the agent to guess. The second asks it to establish a contract.

The test must fail, and it must fail for the right reason. That means the agent has to run it before touching the implementation. If it passes immediately, you have learned something important: the bug report is wrong, the fixture is wrong, or the test never touched the path you care about. If it fails because a seed script broke or a factory cannot build the record, you still do not have the contract. Fix the setup first.

This part is easy to skip, especially with fast code generation. A model can write a plausible test file and jump straight into the change. That is exactly what you do not want. Until the agent has executed the test and seen the right failure, it is still working from a story, not from evidence.

Red is not paperwork. Red is the moment when the expected behavior stops being prose and becomes something the codebase can reject.

Then make green small

Once the failure is real, the implementation gets simpler.

Now the agent is no longer coding against a paragraph. It is coding against an executable example. That narrows the solution space. It also makes wrong answers easier to spot. A change that sounds plausible but does not satisfy the example is not done.

This is the part that feels different in the agent era. Before, test-first could feel like extra typing before the real work. Now the implementation is often the easy part. The hard part is avoiding an answer that is locally convincing and globally wrong.

That shift helps in review too. I can ask two concrete questions instead of one fuzzy one: does the test capture the intended behavior, and does the change satisfy it without breaking adjacent cases? That is a much cheaper judgment than trying to reconstruct intent after the fact.

Where I would not force it

I would not use this pattern for everything.

If I am exploring a new API, tuning UI feel, or spiking an architecture change, I may not know the right assertion yet. In that kind of work, writing the test first can become theater. The contract is still moving.

Red-green pays off when the behavior is clear enough to state precisely: a bug, a business rule, a parser edge case, a regression, a transformation, an authorization boundary. In those cases, the test does not slow the work down. It prevents the work from drifting.

That is also why I care less about strict dogma than about order. I do not need every change to follow a perfect textbook cycle. I do want the contract to become executable before I trust the result.

The prompt can stay short

In practice, the instruction can be brief: `Use red-green TDD for this bug.`

That small prompt carries more structure than it seems to. It means: write the focused test, run it before touching the implementation, check that the failure matches the bug, fix the setup if the failure is wrong, make the narrowest correct change that turns the test green, and rerun the checks.

Sometimes that test is unit-level. Sometimes it is an integration test, a CLI snapshot, or a browser assertion. The important point is not the layer. The important point is that the acceptance criteria exist before the implementation lands.

I used to think TDD felt unnatural because it asked me to describe the answer before I had written it. With agents, it feels natural for the opposite reason. I am not using the test to help me type. I am using it to decide whether a fast answer is trustworthy.

When code arrives cheaply, the scarce resource is confidence. A good failing test is one of the fastest ways to buy it.

Working Frontier

Discussion about this post

Ready for more?