Mar 6, 2026

Specification-Driven Engineering with AI Agents

I’ve been trying to get more out of AI. Recently I found two philosophies.

Philosophy 1: Long-Running Background Agents

The idea is to always have agents running in the background — you set up a task, go do your own work, come back to results. I mainly got this idea from: https://mitchellh.com/writing/my-ai-adoption-journey https://newsletter.pragmaticengineer.com/p/mitchell-hashimoto He seems to be spending first 30 minutes of morning and last 30minutes of work setting up research or longer tasks for AI before their day starts.

I tried this and couldn’t make it work well for me. From my experience, an agent usually finishes in about 5 minutes, so there’s not much time to do meaningful human work in between. Setting up will take 30min and it still finishes quickly. I tried doing some bigger project and it was actually doing them for longer time but I didn’t put enough energy into creating good agent harness for long tasks (I told claude to create it) and I didn’t create good validation eg tests specification (I didn’t care that much about these long tasks, just wanted to see if I could get agent to not prompt me every 5 min). And it created something working and it worked on it for 2 hours, but I didnt need it at all. Maybe I just don’t have enough high quality work for agent to do something useful.

And when I tried using agents for research (both Gemini and Claude Code), the output was full of fluff. The best trick I found was telling the agent to keep output under 40 lines — that forces it to include only the important stuff. I also found that very effective for other tasks: to limit its length output. It is a subset of techniques that provide schema for llm output and even validate it, repromptin it if it fails validation.

Maybe the people who make this work well just have bigger projects with a deeper backlog of well-defined tasks. If you always know what needs to be done next, it’s easier to keep agents busy. I think the bottleneck for me was never “not enough compute” — it was “not enough well-scoped work.” I have some ideas I work on but most of it is researching whether someone already created it, writing specifications and creating prototype of it is very small part of job, in which agent helps me a lot.

For longer tasks, I suspect I also need to give agents better ways to validate their own work. Without that, the output tends to be weak, especially for prototyping. For example I should write more concrete examples that agents can write tests upon.

I thought I missed something so I also went through his posts where he gave concrete example of vibe coding a feature. https://news.ycombinator.com/item?id=45549434 https://mitchellh.com/writing/non-trivial-vibing And I guess some of these prompts might have been running for slightly longer time but at work I’m limited in which tools I can use and I’m too afraid to let agents run all commands freely. When I run them in a sandbox, some tasks require CLI tools that I can’t safely allow there. And at home I don’t have enough high quality work.

Philosophy 2: Specification-Driven Development

The other philosophy is to invest upfront in writing detailed specifications, then let the agent execute against them. Some people argue this approach is already unnecessary because modern models and tools like Claude Code have gotten good enough to cover their shortcomings, which SDD tried to adress. But I still thought SDD could help, at least for greenfield projects.

Tools for this are: https://github.com/Fission-AI/OpenSpec https://kiro.dev/ traycer.ai https://github.com/github/spec-kit

I didnt use traycer at first but it looks great because it helps with creating good spec, you can throw some idea at it and it seems its system prompts and other harness engineering, helps you create specific specificaiton. Also helps you validate your project works.

So I tested traycer and I was hoping a spec-driven tool would be interactive — it would help me figure out what I actually want to build, create a proper specification through conversation, and then make it easier for the agent to validate its own work.

I tested a spec-driven tool that someone I know used with great success. But I found it wasn’t much better than Claude Code’s planning mode, and you have to pay extra on top. I think the difference was that they already had a really good specification going in — with concrete examples of inputs and outputs — and the tool just executed it. The problem I wanted solved was creating the specification in the first place.

What I use for now

So I can’t fully realize both approaches. For now I just improve my AGENTS.md I think something like this provides most of the vaule of SDD tools:

# Task Planning

When given a specification or design doc, break it into a todo file
with concrete, small tasks. Use this file to track progress: check off
finished tasks and note discoveries as you go.

## Structuring the todo file

- **Front-load tasks that need human approval** (web requests, CLI tools
  that need internet, or destructive actions). Batch these first so the
  human can approve once and the agent can run autonomously after that.

- **Commit between tasks.** Include a reminder at the top of the plan.

- **End with a validation loop:** run tests → review & fix → run tests again.

I think it works because it allows agent to “think” about one task at once.