Building the primitive AI coding loop around GPT-4
Before AI coding agents could read error logs, run commands, or inspect a codebase on their own, I was building a browser-based workflow around GPT-4 that tried to make the same loop faster.
The loop was simple: assemble context, ask the model, test the answer locally, bring back evidence, and revise.
At the time, that work was manual. I was the file reader, the test runner, the console logger, and the context manager. GPT-4 was powerful, but the product surface around it did not match the way I actually used it for hard coding problems.
That mismatch became the product idea.
Coding was never going back
Completion Was Convenience, Debugging Was Leverage
Autocomplete was useful, but it did not feel like a fundamental unlock.
Having a tool complete a function, fill in boilerplate, or even sketch out a whole file was impressive. It made me faster. It removed some typing. It saved time on code I probably could have written myself.
But the value felt incremental.
The bigger unlock was not generating code I already knew how to write. It was helping me get unstuck when I did not know what was wrong.
That happened most often during debugging.
A bug can consume hours or days because the hard part is not typing the fix. The hard part is holding the whole problem in your head: the code, the error, the system behavior, the assumptions you have already made, and the paths you have already ruled out.
That cognitive load is expensive. It is also frustrating. When you are stuck, every failed attempt adds more noise to the problem.
GPT-4 was valuable because it could reduce that load.
If I gave it the right context, it could reason through the failure with me. It could compare the code against the error message. It could notice a bad assumption. It could suggest a path I had not considered. Sometimes it could compress a debugging process that might have taken hours into a few minutes.
That felt categorically different from autocomplete.
Autocomplete helped me write code faster.
Debugging with GPT-4 helped me escape problems faster.
That was the difference that made ChatGPT feel more important than the editor tools I had been using. The serious work was not asking the model to finish what I was already typing. It was asking the model to help me understand why something was broken.
Chat Was Too Flat
GPT-4 changed what felt possible.
I got access within a few days of its announcement, and it was immediately clear that coding was not going back to the old shape. If I gave the model the right context, it could often reason through issues that would have taken me much longer to untangle on my own.
It could compare an error message against the code that produced it. It could explain why an assumption was wrong. It could suggest a different path. It could point out the missing detail I had stopped seeing.
The model was not the problem.
The interface was.
ChatGPT put a powerful reasoning engine behind a chat box, but my actual workflow looked more like debugging with a suitcase full of code snippets, stack traces, notes, constraints, and rapidly expiring assumptions.
The workflow was an agent loop before agents
Manual Context Gathering
I would copy code from my editor, paste it into ChatGPT, ask for help, read the answer, try the suggestion locally, then return with a new error or corrected context. If the model missed something, I had to decide what it needed next and paste that in too.
In modern terms, I was doing context gathering by hand.
There was no agent reading files for me. There was no tool call running tests. There was no automatic understanding of the project. I was the part of the system that noticed which file mattered, which error changed the diagnosis, and which part of the previous answer was worth carrying forward.
The more I used GPT-4 this way, the more wrong the ChatGPT input felt.
Evidence Over Answers
The useful discovery was not that GPT-4 could answer a coding question once.
The useful discovery was that the answer got better when I kept the model inside a tight debugging loop.
Usually, the loop started with broken code.
Sometimes I knew the exact failure. Sometimes I only knew that a feature was behaving strangely, a type was fighting me, or a refactor had created a bug I did not understand yet.
The first prompt was rarely perfect. It was a snapshot of my current understanding: the code I thought mattered, the error I could see, and the assumption I was making about why it was failing.
GPT-4 would respond with something useful, but not always final.
Sometimes it explained the bug directly. More often, it revealed what was missing. It might assume a function behaved one way when the actual implementation behaved another. It might suggest a plausible fix that was incomplete because it had not seen the right type, helper, or surrounding module.
That was still progress.
Even when the answer was wrong, it improved the shape of the problem. It showed me which assumption needed to be tested or which piece of context the model needed next.
Reality Check
The real test happened back in the codebase.
I would try the suggestion, run the code, check the type error, inspect the output, or see whether the behavior changed. That step mattered because the model was not connected to my environment.
It could reason from the material I gave it, but I had to close the loop with reality.
The workflow felt less like asking a question and more like running an experiment. The model proposed a path. The codebase answered whether that path held up.
Then the next prompt became the important one.
I could bring back the new error, the failed patch, the changed behavior, or the piece of code that contradicted the model’s assumption. Now the model had more evidence than it had on the first pass.
That changed how I thought about prompting.
The goal was not to write one perfect prompt up front. The goal was to make each pass produce better evidence for the next one.
That was the product insight I kept returning to: the model was capable, but the interface needed to make the loop fast enough to use.
The prompt needed to become a working canvas
More Space
Once I understood that the loop was the real workflow, the input stopped feeling like a chat box.
It needed to become a place where I could assemble, read, and reshape context before sending it to the model.
The first requirement was space.
I wanted the input to use the full height of the viewport because the prompt was often larger than a normal message. It could include code, stack traces, notes, instructions, previous output, and a revised question all at once.
That material needed room to breathe. If I could not see it, I could not reason about it.
In ChatGPT, the prompt felt like an accessory to the conversation. In my workflow, the prompt was the work surface.
Structured Blocks
The second requirement was structure.
Plain text was too flat for the kind of context I was assembling. I wanted the prompt to behave more like a document made of blocks: sections, code, lists, tags, and fragments that could be moved or removed as units.
That is why the Notion-like framing mattered.
The point was not to make a prettier text area. The point was to make pasted context feel like editable material.
Readable Code
Code needed to look like code. When I pasted a function, type, component, or error-producing snippet into a plain input, I lost the visual affordances that made the code readable.
Syntax highlighting was not decoration. It was part of being able to scan and verify the prompt before sending it.
If I was asking the model to reason about code, I needed to be able to reason about the prompt too.
The same applied to everything around the code. Lists should look like lists. Headings should divide the prompt into sections. Tables should stay legible. Context tags should make it clear what role each piece of information was playing.
The final payload still needed to become text the model could consume, but the editing experience did not have to feel like raw text.
That distinction shaped the product: rich visual editing for the human, model-optimized markdown for the LLM.
History had to become working material
Side-By-Side Runs
The other surface I needed was history.
The app used a two-panel layout because iteration depends on comparison. The current input and output lived on one side, while previous runs lived beside them.
That solved a problem I kept hitting in chat: history was technically available, but it was trapped in a scrollback.
If I wanted to reuse something from an earlier answer, I had to scroll away from the prompt I was writing, find the relevant section, copy it, then return to where I had been working. That sounds small, but it breaks the loop. The moment I lose the current input, I have to rebuild my mental position.
With history beside the input, previous attempts became available as working material.
Reuse Without Losing Place
I could keep the current prompt in view while cycling through earlier runs. I could find a useful explanation, patch, phrase, or error and pull it into the next attempt without losing my place.
That mattered because debugging is not linear.
Sometimes the useful answer is not in the latest response. Sometimes it is buried in an earlier attempt that was wrong overall but contained one good diagnosis. Sometimes the previous prompt had better framing than the current one. Sometimes the old error becomes important again after a new test result.
The interface needed to make that kind of reuse natural.
It also needed to preserve position.
When the prompt is a large document, position matters. I might be editing the middle of a code block, pruning a section near the top, or adding a new instruction at the bottom. Looking through history should not reset that work.
The product was not only storing text.
It was preserving attention inside the loop.
The mouse became the flow bottleneck
Prompting Was Editing
As the prompts got longer, I realized I was not just writing messages.
I was editing.
I was moving through a large prompt, inserting context in specific places, deleting stale assumptions, folding noisy sections, pulling useful pieces forward, and reshaping the structure before each run.
That changed how I thought about the interface.
The bottleneck was not only that ChatGPT had a small input box. The bottleneck was that normal browser editing made this kind of work feel slow. Every time I reached for the mouse, highlighted a section, clicked into a new spot, or manually repositioned the cursor, I broke the flow of the loop.
The mouse was useful for pointing, but it was too slow for restructuring.
Once I saw that, the product started to feel less like a chat app and more like a modal editor for context.
Modal Editing
That was how I found my way toward a Vim-style workflow.
I wanted to move through the prompt without touching the mouse. I wanted to jump between sections, operate on blocks, delete or fold context, and place the cursor exactly where the next piece of evidence belonged.
The important shift was realizing that the prompt was not a passive text field. It was an object I needed to navigate and transform.
Typing was only one mode.
There also needed to be a command mode for movement and structure: moving across blocks, selecting a region, pruning old context, opening space for a new instruction, or jumping back to the relevant part of the prompt.
That made the loop feel faster because it reduced the friction between thinking and editing.
Voice Input
Voice became interesting for the same reason.
A lot of the prompt was not code. It was context, intent, observations, and instructions. Those were often easier to say than type.
Once the editor had precise keyboard navigation, voice could fit into the workflow naturally. I could move the cursor to the exact location where a new note belonged, then speak the context into that spot.
That mattered because voice on its own is imprecise. Dictating into a generic text box can create more cleanup work than it saves. But voice combined with modal navigation was different.
The keyboard handled precision.
Voice handled expression.
I could use Vim-style movement to get to the right section, then use voice to quickly add the human part of the prompt: what I had tried, what seemed wrong, what the model should pay attention to, or what I wanted it to do next.
That made context entry faster without giving up control over structure.
The deeper insight was that prompting was not only about better text. It was about lowering the cost of reshaping thought.
The faster I could move, edit, and insert context, the tighter the debugging loop became.
Keyboard commands made context fast to reshape
Navigation
Once the prompt became a structured document, keyboard commands became the fastest way to operate on it.
The common moves were simple: navigate, prune, fold, retry, follow up, and pull useful material forward.
Navigation had to work at the level of the prompt structure. Moving by character or line was not enough. I needed to jump between blocks, move through the document, and get to the next meaningful unit of context without reaching for the mouse.
That connected the product to modal editing.
Typing was not the only state the editor needed. There also needed to be a command mode for manipulating the prompt as a structured object. The same key could mean different things depending on the workflow state, which later influenced how I thought about typed UI state more broadly.
Pruning
Deletion was just as important as insertion.
Long prompts can get worse when they carry irrelevant material forward. The model may still produce an answer, but the answer can drift toward stale assumptions or waste attention on details that no longer matter.
So pruning needed to be fast and precise.
The editor supported operations like deleting the current block, deleting around the cursor, and removing larger regions of context. That made pruning feel like a normal part of the workflow instead of a cleanup chore.
Folding
Some context was useful but visually noisy, so folding became important too.
I did not always want to delete a section. Sometimes I wanted to keep it in the prompt while getting it out of the way visually. Folding borrowed the feeling of code folding in an editor: a section could remain part of the document without occupying attention while I worked somewhere else.
Context management is partly about what the model sees, but it is also about what the human can still understand.
The product needed to support both.
Retry and follow-up created different paths
Retry
One of the most important product distinctions was the difference between retry and follow-up.
Retry meant correcting the same attempt.
If the output was close but not right, I did not want to start from nothing. I wanted to bring the previous input back into the editor, adjust it, and run it again.
The failed attempt still had value. The prompt already contained useful context and framing. Retry treated that previous input as editable starting material.
That was different from chat’s default behavior, where the next move is usually another message appended to the thread.
Follow-Up
Follow-up meant something else.
A follow-up carried the exchange forward: the prior input, the model’s output, and a new instruction built on top of them.
That distinction mattered because some outputs are not failures. They are intermediate material. A good response might explain part of the problem, produce a partial fix, or identify a better direction. Follow-up let that exchange become structured context for the next step instead of disappearing into the chat history.
This was the core of the workbench.
It treated AI coding as an iterative system, not a sequence of isolated messages.
The next unlock was file system context
Browser Boundary
The workbench made the loop faster, but it also made the next limitation obvious.
It lived in the browser, while the project lived somewhere else.
The codebase was in the editor. The files were on disk. The tests ran in the terminal. The errors appeared in the shell, the browser console, or the app itself. Every useful piece of evidence had to cross that boundary manually before the model could reason about it.
That was the real constraint.
A better prompt surface could make context easier to shape. It could make history easier to reuse. It could make iteration feel less like copy-paste chaos.
But the web app still could not see the system where the work was happening.
When the model needed another file, I had to go find it. When an answer needed to be tested, I had to run the command. When the result failed, I had to bring the failure back. The loop worked, but I was still the bridge between the model and the codebase.
That made the next step feel inevitable.
The model needed to move closer to the environment.
File System Access
File system access was the next real unlock.
Debugging rarely lives inside one snippet. The cause might be in the caller, the helper, the type definition, the config, the test setup, the build step, or the relationship between several files.
In the browser workflow, I had to guess which pieces mattered before the model could help. Sometimes that worked. Other times, the missing file was exactly the thing preventing the model from giving a useful answer.
Once the model could inspect the project directly, the shape of the interaction changed.
Context gathering no longer had to begin with a perfect copy-paste bundle. The system could search the repo, follow references, read nearby files, compare implementations, and build a picture of the project from inside the project itself.
Commands mattered for the same reason.
Running tests, checking errors, reading logs, searching the codebase, and applying patches were all part of the debugging loop. In the browser version, those steps stayed outside the model. I performed them, summarized them, and pasted the results back into the prompt.
The natural next step was to let the model participate in those actions directly.
That is the shift from a prompt workbench toward an agent.
Terminal-Native Agents
Tools like Claude Code and Aider felt like continuations of the same idea because they moved the AI workflow into the place where development already happens.
The terminal has the repo. It has the shell. It has the test runner, package manager, file system, logs, scripts, and source control. Putting the model there changes the loop from “tell me what to do” to “look around, try something, observe the result, and keep going.”
That was the missing piece in the browser workbench.
The browser version helped me see the structure of the workflow because every step was still manual. I gathered the files. I ran the tests. I copied the errors. I revised the prompt. I decided what evidence mattered next.
Terminal-native agents could absorb more of that movement.
Instead of asking me to paste a file, the agent could read it. Instead of giving me a patch to apply, it could edit the repo. Instead of waiting for me to report the test output, it could run the command and incorporate the result.
The goal was the same: keep the model inside a tight feedback loop with the real state of the codebase.
The difference was location.
My workbench operated from outside the project.
The next generation moved the loop inside it.
Human Steering
That progression does not make the earlier workbench feel wrong.
It makes the direction clearer.
The browser tool exposed the shape of the loop: context, instruction, output, test, evidence, revision. Local agents moved that loop into the system where more of the context could be gathered automatically.
What still matters is the steering layer.
Even when an agent can read files, run commands, and edit code, the user needs to understand what happened. Which files shaped the answer? What changed? Which test failed? What assumption is the agent carrying forward? Where does the next decision need human judgment?
Automation makes the loop faster, but the interface still has to keep the human oriented.
That is the throughline from the browser workbench to modern coding agents.
The workbench was an early attempt to make the AI coding loop visible from outside the codebase.
File system access made it possible to run that loop from within.