The AI Code Review Gap

Quick to write, pain to review: our lessons learned when worlds collide

Jan 02, 2025

600 users, no analytics.

Right before the holidays we launched the first product iteration of SpecStory — an extension for Cursor — and more than 600 people have installed it: great!

What we were missing, however; were usage analytics on the extension itself so we could know (anonymously) when people were activating it: not so great!

So picture this: a product leader, seasoned engineer and AI assistant enter a zoom room. All with the expectation of adding analytics via PostHog to the extension for the next release.

Like many modern software composers I first approached this task solo with Cursor’s Composer as my guide. The code flowed effortlessly, each prompt brought new functionality to life. Smoke tests worked, events flowed into PostHog, and everything seemed like smooth sailing.

Until the moment of truth: the pull request.

When Worlds Collide

Reality has a way of humbling our assumptions: my 60 minute composition session created a pull request that, as we started digging in, we realized would take more than 5x as long to review.

As Sean, our CTO and I sat in that Zoom room together, he began asking questions about specific lines of code.

And I found myself in an unusual position: unable to explain any of the details of what I had created.

This moment crystallized for us the new gap created via AI-assisted development: intent and implementation doesn't vanish with AI—it just shifts. While I could articulate what I wanted the software to do, I personally had lost the thread connecting each line of code to its original purpose.

Not wanting to let this pain point go to waste, Sean and I spent the next day experimenting with a decidedly different approach.

Trio Programming

Instead of going solo from the beginning, we decided to connect product intent with engineering wisdom while allowing Composer to act as our implementation partner.

As the quick video shows, we moved in small deliberate steps and caught potential regressions that had previously made the pull request opaque and could have turned into hard-to-rectify production issues.

The good news? We ultimately got the analytics shipped (check out version 0.2.4 here) and learned a few things that will probably resonate with you:

For rapid prototypes and demonstrations, solo AI software composition shines
For production features, our trio programming approach provides the right balance while still allowing us to “eat our own dogfood”
For technical intents (like improved code quality linting or enhanced CI), traditional engineering often remains the most efficient path

Substitutes for the second human partner?

While the trio approach with myself, Sean and Cursor worked well, we can’t help but think about the future. The time we spent together given the tools currently at our disposal was clunky and doing trio programming likely won’t scale at the same rate that we intend to grow.

In parallel, we’ve been debating if another AI could serve as a stand-in. While it is still early days,

Greg Ceccarelli

, began pulling on that thread before the holidays. The experiment wasn’t for a SpecStory production feature but it is a fascinating glimpse into a to-be-further-verified promise.

To help our distribution efforts, Greg is building an AI Code Editor ranking app. It sources data from Reddit and uses ChatGPT to analyze the sentiment. He had the idea to use Google’s new realtime streaming functionality to “watch his screen” as he composed and offer suggestions on the code — as it was being generated — using Cursor and Claude.

While the demo above was mostly happy path it points to the exciting possibility of AI not only being a code generator but an active review partner. While currently also clunky to manage, it appears to be the best “test” option sans connecting ChatGPT directly to Cursor through their “Work with Apps” beta feature.

In his words:

It was great to be able to work in familiar modalities and have the second AI keep up. I was impressed with how easily it was to be able to direct Gemini at parts of the screen and get very specific feedback on Cursor’s implementation of my intent. The jury is still out on whether it will be able to aid my efforts on more complicated software composing.

So there you have it: as we continue to build SpecStory, we’re committed to consistently sharing these type of learnings.

We believe that no matter if you’re a solo composer, part of a larger team, or somewhere in between, finding the right balance between AI assistance and human judgement will be continue to be crucial in the months and years ahead.

Intent Driven

Discussion about this post