Changelog 70: Autonomous AI Development

We’re excited about the coming rise of autonomous AI developers. Tools like OpenHands, Copilot Workspace, Devin, and Amazon Q have the potential to greatly streamline how software is built, by describing and handing off tasks for AIs to perform.

This only works if the AI developers are competent and reliable, and in our experience they’re not there yet. They’re closer than it might seem, though.

In a recent blog post we took a deep look at an example of a straightforward improvement that AI developers ought to be able to do. Without detailed instructions the developers all failed – they don’t have a good enough understanding of the application’s behavior to know what they need to fix.

When we used Replay to analyze the application and describe its data flow to an AI, it was able to fix the problem reliably from a short prompt. Similar to what we saw when using AIs to fix a browser test failure, a few pieces of information are crucial. Give the AIs enough pieces to the puzzle, and they’re remarkably good at putting them together.

We believe that these techniques for making AI developers reliable will generalize well to a wide variety of straightforward fixes and improvements to web applications. We’re considering options for a prototype tool combining Replay with OpenHands and we’d love your feedback. We want to make it easy and effective to use AIs to fix your backlog of issues and speed up development on new features.

A rough workflow:

Configure your project so the AI developer knows how to build it and run tests.

When there’s a bug to fix or an improvement to make, create a Replay recording and comment in it on whatever the problem is.

Attach the recording to an issue.

Get a PR to fix the issue a little while later which passes all tests and has a recording of the fixed application with before/after screenshots and videos.

There’s a lot going on under the hood between steps 3 and 4:

The AI developer runs in tandem with Replay based data flow analysis to generate a candidate patch to fix the problem.

The candidate patch is built, and the AI uses feedback from logs to fix any compile time errors.

Tests run against the candidate patch, and any failures are fixed by the developer using feedback from the logs and Replay to analyze the failures.

Information from the original recording like network payloads and user interactions are used to rerecord the application with the candidate patch applied, making it easy to see what effect the patch had on the application’s behavior.

In case the AI developer doesn’t produce the working patch we expect (this is new technology after all), this is still useful in several ways:

If the AI is having trouble, Replay’s devtools and the data flow analysis information can be used to get a better understanding of the problem, either to write a patch by hand or to give more instructions to the AI on what it should do.

If a developer ends up writing a patch by hand, they can rerecord the original recording with that patch to see its effect instead of going through reproduction steps.

The AI can autofix test failures in any PR, whether or not there is an original recording of the problem being addressed.

If you’re interested in using autonomous AI developers to speed up new feature development, fix bugs easily, or cut down your backlog of issues, we’d love to hear from you. Reach us at hi@replay.io or fill out our contact form and we’ll be in touch.