AI Software Engineers: What Works?

Sep 30, 2024

Autonomous AI software engineers are attracting considerable attention, with new companies promising autonomous AI software engineers coming out every month. But how well do they work? Unfortunately, they’re not widely accessible and it can be difficult to interpret benchmark results so it’s difficult to tell.

I did a deep dive into AI software engineers (and coding assistants) to find out. In summary, I think that:

Despite the hype, AI software engineers today do not work (today).
They have the potential to work in the future.
Nonetheless, I think that everyone writing software should at least try coding assistants.

Read on for my reasoning as to why!

Understanding existing agents

Unfortunately, no autonomous AI software engineer is widely available to the public, so I couldn’t test them directly. However, we can look at benchmark results. The most commonly used benchmark for AI software engineers is SWE-Bench, which measures the ability for AI systems to resolve real-world GitHub issues.

The highest performing agent on SWE-Bench achieves 22.06%. Unfortunately, a 22% success rate isn’t usable in the real world. Even on SWE-Bench Verified (which has an upper limit of a 100% resolve rate), the highest success rate is 45.2%. This is also unusable in the real world.

Beyond benchmark results, we can also see if AI software engineers have been widely deployed. To my knowledge, there are no real-world deployments of autonomous AI software engineers. We know there is enormous demand for more productive software engineers, so there should be similar demand for AI software engineers. The lack of success stories strongly suggests that AI software engineers are not widely deployed.

Coding assistants

On the other hand, coding assistants are widely used and rapidly improving. These coding assistants include GitHub Copilot, Cursor, Supermaven, and others.

Copilot reportedly exceeded $100 million in recurring revenue last year, with rapid growth. It seems these coding assistants work - why else would developers use them? But what do we actually know about their efficacy?

Luckily, researchers have conducted high quality studies on how coding assistants affect developer productivity. One recent study randomly deployed Copilot at three large companies (Microsoft, Accenture, and an unnamed Fortune 100 company). In other words, some developers had access to Copilot and others didn’t. The researchers measured the number of tasks completed and found a 26% increase for the developers who had access to Copilot. The number of commits and code compiles also increased.

Other research studied the impact of Copilot on open-source using a difference-in-difference approach. This research also points in the same direction: that Copilot increased the number of contributions to open-source projects. Surveys also show that a plurality of developers use some form of GenAI coding assistance.

Beyond research, the co-creator of Django (and many others) have strongly endorsed using coding assistants. However, using them correctly requires a lot of experimentation. Simon Willison’s blog is a great place to start learning.

Essentially, all of our best evidence points towards coding assistants being helpful. I strongly recommend them!

Predictions

Predictions are always difficult, especially in such a fast moving field. I still think they’re worth making, mostly to see how your predictions line up with reality. So here we go:

Coding assistants will get better: they directly benefit from better models (e.g., o1), better form factors (Copilot itself was an innovation!), and better data.
Autonomous software engineers will find a niche in the next few years, but they won’t be able to wholesale replace an entry-level Google software engineer in the next two years.

Next time, I’ll discuss some issues with autonomous software engineers, techniques to improve autonomous software engineers, and some questions around their deployments. Stay tuned and reach out if you’d like to chat about code and AI!

Daniel’s Substack

Discussion about this post

Ready for more?