By BC Holmes, Chief Technologist Published: April 9, 2024 in Blog
A first impression of Cognition Labs’ AI Software Engineer and Its Potential Impact on the Future of Coding
Several months ago, I wrote a series of blog posts about ChatGPT. One of the key topics that I was the claim that ChatGPT could code. And what I said at that time was that I found those claims unpersuasive. I acknowledged that it could implement a small-scale program — it could answer a basic question like, “can you write an implementation of FizzBuzz?” or “can you write a program that prints the first 100 Fibonacci numbers?” But it isn’t suitable for the much more complex tasks that we’d normally encounter in a software development project.
I did acknowledge that ChatGPT might be able to pass some technical interviews precisely because it can implement the sort of simple programming task that can be accomplished in an hour or so. It can pass the interview, but I don’t believe that it can do the job; that acknowledgement obviously raises discomfiting worries about how we structure interview coding assignments.
I’ve also written, approvingly, about GitHub Copilot: I confess that I often talk about it as a tool that automates the process of finding appropriate code snippets from StackOverflow and copying-and-pasting those snippets into your text editor. That reductive description seems to belie Microsoft’s claim about the productivity improvements that Copilot offers. For my part, I think that such a tool really does have high value for certain specific tasks that programmers do. I also know that a developer’s day involves a lot of work that Copilot can’t help with.
But in mid-March, a new, significant, AI-assisted development tool was announced: Cognition Labs’ Devin, which they bill as “the first AI software engineer.” Within days, videos started to appear on YouTube with titles like “AI just officially took our jobs” and “Should you still learn to code?” and “Meet Devin — the End of programmers as we know it.”
These are extreme claims (and obviously clickbait), but the available demos of Devin in action do give me pause.
In Cognition Labs’ introductory video, Devin is given a (relatively straight-forward) programming assignment: a benchmarking comparison task described by a simple English paragraph. In the video, CEO Scott Wu describes how Devin first comes up with a plan — a description of the tasks that it’ll work on in a simple natural-language checklist form. You can watch Devin tackle those tasks as it writes code in a code editor, runs the code on a command line, and views information in a browser. For me, one of the most eye-opening elements of the demo is how Devin is shown iterating over the coding problem, encountering errors, debugging those errors using print statements to gather more information, and correcting the errors in the code that it writes. Devin then took the results of the programming tasks and created a nice (but simple) styled output, including a bar graph of the benchmark results.
In another demo, Walden, a Cognition AI engineer takes a programming job posting from Upwork and asks Devin to code the solution. Striking activities from that demo include checking out some base code from a linked GitHub repository, resolving some dependency version issues in that code that was preventing the sample from running, and then revising the sample code to perform the task. The task in question involved using an image-analysis ML model to scan through images of roads looking for potholes and other forms of damage.
Again, the part of the exercise that distinguishes this from, say, ChatGPT is that ChatGPT offers up final solutions because it has already memorized the solutions. It knows the algorithm to derive Fibonacci numbers, so it immediately jumps to the final result (even though it “smartly” applies the conditions of the prompt such as “the first 100 Fibonacci numbers”). Copilot is similar: it makes inferences based on what the user is doing, and it pulls from a learned catalog of good answers, adapting the answer according to the context.
So, when I watch Devin iterate over a problem, change code, debug, and revise, I feel like I am watching how developers work on coding problems. I want to believe that it’s not artifice.
When ChatGPT first came on the scene, many people pointed out that its way of responding to questions — the slow, deliberate, word-at-a-time effect of producing sentences — was an implementation decision to improve the illusion of ChatGPT’s human-like intelligence. I’d be disheartened if I learned that Devin was similarly faking it. But demos make me think otherwise.
A particularly striking case is offered up by Andrew He (again, a Cognition Labs employee), who instructed Devin to write a test case for a recently-fixed bug in an open-source project that he maintains. During the exercise, Devin found a new bug, exposed by the test, which Devin then fixed. That’s… not nothing.
I do, however, have to take a lot on faith: while Devin is in early access, there’s no easy way to experiment with it (although they do have a Google form that I can fill in). We know that corporate demos are carefully curated. Apple is legendary in their demo stage-management discipline, but we know that they also deliver the goods. We don’t know enough about Cognition Labs to assess their follow-through.
But there are some other bits of evidence that Cognition is offering us in support of their claims. They graph themselves according to a “Real World Software Engineering Performance” measure (SWE Bench — an evaluation framework published by Princeton and University of Chicago researchers in October 2023, which involves “2,294 software engineering problems drawn from real GitHub issues and corresponding pull requests across 12 popular Python repositories”).
According to Cognition Labs, Devin can resolve 13.86% of those problems compared to ChatGPT’s 0.52%. (There’s an important footnote noting that Devin tackled a “random 25% subset” of the problems, and an assertion that whereas ChatGPT was assisted with information about exactly which files needed to be changed, Devin was not). 13.86% is almost 2.9 times the next highest result, achieved by the Claude 2 tool).
Those stats look impressive, but a tool that can only solve 13.86% of problems is not going to completely replace developers in the short term. I am however, comfortable with a tool that can tackle low-hanging fruit; if Devin can free my human developers up from some easy tasks, that’s still a win.
AI-assisted development continues to be a key trend to watch in 2024, and Devin has raised the bar in a very interesting way.