The Quality of Auto-Generated Code
When Copilot writes your code, will you care whether it’s good or bad?
Kevlin Henney and I were riffing on some ideas about GitHub Copilot, the tool for automatically generating code base on GPT-3’s language model, trained on the body of code that’s in GitHub. This article poses some questions and (perhaps) some answers, without trying to present any conclusions.
First, we wondered about code quality. There are lots of ways to solve a given programming problem; but most of us have some ideas about what makes code “good” or “bad.” Is it readable, is it well-organized? Things like that. In a professional setting, where software needs to be maintained and modified over long periods, readability and organization count for a lot.
We know how to test whether or not code is correct (at least up to a certain limit). Given enough unit tests and acceptance tests, we can imagine a system for automatically generating code that is correct. Property-based testing might give us some additional ideas about building test suites robust enough to verify that code works properly. But we don’t have methods to test for code that’s “good.” Imagine asking Copilot to write a function that sorts a list. There are lots of ways to sort. Some are pretty good—for example, quicksort. Some of them are awful. But a unit test has no way of telling whether a function is implemented using quicksort, permutation sort, (which completes in factorial time), sleep sort, or one of the other strange sorting algorithms that Kevlin has been writing about.
Do we care? Well, we care about O(N log N) behavior versus O(N!). But assuming that we have some way to resolve that issue, if we can specify a program’s behavior precisely enough so that we are highly confident that Copilot will write code that’s correct and tolerably performant, do we care about its aesthetics? Do we care whether it’s readable? 40 years ago, we might have cared about the assembly language code generated by a compiler. But today, we don’t, except for a few increasingly rare corner cases that usually involve device drivers or embedded systems. If I write something in C and compile it with gcc, realistically I’m never going to look at the compiler’s output. I don’t need to understand it.
To get to this point, we may need a meta-language for describing what we want the program to do that’s almost as detailed as a modern high-level language. That could be what the future holds: an understanding of “prompt engineering” that lets us tell an AI system precisely what we want a program to do, rather than how to do it. Testing would become much more important, as would understanding precisely the business problem that needs to be solved. “Slinging code” in whatever the language would become less common.
But what if we don’t get to the point where we trust automatically generated code as much as we now trust the output of a compiler? Readability will be at a premium as long as humans need to read code. If we have to read the output from one of Copilot’s descendants to judge whether or not it will work, or if we have to debug that output because it mostly works, but fails in some cases, then we will need it to generate code that’s readable. Not that humans currently do a good job of writing readable code; but we all know how painful it is to debug code that isn’t readable, and we all have some concept of what “readability” means.
Second: Copilot was trained on the body of code in GitHub. At this point, it is all (or almost all) written by humans. Some of it is good, high quality, readable code; a lot of it isn’t. What if Copilot became so successful that Copilot-generated code came to constitute a significant percentage of the code on GitHub? The model will certainly need to be re-trained from time to time. So now, we have a feedback loop: Copilot trained on code that has been (at least partially) generated by Copilot. Does code quality improve? Or does it degrade? And again, do we care, and why?
This question can be argued either way. People working on automated tagging for AI seem to be taking the position that iterative tagging leads to better results: i.e., after a tagging pass, use a human-in-the-loop to check some of the tags, correct them where wrong, and then use this additional input in another training pass. Repeat as needed. That’s not all that different from current (non-automated) programming: write, compile, run, debug, as often as needed to get something that works. The feedback loop enables you to write good code.
A human-in-the-loop approach to training an AI code generator is one possible way of getting “good code” (for whatever “good” means)—though it’s only a partial solution. Issues like indentation style, meaningful variable names, and the like are only a start. Evaluating whether a body of code is structured into coherent modules, has well-designed APIs, and could easily be understood by maintainers is a more difficult problem. Humans can evaluate code with these qualities in mind, but it takes time. A human-in-the-loop might help to train AI systems to design good APIs, but at some point, the “human” part of the loop will start to dominate the rest.
If you look at this problem from the standpoint of evolution, you see something different. If you breed plants or animals (a highly selected form of evolution) for one desired quality, you will almost certainly see all the other qualities degrade: you’ll get large dogs with hips that don’t work, or dogs with flat faces that can’t breathe properly.
What direction will automatically generated code take? We don’t know. Our guess is that, without ways to measure “code quality” rigorously, code quality will probably degrade. Ever since Peter Drucker, management consultants have liked to say, “If you can’t measure it, you can’t improve it.” And we suspect that applies to code generation, too: aspects of the code that can be measured will improve, aspects that can’t won’t. Or, as the accounting historian H. Thomas Johnson said, “Perhaps what you measure is what you get. More likely, what you measure is all you’ll get. What you don’t (or can’t) measure is lost.”
We can write tools to measure some superficial aspects of code quality, like obeying stylistic conventions. We already have tools that can “fix” fairly superficial quality problems like indentation. But again, that superficial approach doesn’t touch the more difficult parts of the problem. If we had an algorithm that could score readability, and restrict Copilot’s training set to code that scores in the 90th percentile, we would certainly see output that looks better than most human code. Even with such an algorithm, though, it’s still unclear whether that algorithm could determine whether variables and functions had appropriate names, let alone whether a large project was well-structured.
And a third time: do we care? If we have a rigorous way to express what we want a program to do, we may never need to look at the underlying C or C++. At some point, one of Copilot’s descendants may not need to generate code in a “high level language” at all: perhaps it will generate machine code for your target machine directly. And perhaps that target machine will be Web Assembly, the JVM, or something else that’s very highly portable.
Do we care whether tools like Copilot write good code? We will, until we don’t. Readability will be important as long as humans have a part to play in the debugging loop. The important question probably isn’t “do we care”; it’s “when will we stop caring?” When we can trust the output of a code model, we’ll see a rapid phase change. We’ll care less about the code, and more about describing the task (and appropriate tests for that task) correctly.