Auto-optimizing an algorithm
After I found that llms are useful, I wanted to get a better feel for what they are useful for. Clearly, they are good at generating boilerplate and writing bash one-liners, but what else?
I wanted to see how difficult it was to set up an agent loop that has some objective that it can evaluate autonomously, let it run and see what happens. Also, I wanted some idea what was happening in the process, but without human intervention. Code optimization is an example of something with a lot of interesting properties: it's easy to test for correctness, and has a (fairly) clear objective function1.
So which code should we optimize? I figured I'd start at a blank slate, choose a problem that could have some different good solutions, that is easy to manually test or verify (if needed), and that I sort-of know from before.
After some brainstorming with the machine I chose the Levenshtein edit distance.
Given two strings a and b, how many edits do you have to make to turn a into b?
One edit is a single insertion, a single removal, or replacing a single character,
and we want to find the minimal number of edits.
We don't care what the edits actually are.
The Reference Solution
I guess you could brute force this and go through a search three trying all combinations,
but that would be very inefficient.
The standard "proper"2 solution uses dynamic programming:
we create a table dist[i][j] that computes the edit-distance between a[..i] and b[..j],
and build this table bottom-up.
dist[0][k] == dist[k][0] == k because to go from the empty string to any other string
you have to insert each character.
This is the base-case.
For the other entries there are three cases:
- remove a character: cost
dist[i-1][j] + 1, - add a character: cost
dist[i][j-1] + 1, - replace: cost
dist[i-1][j-1], plus1if they are different and0otherwise.
Confused yet? Here's a half-filled out table:
| . | a | g | e | n | t | |
|---|---|---|---|---|---|---|
| . | 0 | 1 | 2 | 3 | 4 | 5 |
| p | 1 | 1 | 2 | 3 | 4 | 5 |
| a | 2 | 1 | ? | |||
| g | 3 | |||||
| a | 4 | |||||
| n | 5 |
Finding ? in the table is computing edit("pa", "ag").
We have some options:
- Pay
1to remove the'a'in"pa"and then do the rest of the edits:edit("p", "ag") + 1. - Do the edit from
"pa"to just"a"and then insert the'g':edit("pa", "a") + 1 - Replace the last charater and edit the rest:
edit("p", "a") + 1. We get the+1part since the chars doesn't match.
We only need to consider all three and pick the cheapest:
edit("p", "ag") + 1 = dist[1][2] + 1 == 3edit("pa", "a") + 1 = dist[2][1] + 1 == 2edit("p", "a") + 1 = dist[1][1] + 1 == 2
In this case is a tie:
for the full edits,
we can either go "ag" -> "aa" -> "pa" where we replace both times,
or we can go "ag" -> "a" -> "pa" where we first delete and then insert.
I don't know, it's kinda confusing, and very tempting to frame as
peeling off characters on both strings until nothing is left, dist[0][0],
but then we have to show that this solves the same problem.
Computing the table, though, is now super easy. Look at the three neighbors
to the top left and add one to the smallest of those. If the letters match
at your position, you get to choose the diagonal one for free.
Here's the filled out table with arrows showing a path we can take for the optimal cost of 3:
| . | a | g | e | n | t | |
|---|---|---|---|---|---|---|
| . | 0 | ←1 | ←2 | ←3 | ←4 | ←5 |
| p | ↑1 | ↖1 | ←2 | ←3 | ←4 | ←5 |
| a | ↑2 | ↖1 | ←2 | ←3 | ←4 | ←5 |
| g | ↑3 | ↑2 | ↖1 | ←2 | ←3 | ←4 |
| a | ↑4 | ↑3 | ↑2 | ↖2 | ←3 | ←4 |
| n | ↑5 | ↑4 | ↑3 | ↑3 | ↖2 | ←3 |
agent -> agen -> agan -> pagan
Setup
Look, this isn't actual science, and you're not my boss, so I can do what I want. Or rather, I don't have to do things I don't want to do. And I didn't want to write tests or benchmarks, or even think too hard about edge-cases or input distributions. So instead of designing a test suite and a benchmark system I first had agents generate these too.
Also, the agent generated the reference implementation:
// THIS IS PURE LLM GENERATED CODE
int ref_edit_distance(
const unsigned char *s,
int s_len,
const unsigned char *t,
int t_len
) {
if (s_len == 0) return t_len;
if (t_len == 0) return s_len;
int *prev = (int *)malloc((t_len + 1) * sizeof(int));
int *curr = (int *)malloc((t_len + 1) * sizeof(int));
for (int j = 0; j <= t_len; j++)
prev[j] = j;
for (int i = 1; i <= s_len; i++) {
curr[0] = i;
for (int j = 1; j <= t_len; j++) {
int cost = (s[i - 1] == t[j - 1]) ? 0 : 1;
int del = prev[j] + 1;
int ins = curr[j - 1] + 1;
int sub = prev[j - 1] + cost;
int d = del;
if (ins < d) d = ins;
if (sub < d) d = sub;
curr[j] = d;
}
int *tmp = prev;
prev = curr;
curr = tmp;
}
int r = prev[t_len];
free(prev);
free(curr);
return r;
}
It has one funny trick, namely that instead of computing the whole table it
computes one row at a time and switches which of the two rows it looks at
and which is writes to.
In iteration i, prev is row i-1 and curr is row i.
Setting up the harness was a little tricy because I tried to over-engineer it at first,
with sandboxing and custom tools, trying to limit the exposure the agent had
to the benchmarks and previous versions.
In the end I used pi and wrote separate agents for separate stages
of the process: one planner, one implementor, one reviewer, and the "top" model to orchestrate.
A pretty standard setup, except that the implementor wasn't allowed to run the benchmarks.
The setup was such that the planner wrote an hypothesis, the implementor created a submission that passed
the test suite,
the orchestrator ran the benchmarks that spat out a json file with wall-clock time and perf numbers,
and the reviewer looked at those and wrote a conclusion.
Then it updated a global FINDINGS.md so that the planner in the next iteration
wouldn't have to read too much to make a new plan.
I also made sure that the agents wrote plans to communicate so that I could store these, in case they
would be interesting.
pi can probably enforce this with an extension or something, but I didn't bother trying to figure out how.
I also found some third party code that looked like it would be a good reference and ran the benchmark on those.
Results
Short version: llms are capabable of optimizing code when given a harness like this.
Here's some timings.
Benchmark names have their size, alphabet size, and mutation rate, in the name.
IDs are incremental and #99x are external implementations.
It's sorted on the middle column.
| ID | 256α4m30 | 1Kα26m15 | 4Kα4m10 | 16K-asym | 32Kα4m80 |
|---|---|---|---|---|---|
| #991 | 12.6 µs | 88.6 µs | 343.0 µs | 7.34 ms | 72.18 ms |
| #992 | 25.7 µs | 123.3 µs | 607.7 µs | 224.47 ms | 1.26 s |
| #020 | 3.4 µs | 54.9 µs | 687.1 µs | 5.78 ms | 40.56 ms |
| #016 | 3.4 µs | 55.9 µs | 698.1 µs | 5.91 ms | 43.25 ms |
| #019 | 3.4 µs | 54.3 µs | 701.4 µs | 5.77 ms | 40.14 ms |
| #018 | 3.4 µs | 60.4 µs | 702.7 µs | 6.20 ms | 44.69 ms |
| #017 | 3.4 µs | 55.3 µs | 727.0 µs | 5.90 ms | 40.49 ms |
| #015 | 5.4 µs | 60.5 µs | 793.9 µs | 6.51 ms | 46.40 ms |
| #012 | 4.7 µs | 57.8 µs | 852.3 µs | 6.77 ms | 52.32 ms |
| #013 | 3.8 µs | 60.0 µs | 855.1 µs | 6.78 ms | 52.53 ms |
| #014 | 3.8 µs | 59.7 µs | 883.0 µs | 6.90 ms | 58.18 ms |
| #004 | 4.8 µs | 64.4 µs | 947.0 µs | 8.19 ms | 58.72 ms |
| #010 | 4.4 µs | 65.1 µs | 973.2 µs | 7.62 ms | 59.09 ms |
| #003 | 4.7 µs | 64.2 µs | 1.02 ms | 7.96 ms | 60.26 ms |
| #011 | 3.8 µs | 72.1 µs | 1.05 ms | 8.29 ms | 64.83 ms |
| #009 | 3.8 µs | 70.6 µs | 1.05 ms | 8.57 ms | 64.31 ms |
| #007 | 4.8 µs | 67.8 µs | 1.08 ms | 8.19 ms | 67.41 ms |
| #008 | 5.8 µs | 74.1 µs | 1.11 ms | 8.47 ms | 66.10 ms |
| #006 | 5.3 µs | 70.0 µs | 1.12 ms | 8.56 ms | 67.15 ms |
| #002 | 7.9 µs | 85.0 µs | 1.12 ms | 11.04 ms | 104.66 ms |
| #001 | 6.8 µs | 86.0 µs | 1.27 ms | 10.10 ms | 87.12 ms |
| #005 | 7.2 µs | 91.2 µs | 1.35 ms | 10.79 ms | 85.71 ms |
| #993 | 97.0 µs | 1.50 ms | 23.65 ms | 204.17 ms | 1.57 s |
Here's a table of three good versions, one of which is external.
| Case | #016 | #020 | #991 |
|---|---|---|---|
| 256-sym-alpha4-mut10 | 3.4 µs | 3.5 µs | 7.4 µs |
| 256-sym-alpha4-mut30 | 3.4 µs | 3.4 µs | 12.6 µs |
| 256-sym-alpha26-mut15 | 3.4 µs | 3.3 µs | 15.0 µs |
| 256-sym-alpha256-mut15 | 3.3 µs | 3.4 µs | 59.3 µs |
| 256-asym-128x256 | 1.9 µs | 1.9 µs | 9.0 µs |
| 256-sym-alpha4-mut80 | 3.4 µs | 3.4 µs | 16.6 µs |
| 1k-sym-alpha4-mut10 | 48.3 µs | 48.9 µs | 50.3 µs |
| 1k-sym-alpha4-mut30 | 50.2 µs | 48.7 µs | 94.2 µs |
| 1k-sym-alpha26-mut15 | 55.9 µs | 54.9 µs | 88.6 µs |
| 1k-sym-alpha256-mut15 | 39.8 µs | 37.8 µs | 351.1 µs |
| 1k-asym-512x1024 | 30.1 µs | 30.1 µs | 55.8 µs |
| 1k-sym-alpha4-mut80 | 49.7 µs | 49.7 µs | 127.6 µs |
| 4k-sym-alpha4-mut10 | 698.1 µs | 687.1 µs | 343.0 µs |
| 4k-sym-alpha4-mut30 | 696.4 µs | 688.2 µs | 833.9 µs |
| 4k-sym-alpha26-mut15 | 783.6 µs | 763.0 µs | 549.1 µs |
| 4k-sym-alpha256-mut15 | 462.5 µs | 432.8 µs | 1.56 ms |
| 4k-asym-2048x4096 | 409.2 µs | 381.4 µs | 546.3 µs |
| 4k-sym-alpha4-mut80 | 759.4 µs | 694.8 µs | 1.39 ms |
| 8k-sym-alpha4-mut10 | 2.99 ms | 2.71 ms | 918.1 µs |
| 8k-sym-alpha4-mut30 | 2.96 ms | 2.73 ms | 2.77 ms |
| 8k-sym-alpha26-mut15 | 3.10 ms | 3.02 ms | 1.62 ms |
| 8k-sym-alpha256-mut15 | 1.72 ms | 1.66 ms | 3.47 ms |
| 8k-asym-4096x8192 | 1.46 ms | 1.49 ms | 2.03 ms |
| 8k-sym-alpha4-mut80 | 2.71 ms | 2.72 ms | 5.15 ms |
| 16k-sym-alpha4-mut10 | 10.54 ms | 10.83 ms | 2.67 ms |
| 16k-sym-alpha4-mut30 | 10.61 ms | 10.70 ms | 10.01 ms |
| 16k-sym-alpha26-mut15 | 11.99 ms | 11.97 ms | 5.28 ms |
| 16k-sym-alpha256-mut15 | 6.45 ms | 6.52 ms | 9.13 ms |
| 16k-asym-8192x16384 | 5.91 ms | 5.78 ms | 7.34 ms |
| 16k-sym-alpha4-mut80 | 10.51 ms | 10.48 ms | 18.68 ms |
| 32k-sym-alpha4-mut10 | 41.42 ms | 42.74 ms | 9.61 ms |
| 32k-sym-alpha4-mut30 | 42.15 ms | 43.96 ms | 37.91 ms |
| 32k-sym-alpha26-mut15 | 49.18 ms | 46.96 ms | 17.65 ms |
| 32k-sym-alpha256-mut15 | 24.20 ms | 24.47 ms | 26.09 ms |
| 32k-asym-16384x32768 | 22.82 ms | 23.11 ms | 27.75 ms |
| 32k-sym-alpha4-mut80 | 43.25 ms | 40.56 ms | 72.18 ms |
| Total Wins | 8 | 18 | 10 |
Some of these differences are close to the benchmark noise, but the generated code
is undeniably faster on certain patterns of inputs compared to the external #991.
The flipside is also true.
The code is pretty bad.
It's very typical "highly-optimized-and-butt-ugly" code.
There's no abstractions pulled out, apart from some functions that are
optimized for smaller string lengths and marked to go in certain sections of the elf.
In other words, no care was taken into constructing a program that makes sense.
But, the llm has no problem explaining what's going on, and with some external help
it's very doable to sit down and work through it, just as you would have to do
if you wanted to understand any other code that you didn't write.
As it was running, I was looking at the output every now and then and
saw the agent run objdump on the target executable to confirm that certain regions
got auto-vectorized properly.
With the right tools, I think it could have done way better;
what if it could easily see how instructions are scheduled on the execution ports to detect backend-pressure?
These tools probably do exist, but neither I nor the agent seemed to know them.
What's useful for humans is also often useful for agents. I hope this wave of llms will both lower the bar and raise the leverage to create good tools that humans and agents can use in order to understand the systems we create.
Thanks for reading.
Footnotes
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License