mht.wtf

Computer science, programming, and whatnot.

Auto-optimizing an algorithm

March 09, 2026 back to posts

After I found that llms are useful, I wanted to get a better feel for what they are useful for. Clearly, they are good at generating boilerplate and writing bash one-liners, but what else?

I wanted to see how difficult it was to set up an agent loop that has some objective that it can evaluate autonomously, let it run and see what happens. Also, I wanted some idea what was happening in the process, but without human intervention. Code optimization is an example of something with a lot of interesting properties: it's easy to test for correctness, and has a (fairly) clear objective function1.

So which code should we optimize? I figured I'd start at a blank slate, choose a problem that could have some different good solutions, that is easy to manually test or verify (if needed), and that I sort-of know from before.

After some brainstorming with the machine I chose the Levenshtein edit distance. Given two strings a and b, how many edits do you have to make to turn a into b? One edit is a single insertion, a single removal, or replacing a single character, and we want to find the minimal number of edits. We don't care what the edits actually are.

The Reference Solution

I guess you could brute force this and go through a search three trying all combinations, but that would be very inefficient. The standard "proper"2 solution uses dynamic programming: we create a table dist[i][j] that computes the edit-distance between a[..i] and b[..j], and build this table bottom-up.

dist[0][k] == dist[k][0] == k because to go from the empty string to any other string you have to insert each character. This is the base-case.

For the other entries there are three cases:

  1. remove a character: cost dist[i-1][j] + 1,
  2. add a character: cost dist[i][j-1] + 1,
  3. replace: cost dist[i-1][j-1], plus 1 if they are different and 0 otherwise.

Confused yet? Here's a half-filled out table:

. a g e n t
. 0 1 2 3 4 5
p 1 1 2 3 4 5
a 2 1 ?
g 3
a 4
n 5

Finding ? in the table is computing edit("pa", "ag"). We have some options:

  1. Pay 1 to remove the 'a' in "pa" and then do the rest of the edits: edit("p", "ag") + 1.
  2. Do the edit from "pa" to just "a" and then insert the 'g': edit("pa", "a") + 1
  3. Replace the last charater and edit the rest: edit("p", "a") + 1. We get the +1 part since the chars doesn't match.

We only need to consider all three and pick the cheapest:

  1. edit("p", "ag") + 1 = dist[1][2] + 1 == 3
  2. edit("pa", "a") + 1 = dist[2][1] + 1 == 2
  3. edit("p", "a") + 1 = dist[1][1] + 1 == 2

In this case is a tie: for the full edits, we can either go "ag" -> "aa" -> "pa" where we replace both times, or we can go "ag" -> "a" -> "pa" where we first delete and then insert.

I don't know, it's kinda confusing, and very tempting to frame as peeling off characters on both strings until nothing is left, dist[0][0], but then we have to show that this solves the same problem. Computing the table, though, is now super easy. Look at the three neighbors to the top left and add one to the smallest of those. If the letters match at your position, you get to choose the diagonal one for free.

Here's the filled out table with arrows showing a path we can take for the optimal cost of 3:

. a g e n t
. 0 ←1 ←2 ←3 ←4 ←5
p ↑1 ↖1 ←2 ←3 ←4 ←5
a ↑2 ↖1 ←2 ←3 ←4 ←5
g ↑3 ↑2 ↖1 ←2 ←3 ←4
a ↑4 ↑3 ↑2 ↖2 ←3 ←4
n ↑5 ↑4 ↑3 ↑3 ↖2 ←3

agent -> agen -> agan -> pagan

Setup

Look, this isn't actual science, and you're not my boss, so I can do what I want. Or rather, I don't have to do things I don't want to do. And I didn't want to write tests or benchmarks, or even think too hard about edge-cases or input distributions. So instead of designing a test suite and a benchmark system I first had agents generate these too.

Also, the agent generated the reference implementation:

// THIS IS PURE LLM GENERATED CODE
int ref_edit_distance(
    const unsigned char *s,
    int s_len, 
    const unsigned char *t,
    int t_len
) {
  if (s_len == 0) return t_len;
  if (t_len == 0) return s_len;

  int *prev = (int *)malloc((t_len + 1) * sizeof(int));
  int *curr = (int *)malloc((t_len + 1) * sizeof(int));

  for (int j = 0; j <= t_len; j++)
    prev[j] = j;

  for (int i = 1; i <= s_len; i++) {
    curr[0] = i;

    for (int j = 1; j <= t_len; j++) {
      int cost = (s[i - 1] == t[j - 1]) ? 0 : 1;

      int del = prev[j] + 1;
      int ins = curr[j - 1] + 1;
      int sub = prev[j - 1] + cost;

      int d = del;
      if (ins < d) d = ins;
      if (sub < d) d = sub;
      curr[j] = d;
    }

    int *tmp = prev;
    prev = curr;
    curr = tmp;
  }

  int r = prev[t_len];
  free(prev);
  free(curr);
  return r;
}

It has one funny trick, namely that instead of computing the whole table it computes one row at a time and switches which of the two rows it looks at and which is writes to. In iteration i, prev is row i-1 and curr is row i.

Setting up the harness was a little tricy because I tried to over-engineer it at first, with sandboxing and custom tools, trying to limit the exposure the agent had to the benchmarks and previous versions. In the end I used pi and wrote separate agents for separate stages of the process: one planner, one implementor, one reviewer, and the "top" model to orchestrate. A pretty standard setup, except that the implementor wasn't allowed to run the benchmarks.

The setup was such that the planner wrote an hypothesis, the implementor created a submission that passed the test suite, the orchestrator ran the benchmarks that spat out a json file with wall-clock time and perf numbers, and the reviewer looked at those and wrote a conclusion. Then it updated a global FINDINGS.md so that the planner in the next iteration wouldn't have to read too much to make a new plan.

I also made sure that the agents wrote plans to communicate so that I could store these, in case they would be interesting. pi can probably enforce this with an extension or something, but I didn't bother trying to figure out how.

I also found some third party code that looked like it would be a good reference and ran the benchmark on those.

Results

Short version: llms are capabable of optimizing code when given a harness like this. Here's some timings. Benchmark names have their size, alphabet size, and mutation rate, in the name. IDs are incremental and #99x are external implementations. It's sorted on the middle column.

ID 256α4m30 1Kα26m15 4Kα4m10 16K-asym 32Kα4m80
#991 12.6 µs 88.6 µs 343.0 µs 7.34 ms 72.18 ms
#992 25.7 µs 123.3 µs 607.7 µs 224.47 ms 1.26 s
#020 3.4 µs 54.9 µs 687.1 µs 5.78 ms 40.56 ms
#016 3.4 µs 55.9 µs 698.1 µs 5.91 ms 43.25 ms
#019 3.4 µs 54.3 µs 701.4 µs 5.77 ms 40.14 ms
#018 3.4 µs 60.4 µs 702.7 µs 6.20 ms 44.69 ms
#017 3.4 µs 55.3 µs 727.0 µs 5.90 ms 40.49 ms
#015 5.4 µs 60.5 µs 793.9 µs 6.51 ms 46.40 ms
#012 4.7 µs 57.8 µs 852.3 µs 6.77 ms 52.32 ms
#013 3.8 µs 60.0 µs 855.1 µs 6.78 ms 52.53 ms
#014 3.8 µs 59.7 µs 883.0 µs 6.90 ms 58.18 ms
#004 4.8 µs 64.4 µs 947.0 µs 8.19 ms 58.72 ms
#010 4.4 µs 65.1 µs 973.2 µs 7.62 ms 59.09 ms
#003 4.7 µs 64.2 µs 1.02 ms 7.96 ms 60.26 ms
#011 3.8 µs 72.1 µs 1.05 ms 8.29 ms 64.83 ms
#009 3.8 µs 70.6 µs 1.05 ms 8.57 ms 64.31 ms
#007 4.8 µs 67.8 µs 1.08 ms 8.19 ms 67.41 ms
#008 5.8 µs 74.1 µs 1.11 ms 8.47 ms 66.10 ms
#006 5.3 µs 70.0 µs 1.12 ms 8.56 ms 67.15 ms
#002 7.9 µs 85.0 µs 1.12 ms 11.04 ms 104.66 ms
#001 6.8 µs 86.0 µs 1.27 ms 10.10 ms 87.12 ms
#005 7.2 µs 91.2 µs 1.35 ms 10.79 ms 85.71 ms
#993 97.0 µs 1.50 ms 23.65 ms 204.17 ms 1.57 s

Here's a table of three good versions, one of which is external.

Case #016 #020 #991
256-sym-alpha4-mut10 3.4 µs 3.5 µs 7.4 µs
256-sym-alpha4-mut30 3.4 µs 3.4 µs 12.6 µs
256-sym-alpha26-mut15 3.4 µs 3.3 µs 15.0 µs
256-sym-alpha256-mut15 3.3 µs 3.4 µs 59.3 µs
256-asym-128x256 1.9 µs 1.9 µs 9.0 µs
256-sym-alpha4-mut80 3.4 µs 3.4 µs 16.6 µs
1k-sym-alpha4-mut10 48.3 µs 48.9 µs 50.3 µs
1k-sym-alpha4-mut30 50.2 µs 48.7 µs 94.2 µs
1k-sym-alpha26-mut15 55.9 µs 54.9 µs 88.6 µs
1k-sym-alpha256-mut15 39.8 µs 37.8 µs 351.1 µs
1k-asym-512x1024 30.1 µs 30.1 µs 55.8 µs
1k-sym-alpha4-mut80 49.7 µs 49.7 µs 127.6 µs
4k-sym-alpha4-mut10 698.1 µs 687.1 µs 343.0 µs
4k-sym-alpha4-mut30 696.4 µs 688.2 µs 833.9 µs
4k-sym-alpha26-mut15 783.6 µs 763.0 µs 549.1 µs
4k-sym-alpha256-mut15 462.5 µs 432.8 µs 1.56 ms
4k-asym-2048x4096 409.2 µs 381.4 µs 546.3 µs
4k-sym-alpha4-mut80 759.4 µs 694.8 µs 1.39 ms
8k-sym-alpha4-mut10 2.99 ms 2.71 ms 918.1 µs
8k-sym-alpha4-mut30 2.96 ms 2.73 ms 2.77 ms
8k-sym-alpha26-mut15 3.10 ms 3.02 ms 1.62 ms
8k-sym-alpha256-mut15 1.72 ms 1.66 ms 3.47 ms
8k-asym-4096x8192 1.46 ms 1.49 ms 2.03 ms
8k-sym-alpha4-mut80 2.71 ms 2.72 ms 5.15 ms
16k-sym-alpha4-mut10 10.54 ms 10.83 ms 2.67 ms
16k-sym-alpha4-mut30 10.61 ms 10.70 ms 10.01 ms
16k-sym-alpha26-mut15 11.99 ms 11.97 ms 5.28 ms
16k-sym-alpha256-mut15 6.45 ms 6.52 ms 9.13 ms
16k-asym-8192x16384 5.91 ms 5.78 ms 7.34 ms
16k-sym-alpha4-mut80 10.51 ms 10.48 ms 18.68 ms
32k-sym-alpha4-mut10 41.42 ms 42.74 ms 9.61 ms
32k-sym-alpha4-mut30 42.15 ms 43.96 ms 37.91 ms
32k-sym-alpha26-mut15 49.18 ms 46.96 ms 17.65 ms
32k-sym-alpha256-mut15 24.20 ms 24.47 ms 26.09 ms
32k-asym-16384x32768 22.82 ms 23.11 ms 27.75 ms
32k-sym-alpha4-mut80 43.25 ms 40.56 ms 72.18 ms
Total Wins 8 18 10

Some of these differences are close to the benchmark noise, but the generated code is undeniably faster on certain patterns of inputs compared to the external #991. The flipside is also true.

The code is pretty bad. It's very typical "highly-optimized-and-butt-ugly" code. There's no abstractions pulled out, apart from some functions that are optimized for smaller string lengths and marked to go in certain sections of the elf. In other words, no care was taken into constructing a program that makes sense. But, the llm has no problem explaining what's going on, and with some external help it's very doable to sit down and work through it, just as you would have to do if you wanted to understand any other code that you didn't write.

As it was running, I was looking at the output every now and then and saw the agent run objdump on the target executable to confirm that certain regions got auto-vectorized properly. With the right tools, I think it could have done way better; what if it could easily see how instructions are scheduled on the execution ports to detect backend-pressure? These tools probably do exist, but neither I nor the agent seemed to know them.

What's useful for humans is also often useful for agents. I hope this wave of llms will both lower the bar and raise the leverage to create good tools that humans and agents can use in order to understand the systems we create.

Thanks for reading.


Footnotes

  1. Creating a benchmark isn't all that easy, because the shape of the inputs can have a huge impact on the performance of the system. As is the case in this benchmark!

  2. The properness of this solution depends on how comfortable you are with DP, I guess.

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License