Searching High and fLow

October 19, 2023 back to posts

Recently¹ at work I found myself with an interesting problem in need of solving. The problem was one stage of a larger algorithm, and we wanted to be able to run the whole algorithm in an optimization loop, so it was important that it was fast.

The problem is this: given a set of points $P \subseteq\mathbb{R}^2$ and a set of sites $S\subseteq\mathbb{R}^2$, assign each point to a site so that:

The sum of the distances from each assigned point to its site is minimized
The number of points assigned to a site is below some limit $L$.

Without constraint (2) we can always choose the closest site to each point and be happy. This is also really fast, since we just need to compute the pairwise distances. Before, this is what the system did, but with the introduction of requirement (2), I had to come up with something else. Here is an example of how the optimal solutions look with and without constraint (2), for a sample set of points and sites:

Intuitively, it's clear what's going on here; the left site has to "give up" some of its points to the right site, so that while these points were closer to the left site, it frees up the site to also be assigned to the points on the far left. Computationally, however, it is not so straight forward to see how we can do this.

Looking up the problem

Or, "How I nearly got tricked into thinking this was NP-Hard".

When presented with a problem like this, the first thing I always do is to figure out what the "proper" name of the problem is, because it is very likely to have already been solved. Some quick searching lead me to a couple of problems:

Facility location problem

In the facility location problem you are given a set of potential facility sites $L$ and a set of demand points $D$, and the task is to choose which facilities to open so that the distances from each demand point to an open facility is minimized. This problem is NP-hard.

It's very similar to my problem -- we're given two sets of points and we're minimizing distances -- but it's not exactly the same, since my problem asked to optimize the choice of facility per demand point, with the capacity constraint. So I had to keep looking.

Vertex k-center problem

In the vertex k-center problem we're given a complete undirected graph $G=(V,E)$ and a cost function $c: E\to\mathbb{R}$, and we're asked to choose vertices $V_0\subseteq V$ to minimize the cost of the vertex farthest away from $V_0$: $$\text{minimize} \max_{v\in V}\min_{ w\in V_0} c(v, w)$$

The vertex k-center problem is also NP-hard.

Again, it looks related, but not quite right. We already know which of our points are in which class, so the "combinatorial choosing" aspect of this problem isn't a part of my problem. The search continued.

Assignment problem

Eventually, my search lead me to the assignment problem. Here we want to assign tasks to workers where each pair has a cost associated to it, and we seek to minimize the total cost. In graph theory terms (and if the number of tasks and workers is the same), this is finding a minimal matching of a certain size, of a weighted bipartite graph.

This too looks similar to what we want, but again, not quite, due to our constraint. Also, our number of points and sites are not the same, so the matching stuff doesn't apply. However, this lead me to Ulrich Bauer's master thesis, which table of contents include a section named "Minimum Cost Flow". Reading the section name was enough.

Minimum Cost Flow

Let's start with the more known, related problem: max-flow. In max-flow, you're given a flow network, and the task is to figure out how much flow you can send through the network. Imagine a network of water pipes through a city with a water basin (a source node) on one side of the city, and a pipe to the ocean (a sink node) on the other side. We want to figure out how much water we can send from the basin to the ocean.

In graph terms, it looks like this: we have a directed graph $G=(V,E)$ and two special nodes: the source $v_s$ and the sink $v_t$. All edges $e\in E$ have a capacity $e_c$. Now we assign a flow, a non-negative number $f(e)\in\mathbb{R}^+$, to the edges. In the analogy above, the flow correspond to the amount of water flowing through the pipe that is that edge. We also have some constraints:

The flow in an edge cannot exceed its capacity.
Nodes have to conserve the flow, so that the flow in the edges going in to the node has to be equal to the flow in the edges going out of that node.
Only the source $v_s$ is allowed to produce flow (meaning it can send out more than it got in).
Only the sink $v_t$ is allowed to consume flow (meaning it can take in more than it sends out).

We want to maximize the flow that the source node produces, which, due to the conservation of flow (req. (2)), is the same as what the sink consumes, while respecting the constraints.

That's max-flow. In minimum cost flow, we also have a cost $c(e)$ for each edge, per flow. Instead of finding the maximum amount of flow we can send through the network, we want to find the cheapest way of sending a certain amount of flow.

Going back to our problem

Water pipes and capacities can seem like a long way from points and sites in the plane; what's the connection? If we create a graph $G=(P\cup S, E)$ and imagine the sites $S$ to be on the left side, and points $P$ on the right side, we can draw edges between all pairs of sites and points to get a bipartite graph:

A bipartite graph with the sites on the left and points on the right.

Now we want to say that a flow going through an edge $(s,p)\in E$ means that site $s$ and point $p$ are connected. The cost of this edge should be the distance between the site and point, so that we'll reduce the total distance of the pairs that we end up assigning.

There's a couple more things we need to make sure that a minimum cost flow through our made-up network actually solves our original problem.

The network has to be a flow network
The sites respect their given limit $L$
Each point is only connected to one site

For (1) we can add in a source to the very left and a sink on the very right and connect them to the two groups, like so:

The leftmost node is the source node, and the rightmost the sink node.

For (2) we can set the capacity on the edge from the source to the site equal to $L$; this way, if we also set the capacity on the edges in the middle to 1, then each edge that is filled with flow will spend one capacity of the source-site edge. For (3), we can set the capacity of the edges from the points to the sink to be 1. This way we expect all the flow that comes in along the edge from the site to the point to go along this edge and into the sink. We still haven't assigned costs to the edges adjacent to the source and sinks, but we don't really care about which of these edges are used, so we can set them all to 0. We also know how much flow we want to send through the network, since all the points should be connected to a site, and each of these connections uses 1 flow.

Here's the final network, where I have set the site capacity $L$ to be 2. For readability, only one edge in each "layer"² is labeled, but the only difference from the other edges are the costs in the middle layer.

The middle gray edges will vary in cost. Otherwise, the capacities and costs are constant for each layer.

And here's one possible solution, in which edges with saturated with flow are black, and edges without flow are gray. Counting from the top with 1-indexing, the first site is connected to the first and fourth point, the second site to the third and sixth, and the third site to the second and fifth point. We see that all sites have maxed their capacity since they have two edges going out to the right, and all points are accounted for, since all of them have an edge to the sink.

Gray edges are not used, and black edges are saturated with flow.

A quick note before I continue, I've glossed over one point: what happens if the flows we get from solving the problem aren't integer? Could we get a bunch of 0.5 flows in the graph? For the approach I went for, the answer is no by construction (as we'll see). However, when writing this post I tried to prove that any optimal flow with fractional flows could be converted to an integer flow that was at least as cheap, irrespective of how this flow was found, but I couldn't quite figure out how. If you do know, my public inbox is open³.

Solving

Great, we now have a flow network in which we want to find a min-cost flow. This was a Rust codebase, so I searched through crates.io and found mcmf, a crate that wraps the LEMON library. LEMON was referenced in the Wikipedia page for minimum cost flow, so I figured it was a safe bet. I added it to my project, set up my graph, ran .mcmf() which ran in a fraction of a second, read out the results I wanted, and it all Just Work™ed.

However ...

LEMON is a C++ library, and I am compiling my Rust crate to wasm using wasm-pack. This works really well for Rust code, but not for a joint C++ code base; it seems the issue is with the compilation target. rustc has both wasm32-unknown-unknown and wasm32-unknown-emscripten listed as Tier 2 supported platforms, but rustc cannot, of course, compile C++ code. So we need a separate toolchain for C++. emscripten is a complete toolchain for compiling C++ to wasm32-unknown-emscripten, but wasm-pack compiles to wasm32-unknown-unknown. Doesn't sound like a big difference, right?

Wrong.

I'm still a little hazy about the details here, but it seems that the two targets are fundamentally different, and that it is not possible to compile for the two targets and somehow join them. Furthermore, it also seems that adding -emscripten as a target to wasm-pack is also a no-no. If there is a solution to this problem, I'd love to hear it! Please send a mail to my public inbox if you know.

I gave up on this path, and went down the other path: implementing it myself.

Implementation

Truth be told, I have never been very comfortable with implementing max-flow. It's not something I do very often, and there are a lot choices one has to make. Choice of algorithm: Standard Ford–Fulkerson with BFS (aka. Edmonds-Karp), try Dinic's, or finally try to understand Push-relabel? Ford-Fulkerson feels like a safe choice for the first version. How do you represent the graph? Everything on the heap? Vec<Node> for the nodes and have Node contain an adjacency list of indices for the edges? Where does the flow and capacities go? BFS through the graph sounds okay, but how do you represent a path? Won't there be a lot of them? Vec<usize> again for each path sounds like a lot of allocations, but maybe it's okay. Oh and by the way, this is all just for max-flow. How do we even solve min-cost-max-flow?

When bogged down with these uncertainties, the best way forward is to just do something, with the expectation that you're only trying something out. At this stage, the only important thing is building a theory of the problem.

First version

I decided that I didn't want to represent the graph completely as-is; the source and sink nodes could be implicit in the code⁴. The flow and capacities for the edges adjacent to the source and sink could also be handled separately, since we know what the graph looks like around these two vertices.

Further, I wasn't sure if the min-cost aspect of the problem was difficult, and decided on a greedy approach without making sure that the solutions produced were optimal. I wanted to write a loop in which we find the cheapest way of increasing the flow by 1, and do that $|P|$ times. This is just Ford-Fulkerson where you find the min-cost path, and I figured it was probably right, but I didn't sit down and prove it. By checking against mcmf later I would get a hunch for whether this really is optimal or not, but I didn't want to spend time figuring this out before seeing if the implementation was feasible⁵ .

For the vertices I made an enum with an index to identify the two different types

pub enum Node {
    Site(usize),
    Point(usize),
}

and for the edges I made a struct containing these indices, as well as the edge cost.

pub struct Edge {
    pub site: usize,
    pub point: usize,
    pub cost: F64,
}

We don't actually have to represent the edges with an adjacency list or anything like that, because we have a complete bipartite graph, so we already know what all the edges are.

Edge costs (I duplicated these for some reason) were precomputed and stored in a $|P|\times |S|$ matrix; Since all edges are there, the matrix is completely full. Capacities were handled in a slightly funny way; I made a Vec<usize> of length $|S|$ ($|P|$), where each entry corresponded to the capacity of the site (point) with the same index, respectively. For the cross-edges I made a Matrix<bool> of size $|S|\times |P|$ called edge_used where each entry corresponded to whether the edge was used or not, since these edges had unit capacity.

The funny looking F64 is a type alias

type F64 = float_ord::FloatOrd<f64>;

using float_ord so that we can order floats⁶.

A path through the network was also a struct listing the Nodes in the path, as well as the negative cost of the path, because the only priority queue in Rust's standard library is std::collections::BinaryHeap, which is a max-heap⁷.

pub struct Path {
    pub neg_cost: F64,
    pub edges: Vec<Node>,
}

Now we can write the function step on Path which extends the path by one move, to all possible new paths:

pub fn step(&self, cost: &Matrix<F64>, edge_used: &Matrix<bool>) -> Vec<Path> {
    let last = self.edges.last().unwrap();
    match last {
        Node::Site(si) => (0..cost.cols)
            .filter(|pi| !edge_used.get(*si, *pi))
            .flat_map(|pi| {
                if self.has_edge(*si, pi) {
                    return None;
                }
                let cost = cost.get(*si, pi);
                let mut edges = self.edges.clone();
                edges.push(Node::Point(pi));
                Some(Path {
                    neg_cost: FloatOrd(self.neg_cost.0 - cost.0),
                    edges,
                })
            })
            .collect(),
        Node::Point(pi) => (0..cost.rows)
            .filter(|si| *edge_used.get(*si, *pi))
            .flat_map(|si| {
                if self.has_edge(si, *pi) {
                    return None;
                }
                let cost = cost.get(si, *pi);
                let mut edges = self.edges.clone();
                edges.push(Node::Site(si));
                Some(Path {
                    neg_cost: FloatOrd(self.neg_cost.0 + cost.0),
                    edges,
                })
            })
            .collect(),
    }
}

Some notes on what's going on here:

When going from sites to points we only want to try edges that haven't already been used, since these would have no capacity left, so these are .filtered out.
has_edge checks that the edge is not already included in the path, in order to avoid looping.
Not sure why I used flat_map and returned Option instead of just another .filter.
When going from points back to sites, we can only go along edges that have been used, since we are effectively undoing the flow that goes along the edge. For this reason, we're adding to the negative cost.

These are the main mechanisms for computing the min-cost flow, apart from the actual flow algorithm itself. This was pretty simple now, but the implementation was somewhat noisy, so here's the pseudo code:

loop {
    create initial paths from sites that has capacity to all points
    put the paths in a max-heap

    while max-heap has elements {
        path = pop(max-heap)
        if path leads to a point that's not assigned yet {
            reduce site capacity by one
            set point capacity to zero (reduce by 1, but we know it is 1)
            mark edges as used
            restart main loop
        }
        children = expand the path
        add children to the heap
    }
}

This algorithm produced exactly the same pairings as mcmf did, but it was a lot slower. Not 2x or 5x, more like 1000x.

The program took over 20 seconds.

Second version

Why was the first attempt so slow? Here I had a few hypotheses straight off the bat, with some ideas for solutions:

Inefficient representations of the graph and paths. Many allocations.
- Try to add back links instead of storing whole paths?
Poor search through the graph; too many paths are expanded
- Maybe prune based on cost? Can we bound cost?
std::collections::BinaryHeap is slow; should try something else
- Probably something on crates.io?
- Write my own?
Only a single flow is added in each iteration; should figure out how to augment many paths at the same time.
- Doesn't Dinic's do this? Or was that Push-relabel?

I tried cost-pruning; it helped, but not by much. I tried to change Path::has_edge to just check for a site, since I was pretty sure I didn't have cycles of negative cost (if you have such a cycle, it will pay off to walk it, which means you'll visit a node twice, so the two checks aren't the same); it helped but not by much. I tried the priority-queue crate (which also easily supported making a min-heap), but that was even slower.

Eventually, I decided to specialize the search to my instance of the problem⁸. I didn't need to solve MCMF for any general graph, since I had very specific knowledge about the types of graph I would solve it on. When searching for a path from a site to the sink (which, again, we didn't include in the graph explicitly), we can do two operations: (1) go to an unassigned point and finish there, let's call this operation Connect; or (2) go to a point and follow the back-edge to a site, let's call this operation Route. Note here that Route is unique for a point, since there is at most one site connected to it.

Here's Connect, when standing at the filled-in site, and routing along the edges with arrows:

`Connect` corresponds to finding a path straight to the sink.

And here is Route:

Left: We find a shortest-path following a back-edge. Right: The resulting flow.

Instead of having long paths, each listing the vertices in the path, maybe it would help to have a path be a Vec<Move>:

type SiteId = usize; // I also aliased these, for later on
type PointId = usize;

pub enum Move {
    Connect(SiteId, PointId),
    Route(SiteId, PointId, SiteId),
}

Why would this help? Consider the path in the Route operation in the figure above. Had we done this in a full graph representation, we would have had six vertices in the path: the source, the site we're currently at, the point we're routing around, the site currently connected to that point, the cheapest point for that site to get to the sink, and the sink. For the Move representation however, we only need three IDs, namely two for which site we're talking about, and one for which point we're rerouting.

Further, we can imagine splitting up the list of points into two: the points that are already assigned to a site, and the ones that aren't: Connect only works for unassigned points, and Route only works for assigned points. Thus, when standing at a site, we have $|P|$ choices to make, since each point correspond to one Move. Before, we had $|P|$ choices to make (which point to visit next), and then for each choice we had $|S|+1$ choices to make (go back to any of the sites, or go to the sink). Most of these were quickly pruned, but I figured they might have added a lot of work for the search.

There was one more important insight to make: When we perform the move Route(a, p, b) , we disconnect the point p from site b and connect it to a, to free up capacity at b, by paying the cost difference of the new edge ap from the old edge pb, as well as spending as capacity. This is the only thing going on: we move one capacity from b to a by paying the difference in edge cost. Thus, for the rest of the search, it doesn't matter which t we choose. The only thing that matters is the edge cost difference.

This means that when considering different Route(a, t, b)s for different choices of t, we only need to look at the cheapest, because the net-result of performing the Route is the same for all Routes around these two sites. We can look at all possible ts, choose the cheapest, and continue the search with only that Route. This is a huge help, because for each pair of sites we don't have a combinatorial explosion of different Route operations. We only have one.

A few other small optimizations (another stab at a cost_limit to prune old paths, sorting the points for each site so that it should be easier to find cheap moves, move a Vec::clone down below an if .. { ... continue; }, and other small improvements) were small done after this. I got big speedup, but I was still not where I would have to be.

The program took around 2 seconds.

Third version

I felt like I was really getting somewhere with Move, but at the same time, the second version still felt too general. The insight about "only the best route matters" helped me find a new framing of the problem: when routing from a site, there is really only $(|S|-1)+1$ moves we can do:

Connect to the closest unpaired point and be done with the search (1 move).
Route the best route around any of the other sites and continue the search ($|S|-1$ moves).

I also knew that for my application, $|S|$ would always be less than 10, and 10 is a really small number. What does this buy us?

Forget the graph, and don't think about nodes and edges. We only need to find the cheapest sequence of these Moves. And for this, we need their cost.

I made a routing_table that was a $|S|\times|S|$ matrix. Entry $S_{ij}$ contained the cost of the best Route(i, p, j) over all points $p$, and the diagonal entries $S_{ii}$ contained the cost of the best Connect(i, p). In addition, the table contained the index of the point p, which works out nicely in both the Route and Connect case, since they both have one point. Here's the code to initialize the table:

fn initialize_routing_table(&mut self) {
    let num_sites = self.cost.rows;
    let num_pts = self.cost.cols;
    self.routing_table = Matrix::new(
        num_sites,
        num_sites,
        (FloatOrd(f64::INFINITY), PointId::MAX)
    );
    for a in 0..num_sites {
        for b in 0..num_sites {
            if a == b {
                continue;
            }

            let mut cand_cost = f64::INFINITY;
            let mut cand_ind = SiteId::MAX;

            for t in 0..num_pts {
                if !*self.edge_used.get(b, t) {
                    continue;
                }
                let route = Move::Route(a, t, b);
                let cost = self.move_cost(&route).0;
                if cost < cand_cost {
                    cand_cost = cost;
                    cand_ind = t;
                }
            }
            *self.routing_table.get_mut(a, b) = (FloatOrd(cand_cost), cand_ind);
        }
    }

    for s in 0..num_sites {
        if let Some((min_cost, t)) = (0..num_pts)
            .filter(|pi| self.tur_cap[*pi] == 1)
            .map(|pi| (*self.cost.get(s, pi), pi))
            .min()
        {
            *self.routing_table.get_mut(s, s) = (min_cost, t);
        }
    }
}

To make up some numbers, here's what the routing_table could look like, for a graph with 3 sites⁹:

$$ \begin{bmatrix} (32.08, t_3) & (32.79, t_1) & (12.01, t_7)\\ (28.62, t_4) & (41.41, t_2) & (24.88, t_4)\\ (14.19, t_5) & (21.31, t_9) & (15.89, t_6) \end{bmatrix} $$

Here we're saying that if we wanted to connect the first site to the nearest unpaired point ($t_3$), it would cost $32.08$ ($S_{1,1}$). However, re-routing a point from site 3 to site 1 around $t_5$ costs $14.19$ ($S_{3,1}$), and connecting site 3 to its nearest unpaired turbine ($t_6$) costs $15.89$ ($S_{3,3}$) for a total of $30.08$. If site 3 is already full, this is the shortest path.

I want to highlight that this table is the only information we need to perform the search. We don't need to know anything about the graph, or the points, or the sites. We don't even need to look at capacities, because the information they're giving us is already encoded in the table. Once we have this table, these 18 numbers is all that's required to find the cheapest way of increasing the network flow by 1.

This was already a pretty large leap from the last version, so I decided to be blunt in the next step, and pre-compute all possible paths. After all, we don't have that many of them. A list of SiteInds can be used as a path, where the first site is the start site, the intermediate sites are routed around using the precomputed points, and the last site goes to its closest unmatched point. Since the sites in this list have to be unique, we have at most $|S|!$ of them, which sounds bad, but since $|S|$ is at most 10, this is at most 3'628'800. If $|S|$ is a more reasonable¹⁰ 5, this is merely 120. Compute all paths, and choose the cheapest.

Each iteration of this main loop invalidates the routing_table, since the site-point assignment has changed. Since I had just implemented the table approach, I didn't want to also incrementally update only the parts of it that did change, so instead I recomputed the whole table before every iteration.

This caveman solution took 200ms.

Version 3.5

200ms is better, but this was still a very decent chunk of the total time of my program. Recall that this whole computation is just the first step of a bigger system. But with a few very naïve choices in the last solution I was confident that I could speed it up some more.

I stored a Search (my new name for Path) in a struct with a SmallVec listing the indices in the matrix corresponding to the move that made up the path:

struct Search {
    moves: SmallVec<[(SiteInd, SiteInd); 10]>,
    cost: F64,
}

There's a lot of duplicate data here, since moves is of the form [(a, b), (b, c), (c, d)], but maybe this was easier to use? I am not sure why I did it this way. Now it's just a matter of finding the shortest path through the routing_table.

If the entries in the routing_table are all non-negative, life is good, because we can use Dijkstra's algorithm to find the shortest path to a diagonal entry (which, recall, represents ending the flow path). We've gone full circle, and are back again at std::collections::BinaryHeap and searching through a graph (this time, $K_{|S|}$: the complete graph on $|S|$ vertices).

I initialized the queue with the legal subset¹¹ of the total $|S|^2$ initial moves, and started to pop. If I got a diagonal entry back, that's the path. If not, I expanded the path from the end of the path (search.moves.last().unwrap().1), and considered all other possible sites to extend to. Sites that were already in the path were filtered out. Here's the code:

while let Some(search) = searches.pop() {
    let mov = search.moves.last().unwrap();
    // Diagonal entry is the best; take it if possible.
    if mov.0 == mov.1 {
        if search.cost < candidate.cost {
            candidate = search;
            break;
        }
        continue;
    }

    // Non-diagonal entry; expand to the possible next moves in the sequence.
    let a = mov.1;
    for b in (0..num_sites).filter(|&b| !search.contains_site(b)) {
        let cost =
            FloatOrd(search.cost.0 + self.routing_table.get(a, b).0 .0);

        self.cost.partial_cmp(&other.cost).map(|o| o.reverse())
        let mut moves = search.moves.clone();
        moves.push((a, b));
        let s = Search { moves, cost };
        searches.push(s);
    }
}

I hinted at this above, but note that we don't have to check for site capacities in this loop. Direct connections are only in the table if they're valid (otherwise they're $\infty$), and Route operations don't need the target site to have capacity, since we're freeing up capacity at the source site. Since the initial paths are also only to sites with capacity, we don't have to check for capacities at all.

The case with negative entries in routing_table calls for a different strategy, since now you can suddenly produce cheaper paths by continuing to route around other sites. Instead of implementing a "real" search algorithm, I bounded the cost savings possible, and used this to prune paths that were so expensive that there would be no way for the negative entries to make up for it.

I did this in a very loose way: if $\text{m}$ is the smallest entry in routing_table, then that's the most savings we can get from extending a path of length $l$ to $l+1$. Since we also know the max length of a path, $|S|$, the best possible cost decrease of a started path of length $l$ is $m(|S|-l)$. This is not at all tight¹², but it's really easy. Here's what it looked like:

while let Some(search) = searches.pop() {
    let mov = search.moves.last().unwrap();
    // Diagonal entry is the best; take it if possible.
    if mov.0 == mov.1 {
        if search.cost < candidate.cost {
            candidate = search;
        }
        continue;
    }

    // Non-diagonal entry; expand to the possible next moves in the sequence.
    let a = mov.1;
    for b in (0..num_sites).filter(|&b| !search.contains_site(b)) {
        let cost =
            FloatOrd(search.cost.0 + self.routing_table.get(a, b).0 .0);
        let best_future_cost = cost.0
            + (num_subs - (search.moves.len() as SiteInd + 1)) as f64 * min_table_entry;

        if candidate.cost.0 < best_future_cost {
            continue;
        }

        let mut moves = search.moves.clone();
        moves.push((a, b));
        let s = Search { moves, cost };
        searches.push(s);
    }
}

Along the way I also pulled the trigger and changed SiteInd and PointInd to be u8 and u16 respectively, which, surprisingly, sped up the code by 30% (!). I continued to recompute the routing_table from scratch in between every single call to route.

Now the program took 4ms, and I declared it Good Enough.

The "Right" Solution

I had a lot of fun with this problem. It's both fun and rewarding to iterate on a problem and seeing the time required to solve it going from "get a coffee", to "impatiently wait" to "wait" to "quick, if you run it once" to "fast" to "can be called in a loop by another program". It's also fun when this isn't just an exercise in how good it can get, but actually a part of what you're really trying to do¹³.

But mostly, it was fun because the solution feels right. I have a theory that most problems¹⁴ we programmers are dealing with are pretty simple, when viewed from the right angle. This angle is often hard to find, but once you have found it, things just seem to "work out"¹⁵, in terms of complexity, number of bugs, maintainability, debugability, all of these axes.

My final version is around 10'000 times faster than my initial version. When presented with such a huge difference without having any context, it is very easy to jump to conclusions. For instance, one might attribute this to:

Language; it was written in slowlang first, and then ported to fastlang.
Lack of optimizations; ran without --release, -O2, or similar.
Algorithmic improvements; change a naïve algorithm to a high performing one.
A full team of experts worked on it for months, creating an engineering jewel that normal programmers simply can't match.
Hyper-optimized code; inline assembly, unsafe everywhere, PGO, lot's of impossible to read code, impossible to debug, probably requires a blood sacrifice.

In this case though, none of the above is true¹⁶.

Language was the same.
Optimization levels were the same.
I would argue that the algorithm is still the same --- we're still solving min-cost-max-flow with successive shortest paths --- but since we are making assumptions about the input I can see a claim for the algorithm being different.
Full team for months is also off the mark; I'm no expert, and certainly not a whole team of them. Further, this whole process, from initial test with mcmf to final code written, took slightly longer than two working days. The first commit was around 15:00 on Wednesday¹⁷, and the last commit was 16:15 on Friday (plus a small bugfix on Monday morning).
Most importantly though, the last one is not true. There's no inline assembly, no unsafe, no special tooling, no architecture specific code, and no "every-trick-in-the-book-pulled" code. Quite the opposite: there's very little code. The whole module (excluding the Matrix struct) is 212 lines of code, as reported by tokei.

So how come we got a 10'000x speedup? I think it's the all due to the updated framing of what we're really trying to do. "The problem" was never about flow through a graph. This is a made up mental framework for us to work in, so that we can use general techniques to specific problems.

Shoutout to LEMON

I did compare my solution to LEMON when I was getting below 100ms. I had given up trying to integrate it, but it was still a very useful benchmark. For the "largest reasonable" input I was testing with, LEMON was still faster than my final version. LEMON, of course, solves the problem in its general form, and as such, is a way better implementation than mine. However, LEMON do "pull-many-tricks-in-the-book"¹⁸, and are written by people who have extensive experience with max-flow-min-cost, so I didn't feel so bad.

I have stepped through their source code, and started reading some of their references to better understand how they achieve what they have; most notably this experimental study. The "Network Simplex Method" seems to be a key term, but I haven't understood this yet.

There's still a hope, of course, that if I only partially invalidate my routing_table instead of recomputing it at every iteration, and store the visited sites as a bitfield in the Search struct, I'll be faster. If the requirements for my system drastically changes, maybe I'll get to find out.

As always, any input, shorter paths, or excess flow, can be sent to my public inbox.

Thanks for reading.

"Hunting High and Low" by a-ha. Other title candidates include "one Flow over the cuckoo's nest" and "Even Flow". ↩
I tried to avoid naming that could be associated to neural networks, but here I came short. ↩
Something something subgraph induced by fractional flow edges, and cancel loops? Something no negative cycles in residual graph (assumption by optimality)? ↩
This is a pattern I feel like I keep seeing when looking at "good" code; algorithms and data structures are very often used in an "abstract" sense, as opposed to directly implemented in the code. ↩
This is a chicken-and-egg problem, because you need both of these. You both want to know what guarantees a method provides (in this case, optimality of the solution), but you also want to make sure that the method is implementable and has the right characteristics (performance, usability, maintainability) for what you're trying to do. Looking back it seemed risky that I started writing code without knowing if what I was trying to do would even lead me to a correct solution, but on the other hand, sitting down trying to prove the correctness of an hypothetical implementation. ↩
While I understand why f{32,64} aren't Ord, this is so annoying to always have to go around that I can't imagine this being the best choice. I wish the ordering was consistently defined to be NaN at either end of the ordering. Maybe there are hairly details I'm not thinking about though. ↩
One alternative to doing this is to implement PartialOrd and Ord yourself, and .reverse the ordering there. I ended up doing this later. ↩
More often than not, we're not dealing with the full-general version of these computational problems, and sometimes there's significant savings when we only solve the actual problem we have. ↩
The numbers here are completely made up, and I did not spend any time checking if they made sense. ↩
Again, this was domain specific knowledge I had about the problem that I was trying to solve. ↩
Each entry in routing_table corresponds to one initial move. Off-diagonal entry $S_{i,j}$ means go from the source to site $i$ and route to site $j$ around whichever point was the best; this is legal iff site $i$ has capacity for another path. Diagonal entry $S_{i,i}$ is legal iff site $i$ has capacity. ↩
You can't use the edge of cost $m$ more than once, since one site can only appear in a path once. However, there could be multiple entries in the table of cost $m$. You could get a tighter bound by looking at the $|S|$ cheapest entries, but this would also not be very tight, depending on which sites your path already contains. It gets complicated. ↩
A business requirement if you will. ↩
"problem" in this specific CS-y narrow sense. The world is big and complicated, and contains plenty of hard problems. ↩
Sometimes this manifests really clearly. I will have tried to write something --- an API, a function, a class, a library, anything --- but it's awkward to use right, often has off-by-one errors, weird bugs, and somehow things are never in the right place. Then, I write it again, changing for instance what I store in some state, and this time, everything just falls out naturally. Off by one errors suddenly can't exist any more, things are always conveniently where they need to be, and everything runs smoothly. ↩
This is not to say that these aren't the reasons for similar speedups in other circumstances; these are often the culprits. But sometimes, there just exist code that is orders of magnitude better. ↩
This was the time of the first commit, but I don't remember when on Wednesday I started this. I also wasn't exclusively working on this during these days, so it's hard to get a time estimate with an hour granularity. ↩
I wanted to write "pull-every-trick", but this simply isn't true. They do, however, pull some tricks. ↩

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License