Posts on mht.wtf

Another Static Site

2024-04-02T22:51:56+01:00

In January 2016 I moved this website from Jekyll to Hugo.The motivation was to make deployments easier, since Hugo was a single static binary, and Jekyll is not. Over the years, I did a minimal amount of work on the website itself, and as Hugo kept changing, warnings kept piling up every time I, for whatever reason, updated it. In addition, the few times I did want to change something to the site, I inevitabely got lost in the Hugo docs. It became increasingly clear that Hugo was not made for my use-case, and so I wanted to migrate off of it.

This easter I decided to bite the bullet and try something else. I spent an afternoon trying to set up Cobalt, followed by Zola, but they both felt too complex for me. So instead, I decided to write my own, and after a few days I have it all set up:

Markdown parsing with markdown-rs
Templating with Tera
Atom feed generation with atom
Syntax highlighting with Prism
Pretty math with KaTeX
Rewrote most of the CSS, and added dark mode

I tried to keep things simple, and I'm pretty happy with the current state. The code is in one file and is around 450 lines of code. It reads a directory structure like this:

mht.wtf/
├── pages # Markdown files that are templated.  Directory stucture is kept.
│  ├── index.md # This turns into https://mht.wtf/index.html
│  ├── painting
│  │  └── index.md # ... and this to https://mht.wtf/painting/index.html
│  └── post
│     ├── index.md # This is the page with the list of blog posts
│     ├── flow
│     │  └── index.md # https://mht.wtf/post/flow/
│     └── static-site
│        └── index.md # ... and so on
├── publish.sh # Convenience script to build and `fsync` to the server.
├── README.md
├── static # These files are copied to the output folder.
│  ├── iosevka.css
│  ├── post
│  │  └── flow
│  │     ├── bipartite.svg
│  │     ├── flow-graph.svg
│  │     ├── route-connect.svg
│  │     └── route.svg
│  └── style.css
└── templates # Tera templates referenced by the files in pages/
   ├── blog-post.html
   ├── blog.html
   ├── cc-by-sa.html
   └── index.html

The pages directory contains all files that will be transformed to html files in exactly the same directory structure. For every markdown file, the template to use is specified in the front matter. The static directory contains files that should be copied as-is, like css, fonts, or svgs and other assets for specific pages. For instance, the blog post flow is located at pages/post/flow/index.md and its pictures are e.g. at static/post/flow/bipartite.svg.

Javascript

There are two sources to Javascript in these blog posts: syntax highlighting and math typesetting.

In Hugo, I had to manually mark blog posts as mathy so that I could include MathJax in the <head> of the template. Initially I ported over the same system here, but I realized that that's only busywork when I have written the generator myself. Now I look for a $ in the Markdown text, and if katex is not explicitly set in the front matter, I set it to true. This way I don't need to specify anywhere that I am using for math, I can just use it. Blog posts that don't use it doesn't include it, and for false positives, katex = false will opt-out.

I do the same with Prism; if I have a code block with a language specified, like ```rust I include Prism, unless prism = false is in the front matter.

Templating

I wanted to be able to write markdown and produce HTML, and so one way or another I needed a way of specifying what that HTML should look like. Templates seemed like the least complicated but still powerful enough solution for this. I am not using anything fancy with templates though, it's pretty much accessing fields from the front matter (e.g. katex or date), and formatting the date.

There was one catch however, namely listing the blog posts.

My plan was to read in the directory structure and pass that to the template, but this made it difficult to write out the template, because

The unique identifier (static-site for this post, flow for the Flow post) is in the directory name, and not the front matter.
I wanted to sort the posts based on a date, which is in the front matter.
index.md should be skipped when on the first level.

Tera, like most templaing languages, isn't a joy to use, and so simple data transformations like this turned out to be difficult. However, it has a nice escape hatch in which you can write a Rust function and call register_function to make it callable in the template. That way you can do whatever transformation you want in Rust instead. Convenient, if not pretty.

Others

Instead of writing an HTTP server to serve the files when writing and reloading when any of the files change, I used python -m http.server and watchexec. Maybe there's a nice "hot-reload simple serve http server" out there that would do both for me, but this setup was very low friction. I have to reload the page myself though, but since I mainly write Markdown anyways there's no real reason to have the page update live.

Rewriting the page was also a good excuse to have another look at the CSS, and with it, some nice positioning for <aside> elements, when space allows for it. These are the gray margin notes you can see above. They are positioned with CSS grid, using named columns, and with a @media query for narrow screens to place it back in the regular flow. <code> is also highlighted almost like in my editor now, with mostly white on dark, not too many colors, and bright yellow comments. I'm still not 100% happy with the spacing around certain elements, but it's okay.

Efficient Simulation Through Linear Algebra

2022-08-12T17:18:54+02:00

I spent a lot of time working on a project in which physically plausible simulation of soft materials with pressure chambers was a key part,and in doing so, we managed to improve a part of our simulation by a significant amount. I was very happy with how this small part of the whole system turned out, and I've been wanting to share it for a while.

A fair warning though, we need to spend a little time setting up the context in order to see why this is a thing that can happen very naturally, as opposed to a magic algebraic trick that we can pull out of a hat.

If you haven't seen physically based simulations before, don't worry too much about the details. It helps if we can get on the same page regarding why we are even here in the first place, but the details of the context really doesn't matter for the point I'm trying to get across.

If you have seen physically based simulations before, also don't worry too much about the details. There isn't anything fancy going on here; no second order elements, no fancy time stepping, no dynamics, basically nothing that hasn't been around for 20 years¹. The trick is somewhat fancy though, to me anyways.

Finite-Elements

We wanted to simulate the behavior of a soft material with a certain geometry when we inject pressurized air into it. A simple way of doing so is by representing the geometry of the material with a tetrahedral mesh, and defining an energy that is a function of the deformation of those tetrahedra, or "tets". The nodal positions of the mesh are our "degrees of freedom": they are what we can move around, and the energy of the system is a function of those positions. You can imagine an energy function for each tet similar to²:

fn energy(nodes: [Node; 4]) -> f64 { ... }

If you are given any nodal positions, you can compute an energy from it. For instance, if a tet was supposed to have 1 volume, but it is stretched out to have 2 volume, it would make sense that it has a lot of energy, which it can "use"³ to return to it's preferred ("rest") position, of having 1 volume again. The energy function⁴ defines exactly how much the tet would want to return to some other configuration when deformed. By summing the energies for all the tets in the system, we get the total energy of the whole system.

You can also have other energies that adds into the whole system. Since we are dealing with a pneumatic system, we assign pressure forces to the faces of our mesh that are adjacent to the pressure chamber, such that the forces are proportional to the face area and the pressure. If we know how much gas is in the chamber (this is one our our degrees of freedom), and we know the volume of the chamber, we can compute the pressure using the ideal gas law⁵.

Finding Equilibrium

Having only this energy function, we can compute the forces that act on the nodes in our system as the direction in which they would have to move to decrease that energy. In other words, we let $$f = -\frac{\partial E}{\partial x}.$$ Note the minus sign: the gradient of a function is the direction in which it increases the most, and we would like it to decrease. This is also where notation gets a little messy: the $x$ above represents the positions of all the nodes, so it is really a vector in $\mathbb R^{3n}$ for a 3 dimensional system of $n$ nodes.

For a single tet, we have 12 numbers, namely the $x$, $y$, and $z$ coordinate of the four vertices. We can pretend that the energy function above reads

fn energy(nodes: [f64; 12]) -> f64 { ... }

With this, we see that $f_t$, the forces on a single tet, is also a vector of 12 numbers, which corresponds to the forces on the respective nodes in their respective coordinate, whichever way we flattened⁶ it in the first place⁷. $$f_t \in\mathbb R^{12}$$

We can use this information to move the nodes in our system in order to decrease the global energy of the whole system: loop over all tets, compute the forces from that tet to its four nodes, sum up the forces on all the nodes into one big vector $f\in \mathbb R^{3n}$, and move the vertices some amount $\eta > 0$ in this direction: $$x^{(i+1)} = x^{(i)} + \eta f.$$ This is called gradient descent, and it's not so great, at least not for these kinds of systems, because it takes a long time before it finds equilibrium. When $f=0$ we have reached equilibrium, and we're at rest.

Newton's Method

To improve convergence⁸ we can compute yet another derivative, namely $$ \frac{\partial^2 E}{\partial x \partial x} = \frac{\partial f}{\partial x},\qquad \frac{\partial f_t}{\partial x_t}\in\mathbb R^{12\times 12}$$ Now we've got $12 \times 12 = 144$ numbers, for each tet! Similarly to what we did above, we can combine all of these smaller matrices to one giant matrix⁹ that we'll call the Hessian $H\in\mathbb R^{3n \times 3n}$, and perform Newtons's method.

What we want to do with $H$ is find a direction $d$ such that $Hd = -f$ and then set our new node positions to be¹⁰ $$x^{(i+1)}=x^{(i)}+\eta d$$ Don't panic if this jumps out of nowhere, because it kind of does. Roughly speaking, what this means is that we pretend that our energy function is quadratic, because then this update will make us go straight to the minimum point, which in our case is force equilibrium. If the function is not quadratic (and it probably isn't), then we hope that we'll get closer, and indeed, as long as we start "sufficiently close" to the minima, we will.

Linear Systems

How do we "solve" $Hd = -f$ when we know $H$ and $f$? This is what we call a "linear system of equations", and is a workhorse of scientific computation, geometry processing, computer graphics, and many related fields. It is often written as the equation

$$Ax = b$$

or, if we choose dimensions of the variables (I chose 6 here) and write everything out explicitly:

$$ \begin{pmatrix} a_{1,1} & a_{1,2} & a_{1,3} & a_{1,4} & a_{1,5} & a_{1,6}\\ a_{2,1} & a_{2,2} & a_{2,3} & a_{2,4} & a_{2,5} & a_{2,6}\\ a_{3,1} & a_{3,2} & a_{3,3} & a_{3,4} & a_{3,5} & a_{3,6}\\ a_{4,1} & a_{4,2} & a_{4,3} & a_{4,4} & a_{4,5} & a_{4,6}\\ a_{5,1} & a_{5,2} & a_{5,3} & a_{5,4} & a_{5,5} & a_{5,6}\\ a_{6,1} & a_{6,2} & a_{6,3} & a_{6,4} & a_{6,5} & a_{6,6} \end{pmatrix} \begin{pmatrix} x_1\\ x_2\\ x_3\\ x_4\\ x_5\\ x_6 \end{pmatrix} =\begin{pmatrix} b_1\\ b_2\\ b_3\\ b_4\\ b_5\\ b_6\end{pmatrix} $$

The operation we want to do is find the $x$ given $A$ and $b$. That is, which $x$ (if any!) should I multiply $A$ with to get $b$? Algebraically, we can simply write $$x = A^{-1}b,$$ but this is very rarely done in practice because computing the inverse of a matrix is rather expensive¹¹. People have figured out that there are ways of finding $x$ without computing $A^{-1}$ explicitly, and it is this we mean by a linear solve.

For instance, in Julia we can use the \ operator for linear solves. Observe:

julia> A = rand(6,6) # Get a random 6x6 matrix (and hope it is full rank)
6×6 Matrix{Float64}:
 0.610793     0.0588659  0.90725   0.723158  0.480303   0.00631715
 0.10528      0.229984   0.536642  0.91345   0.650178   0.237762
 0.600606     0.24921    0.349393  0.626754  0.0971094  0.771216
 0.536192     0.0458314  0.541457  0.556307  0.132692   0.55307
 0.936709     0.215612   0.284619  0.304965  0.926599   0.719019
 0.000957923  0.852531   0.290136  0.151528  0.129307   0.0528658

julia> b = rand(6) # Get a random b
6-element Vector{Float64}:
 0.7716876359155332
 0.4285009788970344
 0.8110655185850537
 0.19638254649350662
 0.6621420580446692
 0.06633609289427767

julia> x = A \ b
6-element Vector{Float64}:
  2.77721146947569
  0.7894422416781481
 -2.7841498287174837
  2.5819747913641087
 -0.6223503841138821
 -2.1248845687489477

julia> A * x - b # If  Ax = b  then  Ax-b = 0
6-element Vector{Float64}:
 -1.1102230246251565e-16
  2.220446049250313e-16
  0.0
  5.551115123125783e-16
  1.1102230246251565e-16
  5.551115123125783e-17

There are many things to be said about solving linear systems, but there's only one more thing we'll need to know here: sparsity.

Solving Sparse Linear Systems

The picture below is the Hessian matrix $H$ of a one of these finite elements systems. The pixel at position i,j correspond to $H_{ij}$, and it is color coded so that blue means negative, red means positive, and gray is zero.

The noteworthy thing about this picture is the amount of gray: most pixels are gray. Since the Hessian quantifies how sensitive the forces on our nodes are to the position of the nodes themselves, this makes sense. Moving around a node on one side of the mesh does not change anything about the forces on the other side. That is, unless those nodes both are on the pressure boundary: in this case the volume is changed ever so slightly, which in turn changes the pressure, which in changes the forces on all of the nodes that are on the pressure boundary. These nodes correspond to the block we are seeing in the upper left corner or the picture¹².

Recall from above that there are a bunch of methods for solving these systems, but, perhaps obviously, any one of these methods will for sure need to look at each element in the matrix. If there are many elements in the matrix, there will be a lot of work; you can think of this as $O(n^2)$¹³ where $n$ is the number of degrees of freedom we have (the number of rows and columns in $H$). On the other hand, if most of the elements in $H$ are zero, we can store the matrix in a sparse format, so that any algorithm working on $H$ does not have to iterate over a whole lot of zeroes. It will still need to look at each non-zero number, but if we only have a constant number $c$ of entries in each row (or column), we only have a total of $O(cn)$ entries in total.

The problem, of course, is that the matrix in the picture above isn't really sparse, since it has this giant block of roughly $\frac{1}{4}n^2$ numbers in it.

... or is it?

Property vs. Representation

This brings us to the key of post. It certainly looks like the matrix is dense, and in general, there is no way of making a dense matrix sparse, since there is simply more information in a dense matrix. But maybe there is a lot of duplicate information in our matrix? To show what I mean, consider the matrix $$A = uv^\top\qquad\text{or equivalently }\qquad A_{i,j} = u_iv_j$$ or for some concrete numbers, consider this:

julia> u, v = rand(6), rand(6);

julia> u
6-element Vector{Float64}:
 0.17648645508411875
 0.9501460722894218
 0.7570256767954698
 0.9097476055645976
 0.7514042466862265
 0.2594892833200104

julia> v
6-element Vector{Float64}:
 0.9880351017724492
 0.7271356154478763
 0.29724548913210114
 0.7470357266014565
 0.8131233770317735
 0.26312703421677464

julia> u * v' # v' is Julia's way of transposing
6×6 Matrix{Float64}:
 0.174375  0.12833   0.0524598  0.131842  0.143505  0.0464384
 0.938778  0.690885  0.282427   0.709793  0.772586  0.250009
 0.747968  0.55046   0.225022   0.565525  0.615555  0.199194
 0.898863  0.66151   0.270418   0.679614  0.739737  0.239379
 0.742414  0.546373  0.223352   0.561326  0.610984  0.197715
 0.256385  0.188684  0.077132   0.193848  0.210997  0.0682786

The matrix is a "full" matrix of 36 numbers, but they all come from only 12 numbers¹⁴. In a sense, the matrix should be sparse, because it's only 12 numbers, but its representation is not sparse. If we can rewrite our $H$ above into a form that looks like this, maybe there's hope for speeding up the solves.

The way we compute pressure forces on the faces of the tets is first to compute the volume of the air chamber, compute the pressure using the ideal gas law, and apply the pressure on each face so that the force is proportional to both the pressure and the face area, and in the direction of the inward normal of the face. Roughly, following the notation I've used already, it looks like this¹⁵: $$f_p = p(x) n(x)$$ where both the pressure $p$ and the area scaled normal vector $n$ is a function of the node positions $x$. When we compute the Hessian entries $H_p$ for only the pressure forces, we use the product rule to get $$H_p=\frac{\partial f_p}{\partial x} = \frac{\partial p}{\partial x}(x)n(x) + p(x)\frac{\partial n}{\partial x}(x).$$ Writing it all out like this is useful since we can pinpoint exactly where in the formulas the density problem comes from. The term $\partial p /\partial x$ is dense, since it depends on the volume of the air chamber, and all nodes along the boundary of this chamber influences the volume if they move¹⁶ . But do note here that, similarly to the toy example above, we really only have $3n$ numbers in ${\partial p}/{\partial x}$, since for given $x$, $p(x)$ is only a single number --- the pressure --- so $\frac{\partial p}{\partial x} \in \mathbb R^{3n}$ (because $x$ is $3n$ numbers). Somehow this is expanded to $O(n^2)$ numbers in the process of assembly.

In fact, if we write $u=\partial p/\partial x$ and $v=n(x)$ then the first summand is just $uv^\top$. We use this to rewrite the computation of $H$ by first doing the pressure computation separately, and then the rest of $H$: $$H = H_p + H_r$$ ($r$ for rest) and then write the pressure terms as $$H_p = uv^\top + p(x)\frac{\partial n}{\partial x}(x)$$ and at last, we write the whole Hessian in a slightly more readable form as $$H = H_s + uv^\top,\qquad H_s = H_r + p(x)\frac{\partial n}{\partial x}(x)$$ This system is still as dense as before if we multiply out $uv^\top$ and add it all together, but we're not going to do that.

Solving The New System

Before we had the system $Hd = -f$ which we wanted to solve for $d$. Now our new system is the slightly less nice $$(H_s + uv^\top) d = -f$$ and it doesn't seem like we've made much progress.

What helps us is the Sherman-Morrison formula, which tells us how to invert a matrix of type $A + uv^\top$; see this and this post on solving these systems. The closed form solution includes inverting $A$ itself ($H_s$ in our case), but we can avoid computing this explicitly because we are not looking for the inverse of the matrix we have, we just want to solve the linear system.

For matrices that are easy to invert, the formula is useful for us; in particular, we choose $A=I$, and write out the inverse explicitly: $${\left(I + uv^T\right)}^{-1} = I - \frac{uv^T}{1 + u^Tv}.$$ Again, this does not help us directly yet, because in our case we have $H_s$ as the matrix inside the parenthesis, and not $I$. We will need to somehow massage it out.

The first step is to take our system $$(H_s + uv^\top)d = -f$$ and algebraically multiply in $H_s^{-1}$ from the left so that we get $$(I + H_s^{-1}uv^\top)d = -H_s^{-1}f.$$ Let's call $H_s^{-1}u=w$, or in other words, $H_sw = u$. Since $H_s$ is sparse we can easily solve for $w$, and insert this back into the equation: $$(I + wv^\top)d = -H_s^{-1}f.$$ Now we introduce a new variable, just to make this step easier: let $c = (I + wv^\top)d$. We haven't found $c$ yet, and we still don't know $d$, this too is just algebra. We are left with $$c = -H_s^{-1}f$$ or $$H_s c = -f$$ in which only $c$ is unknown. $H_s$ is still sparse, so we can solve for $c$. At last, we look at the definition of $c$ that we came up with. We have all quantities¹⁷ except for $d$: $$(I + wv^\top) d = c$$ and we already have a analytical inverse for this matrix, thanks to Sherman-Morrison. By inserting the inverse on the right and multiplying out (notice that we don't even have to construct the matrix that is the SM inverse!) we get: $$\begin{align} d &= {\left(I + wv^\top \right)}^{-1} c \\ &= (I - \frac{wv^\top }{1 + w^\top v}) c\\ &= c - \frac{w(v^\top c)}{1 + w^\top v} \end{align}$$ which is just two dot product, a scalar-vector multiply, and a vector-vector subtraction.

That's quite a mouthful, but in the end we have only solved two sparse linear system with the same matrix $H_s$, and done a few dot products at the end. We avoided the dense solve, and in fact, we avoided even constructing a new matrix.

The fact that we used the same matrix on both of the linear solves is also really important: linear solvers usually factorize the matrix in some way or another before they solve the system, for instance into an LU, LDLT, or QR factorization. When we have the factorization we can very easily solve the system, and so by solving multiple linear systems with the same matrix (and different $b$s) we only need to factorize once, so the second solve is really fast.

Quick Micro benchmark

What does this really give us? Instead of making a proper comparison from the simulation code base, I decided to hack together a small Julia program to illustrate. Here is the measured data of solving what basically amounts to the linear system above.

|$n$|slow|fast|speedup| |--:|--|--|:-:| |500 | 0.01742 | 0.01053 | 1.65 |1000 | 0.03732 | 0.01933 | 1.93 |2000 | 0.12346 | 0.06351 | 1.94 |5000 | 0.96995 | 0.45724 | 2.12 |10000 | 6.12298 | 2.22139 | 2.75 |30000 | 131.273 | 39.2934 | 3.34

The data is generated from the following Julia code:

using LinearAlgebra
using SparseArrays

mod1p(n, m) =  ((n - 1) % m) + 1

# Compute a random sparse matrix in which each column has at most `k` entries
function randomsparse(n, k)
    A = zeros(n, n)
    for i=1:n
        ixs = rand(UInt32, k) .|> a->mod1p(a, n)
        nums = rand(k)
        A[ixs,i] = nums
    end
    sparse(A + I) # ensure we get a full rank
end
    
function doit(n)
    A = randomsparse(n, 5)
    u = rand(n)
    v = rand(n)
    b = rand(n)

    @time(begin # slow path
        slow = A + u * v'
        factor = factorize(slow)
        x = factor \ b
    end);

    @time(begin # fast path
        factor = factorize(A)
        w = factor \ u
        c = factor \ b
        x = c - w * dot(v, c) / (1 + dot(w, v))
    end);
end

The code for the fast path is a little more complicated than the straight-forward slow path, but overall, not by a lot. And the speedup we're getting is well worth it.

Conclusion

One of the reasons for why I really like this solution is that it's such a good example of good things happening because we looked closely at our problem. We already knew that linear solves would be the majority of the time spent in our pipeline. We also knew that sparse solves are quicker than dense solves. We also knew that our system felt dense due to the dependence of all the nodes along the air chamber boundary. Despite all of this, we managed to massage the problem we had from one dense solve into two sparse solves, and we got a significant speedup out of it.

This wouldn't have happened if we were content with the fact that "Linear solves takes up the majority of time in Newton's algorithm" (which is true; the linear solve is the bottleneck).

This wouldn't have happened it we looked at the Hessian and concluded that "The system is dense, therefore the solve will be slow" (which is true; dense systems are slower to solve).

Sometimes there are better solutions, but they require that we look closely at the problem at hand. Without looking closely in the first place, we wouldn't even have known that better solutions could exist.

Even though this example was full of math I really think the general sentiment translates well into programming, or completely different aspects of life. It is really hard to tell the difference between how something appears and how it really is¹⁸. I can't illustrate this with an example from your life, but I hope that having made the distinction here, you might come up with one.

Comments, questions, pointers, and prefactorized matrices, can be sent to my public inbox (plain text email only).

Thanks for reading.

I think, at least! ↩
If you're wondering why I'm using Rust syntax here, when the only real code in this post is Julia code, then you'll have an unanswered question. ↩
If you're accusing me of anthropomorphizing here, I'm guilty as charged. ↩
Often, this energy is a function of the deformation gradient $F$, and not the nodal positions directly. $F$ is the matrix that transforms the shape of the tetrahedron from its initial shape to its deformed shape¹⁹. If nothing has happened, $F$ is the identity matrix, if the tet is only rotated, $F$ would be a rotation matrix, and so on. ↩
We're assuming here that the temperature is constant. ↩
"Flattening" is a fairly common practice when we don't want to deal with tensors in our derivatives; if we have a matrix valued function differentiated with respect to a matrix, we get a 4th order tensor, which is different to deal with algebraically than what we might be used to. A kind of hack to avoid this is to let the positions of all the nodes not be a matrix of size $\mathbb R^{3\times n}$ but a vector of size $\mathbb R^{3n}$ instead. As long as we're willing to put up with the change of indices from the flattened to un-flattened configurations, we're fine. ↩
We basically have two options: xyzxyzxyz... or xxx...yyy...zzz.... ↩
Roughly, how fast we go from a configuration to our goal; in this case a rest configuration. If we keep getting closer and closer, but the amount by which we're getting closer and closer also shrinks proportionally we have "linear" convergence, which is not great. ↩
This operation is often called "assembly". ↩
The step size $\eta$ in Newton's method is kinda optional, in the sense that it should converge to $1$, but intermediate steps might not be $1$, for instance if taking a full step will cause some elements to invert. Some energies are not defined for inverted elements, and for those cases one has to be careful about never taking too long steps. ↩
That is, unless the matrix is small, like a 2x2 or 3x3 matrix. In these cases, computing its inverse is both completely feasible, and often also the preferred way. ↩
In this picture i have moved the nodes at the pressure boundary to have low index, which is why they are all in the top left. Initially I had not ordered any of the nodes, which spread the block out around in the whole matrix. ↩
To be clear, I'm not claiming that linear solves are quadratic in the number of columns/rows in the matrix. But it is clearly a lower bound for dense matrices. ↩
The technical term here is that $vv^\top$ is a "rank-1 matrix". ↩
Again, I'm abusing notation ever so slightly here; it is easier to follow exactly if we treat each coordinate of each node separately, but then we often need to reason about index sets of coordinates for the same nodes, or the nodes which share a triangle or a tet. When implementing this stuff, this is something that has to be done at some point, but for this post I hope isn't not too bad to follow while being a little sloppy. ↩
Unless they move exactly along the walls. ↩
Now it's important to be extra careful; in the initial SM formula we had $uv^\top$ in the parentheses, but we have $wv^\top$. ↩
Some fields, like differential geometry, use this taxonomy all the time. In differential geometry we can talk about intrinsic properties vs. extrinsic properties. If a property is intrinsic to a manifold, it doesn't matter how this manifold is embedded in some space, because the property is the same. On the other hand, an extrinsic property does depend on this. Examples of intrinsic and extrinsic properties include the Gaussian curvature (which is intrinsic) and the Mean curvature (which is extrinsic). ↩
Since it is a matrix this transformation is linear. Another way of looking at this is that we only really have three directions that we are transforming, namely the vectors out from one of the corners. Since we don't care about translation in space, we can assume that this corner starts and ends at the origin. What's left is just to move the three vectors that come out from the fixed corner, and since we are operating in $\mathbb R^3$ there is exactly one linear transformation that moves the vectors from the initial to the deformed directions. ↩

Confusing Words for a Beginner

2020-06-18T17:10:19+02:00

As people are arguing over what to call the default git branch,I started thinking about other words in programming that I now take for granted but that I also remember being very confusing and/or strange when starting out. Instead of reading more internet arguments, I figured I'd try to write about some of these words instead.

Int

Short for integer. Makes sense now, but didn't make much sense when I first encountered it; in my native language they're just called whole numbers. Luckily, the word isn't similar to any other word that I knew and could confuse it with.

Float and Double

This was very confusing when I first encountered it in CheatEngine back in the day. Float means a decimal? Why does it float? At this point I was picturing a life buoy floating in wavy water. And Double is just the same? But more exact? Oh it's bigger? Then shouldn't Float be called Single?

(This was even more confusing when I tried to read up on it as in my native language a "decimal number" refers to a real number that is not an integer, like 1.2, and not a number written in the decimal system, nor the parts after the decimal separator .2, which, according to Wikipedia are two interpretations of "decimal number".)

String

This was a big one, because in my mind a string was what's on a guitar. I might have heard "pearls on a string", but probably hadn't heard about "a string of pearls"¹, since only the latter really points to anywhere near what we call Strings in programming today. Why do we even call it a String? Is it because we have characterse neatly in a row? Isn't things in a row what Arrays are? This is still rather confusing, and I've simply come to accept that String means Text.

Can new programming languages please stop calling Text for String?

Print

As in printing something to the screen. I think this was mainly confusing because in my native language we have borrowed the verb "to print", but it is exclusively used for a physical printer. Hopefully you can appreciate my relief² when after running my first Hello World, which I think was in Java, Windows didn't complain that I didn't have a printer connected.

Argument

You know, as in a parameter³! Not as in reasoning, convincing, or fighting! Why would you think that?

Function

I'm not really sure that I was ever confused about this, but I still think it's a bad name for when you really mean a procedure. However, the word function is way easier to write since you don't have the tricky c e d sequence. It's even easier to pronounce.

Void

This was actually a good word, since it meant nothing to me.

Thanks for reading.

Even now this sounds like it should mean a string made out of the material that pearls are made of. ↩
relief/disappointment/surprise ↩
I don't think I've ever encountered a situation in which having to differentiate between parameters and arguments are useful. I think a simpler way of thinking about the distinction is with bindings and data. ↩

Simplicity as a Value

2023-11-11T12:00:17+01:00

I used to think that simplicity is good because of the other values it often brings:a simple system is easier to write; a simple system is easier to build; a simple system is more portable; a simple system is easier to debug and reason about; a simple system performs well; a simple system is easier to change. The word "simple" does some heavy lifting here, and I found that I would often use these as a metric for if something was simple or not. In other words, these weren't the result of simplicity, they were the definition.

I have since found that I don't actually need any of these values to be true for me to value simplicity. It's not that I don't care about how easy the system is to debug, how easy it is to extend, or how fast it runs, but all of these are universally good qualities. Nobody would prefer a system that's hard to debug over one that is easy to debug, all other things equal. I still think that simplicity very often brings these advantages¹, but among the advantages is simplicity itself.

My definition of simple has also changed a lot over the years. When I was starting out, Java's ArrayList felt simple, but handling arrays² felt awkward and complicated; "buffer" was a scary word. When I learned Python, it's syntax was simple, but requiring the python interpreter was not³; I couldn't just copy over my program to another computer and run it there. When I read K&R, C felt very simple, but when I tried to even build some C projects I found in the wild, my experience was very different⁴. These days I'm mostly excited about low(er) languages, like Rust, Zig, and Hare; programming in these languages feels simple because of how they map to my mental model of my computer. The code that I write that I'm happiest about is the simple code.

I guess I simply value simplicity.

Thanks for reading.

Further, I think it often gives an 80% solution for all of these values. If you want to really maximize, say, speed, readability, debuggability, portability, and all the other -abilites will suffer. ↩
A sidenote: when learning about arrays, I could not for the life of me see any use-case for it. ArrayList made perfect sense, since it was a list you could put stuff in, but I did not see the value of having a fixed-size list of things. Tangentially, I also wanted to string interpolate myself a variable; with int a1, a2, a3; int i = 3; I wanted to be able to write a3 = 0; as a{i} = 0;. I don't remember how long it took to see the connection. ↩
In hindsight, it is curious that my biggest problem with Python was that I needed the interpreter as opposed to just being able to copy a .exe, when this is also true for Java, my first language. For some reason, the lack of ahead-of-time compilation made a very big difference for me. ↩
Here's some keywords: autoconf, glibc, cmake, dependency management, dynamic libraries, macros. ↩

Expanding TeX's \newif

2021-06-19T16:29:02+02:00

Introduction

Like most of my colleagues, I use LaTeX to write papers, reports, notes, or what have you. In fact, I think all of the places that I regularly write supports some variable subset of LaTeX. Also like most of my colleagues, I'm not a TeXnician. I'm not proud to be ignorant in this regard, but there's only so many hours in a day, and the gains from properly learning a huge ecosystem like LaTeX seems minuscule compared to the initial buy-in cost.

Still, I was curious.

LaTeX and TeX, tomato tomato? Here's how I see it. If LaTeX is like C++20 --- big, complex, confusing, full of cruft, but still very popular --- then TeX is like C89 --- small, simpler¹, confusing, a child of its time, and often neglected.

There's a certain pleasure in going far enough down the stack that the systems you are using becomes simple enough to reason about on a deep level. It's the feeling you might get sitting down one afternoon trying to write some assembly after a long week of debugging consistency errors in your sharded database across multiple kubernetes clusters². No magic, no need to constantly search for other people who's had the same problems you're dealing with on StackOverflow. It's just you and the CPU, and likely the Intel Instruction Set Manual or something as big and scary. I wanted that, but with typesetting.

This was my romantic motivation to dig into TeX and try to see whether it really is rewarding to step back a few decades to avoid the complexity of newer and bigger typesetting systems. I bought the TeXbook, and read it from start to finish. Well, some paragraphs are marked with "dangerous bends", signalling that the content covered or the background assumed for those paragraphs are more advanced. I read the single bends, but skipped the double bends, at least most of the time.

Somewhere in the book I found the definition of \newif, a macro that's used to define conditionals, which you can later query, and branch on. Booleans, in other words. I read it, and really didn't understand a single thing, and I figured that if I can manage to sit down and figure out what on earth this macro is doing and why, then I've had a good taste of what it's like digging down this low in the world of TeX.

This post is the result of that process.

How Do I Write TeX?

This is not really as obvious as it might sound. After all, TeX produces a document, but when playing with macros we really want to see what forms expand to, which macros are defined, and so on. I have to say upfront that the method I used here probably wasn't the ideal, because I just started used tex (or sometimes pdftex, for the purposes of this post they seem to be exactly the same), and started writing. The repl doesn't support readline bindings or arrow keys, or clicking to move the cursor, so if I wanted to add something in the middle of a line, I had to hold backspace all the way back to where I wanted to go and write out the rest of the expression. Sometimes I pasted back and forth from a text editor, which worked okay.

Here's exactly how I got started³.

/h/martin$ tex
This is TeX, Version 3.141592653 (TeX Live 2021/Arch Linux) (preloaded format=tex)
**\relax  % don't read input from a file

*\tracingall=1                 % Give us lots of output
{vertical mode: \tracingstats}
{\tracingpages}
{\tracingoutput}
{\tracinglostchars}
{\tracingmacros}
{\tracingparagraphs}
{\tracingrestores}
{\showboxbreadth}
{\showboxdepth}
{the character =}
{horizontal mode: the character =}
{blank space  }

*\message{This will show somewhere}    % some sample message
{\message}
This will show somewhere               % here's the things you wrote above
{blank space  }

*\def\mymacro{from the macro}  % Make a new macro
{\def}
{blank space  }

*\message{\mymacro}         % \message will expand the macro
{\message}

\mymacro ->from the macro   % \mymacro is expanded to `from the macro`
from the macro              % ... and we get the fully expanded form out.
{blank space  }

*

Input lines start with a *. It's very useful to set \tracingall=1, which makes TeX output a bunch of things some of which you care about. Note that I've changed up the formatting of the output throughout this post so that it's easier to see what's going on.

Another quick note: I didn't want to spend hours write an intro to TeX as well as whatever this is, so if you have never written a line of TeX or LaTeX, this might be difficult to follow. If you've written some LaTeX, and maybe defined your own simple macros, I think you'll be fine.

The Goal

This is the definition we'll unravel, copied verbatim from The TeXbook.

\outer\def\newif#1{\count@=\escapechar \escapechar=-1
  \expandafter\expandafter\expandafter
   \def\@if#1{true}{\let#1=\iftrue}%
  \expandafter\expandafter\expandafter
   \def\@if#1{false}{\let#1=\iffalse}%
  \@if#1{false}\escapechar=\count@} % the condition starts out false
\def\@if#1#2{\csname\expandafter\if@\string#1#2\endcsname}
{\uccode`1=`i \uccode`2=`f \uppercase{\gdef\if@12{}}} % `if` is required

Don't despair if this is nonsense: the whole point of this post is to explain what's going on, and to get some better idea of how real and (somewhat) involved TeX macros work.

How TeX Reads Tokens

To start on the right foot, let's make sure that we properly understand how TeX reads tokens. A token is the input "unit" that TeX reads when it reads a document. For instance if you were to write Let $n=\numb$ be a number. then this will be transformed into a queue of tokens from which we will read one at a time. Exactly how the tokens are split up is not crucial to understanding, but in this example it looks something like this:

tokens = ['L', 'e', 't', ' ', $, 'n', '=', \numb, $, ...]

Notice three things. First, a letter is a token in of itself and we do not have one "word" be a token Second, $ is not the character '$', but the special begin/end math mode token. If we were to write \$ we would get the character token '$'. Third, the whole macro \numb is one single token. When you hear "token", think "input unit".

So how does TeX read the tokens? One mental model is like this:

while tokens is not empty
    t <- pop(tokens)
    if shouldexpand(t)
        exp <- expand(t)
        tokens.push(ex)
    else
        process(t)

Some tokens, like the \newif token we will figure out in this post, expand, and the expansion is another list of tokens, some of which might be regular character tokens, and some of which might be other tokens that also expand. Therefore when we expand a token we will push the result back onto the front of the queue.

Note that when we expand a macro that takes arguments, like \def\paren#1{(#1)} the expansion of \paren will pop more tokens from the queue, and then push the tokens of the expanded form back onto the queue.

What does it mean to "process" a token? For a character, this basically means to write that character at the current position on the page⁴. For a macro definition like \def\bob{123} it means to make the definition and storing it somewhere in memory so that if you ever encounter a \bob token you know that it expands to the three tokens 1,2,3.

A Short Example

Let \def\A{a} \def\B{\A b} \def\C{\B\B} and the input token queue be [\C]. To make sure we understand how this works, let's manually expand this whole thing. The left column is the token queue, and the left side of the queue is the front, which is the place at which we will be working. The right column explains what we're about to do.

Tokens | Current action :------|:-------------- [\C] | take \C out of the front of the queue [] | \C expands to \B\B, which we push back [\B \B] | take the first \B out [\B] | \B expands to \A b [\A b \B] | \A is taken out, expanded to a and pushed back [ a b \B] | a is taken out and processed, because it doesn't expand. [ b \B] | b is taken out and processed. [\B] | you get the idea... [\A b] | [ a b] | [ b] | [] |

The end result of this execution is that we have sent the tokens a, b, a, b to the processing part of TeX.

A Primer on Catcodes

We need to know one more thing about tokens, or rather how the characters of your input are split into them. Each character have a category code, or catcode for short. Catcodes decide how to group and split characters into a token. There is a character code for letters (11), a code for space (10), and one for math shift (3) (there are also others). This way TeX knows that in the input let $ consists of three characters, one space, and one "math shift". This is also how TeX figures out when the name of a macro ends and new tokens begin, as in \hey3: here we have one token with catcode 0 (the escape character \), three of catcode 11, and one of catcode 12 ("others", which include numbers). The name of a macro is only letters, so this way TeX knows that \hey is a macro and 3 is just the next token in the queue.

But catcodes can be changed. Why is this useful? Well, if we would like to make some macros that another user wouldn't accidentally redefine we have it include a character that, by default, isn't allowed to be in its name, like @. The catcode of @ is 12, and so the input \h@ will be read as two tokens \h and '@'. However, if we change the catcode of @ to 11 it's as if @ is just a regular letter, and \h@ will be read as a single token \h@.

This is how we change the catcode of @ to 11 and then back to 12:

*\catcode`\@=11  % Category 11 consists of regular letters
*\catcode`\@=12  % Category 12 consists of "other characters"

Some Not So Bad Macros

We need to know about a few other macros that \newif uses internally. Most of these are pretty straight forward.

`\string`

Takes an argument and replaces it by the non-expanded token list. \string\foo expands to the four tokens \ f o o, no matter what the macro \foo would expand to. A crucial detail which we will come back to is that the tokens \string produces will get catcode 12 (unless it's a space).

`\escapechar`

The character which is used when a control sequence is outputted as text. Normally set to \. If this is set to for instance @, then \string\foo would expand to the four tokens @ f o o instead.

`\uccode`

Short for uppercase code. This allows one to set the uppercase character code for another letter. Usually this would be \uccode`x=`X \uccode`X=`X and so on, but this, like most things in TeX, can be changed, and changes, like most things in TeX, are local to the current group.

`\csname` and `\endcsname`

Read and expand everything up until the matching \endcsname. The expansion result should be a list of character tokens, and this list will be made into a single control sequence token. If this is currently not defined it will be defined to \relax.

For instance \csname hello\endcsname will expand to the single token \hello and make the macro \hello expand to \relax. More interestingly, \def\inner{hello}\csname\inner\endcsname will do the same: Here the inner macro expands to the list of tokens h e l l o, and the csname pair of macros expand this macro, effectively replacing it with \csname hello\endcsname.

`\gdef`

Normally definitions made with \def are local to your scope, just like in most programming languages. However, sometimes we want to define global macros, and gdef does exactly this. When a macro is defined with \gdef it is as if it was defined in the top level scope.

{ 
    \def\inner{hello}
    \inner  % expands to  h e l l o
}
\inner   % this doesn't work, because \inner is no longer defined

{
    \gdef\inner{hello}
    \inner  % expands to  h e l l o
}
\inner % also expands to  h e l l o

`\outer`

This is a safety measure that you put before a \def which ensures that this macro is not allowed to be an argument, in the parameter text, or in the replacement text of another macro.

The `\expandafter` Macro

Now that we've seen a few simple macros we turn to one that is slightly less simple. The \expandafter macro first reads the very next token in the queue without expanding it. Then, it'll read and expand the next token after that. Last, it will put the first token back in front, without expanding it. Here's a small example of how it runs:

*\def\first{first}
*\def\second{second}
*\expandafter\first\second
{\expandafter}

\second ->SECOND

\first ->FIRST
{the letter F}
*

Here the output shows that \second is expanded before \first, and that the first token that we process is f. Note that the second form is only expanded and not actually processed, so the following does not work:

*\expandafter\first\def\first{another first!}

The second term, the \def will be expanded, but it will not "run", so when \expandafter later expands \first it will still have the same value as before, for instance not to be defined.

Due to how TeX expansion rules work, a macro doesn't have to have all of it's arguments in place when you use it; currying⁵ is in a sense possible. We can use \expandafter to use this fact if the first token expands to a curried macro, and the first token in the expansion of the second token is the argument we want to give to the curried form.

Here's an example. Say we have a macro \twoarray that takes two things and wraps them in square brackets divided by a comma, as well as a macro \tuple that expands to two tokens 4 and 5. If we want to have \twoarray wrap the two tokens from \tuple, it doesn't work out of the box:

*\def\twoarray#1#2{[ #1 , #2 ]}
*\def\tuple{4 5}
*\twoarray\tuple X  % X is just a placeholder for whatever's next; we don't want it.
[ 4 5 , X ]
% This does not work because `\twoarray` will read two tokens, `\tuple` and `X`

*\expandafter\twoarray\tuple X
[ 4 , 5 ] X
% This does work because `\tuple` is expanded before `\twoarray`, and so the token
% queue when we process `\twoarray` is  `4 5 X`

Chaining

So what happens when we chain multiple \expandafters together? Let's work it out with some notation: dashes under a line means \expandafter is skipping that line, and it's expanding the token above the hat ^. Primed a' letters means expanded.

*\expandafter a  b  c  d ...
%             -  ^
% token list: a  b' c  d

With two \expandafters this becomes

*\expandafter \expandafter a  b  c  d ...
%             ------------ ^
% token list:  \expandafter a' b  c  d
*\expandafter  a' b  c  d ...
%              -  ^
% token list:  a' b' c  d

It undid itself! The expansion order was a and then b. Let's try three expands in a row. Now we're getting somewhere, because when expanding the second token that \expandafter finds, we might end up reading additional tokens, if that token takes arguments. In this case this token is \expandafter, which does indeed take two arguments!

*\expandafter \expandafter \expandafter a  b  c  d ...
%             ------------     ^^^
%                          [eat 2 arguments]
*             \expandafter       a  b'        c  d ...
% This is just the first example again.
% token list:  a  b'' c  d ...

and we're again back to having the expansion order of a and b flipped. Despite this though, they are not identical, because expandafter does not expand a form until it only expands to itself, but only once. We can think of regular expansion as taking out the next token in the queue and if it is expandable we push back the expansion onto the queue.

Let's get concrete. As a warm up, here is the easy case where the two forms are identical, namely when expanding once is fully expanded. The list of \A ->a beneath each input line is the evaluation sequence such that the macro \A expands to the token a.

*\def\A{a}\def\B{b}\def\C{c}

*\A\B\C
\A ->a   \B ->b   \C ->c   
*\expandafter\A\B\C
\B ->b   \A ->a   \C ->c   
*\expandafter\expandafter\A\B\C
\A ->a   \B ->b   \C ->c   
*\expandafter\expandafter\expandafter\A\B\C
\B ->b   \A ->a   \C ->c

Note that just like we said above, the first and third lines are the same, and the second and fourth are the same.

Next we make it slightly more interesting by expanding macros which body is another macro:

*\def\AA{\A}\def\BB{\B}\def\CC{\C}

*\AA\BB\CC
\AA ->\A   \A ->a     \BB ->\B    \B ->b   \CC ->\C   \C ->c   
*\expandafter\AA\BB\CC
\BB ->\B   \AA ->\A   \A ->a      \B ->b   \CC ->\C   \C ->c   
*\expandafter\expandafter\AA\BB\CC
\AA ->\A   \BB ->\B   \A ->a      \B ->b   \CC ->\C   \C ->c   
*\expandafter\expandafter\expandafter\AA\BB\CC
\BB ->\B   \B ->b     \AA ->\A    \A ->a   \CC ->\C   \C ->c

The four lines have all distinct orders on which macros are expanded when, in contrast with the last example. With four expandafters we are back to as if we had none.

What if we had \AAA and friends?

The TeX tracing output is getting pretty big, so I've compressed it down to the following table, where the left column is the number of \expandafters before \AAA\BBB\CCC, and each row is the order in which macros were expanded. For instance, in the first row we first expanded \AAA, then \AA, then \A and so on.

0     AAA     AA      A    BBB     BB      B    CCC     CC      C
1     BBB    AAA     AA      A     BB      B    CCC     CC      C 
2     AAA    BBB     AA      A     BB      B    CCC     CC      C 
3     BBB     BB    AAA     AA      A      B    CCC     CC      C 
4     AAA     AA    BBB      A     BB      B    CCC     CC      C
5     BBB    AAA     BB     AA      A      B    CCC     CC      C 
6     AAA    BBB     BB     AA      A      B    CCC     CC      C 
7     BBB     BB      B    AAA     AA      A    CCC     CC      C 
8     AAA     AA      A    BBB     BB      B    CCC     CC      C

After 8 of them we are back to where we started. Also note that the CCCs never change.

\meaning\noexpand\foo

Start Actually Expanding `\newif`

If you've made it this far, good job! I realize this is a fair amount of prerequisites before getting to the point of the post.

Here's the definition of \newif again, but formatted a little differently:

\outer\def\newif#1{
    \count@=\escapechar
    \escapechar=-1
    \expandafter\expandafter\expandafter \def\@if#1{true}{\let#1=\iftrue}%
    \expandafter\expandafter\expandafter \def\@if#1{false}{\let#1=\iffalse}%
    \@if#1{false} % the condition starts out false
    \escapechar=\count@
}
\def\@if#1#2{\csname\expandafter\if@\string#1#2\endcsname}
{
    \uccode`1=`i
    \uccode`2=`f
    \uppercase{\gdef\if@12{}}
} % `if` is required

Let's do this in parts, starting with the bottom group, then the middle \def, and then move on to the actual \newif. Note that only the first form is the actual body of \newif and that the bottom group and the \def in the middle is just part of the one-time setup. We'll start with the bottom group.

The Bottom Group

{
    \uccode`1=`i
    \uccode`2=`f
    \uppercase{\gdef\if@12{}}
} % `if` is required

Recall from before that the \uccode macro sets the character code of the uppercase version of a character, so we can for instance change the uppercase of g to be H by writing \uccode`g=`H. In our snippet we are setting the uppercase version of the numbers 1 and 2 to be i and f. Yes really. Also recall that the change is local to the current group, so this change will be undone after the third macro.

So we've changed the uppercase of 1 and 2, and next we're uppercasing a gdef which name is if@12.

Let's make this slightly easier by only having one character we uppercase

*{\uccode`1=`M \uppercase{\gdef\bob1{bob}}}
*\bob
\bob M->BOB

Notice that the name of the macro is just \bob, not \bob1 or \bobM.

A note about more advanced parameter texts

TeX allows us to ensure that there are other tokens in the argument list of a macro expansion, or that the arguments are delimited by certain tokens. For instance consider the following:

*\def\commasep#1,#2{(#1, #2)}
*\message{\commasep 1 2 3 , 9 8 7}
(1 2 3 ,9) 8 7

We see that the first argument was not in fact just the first token, but all tokens up until we hit , which we had after the #1 in the parameter text. The last argument however, was just the next token.

We can also do this:

*\def\mfirst m#1{(#1)}
*\message{\mfirst a a}
! Use of \mfirst doesn't match its definition.
<*> \mfirst a
              a
*\message{\mfirst m a}
(a)

Here we've said that we need an m before we get the next token as the first argument to the macro. If the next token is not an m, like in the first attempt, we error. It is basically a very simple version of pattern matching.

Back to Bob

In our definition of \bob we have ensured that the parameter text should end with an uppercase 1, which was M. There is a problem though:

*\bob M
! Use of \bob doesn't match its definition.
<*> \bob M

?

The reason this doesn't work is that while the uppercase of 1 is temporarily set to M and the macro really does expect to be called as \bob M, the M we send in now has the wrong character code: it's a letter and not a number. We can temporarily change this in a group, and it will work.

*{\catcode`M=12 \bob M}
{begin-group character {}
{entering simple group (level 1)}
{\catcode}
{changing \catcode77=11}
{into \catcode77=12}

\bob M->BOB
{the letter B}
{end-group character }}
{restoring \catcode77=11}
{leaving simple group (level 1)}
{blank space  }

*

Now we are ready to understand the current snippet

{\uccode`1=`i \uccode`2=`f \uppercase{\gdef\if@12{}}} % `if` is required

This will define a macro \if@ that ensures that the first two tokens after it is i and f with category code 12. Also note that it will expand to nothing, but it will eat the matched tokens in the parameter list. In other words:

*\def\eat h{H} \message{\eat hello}
Hello

The h is eaten and replaced with the body of the macro, H, and the rest of the tokens ello are just characters so nothing is done to them, and the result is Hello.

To summarize, we've now globally defined a macro if@ which ensures that when applied the next two tokens in the token list will be two tokens with catcode 12 that is i and f, and these tokens will be taken out of the token list.

The Middle `\def`

Moving on to this part:

\def\@if#1#2{\csname\expandafter\if@\string#1#2\endcsname}

Let's peel the onion. We've got a csname/endcsname pair, so the output of the function will be a control sequence name, which will, unless already defined, be defined to expand to \relax. The name will be the result of \expandafter\if@\string#1#2; the arguments passed to \@if (the def we're looking at) will thus be sent to \if@, but the first argument will be eaten by \string first. We just learned that the only thing that \if@ does is to ensure that the first two tokens given are i f of catcode 12. And it just so happen that the tokens that we get from expanding \string are exactly of catcode 12!

Let's try to expand \@if{ifeven}{true}:

\@if{ifeven}{true}
\csname \expandafter\if@\string{i f e v e n}{t r u e}\endcsname
\csname \if@ i f e v e n {t r u e}\endcsname
\csname e v e n {t r u e}\endcsname
\csname e v e n t r u e\endcsname   % csname doesn't care about grouping
eventrue

The result is a single control sequence token with the name eventrue. That's it! As long as the \string expansion of the first argument starts with i f we will get a control sequence token that is the concatenation of the two arguments.

The First `\def`

Phew, back at the top. Here it is, once more:

\outer\def\newif#1{
    \count@=\escapechar
    \escapechar=-1
    \expandafter\expandafter\expandafter \def\@if#1{true}{\let#1=\iftrue}%
    \expandafter\expandafter\expandafter \def\@if#1{false}{\let#1=\iffalse}%
    \@if#1{false} % the condition starts out false
    \escapechar=\count@
}

We're almost there; it's just a matter of piecing together some of the parts that we've already unravelled. First we can note that we are temporarily setting \escapechar to be -1 and then restoring it at the end. There are two questions we can answer here: (1) why do we set it, and (2) why can't we group it instead?

We want the argument to \newif to be a control sequence, like \newif\ifred, and we also need to check that the given control sequence starts with if, which we do in \if@ through the \string macro. If naively applied, \string\ifred would expand to \ i f r e d, but we need it to be i f r e d. By setting \escapechar=-1 we make \string output nothing for \, and we are good.
Had we used grouping the \defs we have inside would be local to the group and effectively destroyed by the time we are done expanding \newif. If we were to use \gdef then all defined macros with \newif would have to be global. This way we can have the user define \newifs that are local to their groups.

That only leaves three lines in the macro body, and two of them are of the same form. From earlier we remember that three \expandafter would expand the second token in the token list twice. Let's assume #1 = \ifred. With the total form

\expandafter\expandafter\expandafter \def \@if \ifred {true} {\let \ifred = \iftrue}

we would first expand \@if, which will eat two tokens, #1 and {true} and be replaced with the body of the macro, as seen above. Then we need a second expansion to expand the csname pair, and this will expand to the control sequence token redtrue. This would be put back in the token queue,

\expandafter \def \csname \expandafter\if@\string\ifred{true}\endcsname{\let \ifred = \iftrue}
\def \redtrue{\let \ifred = \iftrue}

and at the end we have a familiar form. The same happens with the false variant. The next line is then ran:

\@if\ifred{false} % expand:
\csname \expandafter\if@\string\ifred{true}\endcsname  % eval the csname pair
\redfalse  % we just defined this macro
\let\ifred=\iffalse  % run this

At last, we restore \escapechar to whatever it was initially.

In Conclusion

Taking it all together, running \newif\ifred expands to this:

% In the preamble we have the forms
\def\@if#1#2{\csname\expandafter\if@\string#1#2\endcsname}
{\uccode`1=`i \uccode`2=`f \uppercase{\gdef\if@12{}}} % `if` is required

% The user writes
\newif\ifred
% .. which expands to
\count@=\escapechar
\escapechar=-1
\expandafter\expandafter\expandafter \def\@if\ifred{true}{\let\ifred=\iftrue}
\expandafter\expandafter\expandafter \def\@if\ifred{false}{\let\ifred=\iffalse}
\@if\ifred{false}
\escapechar=\count@
% ... which is basically the same as
\def\redtrue{\let\ifred=\iftrue}
\def\redfalse{\let\ifred=\iffalse}
\redfalse

and that's it! So hey, we had to peel a few onions⁶, but in the end we managed to unravel the mystery and really understand what's going on in \newif; it turns out it's quite a lot, though the main functionality seems that we don't have to write these three lines every time we want to define a new conditional, but that only one suffices.

If you want to know more "real" definition and edge cases, check out this site; I went back and forth on that and in the TeXbook when writing this post, and having a searchable index of basically the entire language is, well, indispensable. Of course, if you don't know much about TeX from before I can only assume that the reference will be hard to dig into.

Notes, comments, questions, and tomatoes can be sent to my public inbox.

Hope you learned something, and thanks for reading.

I couldn't call C89 or TeX simple in good faith. ↩
I don't know what I'm talking about here; can you tell? ↩
btw I use arch ↩
This isn't really how it works, but for the purposes of this post we might as well pretend it is. ↩
This example is more close to destructuring, but I didn't want to get in the weeds of constructing an example that looked more like currying. Here's a sketch: you can have a macro in the body of another macro \func #1 x y such that #1 expands to another macro. If we \expandafter the #1 here we might get something like \func u v w x y and so we've effectively constructed a function f(g) = h(g(), x, y). ↩
Something something crying when peeling an onion. ↩

Building Zig structs at Compile Time

2022-06-11T22:05:08+02:00

Let's talk about comptime in Zig. comptime is the feature that allowyou to run code at compile time, and is maybe Zig's biggest differentiator from other languages in the same space. Combined with having types as values we get both type specializaton, generics, reflection, and even code generation.

For readers who are not familiar with Zig, here's a small example. We can make a Range type that is generic over the element type by writing a function called Range that takes a type (which is required to be compile time known), and produces a struct with two fields of that type. Returning the struct from the function is no problem; types are values after all. It looks like this:

fn Range(comptime t: type) type {
    return struct {
        from: t,
        to: t,
    };
}

We can use this new function, and the type it returns, like this¹:

test "range-create" {
    var a = Range(i32){ .from = 0, .to = 10 };
    std.debug.print("\n[{}, {})\n", .{ a.from, a.to });
}

which, when ran, prints out the numbers we gave in a math-like format.

$ zig test comptime-struct/cs.zig --test-filter range
Test [0/1] test "range-create"...
[0, 10)
All 1 tests passed.

We can also add a method to the newly created type, for instance for checking whether a value is in the range or not. The type of this parameter other in the contains method is the type t that we're given as argument in Range, and it works just as expected.

fn Range(comptime t: type) type {
    return struct {
        from: t,
        to: t,

        pub fn contains(this: @This(), other: t) bool {
            return this.from <= other and other < this.to;
        }
    };
}

Here we're using the @This() builtin which gives us the type in which we currently are. We need this here since we don't have a name for the type yet, as we're still defining it. There's nothing special about the name this, but it is familiar from many other languages, and since the builtin is called @This it's a convenient name to give. The new method can be tested like so:

test "range-contains" {
    var r = Range(i32){ .from = 0, .to = 10 };
    try std.testing.expect(r.contains(5));
    try std.testing.expect(r.contains(0));
    try std.testing.expect(r.contains(9));
    try std.testing.expect(!r.contains(10));
    try std.testing.expect(!r.contains(-1));
}

which works:

$ zig test comptime-struct/cs.zig --test-filter range-contains
All 1 tests passed.

Building Structs

Usually in Zig, the way you define a struct is by assigning the value of a struct { .. } expression to a name, like this:

const MyString = struct {
    someNumber: i32,
    aBool: bool,
    yourString: []const u8,
};

We have just seen how to control the types of the struct fields programatically (and, I stress, with completely regular Zig code!). What about the names? Or both? It is possible to construct, at compile time, a new struct in which the names and types of all of the field come from some other data?

The answer is yes! The key is the @Type builtin, which takes a std.builtin.TypeInfo² and reifies³ the description of the type into a real type. Here's how it looks:

test "reify-empty" {
    const Type = @Type(.{
        .Struct = .{
            .layout = .Auto,
            .fields = &[_]std.builtin.TypeInfo.StructField{},
            .decls = &[_]std.builtin.TypeInfo.Declaration{},
            .is_tuple = false,
        },
    });
    try std.testing.expect(@sizeOf(Type) == 0);
}

This will create an empty struct, since we're instantiating the .Struct field of the TypeInfo enum with both .fields and .decls empty. So far this only seems to be a difficult way of writing const Type = struct {};, but this is just regular Zig code, and while we require that the value passed to @Type is compile time known, we don't require it to be one big literal like it is now. It can very well be the result of a complex computation, as long as it is compile time known.

We can for instance write a function that takes an anonymous struct literal with names and types that should be the fields of a struct, and if the name starts with a ? it automatially makes the field optional. In code, calling our function

const Foo = MakeStruct(.{
    .{ "someNumber", i32 },
    .{ "?aBool", bool },
    .{ "?yourString", yourString },
});

should be the same writing

const Foo = struct {
    someNumber: i32,
    aBool ?bool,
    yourString: ?[]const u8,
};

One way of doing this by building up a list of StructFields with the right names and types, making a TypeInfo struct with those fields, and pass it to @Type. The only thing we must do is to branch on whether the variable name starts with a ?, and if so, remove the ? from the name and turn the given type into an optional type, T to ?T. Here is an example:

fn MakeStruct(comptime in: anytype) type {
    var fields: [in.len]std.builtin.TypeInfo.StructField = undefined;
    for (in) |t, i| {
        var fieldType: type = t[1];
        var fieldName: []const u8 = t[0][0..];
        if (fieldName[0] == '?') {
            fieldType = @Type(.{ .Optional = .{ .child = fieldType } });
            fieldName = fieldName[1..];
        }
        fields[i] = .{
            .name = fieldName,
            .field_type = fieldType,
            .default_value = null,
            .is_comptime = false,
            .alignment = 0,
        };
    }
    return @Type(.{
        .Struct = .{
            .layout = .Auto,
            .fields = fields[0..],
            .decls = &[_]std.builtin.TypeInfo.Declaration{},
            .is_tuple = false,
        },
    });
}

There's another thing to highlight here. We are declaring fields to be an array of length in.len, even though in is the argument of the function. This is fine since in is declared to be comptime known, and so of course we should be able to declare statically sized arrays of that length, and indeed, in Zig we can.

We can see that we're getting what we expect by using the "inverse" builtin of @Type which is @typeInfo. @typeInfo takes a type and returns its std.builtin.TypeInfo, which we can operate on.

test "make-struct" {
    const Type = MakeStruct(.{
        .{ "someNumber", i32 },
        .{ "?aBool", bool },
        .{ "?yourString", []const u8 }, 
    });
    
    std.debug.print("\n", .{});
    inline for (@typeInfo(Type).Struct.fields) |f, i| {
        std.debug.print("field {} is {s} type is {s}\n", .{ i, f.name, f.field_type });
    }
}

Here we are just looping over the fields of the struct and printing out the names and types in order. The result is this:

$ zig test comptime-struct/cs.zig --test-filter make
Test [0/1] test "make-struct"...
field 0 is someNumber type is i32
field 1 is aBool type is ?bool
field 2 is yourString type is ?[]const u8
All 1 tests passed.

We have succesfully moved the ? from the field names and over to the field types. Granted, this new way of making structs does not offer very much in terms of readability or functionality. Putting the ? in the name isn't any easier than having it in the type.

So What?

Even though we are effectively generating code at compile time, there's no magic here: we're just writing regular Zig code. The data types we're making are from std.builtin, so they're tightly bound to the language, but there's no special syntax, and no second language to learn and remember. By simply filling in a std.builtin.TypeInfo we can construct new types at compile time.

Also, the input to our function was an anonymous struct literal, but this doesn't have to be the case. We could have taken a []const u8 with source code of a struct definition from another language like C++ or Rust, parsed it, and constructed the corresponding Zig type for the given definition. Parsing the other language would be the vast majority of the work, because as we've just seen, making the Zig struct is really easy.

Another idea is to have a compile-time readable configuration .ini file embedded in the source with @embedFile, and a function that reads in the file, finds the names and types of the values in the file, and collects it all into a struct. This struct would always be in perfect correspondance with the .ini file, and so there is no danger of reference configuration file and code diverging. There would be one definite source of truth for the configuration values.

In most other compiled languages, this is very difficult to do without any external tools. One would most likely try to go the other way, and have the struct definition be the single source of truth, and output the default config file from that, either through a function that has to be kept up-to-date as fields are added and changed, or by a macro system, which is likely to be written in some DSL. If you would want to have the configuration file as a plain text file, you would need to ensure that the file on disk is always consistent with the code; maybe you would want this to be a distinct step in the build process of the program.

Either way, the value proposition of Zig is clear: by simply allowing Zig code to be ran at compile time⁴ we get a powerful and easy to use metaprogramming system without requiring to learn a second language or use external tools.

Pointers, complaints, suggestions, and others can be sent to my public inbox (plain text emails only).

Thanks for reading.

We could also have written var a: Range(i32) = .{ .from = 0, .to = 10 }; even though it might look funny that we have put a function call in the type specifier position of the expression, as this is usually reserved for type literals in other languages. Not so in Zig! ↩
This type is about to be renamed to just Type, but I'm running my code samples with the 0.9.1 compiler which is still using the old name. ↩
From Merriam-Webster: "Reify: to consider or represent (something abstract) as a material or concrete thing : to give definite content and form to (a concept or idea)" ↩
At comptime, the full Zig language is available, but there are some limitations. For instance, I/O is not allowed. ↩

Content Aware Image Resize

2017-02-13T17:30:00+01:00

Content aware image resizing, liquid image resizing, retargeting, or seam carving,refers to a image resizing technique where one can insert or remove seams, or "paths of least importance", in order to shrink or grow the image. I was introduced to the concept by a YouTube video by Shai Avidan and Ariel Shamir.

In this blog post, I'll go through a simple proof-of-concept implementation of content aware image resizing, naturally in Rust :)

For our sample image, I simply searched¹ for "sample image", and got back this²:

Sketching out a top down approach

Let's start with some brainstorming. I imagine the library to be used like this:

/// caller.rs
let mut image = car::load_image(path);
// Resize to a known size?
image.resize_to(car::Dimensions::Absolute(800, 580));
// or remove 20 rows?
image.resize_to(car::Dimensions::Relative(0, -20));
// Maybe show the image in a window?
car::show_image(&image);
// or save to disk?
image.save("resized.jpeg");

The most important functions in lib.rs could look something like this:

/// lib.rs
pub fn load_image(path: Path) -> Image {
    // We'll forget about error handling for now :)
    Image {
        inner: some_image_lib::load(path).unwrap(),
    }
}

impl Image {
    pub fn resize_to(&mut self, dimens: Dimensions) {
        // How many columns and rows do we need to insert/remove?
        let (mut xs, mut ys) = self.size_diffs(dimens);
        // When we want to add columns and rows, we would like
        // to always pick the path with the lowest score, no
        // matter if it's a row or a column.
        while xs != 0 && ys != 0 {
            let best_horizontal = image.best_horizontal_path();
            let best_vertical = image.best_vertical_path();
            // Insert the best
            if best_horizontal.score < best_vertical.score {
                self.handle_path(best_horizontal, &mut xs);
            } else {
                self.handle_path(best_vertical, &mut ys);
            }
        }
        // Insert the rest in either direction.
        while xs != 0 {
            let path = image.best_horizontal_path();
            self.handle_path(path, &mut xs);
        }
        while ys != 0 {
            let path = image.best_vertical_path();
            self.handle_path(path, &mut ys);
        }
    }
}

This gives us some idea on how to approach writing system. We need to load an image, we need to find these seams, or paths, and we need to handle removing such a path from the image. In addition, we would perhaps like to be able to see our result.

Let's do the image loading first, so we know what kind of API we're working with.

`image`

The image library from the Piston developers seems useful, so we'll add image = "0.12" to our Cargo.toml. A quick search in the docs is all that it takes for us to write the image loading:

struct Image {
    inner: image::DynamicImage,
}

impl Image {
    pub fn load_image(path: &Path) -> Image {
        Image {
            inner: image::open(path).unwrap()
        }
    }
}

A natural next step is figuring out how to get the gradient magnitudes from a image::DynamicImage. The image crate doesn't provide a way to do this directly, but the imageproc crate does: imageproc::gradients::sobel_gradients. Here however, we run into trouble³. The sobel_gradient function takes an 8-bit grayscale image, and returns a 16-bit grayscale image. The image we have loaded is an RGB image with 8-bits per channel, so we'll have to decompose the channels, convert the three channels into separate grayscale images, compute the gradients of the three component images, and then merge the gradients together into one image, in which we will do the path searching.

Is this elegant? No. Does it work? Maybe :)

type GradientBuffer = image::ImageBuffer<image::Luma<u16>, Vec<u16>>;

impl Image {
    pub fn load_image(path: &Path) -> Image {
        Image {
            inner: image::open(path).unwrap()
        }
    }

    fn gradient_magnitude(&self) -> GradientBuffer {
        // We'll assume RGB
        let (red, green, blue) = decompose(&self.inner);
        let r_grad = imageproc::gradients::sobel_gradients(red.as_luma8().unwrap());
        let g_grad = imageproc::gradients::sobel_gradients(green.as_luma8().unwrap());
        let b_grad = imageproc::gradients::sobel_gradients(blue.as_luma8().unwrap());

        let (w, h) = r_grad.dimensions();
        let mut container = Vec::with_capacity((w * h) as usize);
        for (r, g, b) in izip!(r_grad.pixels(), g_grad.pixels(), b_grad.pixels()) {
            container.push(r[0] + g[0] + b[0]);
        }
        image::ImageBuffer::from_raw(w, h, container).unwrap()
    }
}

fn decompose(image: &image::DynamicImage) -> (image::DynamicImage,
                                              image::DynamicImage,
                                              image::DynamicImage) {
    let w = image.width();
    let h = image.height();
    let mut red = image::DynamicImage::new_luma8(w, h);
    let mut green = image::DynamicImage::new_luma8(w, h);
    let mut blue = image::DynamicImage::new_luma8(w, h);
    for (x, y, pixel) in image.pixels() {
        let r = pixel[0];
        let g = pixel[1];
        let b = pixel[2];
        red.put_pixel(x, y, *image::Rgba::from_slice(&[r, r, r, 255]));
        green.put_pixel(x, y, *image::Rgba::from_slice(&[g, g, g, 255]));
        blue.put_pixel(x, y, *image::Rgba::from_slice(&[b, b, b, 255]));
    }
    (red, green, blue)
}

When ran, Image::gradient_magnitune takes our bird image, and returns this:

The path of least resistance

Now we have to implement the arguably hardest part of the program: the DP algorithm to find the path of least resistance. Let's take a quick look at how this will work out. For simplicitys sake, we'll only look at the case where we find a vertical path. Imagine the table below being the gradient image of a 6x6 image.

$$ G = \begin{bmatrix} 1 & 4 & 3 & 4 & 2 & 1\\\ 2 & 2 & 3 & 5 & 3 & 2\\\ 1 & 4 & 5 & 5 & 1 & 2\\\ 4 & 4 & 3 & 1 & 5 & 3\\\ 5 & 3 & 2 & 2 & 3 & 1\\\ 3 & 1 & 4 & 4 & 1 & 1 \end{bmatrix} $$

The point of the algorithm is to find a path $P=p_1 \dots\ p_6$ from one of the top cells $G_{1i}$ to one of the bottom cells $G_{6j}$, such that we minimize $\sum_{1 \leq i \leq 6} p_i$. This can be done by creating a new table $S$ using the following recurrence relation (ignoring boundaries):

$$ S_{ji} = \begin{cases} G_{6i} & \text{ if } i = 6\\ G_{ji} + \min(S_{j + 1, i - 1}, S_{j + 1, i}, S_{j + 1, i + 1}) & \text{ otherwise} \end{cases} $$

That is, each cell in $S$ is the minimum sum from that cell to a cell on the bottom. Every cell selects the smallest of the three cells below it in the table to be the next cell in the path. When we have completed $S$, we simply select the smallest number in the top row to be our start.

Let's find $S$:

$$ S^{(1)} = \begin{bmatrix} - & - & - & - & - & -\\\ - & - & - & - & - & -\\\ - & - & - & - & - & -\\\ - & - & - & - & - & -\\\ - & - & - & - & - & -\\\ 3 & 1 & 4 & 4 & 1 & 1 \end{bmatrix} \hspace{1cm} S^{(2)} = \begin{bmatrix} - & - & - & - & - & -\\\ - & - & - & - & - & -\\\ - & - & - & - & - & -\\\ - & - & - & - & - & -\\\ 6 & 4 & 3 & 3 & 4 & 2\\\ 3 & 1 & 4 & 4 & 1 & 1 \end{bmatrix} $$ $$ S^{(3)} = \begin{bmatrix} - & - & - & - & - & -\\\ - & - & - & - & - & -\\\ - & - & - & - & - & -\\\ 8 & 7 & 6 & 4 & 7 & 5\\\ 6 & 4 & 3 & 3 & 4 & 2\\\ 3 & 1 & 4 & 4 & 1 & 1 \end{bmatrix} \hspace{1cm} S^{(4)} = \begin{bmatrix} - & - & - & - & - & -\\\ - & - & - & - & - & -\\\ 8 & 10 & 9 & 9 & 5 & 7\\\ 8 & 7 & 6 & 4 & 7 & 5\\\ 6 & 4 & 3 & 3 & 4 & 2\\\ 3 & 1 & 4 & 4 & 1 & 1 \end{bmatrix} $$ $$ S^{(5)} = \begin{bmatrix} \ - & - & - & - & - & -\\\ 10 & 10 & 12 & 10 & 8 & 7\\\ 8 & 10 & 9 & 9 & 5 & 7\\\ 8 & 7 & 6 & 4 & 7 & 5\\\ 6 & 4 & 3 & 3 & 4 & 2\\\ 3 & 1 & 4 & 4 & 1 & 1 \end{bmatrix} \hspace{1cm} S^{(6)} = \begin{bmatrix} 11 & 14 & 13 & 13 & 10 & \textbf{8}\\\ 10 & 10 & 12 & 10 & 8 & \textbf{7}\\\ 8 & 10 & 9 & 9 & \textbf{5} & 7\\\ 8 & 7 & 6 & \textbf{4} & 7 & 5\\\ 6 & 4 & 3 & \textbf{3} & 4 & 2\\\ 3 & 1 & 4 & 4 & \textbf{1} & 1 \end{bmatrix} $$

And there it is! We can see that there is a path which sums to only 8, and that the path starts in the upper right corner. In order to find the path, we could have saved which way we went for each cell (left, down, or right), but we don't have to: we can simply choose the minimum child of each cell, because the cells in $S$ says how long the shortest path from that cell to a bottom cell is.

Also note that there are two paths that sum to 8 (the two bottom cells differ in the two paths).

Implementation

Since we are just prototyping we will do the simplest thing. We'll make a struct with an array for the table, and just for loop our way through the algorithm.

struct DPTable {
    width: usize,
    height: usize,
    table: Vec<u16>,
}

impl DPTable {
    fn from_gradient_buffer(gradient: &GradientBuffer) -> Self {
        let dims = gradient.dimensions();
        let w = dims.0 as usize;
        let h = dims.1 as usize;
        let mut table = DPTable {
            width: w,
            height: h,
            table: vec![0; w * h],
        };
        // return gradient[h][w], save us some typing
        let get = |w, h| gradient.get_pixel(w as u32, h as u32)[0];

        // Initialize bottom row
        for i in 0..w {
            let px = get(i, h - 1);
            table.set(i, h - 1, px)
        }
        // For each cell in row j, select the smaller of the cells in the
        // row above. Special case the end rows
        for row in (0..h - 1).rev() {
            for col in 1..w - 1 {
                let l = table.get(col - 1, row + 1);
                let m = table.get(col    , row + 1);
                let r = table.get(col + 1, row + 1);
                table.set(col, row, get(col, row) + min(min(l, m), r));
            }
            // special case far left and far right:
            let left = get(0, row) + min(table.get(0, row + 1), table.get(1, row + 1));
            table.set(0, row, left);
            let right = get(0, row) + min(table.get(w - 1, row + 1), table.get(w - 2, row + 1));
            table.set(w - 1, row, right);
        }
        table
    }
}

After running, we can convert the DPTable back to a GradientBuffer, and write it to a file. The pixels in the image below are the path weights divided by 128.

The image can be interpreted as follows: white pixels are cells that have a large sum from it to the bottom. These pixels has so much detail (change of color) around it (which we would like to preserve) so the gradient, which tells something about the rate of change, is large. Since the path finding algorithm will search for the smallest sum, which here is the "darkest path", the algorithm will try its best to avoid these pixels. That is, the white parts in the gradient image are the most distinct parts.

Finding the path

Now that we have the entire table, finding the best path is easy: it's just a matter of searching through the uppper row and creating a vec of indices, by always choosing the smallest child:

impl DPTable {
    fn path_start_index(&self) -> usize {
        // Has FP Gone Too Far?!
        self.table.iter()
            .take(self.width)
            .enumerate()
            .map(|(i, n)| (n, i))
            .min()
            .map(|(_, i)| i)
            .unwrap()
    }
}

struct Path {
    indices: Vec<usize>,
}

impl Path {
    pub fn from_dp_table(table: &DPTable) -> Self {
        let mut v = Vec::with_capacity(table.height);
        let mut col: usize = table.path_start_index();
        v.push(col);
        for row in 1..table.height {
            // Leftmost, no child to the left
            if col == 0 {
                let m = table.get(col, row);
                let r = table.get(col + 1, row);
                if m > r {
                    col += 1;
                }
            // Rightmost, no child to the right
            } else if col == table.width - 1 {
                let l = table.get(col - 1, row);
                let m = table.get(col, row);
                if l < m {
                    col -= 1;
                }
            } else {
                let l = table.get(col - 1, row);
                let m = table.get(col, row);
                let r = table.get(col + 1, row);
                let minimum = min(min(l, m), r);
                if minimum == l {
                    col -= 1;
                } else if minimum == r {
                    col += 1;
                }
            }
            v.push(col + row * table.width);
        }

        Path {
            indices: v
        }
    }
}

In order to see if the paths selected are at least plausible, I generated 10 paths, and colored them yellow:

Looks plausible to me!

Removal

The only thing remaining now is to remove the path instead of coloring it yellow. Since we simply want to get something to work, we could do this in a pretty simple way: get the raw bytes from the image, and copy the intervals between in indexes we want to remove over in a new array, which we create a new image from.

impl Image {
    fn remove_path(&mut self, path: Path) {
        let image_buffer = self.inner.to_rgb();
        let (w, h) = image_buffer.dimensions();
        let container = image_buffer.into_raw();
        let mut new_pixels = vec![];

        let mut path = path.indices.iter();
        let mut i = 0;
        while let Some(&index) = path.next() {
            new_pixels.extend(&container[i..index * 3]);
            i = (index + 1) * 3;
        }
        new_pixels.extend(&container[i..]);
        let ib = image::ImageBuffer::from_raw(w - 1, h, new_pixels).unwrap();
        self.inner = image::DynamicImage::ImageRgb8(ib);
    }
}

Finaly, the time has come. Now we can remove a line from an image, or we could loop, and remove, say, 200 lines:

let mut image = Image::load_image(path::Path::new("sample-image.jpg"));
for _ in 0..200 {
    let grad = image.gradient_magnitude();
    let table = DPTable::from_gradient_buffer(&grad);
    let path = Path::from_dp_table(&table);
    image.remove_path(path);
}

However, we can see that the algorithm has removed quite a lot of the right side of the image, that is, the image is more or less cropped, which was exactly one of the problems that we would like to solve! A quick and somewhat dirty fix to this is to simply alter the gradient a little, by explicitly setting the borders to some large number, say 100.

Tada!

There are quite a few artifacts here, which makes the end result a little less satisfactory. The bird however is almost untouched, and still looks great (to me). You could also argue that we have destroyed all sense of image composition in the process of making this image only slightly smaller. To this I will say .... uum.... yes.

Seeing is believing

Saving the images to a file and looking at it is kind of cool, but it isn't resize-window-live-update cool! As a final effort, let's try to hack something together.

First, we need to be able to load, get, and resize an image outside of the crate. We'll try to make something like our initial plan:

extern crate content_aware_resize;
use content_aware_resize as car;

fn main() {
    let mut image = car::load_image(path);
    image.resize_to(car::Dimensions::Relative(-1, 0));
    let data: &[u8] = image.get_image_data();
    // Somehow show this data in a window
}

We start simple, by only adding exactly what we need, and taking shortcuts where we can.

pub enum Dimensions {
    Relative(isize, isize),
}
...
impl Image {
    fn size_difference(&self, dims: Dimensions) -> (isize, isize) {
        let (w, h) = self.inner.dimensions();
        match dims {
            Dimensions::Relative(x, y) => {
                (w as isize + x, h as isize + x)
            }
        }
    }

    pub fn resize_to(&mut self, dimensions: Dimensions) {
        let (mut xs, mut _ys) = self.size_difference(dimensions);
        // Only horizontal downsize for now
        if xs < 0 { panic!("Only downsizing is supported.") }
        if _ys != 0 { panic!("Only horizontal resizing is supported.") }
        while xs > 0 {
            let grad = self.gradient_magnitude();
            let table = DPTable::from_gradient_buffer(&grad);
            let path = Path::from_dp_table(&table);
            self.remove_path(path);
            xs -= 1;
        }
    }

    pub fn get_image_data(&self) -> &[u8] {
        self.inner.as_rgb8().unwrap()
    }
}

Just a little copy-paste!

Now, maybe we want the resizable window. We can start a new project, include the library crate, and use, say, sdl2 to get something up fast.

extern crate content_aware_resize;
extern crate sdl2;
use content_aware_resize as car;
use sdl2::rect::Rect;
use sdl2::event::{Event, WindowEvent};
use sdl2::keyboard::Keycode;
use std::path::Path;

fn main() {
    // Load image
    let mut image = car::Image::load_image(Path::new("sample-image.jpeg"));
    let (mut w, h) = image.dimmensions();

    // Setup sdl2 stuff, and get a window
    let sdl_ctx = sdl2::init().unwrap();
    let video = sdl_ctx.video().unwrap();
    let window = video.window("Context Aware Resize", w, h)
        .position_centered()
        .opengl()
        .resizable()
        .build()
        .unwrap();

    let mut renderer = window.renderer().build().unwrap();

    // Convenience function to update `texture` with a resized image
    let update_texture = |renderer: &mut sdl2::render::Renderer, image: &car::Image| {
        let (w, h) = image.dimmensions();
        let pixel_format = sdl2::pixels::PixelFormatEnum::RGB24;
        let mut tex = renderer.create_texture_static(pixel_format, w, h).unwrap();
        let data = image.get_image_data();
        let pitch = w * 3;
        tex.update(None, data, pitch as usize).unwrap();
        tex
    };
    let mut texture = update_texture(&mut renderer, &image);

    let mut event_pump = sdl_ctx.event_pump().unwrap();
    'running: loop {
        for event in event_pump.poll_iter() {
            // Handle exit and resize events
            match event {
                Event::Quit {..}
                | Event::KeyDown { keycode: Some(Keycode::Escape), .. } => { break 'running },
                Event::Window {win_event: WindowEvent::Resized(new_w, _h), .. } => {
                    // Find out how many pixels we sized down, and scale down
                    // the image accordingly
                    let x_diff = new_w as isize - w as isize;
                    if x_diff < 0 {
                        image.resize_to(car::Dimensions::Relative(x_diff, 0));
                    }
                    w = new_w as u32;
                    texture = update_texture(&mut renderer, &image);
                },
                _ => {}
            }
        }
        // Clear, copy, and present.
        renderer.clear();
        renderer.copy(&texture, None, Some(Rect::new(0, 0, w, h))).unwrap();
        renderer.present();
    }
}

And that's it. A days work, wih only very little knowledge of sdl2, image, and blog post writing. I hope you enjoyed it, if only just a little bit :)

Somehow, duckduckgoed doesn't work as well as googled when used as a verb. ↩
http://imgsv.imaging.nikon.com/lineup/lens/zoom/normalzoom/af-s_dx_18-140mmf_35-56g_ed_vr/img/sample/sample1_l.jpg ↩
I'd like to know if there is an easier way to do this! In addition, saving the resulting gradient is seemingly not possible at the moment, as the function returns an ImageBuffer over u16, while ImageBuffer::save requires the underlying data to be u8. I also couldn't figure out how to create a DynamicImage (which also has a ::save, with a slightly cleaner interface) from an ImageBuffer, but this might be possible. ↩

WTFs in Floating Point Math

2016-07-17T19:33:54+02:00

We all know that floating point nubmers have a limited precision.With small numbers, say between zero and one, we can describe the fractional part of the number in great detail. However, if the number is large, there may not be room for any fractional part. So how large numbers do we need before a == a + 1?

We'll find our answer by simply checking:

#include <stdio.h>

int simple() {
  float num = 1.0f;
  float inc = num + 1;

  while (num != inc) {
    num *= 10;
    inc = num + 1;
  }
  printf("%f\n", num);
}

which outputs 100000000.000000. If we multiply by 2 insead of 10, we may get a more intuitive number: 16777216 = 2^24. If we look at the IEEE 754 single-precision floating-point format, we see that the fraction part uses only 24 bits; our number 2^24 needs 25 bits, so we can safely add one, because the last bit gets truncated.

How high can we go if we are using double instead of float? As IEEE 754 double-presicion uses 53 bits, we are expecting to get to 2^53 = 9007199254740992, or 10000000000000000 if we multiply by 10. And that is what we get.

Adding more than one

How much can we add before getting a new number? One would think the answer was simply until the addition carries over to the 24 first bits. However, the answer is not that straight forward. Instead, IEEE has defined Rounding rules, for deciding what to do when the number we would like to represent doesn't quite fit in 24 bits. The default rule is «Round to nearest, ties to even», meaning if we are in between two numbers, round to the even number.

Let's try this out with the totaly random number 314159256:

314159265 = 10010101110011011000010100001
            |------- 24 bits ------||---|
// The last 5 bits will be truncated
// .. so the 16 numbers from
314159264 = 10010101110011011000010100000
// to
314159279 = 10010101110011011000010101111
// are all the same.

This is fairly easy to confirm:

void confirm_suspicions(void) {
  double first = 314159264;
  double number = first;
  while (((float) first) == ((float) number)) {
    number++;
  }
  printf("%f != %f\n", number, first);
  // 314159280 != 314159264
}

If we however were to set the fifth bit, the rounding rule would round all the way up to 314156296

314159280 = 10010101110011011000010110000
            |------- 24 bits ------||---|
// Round up, so the least significant bit of the 24 bits becomes 0
314159296 = 10010101110011011000011000000
// This will be the same number all the way up to
314159312 = 10010101110011011000011010000
// .. where the rounding rule allows us to include ..10000, as the lsb is 0.
// This makes 33 different numbers, all represented as the same floating point number.

Conclusion

What can we take home from this? First of all, floating point numbers doesn't behave like real numbers, and its easy to forget this:

float a,b;
// ...
a = b;
a++;
a == b // true ???

or what about this?

float a;
// ...
for (; a < large_float; a++) {
  // ...
}

which is an infinite loop, if a == (a + 1). Lastly, consider this:

void oops(void) {
  float a = 314159264;
  float c = a + 10 + 10;
  printf("%d\n", c == 314159284.f); // 0
  float d = a + (10 + 10);
  printf("%d\n", d == 314159284.f); // 1
}

And you thought addition was associative? Not in the world of floating point numbers!

Advent of Common Lisp, Day 1-4

2018-12-01T13:46:06+01:00

The only exposure that I have to Common Lisp is that I wrote about 1000 linesof it about 4 years ago. Since I don't have any excuse to write CL day-to-day, the days since I last typed defun seems to have added up. Luckily, the Advent of Code is upon us, which is a great way of learning a new language or brushing dust of old skills of a language you once knew; I'm taking the opportunity to finally write me some Common Lisp.

Common Lisp, Emacs, Slime, and QuickLisp

People seem to say that the way of writing CL is in Emacs using Slime; I am a long going vim addict, but I have spend the last few months in Spacemacs, in order to see what I've been missing out on, so being pressured into using Emacs isn't all that bad.

I'm still not sure exactly what Slime is, but it seems to be something that allows me to write code in emacs, and send it to a Lisp process, which sounds useful enough. Oh, and it also has a debugger which, though a little difficult to use, looks promising. Slime is installed using package-install, like most other things in the emacs world.

Installing QuickLisp

QuickLisp is a library manager for Common Lisp, and it comes in handy when we want to do something that the standard library doesn't offer but that we don't want to write ourselves. Installing quicklisp is rather easy, and the process is pretty much described on its website. We download a file quicklisp.lisp, load it with sbcl --load <path-to-file>, and that's it. Now all we must do is evaluate (load "~/quicklisp/setup.lisp") in Lisp, and we're ready to go.

Reading Input

We will probably read input from a file every day, so having a function that returns a list of strings, one for each line, makes sense. uiop is a library that comes with asdf and contains the function uiop:read-file-lines which does exactly this. This is the function we will be using, if nothing else is mentioned.

Day 1

Part 1

The first challenge was simple enough: sum a list of numbers. This is straight forward in any Lisp, provided you remember whether the function is called fold or reduce:

(defparameter *input-1* (mapcar #'parse-integer (uiop:read-file-lines "1.input")))
(defun day-1/1 (numbers)
  (reduce #'+ numbers))

Part 2

The second part is slightly worse: we are asked to keep track of all partial sums through the list and see what sum we get twice first. In addition, if no collisions are found throughout the first iteration of the list, we should restart, while keeping the accumulated sum.

I first attempted using the loop macro:

(defun day-1/2-loop (numbers)
  (loop for n in (cons 0 numbers)
        summing n into freq
        when (find n seen)
          return freq
        append (list freq) into seen))

One quirk with this attempt is that append seemingly want a list as its first argument, and not the element you are appending --- we're really just joining two lists --- so we construct the first list explicitly. A worse thing about this is that this function only runs through the list once. After spending 15 minutes looking at tutorials, cookbooks, and other documentation, looking for a way to just repeat the for loop if we exhaust the list, I guessed that it's not possible using loop, so I rewrote it using a much worse looking recursive function:

(defun day-1/2-list (numbers)
  (labels ((inner (numbers current freq seen)
                 (if (current)
                     (if (find freq seen)
                         freq
                         (inner numbers 
                                (cdr current) 
                                (+ freq (car current)) 
                                (cons freq seen)))
                     (inner numbers numbers freq seen))))
  (inner numbers numbers 0 nil)))

Since we're checking through the seen list in each call, this is has squared complexity. Looking at the runtime, it shows:

(time (day-1/2-list *input-1*))
Evaluation took:
  54.224 seconds of real time
  54.176661 seconds of total run time (54.173330 user, 0.003331 system)
  99.91% CPU
  157,466,185,510 processor cycles
  2,162,688 bytes consed
  
219

One option of improving this is to use a hash table instead of a list for seen:

(defun day-1/2-ht (numbers)
  (let ((seen (make-hash-table)))
    (labels ((inner (numbers current freq)
                    (if current
                        (if (gethash freq seen)
                            freq
                            (progn (setf (gethash freq seen) t)
                                (inner numbers (cdr current) (+ freq (car current)))))
                        (inner numbers numbers freq))))
      (inner numbers numbers 0))))

As one would expect the running time is much better now:

(time (day-1/2-ht *input-1*))
Evaluation took:
  0.026 seconds of real time
  0.025140 seconds of total run time (0.025136 user, 0.000004 system)
  [ Run times consist of 0.007 seconds GC time, and 0.019 seconds non-GC time. ]
  96.15% CPU
  73,008,592 processor cycles
  20,931,664 bytes consed
  
219

Day 2

Part 1

The task of the second day amounts to checking whether a string contains exactly two or exactly three of any character. Since we've seen that list processing in Lisp can be quite slow, I want to go for a more traditional solution:

Turn the String into an Array.
Sort the Array.
Loop through and count the length of equal character runs.

As it turns out, Strings in Common Lisp are already Arrays: off to a good start. Next we want to sort it. Running (describe #'sort) tells me the following:

* (describe #'sort)
#<FUNCTION SORT>
  [compiled function]


Lambda-list: (SEQUENCE SB-IMPL::PREDICATE &REST SB-IMPL::ARGS &KEY
              SB-IMPL::KEY)
Dynamic-extent arguments: positional=(1), keyword=(:KEY)
Declared type: (FUNCTION
                (SEQUENCE (OR FUNCTION SYMBOL) &REST T &KEY
                 (:KEY (OR FUNCTION SYMBOL)))
                (VALUES SEQUENCE &OPTIONAL))
Documentation:
  Destructively sort SEQUENCE. PREDICATE should return non-NIL if
     ARG1 is to precede ARG2.
Inline proclamation: MAYBE-INLINE (inline expansion available)
Known attributes: call
Source file: SYS:SRC;CODE;SORT.LISP

There are a few things to note here. First off, we need to pass a predicate, since sort doesn't know the types of the values that we want to sort, so we need to find a character comparing function. In addition, sort destructively sorts the sequence; this should be fine (even preferable), but we need to take that into account. Browsing lispcookbook we find an example using a function char=, so we guess there is a function char<.

* (sort "hello world" #'char<)
" dehllloorw"

Great! Now that we have a sorted Array of the characters we loop through the Array and increment a counter if there is exactly two or three equal characters. Something like this should work:

(defun count-runs-2-3 (string)
  (let ((arr (sort string #'char<))
        (2-count 0)
        (3-count 0)
        (prev-char #\NULL)
        (curr-count 0))
    (loop for c across arr
          if (char= prev-char c) do (incf curr-count)
          else do (progn
                    (case curr-count
                      (2 (incf 2-count))
                      (3 (incf 3-count))
                      (otherwise))
                    (setf prev-char c)
                    (setf curr-count 1)))
    (case curr-count    ; Don't forget adding the last run
      (2 (incf 2-count))
      (3 (incf 3-count))
      (otherwise))
    (list 2-count 3-count)))

Now we can loop through each line in the input file and sum to two counters, one for 2 runs, and one for 3 runs. However, if there are multiple runs, they should only count as one. While this could have been done in count-runs-2-3, we might as well make 1-if-pos, and handle it in the summing.

(defun 1-if-pos (x) (if (< 0 x) 1 0))

(defun day-2/1 (input)
  (loop for line in input
        for tuple = (count-runs-2-3 line)
        summing (1-if-pos (first tuple)) into 2-sum
        summing (1-if-pos (second tuple)) into 3-sum
        finally (return (* 2-sum 3-sum))))

This solves part 1.

Part 2

In the second part we are asked to find a pair of strings in the input that differs by exactly one character. By this time I realize that the destructive sorting has messed up my input variables:

*test-input-2*
("abcdef" "aabbbc" "abbcde" "abcccd" "aabcdd" "abcdee" "aaabbb")

Oops! Instead of fixing this (eg. by cloning the strings before sorting, or find out whether sort offers an option to be non-destructive) I'll just leave it as is, and read the input file again.

In any case, there's a few different ways we can do part 2. The simplest is just to check all pairs, calculate the difference, and output the pair if the difference is two.

First we need to find all pairs of elements in a list. Again, after looking at loop for a while I couldn't find anything useful (discoverability is hard!), so I decided to roll my own:

(defun all-pairs (list)
  (if list
      (let ((head (car list))
            (rest (cdr list)))
        (append (mapcar (lambda (e) (list head e)) rest)
                (all-pairs rest)))))

Now, this isn't quite correct: (all-pairs '(1)) returns NIL, but with exception of this case the function seems to do the trick. Next we need to count the number of different chars in a pair. Again we're doing the simplest thing possible:

(defun count-difference (first second)
  (loop for i from 0 below (length first)
        for a = (char first i)
        for b = (char second i)
        counting (not (char= a b)) into diffs
        finally (return diffs)))

Now we can find the two strings that differ in exactly one position. However, the task asks us to find the portion of the two strings that are the same, and not the two strings themselves, so we need yet another function:

(defun remove-equals (first second)
  (with-output-to-string (out)
    (loop for i from 0 below (length first)
          for a = (char first i)
          for b = (char second i)
          when (char= a b) do (write-char a out))))

Now the final function for today's task is done:

(defun day-2/2 (input)
  (loop for (a b) in (all-pairs input)
        when (eq (count-difference a b) 1)
        do (return (remove-equals a b))))

I figured that since we have consequently done the simplest, and probably the least efficient things, running the function on the input would take some time:

(time (day-2/2 *input-2*))
Evaluation took:
  0.012 seconds of real time
  0.011565 seconds of total run time (0.008236 user, 0.003329 system)
  100.00% CPU
  33,790,325 processor cycles
  1,998,496 bytes consed
  
"mbruvapghxlzycbhmfqjonsie"

... but apparently not. One simple optimization we could have done is early return from 'count-difference', since we only care if the difference is 1 or not. Had the strings been very long this could have been significantly faster; our strings are only 25 chars long, so for our input it doesn't matter much, at least not wall clock wise:

(time (day-2/2-opt *input-2*))
Evaluation took:
  0.007 seconds of real time
  0.006663 seconds of total run time (0.000033 user, 0.006630 system)
  100.00% CPU
  19,506,926 processor cycles
  1,998,496 bytes consed
  
"mbruvapghxlzycbhmfqjonsie"

~5ms less in real time, but only 66% of the CPU cycles.

Day 3

Part 1

Day three is here, and the first task of today is to parse lines of the format #<id> @ <x>,<y>: <w>x<h>, like #1 @ 1,3: 4x4. This sounds like a regex job! Which means we must figure out how to regex in Common Lisp.

The Cookbook informs us that there is not support for regex in the standard library, but that packages, like cl-ppcre exist. Let's try

* (ql:quickload "cl-ppcre")
To load "cl-ppcre":
  Load 1 ASDF system:
    asdf
  Install 1 Quicklisp release:
    cl-ppcre
; Fetching #<URL "http://beta.quicklisp.org/archive/cl-ppcre/2018-08-31/cl-ppcre-20180831-git.tgz">
; 151.37KB
==================================================
155,003 bytes in 0.00 seconds (151370.13KB/sec)
; Loading "cl-ppcre"
[package cl-ppcre]................................
..........................
("cl-ppcre")

Fancy!

Ideally we would be able to write our regex with group names, match each line, and retrieve the groups by name. Identifying groups by index is also fine. Looking through the docs it seems like *allow-named-registers* is somewhat important here, so we set it to t and try ppcre:scan:

(ppcre:scan "(?<num>[0-9]+)" "number is 1234 lol")

10
14
#(10)
#(14)

It seems to work fine --- we're presumably getting out start and end index of our match --- but our name num is nowhere to be seen in the return values. We are maybe meant to use something else than scan, but this seems strange, since the docs for *allow-named-registers* mostly used scan. Looking further in the docs, and with a little inspiration from the cookbook we end up with

* (ppcre:register-groups-bind (a b)
	 ("([0-9]+).*(lol)" "number is 1234 lolxD")
   (list a b))
("1234" "lol")

We didn't get to set the name in the regex itself, but this seems alright. Now we can write our regex:

* (defun day-3/match-line (line)
  (ppcre:register-groups-bind (id x y w h)
                              ("#(\\d+) @ (\\d+),(\\d+): (\\d+)x(\\d+)" line)
                              (list id x y w h)))

* (day-3/match-line "#1 @ 1,3: 4x4")
("1" "1" "3" "4" "4")

Good! Now we're able to parse the input

The actual first task of the day is to find the number of overlapping tiles of the squares defined by the lines we just parsed. The one solution that first comes to mind is to have a hash map mapping coordinates to number of squares touching them.

Now the plan is to parse the line into something that is easier to work with, loop through all points in the rectangle, and insert them into a hash map. Perhaps something like this:

(defstruct rect x y w h)
(defstruct point x y)

(defun day-3/insert-coordinates (rect hashmap)
  (loop for y from (rect-y rect) below (+ (rect-y rect) (rect-h rect))
        do (loop for x from (rect-x rect) below (+ (rect-x rect) (rect-w rect))
                 do (incf (gethash (make-point :x x :y y) hashmap)))))

(defun day-3/1 (input)
  (let ((hashmap (make-hash-table)))
    (day-3/insert-coordinates (make-rect :x 0 :y 0 :w 3 :h 3) hashmap)
		;; For now we print out the map so we can see if we succeeded or not
    (loop for key being the hash-keys of hashmap
          do (format t "~S -> ~S" key (gethash key hashmap)))))

... but day-3/insert-coordinates isn't quite right, since we cannot incf a value when it is not present in the map. For this we try to write a new function:

(defun inc-or-1 (key hashmap)
  (let ((entry (gethash key hashmap)))
    (if entry
        (incf entry)
      (setf entry 1))))

The idea is that by having the let we 1) have less code, and 2) might not need to lookup into the hash table twice. However, this doesn't work:

* (defparameter my-map (make-hash-table))
MY-MAP
* (inc-or-1 123 my-map)
1
* (inc-or-1 123 my-map)
1
* (inc-or-1 123 my-map)
1

Apparently, we are required to use (setf (gethash key table) value), and cannot go through the let. Okay.

(defun inc-or-1 (key hashmap)
  (if (gethash key hashmap)
      (incf (gethash key hashmap))
    (setf (gethash key hashmap) 1)))

(inc-or-1 123 my-map)
1
* (inc-or-1 123 my-map)
2
* (inc-or-1 123 my-map)
3
*

Good. Updating day-3/insert-coordinates to use inc-or-1 rather than setf directly causes us to print out the coordinates correctly. Now it's just a matter of changing the print loop in day-3/1 to two loops: first parse and insert all input lines, then count the number of points which count is more than 1.

(defun day-3/1 (input)
  (let ((hashmap (make-hash-table)))
    (loop for line in input
          do (day-3/insert-coordinates (day-3/match-line line) hashmap))
    (loop for key being the hash-keys of hashmap
          counting (< 1 (gethash key hashmap)) into collisions
          finally (return collisions))))

Reading the test input we're given into *test-input-3* and running gives us:

* (day-3/1 *test-input-3*)
The value
  "3"
is not of type
	NUMBER
when binding SB-KERNEL::X
  [Condition of type TYPE-ERROR]

Ooops! We can change one line in day-3/match-line to

  (ppcre:register-groups-bind ((#'parse-integer id x y w h))

which apparently works. This is pretty much macro magic if you ask me. However, our solution still doesn't work:

* (day-3/1 *test-input-3*)
0

There is a few things that could have gone wrong. Input parsing, count incrementing (though we somewhat checked this), count printing, or messing up indices. After a quick format debugging session, I see what's wrong: when printing out the keys in the hashmap, there are multiple "equal" keys being shown! We must tell the hashmap how to compare keys! ... or, maybe it's hashing to different values?

Now, make-hash-table do take a :test argument. However, according to this site, this is only allowed to be either #'eq, #'eql, or #'equal, neither of which helps. Luckily, LispWorks helps us out by saying that it can in fact also be #'equalp, and this fixes our bug.

(day-3/1 *test-input-3*)
4

Which means that we have finally solved part 1!

Part 2

We spent quite a bit of time on part 1. Luckily, when we have this setup, part 2 does not take that long. We are asked to find the line in the input that does not overlap with any other line. This property only holds for a single line. We can do this in the following way: have a set of all lines that have not overlapped with any other yet. When we are adding counts into the hashmap, we detect overlaps (if the point is already there). Then we can remove the current line from the set of non-overlapping lines. When we are done only one line should remain.

This doesn't quite work though, since the first rectangle somewhere doesn't know that some other rectangle overlapped with it. In order to fix this we map point to id in the hashmap, so that when a rectangle finds another rectangle that it overlaps with, it has both ids, and can remove both from the list of non-overlapping ids. Now all consecutive lines that overlap with this line will also attempt to remove the line from the unique set, but this is fine.

(defun day-3/2 (input)
  (let ((map (make-hash-table :test #'equalp))
        (unique (make-hash-table)))
    (loop for line in input
          do (let ((rect (day-3/match-line line)))
               (setf (gethash (rect-id rect) unique) t)
               (loop for y from (rect-y rect)
                           below (+ (rect-y rect) (rect-h rect))
                     do (loop for x from (rect-x rect) 
                                    below (+ (rect-x rect) (rect-w rect))
                              do (let ((p (make-point :x x :y y)))
                                   (if (gethash p map)
                                       (progn (remhash (gethash p map) unique)
                                              (remhash (rect-id rect) unique))
                                       (setf (gethash p map) (rect-id rect)))))))
          finally (return (loop for key being the hash-keys of unique return key)))))

Figuring out that I had to finally (return (loop took me 15 minutes of (format debugging, but this solves it.

Day 4

Day four is upon us, and we continue.

Part 1

Today's first part is a little convoluted, but there are a few things that come to mind when we want to clean up the data.

Read in the input such that for each guard we have a list of intervals in which they sleep
Make a length 60 array -- one for each minute --- for each guard, and count the "number of sleeps" they have in that time.

We start off with input reading. One option is to go full regex, as we did yesterday, but we might do just fine without it: the time part of each line always has the same length, so we can index directly into the string on the positions we want, in order to identify which variant of message it is, and extract the data; the only exception begin the guard ID, where we must scan until we find a space, but also here is the starting position known ahead of time.

We can start out by writing a couple of predicates and accessor functions:

(defun guard-line-p (line) (eq (char line 19) #\G))
(defun sleep-line-p (line) (eq (char line 19) #\f))
(defun wake-line-p (line) (eq (char line 19) #\w))
(defun line-mm (line) (parse-integer (subseq line 15 17)))
(defun line-id (line)
  (let ((end (position #\SPACE (subseq line 26))))
    (parse-integer (subseq line 26 (+ 26 end)))))

Now we want to loop through the lines, take out the data we need, and insert it into a hash map that maps guard IDs to a list of intervals when they sleep. This is slightly awkward since we must keep track of when the guard began sleeping until the next iteration when we get the wake-up time. A better structure would be to directly advance the line iterator, while still in the body of the loop, like this pseudo-code (notice how I'm already moving away from lisp syntax):

lines = input.lines()
while line = lines.next() {
  if guard_line(line) { ... }
  else if sleep_line(line) {
    next = lines.next()
    start = line_mm(line)
    end	= line_mm(line)
    ...
 	}
}

Not knowing how one would do something like this in CL, I settled for the traditional state-keeping approach. Here we just print out all values of the hashmap at the end.

(defun day-4/1 (input)
  (let ((guard-sleeps (make-hash-table))
        (sleep-start)
        (current-guard))
    (loop for line in input
          when (guard-line-p line) do (setf current-guard (line-id line))
          when (sleep-line-p line) do (setf sleep-start (line-mm line))
          when (wake-line-p line) do
          (let ((interval (make-interval :from sleep-start :to (line-mm line))))
            (if (gethash current-guard guard-sleeps)
                (push interval (gethash current-guard guard-sleeps))
                (setf (gethash current-guard guard-sleeps) (list interval)))))
    (loop for key being the hash-keys of guard-sleeps
          do (format t "~S -> ~S~%" key (gethash key guard-sleeps)))))

Next we make an array for each guard, and count the number of times the guard has slept through each of the 60 minutes.

(defun day-4/1 (input)
  (let ((guard-arrays (make-hash-table))
				...
    (loop ...
    (loop for guard-id being the hash-keys of guard-sleeps
          do (let ((arr (make-array 60)))
               (loop for interval in (gethash guard-id guard-sleeps)
                     do (loop for i from (interval-from interval) below (interval-to interval)
                          do (incf (aref arr i))))
               (setf (gethash guard-id guard-arrays) arr)))
    (loop for k being the hash-keys of guard-arrays
          do (format t "~S ~S~%" k (gethash k guard-arrays)))))

10 #(0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 0 1 1 1 1 1 1 1
     1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0)
99 #(0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
     1 1 1 2 2 2 2 2 3 2 2 2 2 1 1 1 1 1 0 0 0 0 0)

Based on the table given in the problem description, this looks very plausible. Next up is solving the actual task: we wanted to find the guard that was most asleep, find the minute they spent the most asleep, and multiply that minute with the guards ID.

As a side note, while trying to write this out I ran into some weird problems, and the debugger didn't help me much due to variables being optimized away. Evaluating

(declaim (optimize (speed 0) (safety 3) (debug 3)))

helped out a lot. In any case, I eventually arrived at something that succeeded with the test input. I'm not too happy with this function: it is pretty messy, but it works. One thing one could do is split the three steps up into three functions, but when things are supposed to happen sequentially I try to avoid splitting the steps up into functions, at least if the only rationale is that one function is "too long" to not be split.

(defun day-4/1 (input)
  (let ((guard-sleeps (make-hash-table))
        (guard-arrays (make-hash-table))
        (sleep-start)
        (current-guard))
    (loop for line in input
          when (guard-line-p line) do (setf current-guard (line-id line))
          when (sleep-line-p line) do (setf sleep-start (line-mm line))
          when (wake-line-p line) do
          (let ((interval (make-interval :from sleep-start :to (line-mm line))))
            (if (gethash current-guard guard-sleeps)
                (push interval (gethash current-guard guard-sleeps))
                (setf (gethash current-guard guard-sleeps) (list interval)))))
    (loop for guard-id being the hash-keys of guard-sleeps
          do (let ((arr (make-array 60)))
               (loop for interval in (gethash guard-id guard-sleeps)
                     do (loop for i from (interval-from interval) below (interval-to interval)
                          do (incf (aref arr i))))
               (setf (gethash guard-id guard-arrays) arr)))
    (let* ((sums (loop for k being the hash-keys of guard-arrays
                      collect (list (reduce #'+ (gethash k guard-arrays)) k)))
           (laziest (second (first (sort sums #'> :key #'car))))
           (arr (gethash laziest guard-arrays))
           (max-freq (reduce #'max arr)))
      (* laziest (position max-freq arr)))))

This, however, didn't run with the real data; whoever made the input decided to put in one little gotcha, and shuffle all lines. Luckily this is pretty straight forward to fix with a (sort input #'string<). After this, part 1 was solved.

However, the story doesn't end there. After trying to run it a second time, we get this:

The value
  NIL
is not of type
  REAL
when binding I
  [Condition of type TYPE-ERROR]

Looking at the debugger we're in a weird situation where we are looping through the keys of guard-sleeps, but the key guard-id is nil. Pressing RET while the cursor is over GUARD-SLEEPS in the backtrace shows us this:

#<HASH-TABLE {1008752283}>
--------------------
Count: 21
Size: 32
Test: EQL
Rehash size: 1.5
Rehash threshold: 1.0
[clear hashtable]
Contents: 
241 = (#S(INTERVAL :FROM 47 :TO 57) #S(INTERVAL :FROM 6 :TO 44) #S(INTERVAL :FROM 40 :TO 50) #S(INTERVAL :FROM 22 :TO 46) #S(INTERVAL :FROM 1 :TO 12) #S(INTERVAL :FROM 46 :TO 52) #S(INTERVAL :FROM 19 :TO 42) #S(INTERVAL :FROM 50 :TO 58) #S(INTERVAL :FROM 56 :TO 57) #S(INTERVAL :FROM 48 :TO 49) #S(INTERVAL :FROM 29 :TO 41) #S(INTERVAL :FROM 41 :TO 49) #S(INTERVAL :FROM 12 :TO 19) #S(INTERVAL :FROM 47 :TO 51) #S(INTERVAL :FROM 31 :TO 42) #S(INTERVAL :FROM 18 :TO 24) #S(INTERVAL :FROM 33 :TO 52) ..) [remove entry]
1213 = (#S(INTERVAL :FROM 53 :TO 57) #S(INTERVAL :FROM 48 :TO 50) #S(INTERVAL :FROM 6 :TO 49) #S(INTERVAL :FROM 46 :TO 57) #S(INTERVAL :FROM 2 :TO 35) #S(INTERVAL :FROM 33 :TO 49) #S(INTERVAL :FROM 18 :TO 56) #S(INTERVAL :FROM 46 :TO 52) #S(INTERVAL :FROM 0 :TO 26) #S(INTERVAL :FROM 21 :TO 42) #S(INTERVAL :FROM 40 :TO 46) #S(INTERVAL :FROM 23 :TO 45) #S(INTERVAL :FROM 17 :TO 55) #S(INTERVAL :FROM 26 :TO 39) #S(INTERVAL :FROM 12 :TO 19)) [remove entry]
2903 = (#S(INTERVAL :FROM 4 :TO 48) #S(INTERVAL :FROM 39 :TO 42) #S(INTERVAL :FROM 34 :TO 40) #S(INTERVAL :FROM 49 :TO 56) #S(INTERVAL :FROM 30 :TO 41) #S(INTERVAL :FROM 54 :TO 58) #S(INTERVAL :FROM 24 :TO 53) #S(INTERVAL :FROM 32 :TO 46) #S(INTERVAL :FROM 56 :TO 59) #S(INTERVAL :FROM 26 :TO 42) #S(INTERVAL :FROM 35 :TO 52) #S(INTERVAL :FROM 27 :TO 47)) [remove entry]
1283 = (#S(INTERVAL :FROM 22 :TO 42) #S(INTERVAL :FROM 56 :TO 59) #S(INTERVAL :FROM 6 :TO 49) #S(INTERVAL :FROM 32 :TO 42) #S(INTERVAL :FROM 9 :TO 21) #S(INTERVAL :FROM 17 :TO 46) #S(INTERVAL :FROM 45 :TO 47) #S(INTERVAL :FROM 13 :TO 55) #S(INTERVAL :FROM 57 :TO 59) #S(INTERVAL :FROM 40 :TO 48) #S(INTERVAL :FROM 26 :TO 52) #S(INTERVAL :FROM 2 :TO 17) #S(INTERVAL :FROM 53 :TO 55) #S(INTERVAL :FROM 19 :TO 47) #S(INTERVAL :FROM 41 :TO 46) #S(INTERVAL :FROM 24 :TO 29) #S(INTERVAL :FROM 22 :TO 52) ..) [remove entry]
829 = (#S(INTERVAL :FROM 40 :TO 50) #S(INTERVAL :FROM 11 :TO 24) #S(INTERVAL :FROM 29 :TO 32) #S(INTERVAL :FROM 37 :TO 45) #S(INTERVAL :FROM 31 :TO 32) #S(INTERVAL :FROM 32 :TO 52) #S(INTERVAL :FROM 20 :TO 39) #S(INTERVAL :FROM 57 :TO 59) #S(INTERVAL :FROM 10 :TO 26) #S(INTERVAL :FROM 4 :TO 39) #S(INTERVAL :FROM 8 :TO 18)) [remove entry]
3347 = (#S(INTERVAL :FROM 36 :TO 46) #S(INTERVAL :FROM 14 :TO 25) #S(INTERVAL :FROM 7 :TO 48) #S(INTERVAL :FROM 18 :TO 56) #S(INTERVAL :FROM 7 :TO 14) #S(INTERVAL :FROM 48 :TO 57) #S(INTERVAL :FROM 9 :TO 53) #S(INTERVAL :FROM 41 :TO 57) #S(INTERVAL :FROM 39 :TO 47) #S(INTERVAL :FROM 33 :TO 34) #S(INTERVAL :FROM 52 :TO 59) #S(INTERVAL :FROM 30 :TO 46) #S(INTERVAL :FROM 41 :TO 46) #S(INTERVAL :FROM 13 :TO 26) #S(INTERVAL :FROM 54 :TO 55) #S(INTERVAL :FROM 23 :TO 48) #S(INTERVAL :FROM 57 :TO 59) ..) [remove entry]
1319 = (#S(INTERVAL :FROM 50 :TO 59) #S(INTERVAL :FROM 2 :TO 37) #S(INTERVAL :FROM 37 :TO 45) #S(INTERVAL :FROM 46 :TO 58) #S(INTERVAL :FROM 0 :TO 31) #S(INTERVAL :FROM 33 :TO 50) #S(INTERVAL :FROM 29 :TO 45) #S(INTERVAL :FROM 1 :TO 42) #S(INTERVAL :FROM 25 :TO 29) #S(INTERVAL :FROM 24 :TO 42) #S(INTERVAL :FROM 50 :TO 55) #S(INTERVAL :FROM 18 :TO 27) #S(INTERVAL :FROM 19 :TO 57) #S(INTERVAL :FROM 29 :TO 35) #S(INTERVAL :FROM 8 :TO 53)) [remove entry]
439 = (#S(INTERVAL :FROM 19 :TO 55) #S(INTERVAL :FROM 33 :TO 39) #S(INTERVAL :FROM 41 :TO 51) #S(INTERVAL :FROM 37 :TO 50) #S(INTERVAL :FROM 9 :TO 53) #S(INTERVAL :FROM 31 :TO 38) #S(INTERVAL :FROM 38 :TO 59) #S(INTERVAL :FROM 14 :TO 25) #S(INTERVAL :FROM 51 :TO 59) #S(INTERVAL :FROM 19 :TO 24) #S(INTERVAL :FROM 5 :TO 32) #S(INTERVAL :FROM 52 :TO 54) #S(INTERVAL :FROM 1 :TO 41) #S(INTERVAL :FROM 51 :TO 56) #S(INTERVAL :FROM 7 :TO 33) #S(INTERVAL :FROM 6 :TO 30) #S(INTERVAL :FROM 24 :TO 57) ..) [remove entry]
2213 = (#S(INTERVAL :FROM 52 :TO 58) #S(INTERVAL :FROM 38 :TO 48) #S(INTERVAL :FROM 54 :TO 57) #S(INTERVAL :FROM 27 :TO 53) #S(INTERVAL :FROM 46 :TO 57) #S(INTERVAL :FROM 30 :TO 43) #S(INTERVAL :FROM 57 :TO 58) #S(INTERVAL :FROM 36 :TO 46) #S(INTERVAL :FROM 6 :TO 29) #S(INTERVAL :FROM 33 :TO 55) #S(INTERVAL :FROM 23 :TO 26)) [remove entry]
3319 = (#S(INTERVAL :FROM 50 :TO 57) #S(INTERVAL :FROM 41 :TO 43) #S(INTERVAL :FROM 10 :TO 36) #S(INTERVAL :FROM 7 :TO 53) #S(INTERVAL :FROM 4 :TO 42) #S(INTERVAL :FROM 35 :TO 58) #S(INTERVAL :FROM 57 :TO 58) #S(INTERVAL :FROM 51 :TO 54) #S(INTERVAL :FROM 3 :TO 19) #S(INTERVAL :FROM 54 :TO 57) #S(INTERVAL :FROM 7 :TO 34) #S(INTERVAL :FROM 56 :TO 59) #S(INTERVAL :FROM 21 :TO 53) #S(INTERVAL :FROM 32 :TO 38) #S(INTERVAL :FROM 42 :TO 46) #S(INTERVAL :FROM 21 :TO 35) #S(INTERVAL :FROM 11 :TO 15) ..) [remove entry]
2539 = (#S(INTERVAL :FROM 42 :TO 51) #S(INTERVAL :FROM 10 :TO 27) #S(INTERVAL :FROM 45 :TO 55) #S(INTERVAL :FROM 33 :TO 35) #S(INTERVAL :FROM 44 :TO 56) #S(INTERVAL :FROM 12 :TO 36) #S(INTERVAL :FROM 43 :TO 57) #S(INTERVAL :FROM 23 :TO 34) #S(INTERVAL :FROM 57 :TO 58) #S(INTERVAL :FROM 15 :TO 39) #S(INTERVAL :FROM 52 :TO 54) #S(INTERVAL :FROM 32 :TO 36) #S(INTERVAL :FROM 7 :TO 22)) [remove entry]
631 = (#S(INTERVAL :FROM 44 :TO 58) #S(INTERVAL :FROM 3 :TO 27) #S(INTERVAL :FROM 51 :TO 56) #S(INTERVAL :FROM 23 :TO 47) #S(INTERVAL :FROM 6 :TO 17) #S(INTERVAL :FROM 50 :TO 56) #S(INTERVAL :FROM 15 :TO 46) #S(INTERVAL :FROM 55 :TO 56) #S(INTERVAL :FROM 1 :TO 49) #S(INTERVAL :FROM 23 :TO 57) #S(INTERVAL :FROM 44 :TO 48) #S(INTERVAL :FROM 3 :TO 29) #S(INTERVAL :FROM 33 :TO 45) #S(INTERVAL :FROM 11 :TO 21)) [remove entry]
2129 = (#S(INTERVAL :FROM 51 :TO 57) #S(INTERVAL :FROM 36 :TO 46) #S(INTERVAL :FROM 42 :TO 43) #S(INTERVAL :FROM 43 :TO 51) #S(INTERVAL :FROM 15 :TO 38) #S(INTERVAL :FROM 54 :TO 59) #S(INTERVAL :FROM 41 :TO 43) #S(INTERVAL :FROM 54 :TO 59) #S(INTERVAL :FROM 6 :TO 47) #S(INTERVAL :FROM 48 :TO 57) #S(INTERVAL :FROM 32 :TO 56) #S(INTERVAL :FROM 38 :TO 54)) [remove entry]
1889 = (#S(INTERVAL :FROM 57 :TO 59) #S(INTERVAL :FROM 30 :TO 35) #S(INTERVAL :FROM 31 :TO 42) #S(INTERVAL :FROM 31 :TO 41) #S(INTERVAL :FROM 39 :TO 40) #S(INTERVAL :FROM 28 :TO 33) #S(INTERVAL :FROM 56 :TO 57) #S(INTERVAL :FROM 29 :TO 34) #S(INTERVAL :FROM 27 :TO 42) #S(INTERVAL :FROM 24 :TO 32) #S(INTERVAL :FROM 57 :TO 59) #S(INTERVAL :FROM 44 :TO 51) #S(INTERVAL :FROM 31 :TO 36) #S(INTERVAL :FROM 22 :TO 36) #S(INTERVAL :FROM 11 :TO 15) #S(INTERVAL :FROM 2 :TO 47) #S(INTERVAL :FROM 27 :TO 50) ..) [remove entry]
2137 = (#S(INTERVAL :FROM 49 :TO 59) #S(INTERVAL :FROM 43 :TO 53) #S(INTERVAL :FROM 4 :TO 47) #S(INTERVAL :FROM 55 :TO 56) #S(INTERVAL :FROM 35 :TO 52) #S(INTERVAL :FROM 50 :TO 55) #S(INTERVAL :FROM 46 :TO 47) #S(INTERVAL :FROM 52 :TO 58) #S(INTERVAL :FROM 23 :TO 26) #S(INTERVAL :FROM 45 :TO 57)) [remove entry]
2251 = (#S(INTERVAL :FROM 22 :TO 38) #S(INTERVAL :FROM 17 :TO 31) #S(INTERVAL :FROM 27 :TO 54) #S(INTERVAL :FROM 8 :TO 22) #S(INTERVAL :FROM 49 :TO 56) #S(INTERVAL :FROM 7 :TO 14) #S(INTERVAL :FROM 12 :TO 35) #S(INTERVAL :FROM 56 :TO 58) #S(INTERVAL :FROM 25 :TO 32) #S(INTERVAL :FROM 3 :TO 20) #S(INTERVAL :FROM 55 :TO 59) #S(INTERVAL :FROM 14 :TO 40) #S(INTERVAL :FROM 52 :TO 55) #S(INTERVAL :FROM 8 :TO 56) #S(INTERVAL :FROM 21 :TO 37)) [remove entry]
2389 = (#S(INTERVAL :FROM 57 :TO 58) #S(INTERVAL :FROM 28 :TO 49) #S(INTERVAL :FROM 5 :TO 22) #S(INTERVAL :FROM 57 :TO 59) #S(INTERVAL :FROM 52 :TO 53) #S(INTERVAL :FROM 13 :TO 20) #S(INTERVAL :FROM 28 :TO 58) #S(INTERVAL :FROM 11 :TO 14) #S(INTERVAL :FROM 42 :TO 54) #S(INTERVAL :FROM 53 :TO 55) #S(INTERVAL :FROM 9 :TO 33) #S(INTERVAL :FROM 51 :TO 55) #S(INTERVAL :FROM 37 :TO 39) #S(INTERVAL :FROM 56 :TO 59) #S(INTERVAL :FROM 15 :TO 48) #S(INTERVAL :FROM 53 :TO 55) #S(INTERVAL :FROM 52 :TO 59) ..) [remove entry]
1777 = (#S(INTERVAL :FROM 46 :TO 51) #S(INTERVAL :FROM 9 :TO 37) #S(INTERVAL :FROM 52 :TO 59) #S(INTERVAL :FROM 36 :TO 39) #S(INTERVAL :FROM 47 :TO 56) #S(INTERVAL :FROM 24 :TO 34) #S(INTERVAL :FROM 48 :TO 52) #S(INTERVAL :FROM 6 :TO 38) #S(INTERVAL :FROM 1 :TO 49) #S(INTERVAL :FROM 53 :TO 58) #S(INTERVAL :FROM 34 :TO 45) #S(INTERVAL :FROM 28 :TO 30) #S(INTERVAL :FROM 10 :TO 58) #S(INTERVAL :FROM 10 :TO 49) #S(INTERVAL :FROM 40 :TO 52) #S(INTERVAL :FROM 15 :TO 35) #S(INTERVAL :FROM 31 :TO 57) ..) [remove entry]
3371 = (#S(INTERVAL :FROM 38 :TO 50) #S(INTERVAL :FROM 8 :TO 15) #S(INTERVAL :FROM 53 :TO 54) #S(INTERVAL :FROM 11 :TO 29) #S(INTERVAL :FROM 27 :TO 53) #S(INTERVAL :FROM 33 :TO 48) #S(INTERVAL :FROM 33 :TO 49) #S(INTERVAL :FROM 39 :TO 52) #S(INTERVAL :FROM 34 :TO 36) #S(INTERVAL :FROM 0 :TO 22) #S(INTERVAL :FROM 51 :TO 57) #S(INTERVAL :FROM 52 :TO 54) #S(INTERVAL :FROM 6 :TO 49) #S(INTERVAL :FROM 38 :TO 57) #S(INTERVAL :FROM 27 :TO 43) #S(INTERVAL :FROM 37 :TO 53) #S(INTERVAL :FROM 0 :TO 28) ..) [remove entry]
103 = (#S(INTERVAL :FROM 56 :TO 59) #S(INTERVAL :FROM 1 :TO 30) #S(INTERVAL :FROM 38 :TO 41) #S(INTERVAL :FROM 31 :TO 41) #S(INTERVAL :FROM 48 :TO 55) #S(INTERVAL :FROM 23 :TO 36) #S(INTERVAL :FROM 38 :TO 49) #S(INTERVAL :FROM 12 :TO 25) #S(INTERVAL :FROM 26 :TO 49) #S(INTERVAL :FROM 17 :TO 23) #S(INTERVAL :FROM 13 :TO 56) #S(INTERVAL :FROM 39 :TO 56) #S(INTERVAL :FROM 24 :TO 36) #S(INTERVAL :FROM 26 :TO 55) #S(INTERVAL :FROM 31 :TO 37) #S(INTERVAL :FROM 57 :TO 58) #S(INTERVAL :FROM 15 :TO 50) ..) [remove entry]
NIL = (#S(INTERVAL :FROM NIL :TO 57)) [remove entry]

We see that the last entry has key nil, and the interval it maps to has :from nil. Strange! Looking closer at the Slime debugging window we do find the problem though: in the value of input. We have already seen that sort destructively sort the given list. In addition, we do know that lists in Lisp are linked, so sorting a list means shuffling around pointers. Aha! If a list is just a pointer to its first element and we sort the list, that means that the reference we have to the list, the pointer to an element that used to be first, is no longer first, and all elements that were put in front of it is no longer reachable! The following illustrates:

* (defparameter bing '(9 6 3 5 7 1 8 2 3 5))
* (format t "~S~%" bing)
(9 6 3 5 7 1 8 2 3 5)
* (defparameter bong (sort bing #'<))

* (format t "~S~%" bing)
(6 7 8 9)
* (format t "~S~%" bong)
(1 2 3 3 5 5 6 7 8 9) ; so much for attempting to type in 1 through 9 shuffled

The solution is rather simple: instead of sorting inside day-4/1 we just sort when we set value, in defparameter.

Part 2

Part two asks us for a tiny modification to our function: instead of selecting the guard that sleeps the most, we want the guard who has slept the most times on any minute. So instead of maximizing by summing, we will maximize by maxing (At this point I'm tempted to refactor out most of the logic, but at the same time, this is a write once, run once situation):

(defun day-4/2 (input)
  (let ((guard-sleeps (make-hash-table))
        (guard-arrays (make-hash-table))
        (sleep-start)
        (current-guard))
    (loop for line in input
          when (guard-line-p line) do (setf current-guard (line-id line))
          when (sleep-line-p line) do (setf sleep-start (line-mm line))
          when (wake-line-p line) do
          (let ((interval (make-interval :from sleep-start :to (line-mm line))))
            (if (gethash current-guard guard-sleeps)
                (push interval (gethash current-guard guard-sleeps))
                (setf (gethash current-guard guard-sleeps) (list interval)))))
    (loop for guard-id being the hash-keys of guard-sleeps
          do (let ((arr (make-array 60)))
               (loop for interval in (gethash guard-id guard-sleeps)
                     do (loop for i from (interval-from interval) below (interval-to interval)
                          do (incf (aref arr i))))
               (setf (gethash guard-id guard-arrays) arr)))
    (let* ((sums (loop for k being the hash-keys of guard-arrays
                      collect (list (reduce #'max (gethash k guard-arrays)) k))) ;; HERE!!
           (laziest (second (first (sort sums #'> :key #'car))))
           (arr (gethash laziest guard-arrays))
           (max-freq (reduce #'max arr)))
      (* laziest (position max-freq arr)))))

The only thing we changed was changing a #'+ to a #'max. Hooray! Code reuse!

Thoughts so far

Common Lisp is a bit of a weird language for me. Certain things I figured would be easy, like making tuples, seems to force you to use (list ..), which presumably allocates. In addition, the dynamic nature of the language is something I'm still getting used to, begin a big fan of statically typed languages. Despite being foreign, I think that most of what I have wanted to do has been expressible in CL, and this is, after all, the main point of a programming language.

Lastly, the Slime experience is something I look forward to getting to know better. Being able to interactively looking through the state of the stack, including all local variables, at once when you get an error is simply not something I'm used to; this is the reason I included the hash table output above, despite not actually using it to find the source of the bug. It was just really cool!

Thank you for reading.

The Problem with OOP is "Oriented"

2020-05-16T17:33:35+02:00

The problem with OOP isn't the "object" part, it's really the "oriented" part.

My problem with OOP isn't really that I despise design patterns, nor that I like all of my data to be in global scope, which OOP at least discourages, nor that I'm convinced that all functions should be pure and any state is the root of all evil. My problem is simply that it doesn't really solve the problems I have when I'm programming. It does little to solve the problems I do have, and it imposes limitations and shepherds me into getting problems that I didn't have before. Even calling this a trade-off is generous, since a trade would imply that I actually am getting something out of it.

How I Program

It is very rare that I know up front exactly what I'm making and how it will look in the end. Most of the time I have some high level idea of what should happen in my program, and I need to figure out how to integrate this in the current codebase. This means understanding what data I need to transform and how to organize this data. Sometimes this is straight forward, and sometimes this reveals that my high-level idea of the problem was wrong, which means I have to go back and adjust my high level idea, this time with more information in mind. This process goes back and forth, and in the end my program might look nothing like my first attempt.

I think most programmers experience this often.

Since I have to go back and forth a lot when developing, most of my code will not make it far. This is a crucial counterexample to the meme that "code is written once but read a thousand times". Most of the code I write is probably just read a couple of times (and only by me!), before it's replaced by new code that better solves the problems I'm having. Therefore, spending time on this code is very often a complete waste of time.

A Lack of Upsides

In OOP it's central to make class hierarchies with methods that the subclasses override, and having the callers of these methods be agnostic to the actual class of the objects they are operating on. Having had some experience in languages in which this is either discouraged or simply not possible, I've come to the conclusion that having a superclass that defines methods and a number of different subclasses with which the methods are automatically dispatched, is really not something I need that often. Maybe I can come up with an example where this is actually a good solution, given some time I probably can, but this comes back to the second O of OOP: "Oriented".

The case where I actually want dynamic dispatch is very rare, and so having the entire programming language (or worse, your codebase) be oriented around this concept does not make any sense. Similarly, most of my types are significantly different enough for it not to make sense having code agnostic to the types it's operating on. It just doesn't really happen that often.

Similarly, having intentionally been very generous with visibility qualifiers in my code, I cannot think of a single time where a member being visible has caused any problems. On the other hand, I remember vividly a case where a 3rd party library had messed up the qualifiers on a tuple-like type containing the coordinates of a mouse click, which made it impossible to get them. Try to think back: when was the last time you tried to access a field or method, only to get told that it's private, and then being forced into a setter/getter with other logic which saved you from a bug?

It's "Pay up front"

OOP is about structuring your program according to certain principles. That is, in of itself it doesn't to any work. Thus, the time you spend on making your codebase adhere to OO principles is time spent not making your program do what it needs to do, and thus wasted. That is, unless it pays back later.

I think what often brings people into OOP is the idea that the work you spend on having a good class structure, with subclassing and getters/setters and proper encapsulation, will pay back in dividends during the lifetime of the project by avoiding code duplication, maintaining invariants on a class' private state, help debugging, making your code more flexible, and a bunch of other things.

So by spending time designing class structures, figuring out which fields are properly encapsulated, and making sure methods are abstracted sufficiently high enough up in the hierarchy, you're spending time in the hopes that this will make it easier to work with down the line.

You are, effectively, paying the price to fight a problem you think you might get further down the line, and you hope that the price, which you definitely paid, was sufficient for these problems to not get out of hand.

Encapsulation by default

I'm also becoming increasingly vary to complete encapsulation. At it's essence, encapsulation is about only exposing the minimal subset of the members (data or methods) of a class that the caller needs. This sounds rather reasonable and sometimes it is a good idea. However, it also opens up an extremely difficult problem for the programmer, because they now have to decide exactly which subset of their members are sufficient for any possible use-case of that class. When marking a field private, the programmer is really saying that there is no possible valid program which needs to access this field, whatsoever.

I genuinely think hidden by default is the wrong choice. You shouldn't have to make the case to have a function or data members visible, because now the programmer will have to imagine a valid use-case in which having this member visible is necessary, and there will always be use-cases that they miss.

It's Object Oriented

The mere presence of dynamic dispatched method calls doesn't make a whole codebase OO. Batching together data into something that looks like a class certainly does not make the codebase OO. Having subclasses does not make the codebase OO. For a codebase to be Object Oriented, it really has to be oriented around objects.

Conclusion

I don't this the OO mindset is without merits, but I think the gains are sufficiently far and few in between that having your codebase be oriented around objects is a mistake. Rather, I think the codebase should be oriented around the data that your programming is operating on; after all, everything a program does is transforming input data to output data, and the codebase should reflect this.

At last, don't start quoting Alan Kay; I know his objections to what we now call OOP, and if you've made it this far I'm sure you understand that I'm not talking about his idea of it.

Thanks for reading.

Copilot

2023-04-05T11:29:43+02:00

I have been playing around with LLMs for the past couple of days and decided to try out GitHub Copilot.Overall, the experience so far has been similar to that of ChatGPT, in that sometimes it's helpful, and sometimes it's not. It pretty much always require some tweaking, so you have to know what you're after, and you have to be able to check that its suggestions are actually any good because it'll often be wrong, and sometimes in subtle ways.

Anyways, something funny happened. I was trying to see if Copilot would also generate natural text unrelated to surrounding code when prompted to do so in a comment. I tried

# Write a funny prompt mimicing youtube personalities asking for more subscribers:
#

and Copilot suggested

# Write a funny prompt mimicing youtube personalities asking for more subscribers:
# https://www.youtube.com/watch?v=9bZkp7q19f0

Thanks for reading.

Code Generation and Merge Sort

2019-04-24T10:29:54+02:00

I was reading a few pages of Knuths The Art of Computer Programming, Volume4A about "branchless computation" (p. 180) in which he demonstrates how to get rid of branches by using conditional instructions. As an instructive example he consideres the inner part of merge sort, in which we are to merge two sorted lists of numbers into one bigger list of the numbers. The description as given by Knuth is as follows:

If $x_i < y_j$ set $z_k \gets x_i$, $i \gets i+1$, and go to x_done if $i = i_{max}$.
Otherwise set $z_k \gets y_i$, $j \gets j+1$, and go to y_done if $j = j_{max}$.
Then set $k \gets k+1$ and go to z_done if $k = k_{max}$.

$x$ and $y$ are the input lists, $z$ is the output merged list. $i$, $j$, and $k$ are loop indices for the three respective lists and the $_{max}$ variants are the lists length.

I got curious and decided to see how a standard optimizing compilier would handle this case, and whether writing the assmebly yourself would provide any gain in performance. After all, this is just slightly more complicated than the trivial examples used to show off good codegen, so it would not be unreasonable for the compiler to manage to fix a bad implementation of this. In addition, it would serve as a great excuse to finally learn how to write x86.

Basics

Here's the inner loop in C code:

void branching(uint64_t *xs, size_t xmax, uint64_t *ys, size_t ymax, 
               uint64_t *zs, size_t zmax) {
  size_t i = 0, j = 0, k = 0;
  while (k < zmax) {
    if (xs[i] < ys[j]) {
      zs[k++] = xs[i++];
      if (i == xmax) { // x_done
        memcpy(zs + k, ys + j, 8 * (zmax - k));
        return; 
      }
    } else {
      zs[k++] = ys[j++];
      if (j == ymax) { // y_done
        memcpy(zs + k, xs + i, 8 * (zmax - k));
        return; 
      }
    }
  } // z_done
}

This seems to be a more or less straight forward textbook implementation of the procedure, so it will do fine as a benchmark. As a quick check before going any deeper into this we can use godbolt.org to see whether this experiment is even worth doing. Godbolts x86-64 gcc 8.3 with -O3 spits out this (annotations are by me):

branching(unsigned long*, unsigned long, unsigned long*, unsigned long, 
          unsigned long*, unsigned long):
        test    r9, r9       ; if (r9 == 0)
        je      .L15         ;   goto .L15
        push    r13          ;
        xor     eax, eax     ;
        xor     r11d, r11d   ; j = 0
        xor     r10d, r10d   ; i = 0
        push    r12          ;
        push    rbp          ;
        push    rbx          ;
        jmp     .L2          ;
.L17:
        add     r10, 1                        ; i++
        mov     QWORD PTR [r8-8+rax*8], rbp   ; zs[k-1] = xi
        cmp     r10, rsi                      ; if (i == xmax)
        je      .L16                          ;   goto .L16
.L6:
        cmp     r9, rax      ; if (k == zmax)
        je      .L1          ;   goto .L1
.L2:
        lea     r12, [rdi+r10*8]             ; calculate xs + i
        lea     r13, [rdx+r11*8]             ; calculate ys + j
        add     rax, 1                       ; k++
        mov     rbp, QWORD PTR [r12]         ; xi = xs[i]
        mov     rbx, QWORD PTR [r13+0]       ; yj = ys[j]
        cmp     rbp, rbx                     ; if (xi < yj)
        jb      .L17                         ;   goto .L17
        add     r11, 1                       ; j++
        mov     QWORD PTR [r8-8+rax*8], rbx  ; zs[k-1] = yj
        cmp     r11, rcx                     ; if (j != ymax)
        jne     .L6                          ;   goto .L6
        sub     r9, rax            ; y_done 
        pop     rbx                ;
        mov     rsi, r12           ;
        pop     rbp                ;
        lea     rdi, [r8+rax*8]    ;
        pop     r12                ;
        lea     rdx, [0+r9*8]      ;
        pop     r13                ;
        jmp     memcpy             ;
.L1:
        pop     rbx       ; z_done
        pop     rbp       ;
        pop     r12       ;
        pop     r13       ; 
        ret               ;
.L16:
        sub     r9, rax            ; x_done
        pop     rbx                ;
        mov     rsi, r13           ;
        pop     rbp                ;
        lea     rdi, [r8+rax*8]    ;
        pop     r12                ;
        lea     rdx, [0+r9*8]      ;
        pop     r13                ;
        jmp     memcpy             ;
.L15:
        ret

Plenty of branches!¹

Now, maybe it turns out that it doesn't matter if we're branching or not and that the compiler knows best. We could guess that the reason we're still getting branches is because that's really the best way to go here. After all "you can't beat the compiler" seems to be the consensus in many programming circles. Let's try to write a version in C without exessive use of branching. Then perhaps the compiler will generate different code, and we can see what that difference amounts to in terms of running time. We can adopt Knuth's branchless version:

void nonbranching_but_branching(uint64_t *xs, size_t xmax, uint64_t *ys, size_t ymax, 
                                uint64_t *zs, size_t zmax) {
  size_t i = 0, j = 0, k = 0;
  uint64_t xi = xs[i], yj = ys[j];
  while ((i < xmax) && (j < ymax) && (k < zmax)) {
    int64_t t = one_if_lt(xi - yj);
    yj = min(xi, yj);
    zs[k] = yj;
    i += t;
    xi = xs[i];
    t ^= 1;
    j += t;
    yj = ys[j];
    k += 1;
  }
  if (i == xmax)
    memcpy(zs + k, ys + j, 8 * (zmax - k));
  if (j == ymax)
    memcpy(zs + k, xs + i, 8 * (zmax - k));
}

What is going on, you might ask? The general idea is to first get min(xi, yj), and then have a number t that's 1 if xi < yj and 0 otherwise: we can add t to i, since t=1 if we just wrote xi to zs[k]. Then we can xor it with 1, effectively flipping 1 to 0 and 0 to 1, and then add t^1 to j; this causes either i or j to be incremented but not both. We used two convenience functions here, one_if_lt and min, both implemented straight forward with branching, hoping that the compiler will figure this out for us, now that the branches are much smaller.

Next, if we cheat a litte and assume that the highest bit in the numbers are never set we can get rid of those branches²:

void nonbranching(uint64_t *xs, size_t xmax, uint64_t *ys, size_t ymax, 
                  uint64_t *zs, size_t zmax) {
  size_t i = 0, j = 0, k = 0;
  uint64_t xi = xs[i], yj = ys[j];
  while ((i < xmax) && (j < ymax) && (k < zmax)) {
    uint64_t neg = (xi - yj) >> 63;
    yj = neg * xi + (1 - neg) * yj;
    zs[k] = yj;
    i += neg;
    xi = xs[i];
    neg ^= 1;
    j += neg;
    yj = ys[j];
    k += 1;
  }
  if (i == xmax)
    memcpy(zs + k, ys + j, 8 * (zmax - k));
  if (j == ymax)
    memcpy(zs + k, xs + i, 8 * (zmax - k));
}

What is up with (xi - yj) >> 63 you may ask? This result is negative if xi < yj, and so it will overflow and its most significant bit will be set. Then we shift down logically (since we're using unsigned integers³) so the bits that are filled in are all zeroes. Since the width is 64, we effectively move the upper bit to the lowest position while setting all other bits to zero.

Knuth has another quirk, namely that his arrays usually points to the end of the array, and his indices are negative, going from -xmax up to 0 instead of the more standard going from 0 up to xmax. One consequence of this is that the termination check can be done with one comparison instead of three, by anding together the three indices: since they are negative they have their most significant bit set, unless zero. Here's both of the previous versions with this reversal trick:

void nonbranching_but_branching_reverse(uint64_t *xs, size_t xmax, 
                                        uint64_t *ys, size_t ymax, 
                                        uint64_t *zs, size_t zmax) {
  uint64_t *xse = xs + xmax;
  uint64_t *yse = ys + ymax;
  uint64_t *zse = zs + zmax;

  ssize_t i = -((ssize_t) xmax);
  ssize_t j = -((ssize_t) ymax);
  ssize_t k = -((ssize_t) zmax);

  uint64_t xi = xse[i], yj = yse[j];
  while (i & j & k) {
    uint64_t t = one_if_lt(xi - yj);
    yj = min(xi, yj);
    zse[k] = yj;
    i += t;
    xi = xse[i];
    t ^= 1;
    j += t;
    yj = yse[j];
    k += 1;
  }
  if (i == 0)
    memcpy(zse + k, yse + j, -8 * k);
  if (j == 0)
    memcpy(zse + k, xse + i, -8 * k);
}

void nonbranching_reverse(uint64_t *xs, size_t xmax, uint64_t *ys, size_t ymax, 
                          uint64_t *zs, size_t zmax) {
  uint64_t *xse = xs + xmax;
  uint64_t *yse = ys + ymax;
  uint64_t *zse = zs + zmax;

  ssize_t i = -((ssize_t) xmax);
  ssize_t j = -((ssize_t) ymax);
  ssize_t k = -((ssize_t) zmax);

  uint64_t xi = xse[i], yj = yse[j];
  while (i & j & k) {
    uint64_t neg = (xi - yj) >> 63;
    yj = neg * xi + (1 - neg) * yj;
    zse[k] = yj;
    i += neg;
    xi = xse[i];
    neg ^= 1;
    j += neg;
    yj = yse[j];
    k += 1;
  }
  if (i == 0)
    memcpy(zse + k, yse + j, -8 * k);
  if (j == 0)
    memcpy(zse + k, xse + i, -8 * k);
}

Technically, I suppose we do assume that the length of the arrays are not >2**63, so that they fit in an ssize_t, but considering that the address space of x86-64 is not 64 bits, but merely 48 bits⁴, this is not a problem, even in theory.

Writing the ASM ourselves

Lastly, we can try to write the assembly ourselves. When translating the branch-free routine by Knuth into x86 there are a number of things to do. First we need to figure out how to get -1/0/+1 by comparing two variables, as MMIXs CMP instruction does. However, instead of trying to translate this line by line, which would end up with us having more instructions than needed, we should rather look more closely at what we're doing, so that we really understand the minimal amount of work that we have to do.

We only need to do two things: compare $x_i$ and $y_i$ and load the smaller into a register, and increment either i or j. The former can be done using cmovl, and the latter can be done in a similar fasion as Knuth does it, which is basically what we've been doing up to this point in C. This is the version I ended up with (here in inline-GCC asm format):

1: mov   %[minxy], %[yj]                     ;
   cmp   %[xi], %[yj]                        ; minxy = min(xi, yj)
   cmovl %[minxy], %[xi]                     ;
   mov   QWORD PTR [%[zse]+8*%[k]], %[minxy] ; zs[k] = minxy
   mov   %[t], 0                             ; t = 0
   cmovl %[t], %[one]                        ; if xi < yj: t = 1
   add   %[i], %[t]                          ; i += t
   mov   %[xi], QWORD PTR [%[xse]+8*%[i]]    ; xi = xs[i]
   xor   %[t], 1                             ; t ^= 1
   add   %[j], %[t]                          ; j += t
   mov   %[yj], QWORD PTR [%[yse]+8*%[j]]    ; yj = ys[j]
   add   %[k], 1                             ; k += 1
   mov   %[u], %[i]                          ; 
   and   %[u], %[j]                          ;
   test  %[u], %[k]                          ; if ((i & j & k) != 0)
   jnz   1b                                  ;   goto 1

There's a few quirks here, like having a couple of mov instructions in between the second conditional load and the instruction it conditions on, and the fact that cmovl couldn't take an immediate value, so I had to setup a register with only the value 1 in it. A sneaky detail to keep in mind is that when we set t = 0 we cannot use the trick of xoring t with itself, since this will change the flags, causing the subsequent cmovl to be wrong.

Now we can take a look at the assembly generated from some of the other fuctions by using objdump -d. Our own programs are compiled with -O3 -march=native. Here is the inner loop in nonbranching_reverse:

<nonbranching_reverse>:
1ef0:	mov    rax,rdi
1ef3:	sub    rax,rsi
1ef6:	shr    rax,0x3f
1efa:	mov    rdx,r8
1efd:	sub    rdx,rax
1f00:	imul   rdx,rsi
1f04:	imul   rdi,rax
1f08:	add    rbp,rax
1f0b:	xor    rax,0x1
1f0f:	add    rdi,rdx
1f12:	mov    QWORD PTR [r13+r12*8+0x0],rdi
1f17:	add    rcx,rax
1f1a:	inc    r12
1f1d:	mov    rax,rbp
1f20:	and    rax,r12
1f23:	mov    rdi,QWORD PTR [rbx+rbp*8]
1f27:	mov    rsi,QWORD PTR [r10+rcx*8]
1f2b:	test   rax,rcx
1f2e:	jne    1ef0 <nonbranching_reverse+0x40>

Sure looks a lot better than branching! This seems more or less reasonable, but we can see that the multiplication trickery that we used to avoid the min branch takes up some space here; presumably it also takes some time. Maybe one little branch isn't too bad though, and perhaps the compiler is more willingly to use conditional instructions if we use the ternary operator, like this:

void nonbranching_reverse_ternary(uint64_t *xs, size_t xmax, uint64_t *ys, size_t ymax, 
                                  uint64_t *zs, size_t zmax) {
  uint64_t *xse = xs + xmax;
  uint64_t *yse = ys + ymax;
  uint64_t *zse = zs + zmax;

  ssize_t i = -((ssize_t) xmax);
  ssize_t j = -((ssize_t) ymax);
  ssize_t k = -((ssize_t) zmax);

  uint64_t xi = xse[i], yj = yse[j];
  while (i & j & k) {
    uint64_t ybig = (xi - yj) >> 63;
    yj = ybig ? xi : yj;
    zse[k] = yj;
    i += ybig;
    xi = xse[i];
    ybig ^= 1;
    j += ybig;
    yj = yse[j];
    k += 1;
  }
  if (i == 0)
    memcpy(zse + k, yse + j, -8 * k);
  if (j == 0)
    memcpy(zse + k, xse + i, -8 * k);
}

This time, if we look at the assembly, we can see that the compiler is finally getting it: cmove!

2080:	mov    rax,yj                     ;
2083:	sub    rax,xi                     ;
2086:	shr    rax,0x3f                   ; t = (yj - xi) >> 63
208a:	cmove  yj,xi                      ; yj = t == 0 ? xi : yj
208e:	add    j,rax                      ; j += t
2091:	mov    QWORD PTR [zs+k*8],yj      ; z[k] = yj
2096:	xor    rax,0x1                    ; t ^= 1
209a:	inc    k                          ; k++
209d:	add    i,rax                      ; i += t
20a0:	mov    rax,k                      ; 
20a3:	and    rax,j                      ; t = k & j
20a6:	mov    yj,QWORD PTR [ys+j*8]      ; yj = ys[j]
20aa:	mov    xi,QWORD PTR [xs+i*8]      ; xi = xs[i]
20ae:	test   rax,i                      ; if ((i & j & k) != 0)
20b1:	jne    2080                       ; goto .2080

So we see it's really the same! Curiously, the compiler turned our code around to have t be 1 if xi was the bigger, whereas our ybig was 1 if yj was the bigger.

Results

And now for the results! We fill two arrays with random elements and run branching on it, such that we get the merged array back. This is used as the ground truth which all other variations are checked agaist, in case we have messed up. Then we use clock_gettime to measure the wall clock time that we spend, per method. The following is running time in milliseconds where both lists are 2**25 elements long, averaged over 100 runs; 10 iterations per seed and 10 different seeds (srand(i) for each iteration).

These are the numbers I got on a Intel i7-7500U@2.7GHz (avg +/- var):

branching:                          30.998 +/- 0.001
nonbranching_but_branching:         27.330 +/- 0.002
nonbranching:                       24.770 +/- 0.000
nonbranching_but_branching_reverse: 19.387 +/- 0.000
nonbranching_reverse:               20.015 +/- 0.000
nonbranching_reverse_ternary:       19.038 +/- 0.000
asm_nb_rev:                         18.987 +/- 0.001

I also ran the suite on another machine with a Intel i5-8250U@1.60GHz, in order to see if there would be any significant difference:

branching:                          31.405 +/- 0.034
nonbranching_but_branching:         27.646 +/- 0.097
nonbranching:                       27.894 +/- 0.021
nonbranching_but_branching_reverse: 22.760 +/- 0.040
nonbranching_reverse:               21.284 +/- 0.050
nonbranching_reverse_ternary:       19.299 +/- 0.002
asm_nb_rev:                         19.793 +/- 0.009

Interestingly, on this CPU our assembly is slightly slower than the ternary version; I guess this is due to us using a cmovl where the compiler generated version used the shifting trick.

Bonus: Sorting

We can't possibly have done all this merging without making a proper mergesort in the end! Luckily for us, the merge part is really the only difficult part of the routine:

void merge_sort(uint64_t *xs, size_t n, uint64_t *buf) {
  if (n < 2) return;
  size_t h = n / 2;
  merge_sort(xs, h, buf);
  merge_sort(xs + h, n - h, buf + h);
  merge(xs, h, xs + h, n - h, buf, n);
  memcpy(xs, buf, 8 * n);
}

Unfortunately we have to merge to a buffer and then memcpy it back. Perhaps this is fixable: we can make the sorting routine either put the result in xs or in buf, and by having the recursive calls say which we can merge into the other, assuming both recursive calls agree(!!⁵). That is, if the recursive calls say that the sorted subarrays are in xs, we merge into buf and tell our caller that our result is in buf. At the end, we just need to make sure that the final sorted numbers are in xs.

void _sort_asm(uint64_t *xs, size_t n, uint64_t *buf, int *into_buf) {
  if (n < 2) {
    *into_buf = 0;
    return;
  }
  size_t h = n / 2;
  int res_in_buf;
  _sort_asm(xs, h, buf, &res_in_buf); // WARNING: `res_in_buf` for the two calls is needs
  _sort_asm(xs + h, n - h, buf + h, &res_in_buf); // not be the same in the real world!
  *into_buf = res_in_buf ^ 1;
  if (res_in_buf)
    asm_nb_rev(buf, h, buf + h, n - h, xs, n);
  else
    asm_nb_rev(xs, h, xs + h, n - h, buf, n);
}

void sort_asm(uint64_t *xs, size_t n, uint64_t *buf) {
  int res_in_buf;
  _sort_asm(xs, n, buf, &res_in_buf);
  if (res_in_buf) {
    memcpy(xs, buf, 8 * n);
  }
}

and similar, for the other variants. You might see the branch and wonder if we can remove it --- I tried, by making an array {xs, buf} and index it with res_in_buf, but it caused a minor slowdown: maybe some branching is fine after all.

Here are the running times:

                                         i7-7500U              i5-8250U
sort_branching:                          369.479 +/- 0.047     393.762 +/- 0.082
sort_nonbranching_but_branching:         324.337 +/- 0.014     337.120 +/- 0.099
sort_nonbranching:                       325.658 +/- 0.028     352.802 +/- 0.120
sort_nonbranching_but_branching_reverse: 279.237 +/- 0.164     287.799 +/- 0.154
sort_nonbranching_reverse:               283.927 +/- 0.033     299.277 +/- 0.929
sort_nonbranching_reverse_ternary:       270.668 +/- 0.009     278.644 +/- 1.677
sort_asm_nb_rev:                         270.228 +/- 0.009     281.657 +/- 0.360

If you would like to run the suite yourself, the git repo is avaiable here.

Thanks for reading.

Originally I had omitted the _done parts, and the code was much cleaner, and I'm not sure why having it in complicates this that much. Also, why is k incremented before storing zs[k] so that we have to store zs[k-1] instead? ↩
Curiously, if we change from uint64_t to int64_t and use ((a-b)>>63)&1 for the test we do not depend on the magnitudes of the numbers (as the compiler can assume signed overflow will not happen); also the and never makes it to the assembly, and we still use logical instead of arithmetic shift. ↩
The alternative is arithmetic shift in which the sign bit is propagated down. In this case we would end up with either all zeroes or all ones. ↩
https://en.wikipedia.org/wiki/X86-64#Virtual_address_space_details ↩
This is really only the case if n is a power of two: otherwise you'll have two siblings in the call tree with different ns, and this difference will cause two leaf nodes to be at different depths, which in turn will make them "out of sync". ↩

Recursion

2016-04-10T14:50:41+01:00

So, what is recursion¹ anyways?

Mathematical Induction

Explaining recursion might be easier if we understand mathematical induction. Indunction is a proof--technique, which works somewhat like this²:

Show that we can get from one state to the next state
Show that we have an initial state
We can now get to any state, after the initial state.

The concept can be visalized as climbing a staircase. If we are on an arbitrary step, we know how to get to the next step. We also know how to get to the 0th step, which (i guess) is the ground. From this, we conclude that we can get to step n, for any positive n.

What does this have to do with recursion? Well, recursion is kind of the opposite. With recursion, we need the following:

The problem can be solved by first solving a smaller instance of the same problem
When the input is small enough, it is trivial to solve
We can now solve an instance of any size, simply by making the problem smaller enough times.

Recursion is heavily used in functional programming, so if you are familiar with, say a Lisp, or Haskell, this might be familiar.

For instance, we can find the length of a list using recursion: the length of a list is one more then the length of the same list, with the first element removed, and the length of an empty list is 0.

num list_length(List)
    if List == []
        return 0
    let head be first element
    let tail be rest of list
    return 1 + list_length(tail)

Or, in actual python:

def list_len(lst):
  if lst == []:
    return 0
  return 1 + list_len(lst[1:])

A Not So Straight Forward Recursion

Let's say we're trying to sort a list. We could to a lot of different things, like keeping a sorted list, and inserting one new element at a time, or trying to build a data structure, with which extracting a sorted list is trivial³. But we'll try something else. We can begin with the observation that if we split the list in two, and somehow manage to sort the two list separately, we can merge the lists, pretty easily: just take out the smallest element of the two in front of the lists, and push it at the end onto a new list. Then the new list will consist of all of the elements in sorted order, because we allways took the smaller element.

So great, we just found a way to sort a list... except we didn't, because in order for our algorithm to work, we need an additional algorithm to sort the two lists. But, do we really? The algorithm we just made is itself a sorting algorithm! What happends if we try to use the algorithm itself?

list merge_sort(List)
    split List at middle into A, B
    merge_sort(A)
    merge_sort(B)
    return merge(A, B)

Does this work? What happends if List is empty? Or contains one element? Actually, when the length of the list is less than two, the list is already sorted (if we can say that an empty list is sorted).

list merge_sort(List)
    if List.length < 2
        return List
    split List at middle into A, B
    merge_sort(A)
    merge_sort(B)
    return merge(A, B)

Ok, we are (perhaps) starting to build up confidence in our new, albeit weird, algorithm. But, this call to the function itself looks a little dangerous. How can we know if it will end?

We first observe that if the length of the list is less than two, we return the list, so the algorithm returns from that call. If the length of the list is larger or equal to two, we split the list at the middle, and call the parts A and B. But the length of both A and B is the half of List (or something very close to half). Neither of them can possibly be larger, or even of equal length. Hence, we know that the algorithm calls itself, but with a smaller input. Eventually, when the input it small enough --- containing less than two elements --- we return. Hence, no matter what size the input is, the algorithm always terminates!

Now, this result is kind of amazing. We have just constructed a sorting algorithm, kind of without even thinking about the problem of sorting! By assuming our own algorithm actually works, we gained confidence that the same algorithm works!

The merge step looks a bit scary, though⁴. Maybe we can try something similar, but without having to merge the lists.

Ok, what if we sort the list just a little bit, so that we can make the recursion work. For instance, we can take a element from the list and then split, or partition, the list in two, based on that one element, such that all elements in the first list are less than or equal to our selected element, and all elements in the last list are greater than it. At the end, we simply put them together, and put our special element, the pivot, in betweeen.

list sort(List)
    if List.length < 2
        return List
    select e from List
    make A such that a in A --> a < e
    sort(A)
    make B such that b in B --> b > e
    sort(B)
    return A + [e] + B

This might look more like sorting, because we partition the list into two, based on the pivot. But one would think there is still lots to do. There is not. This python function implements this algorithm, quick-sort.

def sort(lst):
  if len(lst) < 2:
    return lst
  pivot = lst.pop()
  A = [a for a in lst if a <= pivot]
  B = [b for b in lst if b <= pivot]
  return sort(A) + [pivot] + sort(B)

and this is the haskell version:

sort (x:xs) = l ++ [x] ++ g
            where l = sort [a | a<-xs, a <= x]
                  g = sort [b | b<-xs, b > x]
sort l = l

Recursion Gone Wrong

Recursion is an amazing tool, and sometimes it works out just right. However, there are multiple things that can go wrong⁵.

Make Sure It Ends

It is very easy to get burnt on this, either by looping forever⁶:

def fac(n):
    if n == 1:
        return 1
    return n * fac(n - 1)

# fac(-2) loops infinitely

or incorrectly handling, or even forgetting, the base case⁷, and result in a crash:

sort [x] = [x]
sort (x:xs) = l ++ [x] ++ g
            where l = sort [a | a<-xs, a <= x]
                  g = sort [b | b<-xs, b > x]
-- Non-exhaustive patterns:
-- the empty list case is not handled

As mentioned here¹, making sure the recursion ends is tricky.

If we have multiple recursive calls it is also easy to do one thing over and over using recursion. The classical example here is the recursive fibonacci function⁸, which runs in exponential time. Try to run fib(35)⁹.

def fib(n):
    if n < 3:
        return 1
    return fib(n-1) + fib(n-2)

A solution to this is memoization¹⁰, where we save intermediate results, and look up values instead of recomputing them. Of course, in the case on calculating fibonacci numbers, this is still worse than an iterative solution, because we are using more memory. This is not to say that it is not possible to write a equally good implementation, while still using recursion; we just have to go upwards, instead of downwards:

def fib(n):
    def go(current, previous, a):
        if a == n:
            return current
        return go(current + previous, current, a + 1)
    if n < 3:
        return 1
    return go(1, 0, 1)

At this point, one could argue that this is basically a loop¹¹, so there is no point in using recursion; we might be better off with simply writing

def fib(n):
    current = 1
    previous = 0
    for _ in range(n - 1):
      current, previous =  current + previous, current
    return current

Hopefully, we have gained a little insight in recursion --- both its magic and its dangers. While there might be tricky to get recursion right, actually getting it right is too much fun to miss out on.

While trying to make the Recursion: See Recursion joke, the static site generator ended up crashing due to a never ending recursion so that the call stack became too large. ↩ ↩²
This is not a rigid, and probably not a very good, intro to induction --- but that is fine, because this post is about recursion. ↩
This is my personal favorite. ↩
It's not, but it is easy to mess up the merge routine. ↩
These problems aren't unique to recursion --- both infinite loops and horrible running times are naturally possible without using recursion. ↩
You could argue that the factorial of a negative number is not well defined, so that calling fac(-2) does not make any sense. However, defining the factorial of a negative number is useful in many situations. ↩
This actually was the version I wrote when trying to write the correct version above. ↩
We will be using the same definition as on wikipedia, where fib(1) = fib(2) = 1. ↩
... or fib(50), if you have a lot of time on your hands. ↩
Note that there is no r in memoization. ↩
The function is Tail--Recursive, so in many languages this would be transformed by the compiler to a loop. However, Python is not one of those languages. ↩

Fixing My Wacom Tablet

2020-06-21T01:00:39+02:00

A quick warning: this isn't one of these "here's the problem, here's the solution, bam bam bam" type of write-ups.This was written while I tried to fix my tablet, and with very little editing after the fact. Don't go in expecting a well though out story arc, as this is meant to reflect how I was working and what I was thinking. I think generally there's way too little material online on how people work day to day, and too many write-ups of just the good parts, so this is me helping pushing the ratio a little in the right direction.

With that out of the way, let's start with the background.

Over a year ago I bought a Wacom drawing tablet, having been increasingly annoyed with my handwritten notes and doodles. I figured if I got a tablet I could draw digitally which would simplify erasing, colors, and layout. And, of course, I would be able to access it digitally. Since then I've mainly used XournalPP for this, and I think it's been working okay.

Not great though; there are definitely quirks with both XournalPP and the wacom driver, and it took some tinkering before I had a setup that was usable. Still, one thing that never worked is button presses on the drawing pad. The buttons on the stylus works fine, and the on/off button on the pad works, but none of the four remaining pad buttons do anything.

Worse yet, I'm using Wayland on my home computer, which I suspect will make things tougher.

Still, I figured after so long I'd try to properly fix this, whatever it takes. Just¹ getting some button event presses shouldn't be that hard, right?

libinput

libinput is a library to handle input devices in Wayland. My system also has the libinput tool for interfacing with this library, and the tool has, among other things, the command debug-events:

$ sudo libinput debug-events
-event1   DEVICE_ADDED     Power Button                      seat0 default group1  cap:k
-event0   DEVICE_ADDED     Power Button                      seat0 default group2  cap:k
-event20  DEVICE_ADDED     Logitech Performance MX           seat0 default group3  cap:p left scroll-nat scroll-button
-event3   DEVICE_ADDED     HDA ATI HDMI HDMI/DP,pcm=3        seat0 default group4  cap:
-event4   DEVICE_ADDED     HDA ATI HDMI HDMI/DP,pcm=7        seat0 default group4  cap:
-event5   DEVICE_ADDED     HDA ATI HDMI HDMI/DP,pcm=8        seat0 default group4  cap:
-event6   DEVICE_ADDED     HDA ATI HDMI HDMI/DP,pcm=9        seat0 default group4  cap:
-event7   DEVICE_ADDED     HDA ATI HDMI HDMI/DP,pcm=10       seat0 default group4  cap:
-event8   DEVICE_ADDED     HDA ATI HDMI HDMI/DP,pcm=11       seat0 default group4  cap:
-event21  DEVICE_ADDED     Kingsis Peripherals Evoluent VerticalMouse 4 seat0 default group5  cap:p left scroll-nat scroll-button
-event26  DEVICE_ADDED     HD Pro Webcam C920                seat0 default group6  cap:k
-event24  DEVICE_ADDED     Wacom Intuos BT M Pen             seat0 default group7  cap:T  size 216x135mm
-event25  DEVICE_ADDED     Wacom Intuos BT M Pad             seat0 default group7  cap:P buttons:4 strips:0 rings:0 mode groups:1
-event17  DEVICE_ADDED     ZSA Ergodox EZ                    seat0 default group8  cap:k
-event18  DEVICE_ADDED     ZSA Ergodox EZ Mouse              seat0 default group8  cap:p left scroll-nat scroll-button
-event19  DEVICE_ADDED     ZSA Ergodox EZ System Control     seat0 default group8  cap:k
-event22  DEVICE_ADDED     ZSA Ergodox EZ Consumer Control   seat0 default group8  cap:kp scroll-nat
-event23  DEVICE_ADDED     ZSA Ergodox EZ Keyboard           seat0 default group8  cap:k
-event10  DEVICE_ADDED     HD-Audio Generic Rear Mic         seat0 default group4  cap:
-event11  DEVICE_ADDED     HD-Audio Generic Line             seat0 default group4  cap:
-event12  DEVICE_ADDED     HD-Audio Generic Line Out Front   seat0 default group4  cap:
-event13  DEVICE_ADDED     HD-Audio Generic Line Out Surround seat0 default group4  cap:
-event14  DEVICE_ADDED     HD-Audio Generic Line Out CLFE    seat0 default group4  cap:
-event15  DEVICE_ADDED     HD-Audio Generic Line Out Side    seat0 default group4  cap:
-event16  DEVICE_ADDED     HD-Audio Generic Front Headphone  seat0 default group4  cap:
-event9   DEVICE_ADDED     HD-Audio Generic Front Mic        seat0 default group4  cap:

two of which looks pretty interesting:

-event24  DEVICE_ADDED     Wacom Intuos BT M Pen             seat0 default group7  cap:T  size 216x135mm
-event25  DEVICE_ADDED     Wacom Intuos BT M Pad             seat0 default group7  cap:P buttons:4 strips:0 rings:0 mode groups:1

The pad with its four buttons seems to be properly detected. Drawing on the pad while libinput debug-events is running spits out a bunch of lines of the form

 event24  TABLET_TOOL_AXIS +1.666s		121.71*/69.45*	distance: 0.94*
 event24  TABLET_TOOL_AXIS +1.674s		121.76*/69.46*	distance: 0.94
 event24  TABLET_TOOL_AXIS +1.682s		121.83*/69.47*	distance: 0.87*
 event24  TABLET_TOOL_AXIS +2.552s		106.09*/43.80*	pressure: 0.34*
 event24  TABLET_TOOL_AXIS +2.558s		106.06*/43.79*	pressure: 0.35*
 event24  TABLET_TOOL_AXIS +2.566s		106.05*/43.79	pressure: 0.35*

that is, events from the pen. We can see that we're getting events that the pen is near, but not touching, the pad with the lines saying distance, and that the events corresponding to when we're actually touching² the pad have pressure. The numbers in the middle are positional coordinates, ranging from 0/0 at the top left to 216/135 at the bottom right, which I suspect are in millimeters³.

So far this seems to be working rather well. But uh oh, what happens when we try to press the pad buttons? Nothing. And thus begins the adventure.

libwacom

I figure that since the pen is working well, it might be the driver for the pad that's lacking. This is slightly supported by the fact that the only apparent usage of the pad is to detect when the pen is near (see this² footnote). Assuming this is handled across other pads in the same way, maybe the specifics of my pad, like the buttons, are problematic in the driver. The output from libinput did state correctly that it has 4 button though, but ... uuuh, let's just check anyways.

Let's see what relevant kernel modules and packages we have:

$ lsmod | grep wacom
wacom                 126976  0
usbhid                 65536  2 wacom,hid_logitech_dj
hid                   143360  5 wacom,usbhid,hid_generic,hid_logitech_dj,hid_logitech_hidpp
$ pacman -Q | grep wacom
libwacom 1.3-1

Okay. A quick search reveals that libwacom is on Github, and the README contains the following helpful note:

Use the libwacom-list-local-devices tool to list all local devices recognized by libwacom. If your device is not listed, but it is available as an event device in the kernel (see /proc/bus/input/devices) and in the X session (see xinput list), the device is missing from libwacom's database.

Again, we are using Wayland, and since the README assumes an X system we might run into trouble. Let's give it a shot:

$ libwacom-list-local-devices
# Device node: /dev/input/event25
[Device]
Name=Wacom Intuos BT M
ModelName=CTL-6100WL
DeviceMatch=usb:056a:0378;bluetooth:056a:0379;
Class=Bamboo
Width=9
Height=5
IntegratedIn=
Layout=intuos-m-p3.svg
Styli=0x862;

[Features]
Reversible=false
Stylus=true
Ring=false
Ring2=false
Touch=false
TouchSwitch=false
# StatusLEDs=
NumStrips=0
Buttons=4

[Buttons]
# Left=
# Right=
Top=A;B;C;D;
# Bottom=
# Touchstrip=
# Touchstrip2=
# OLEDs=
# Ring=
# Ring2=
EvdevCodes=0x110;0x111;0x115;0x116;
RingNumModes=0
Ring2NumModes=0
StripsNumModes=0

---------------------------------------------------------------
# Device node: /dev/input/event24
[Device]
Name=Wacom Intuos BT M
ModelName=CTL-6100WL
DeviceMatch=usb:056a:0378;bluetooth:056a:0379;
Class=Bamboo
Width=9
Height=5
IntegratedIn=
Layout=intuos-m-p3.svg
Styli=0x862;

[Features]
Reversible=false
Stylus=true
Ring=false
Ring2=false
Touch=false
TouchSwitch=false
# StatusLEDs=
NumStrips=0
Buttons=4

[Buttons]
# Left=
# Right=
Top=A;B;C;D;
# Bottom=
# Touchstrip=
# Touchstrip2=
# OLEDs=
# Ring=
# Ring2=
EvdevCodes=0x110;0x111;0x115;0x116;
RingNumModes=0
Ring2NumModes=0
StripsNumModes=0

---------------------------------------------------------------

Recall from above that event24 is the pen and event25 is the pad. The driver seems to be confused as to the difference between the pad and the pen, as both devices have the exact same output; Maybe this is due to to the fact that they share a device id, or maybe it makes things simpler in the driver. For instance, having the pen be Width=9 and Height=5 is obviously not true, but those are the limits of the pen pressure events that you'd get since you always would use the pen together with the pad. I'll assume that's not a problem.

The output also states, once again, that we do have four buttons, but now they also correctly state that the buttons are on the top of the pad. They are labeled A-D. Since there are a listing of four numbers in EvdevCodes, I think those are the "scan codes", so to speak, that are sent when the buttons are pressed. In case of confusion, let's write those down in hex and decimal:

0x110 = 272
0x111 = 273
0x115 = 277
0x116 = 278

Just cat it

Come to think of it, why don't we just cat the right event file? If there are any events coming through we would at least know that the hardware is recognizing that we're using it and sending something into the driver. Then we would have narrowed down slightly more where in the stack the problems are.

$ sudo cat /dev/input/event25
�`��`�(�`��*D	�*D	(�*D	�y
                                  �y
                                    (�y
                                       ����(���)�)(�)^C⏎

That looks about right? Slightly unreadable tough; here are the output after having pressed the buttons one at a time, piped through hexdump with a newline in between each event:

$ sudo cat /dev/input/event25 | hexdump
0000000 04c1 5eee 0000 0000 ffb0 0005 0000 0000
0000010 0001 0100 0001 0000 04c1 5eee 0000 0000
0000020 ffb0 0005 0000 0000 0003 0028 000f 0000
0000030 04c1 5eee 0000 0000 ffb0 0005 0000 0000

0000040 0000 0000 0000 0000 04c1 5eee 0000 0000
0000050 39ff 0008 0000 0000 0001 0100 0000 0000
0000060 04c1 5eee 0000 0000 39ff 0008 0000 0000
0000070 0003 0028 0000 0000 04c1 5eee 0000 0000
0000080 39ff 0008 0000 0000 0000 0000 0000 0000

0000090 04c4 5eee 0000 0000 4288 0004 0000 0000
00000a0 0001 0101 0001 0000 04c4 5eee 0000 0000
00000b0 4288 0004 0000 0000 0003 0028 000f 0000
00000c0 04c4 5eee 0000 0000 4288 0004 0000 0000

00000d0 0000 0000 0000 0000 04c4 5eee 0000 0000
00000e0 d49a 0007 0000 0000 0001 0101 0000 0000
00000f0 04c4 5eee 0000 0000 d49a 0007 0000 0000
0000100 0003 0028 0000 0000 04c4 5eee 0000 0000
0000110 d49a 0007 0000 0000 0000 0000 0000 0000

0000120 04c4 5eee 0000 0000 d983 000e 0000 0000
0000130 0001 0102 0001 0000 04c4 5eee 0000 0000
0000140 d983 000e 0000 0000 0003 0028 000f 0000
0000150 04c4 5eee 0000 0000 d983 000e 0000 0000

0000160 0000 0000 0000 0000 04c5 5eee 0000 0000
0000170 0a14 0003 0000 0000 0001 0102 0000 0000
0000180 04c5 5eee 0000 0000 0a14 0003 0000 0000
0000190 0003 0028 0000 0000 04c5 5eee 0000 0000
00001a0 0a14 0003 0000 0000 0000 0000 0000 0000

00001b0 04c5 5eee 0000 0000 7e2c 000b 0000 0000
00001c0 0001 0103 0001 0000 04c5 5eee 0000 0000
00001d0 7e2c 000b 0000 0000 0003 0028 000f 0000
00001e0 04c5 5eee 0000 0000 7e2c 000b 0000 0000

00001f0 0000 0000 0000 0000 04c5 5eee 0000 0000
0000200 3f1e 000f 0000 0000 0001 0103 0000 0000
0000210 04c5 5eee 0000 0000 3f1e 000f 0000 0000
0000220 0003 0028 0000 0000 04c5 5eee 0000 0000
0000230 3f1e 000f 0000 0000 0000 0000 0000 0000

Somehow, the the down presses are less data than the releases. This might be true, but another explanation is that hexdump is buffering up the data so that it can output each line as 0x10 bytes. After carefully reading the man page of hexdump, and with some trial and error⁴, the following does the job:

$ sudo cat /dev/input/event25 | hexdump -ve "1/1 \"%02x\n\""
# A down                                              **    **
0a 0a ee 5e 00 00 00 00 9c 82 06 00 00 00 00 00 01 00 00 01 01 00 00 00 0a 0a ee 5e 00 00 00 00 9c 82 06 00 00 00 00 00 03 00 28 00 0f 00 00 00 0a 0a ee 5e 00 00 00 00 9c 82 06 00 00 00 00 00 00 00 00 00 00 00 00 00
# A up
0b 0a ee 5e 00 00 00 00 3b fa 03 00 00 00 00 00 01 00 00 01 00 00 00 00 0b 0a ee 5e 00 00 00 00 3b fa 03 00 00 00 00 00 03 00 28 00 00 00 00 00 0b 0a ee 5e 00 00 00 00 3b fa 03 00 00 00 00 00 00 00 00 00 00 00 00 00
# B down
0c 0a ee 5e 00 00 00 00 a9 ea 03 00 00 00 00 00 01 00 01 01 01 00 00 00 0c 0a ee 5e 00 00 00 00 a9 ea 03 00 00 00 00 00 03 00 28 00 0f 00 00 00 0c 0a ee 5e 00 00 00 00 a9 ea 03 00 00 00 00 00 00 00 00 00 00 00 00 00
# B up
0c 0a ee 5e 00 00 00 00 42 0e 0f 00 00 00 00 00 01 00 01 01 00 00 00 00 0c 0a ee 5e 00 00 00 00 42 0e 0f 00 00 00 00 00 03 00 28 00 00 00 00 00 0c 0a ee 5e 00 00 00 00 42 0e 0f 00 00 00 00 00 00 00 00 00 00 00 00 00
# C down
0e 0a ee 5e 00 00 00 00 f4 ee 01 00 00 00 00 00 01 00 02 01 01 00 00 00 0e 0a ee 5e 00 00 00 00 f4 ee 01 00 00 00 00 00 03 00 28 00 0f 00 00 00 0e 0a ee 5e 00 00 00 00 f4 ee 01 00 00 00 00 00 00 00 00 00 00 00 00 00
# C up
0e 0a ee 5e 00 00 00 00 4c 76 0c 00 00 00 00 00 01 00 02 01 00 00 00 00 0e 0a ee 5e 00 00 00 00 4c 76 0c 00 00 00 00 00 03 00 28 00 00 00 00 00 0e 0a ee 5e 00 00 00 00 4c 76 0c 00 00 00 00 00 00 00 00 00 00 00 00 00
# D down
0f 0a ee 5e 00 00 00 00 ee 8f 08 00 00 00 00 00 01 00 03 01 01 00 00 00 0f 0a ee 5e 00 00 00 00 ee 8f 08 00 00 00 00 00 03 00 28 00 0f 00 00 00 0f 0a ee 5e 00 00 00 00 ee 8f 08 00 00 00 00 00 00 00 00 00 00 00 00 00
# D up
10 0a ee 5e 00 00 00 00 c2 96 04 00 00 00 00 00 01 00 03 01 00 00 00 00 10 0a ee 5e 00 00 00 00 c2 96 04 00 00 00 00 00 03 00 28 00 00 00 00 00 10 0a ee 5e 00 00 00 00 c2 96 04 00 00 00 00 00 00 00 00 00 00 00 00 00

We can even see some signs of what data is sent through here. For instance, in the columns as marked by ** we see 00 through 03, likely the button number, and 01 for press and 00 for release. In other words, there seems to be reasonable data sent from the pad that we can read from /dev/input/event25.

Next, we need to find out on which side libwacom is; is that doing the mapping from whatever goes to over the wire and to what we just read, or is that supposed to read from event25 to the events that we did not get from libinput?

A Closer Look At That Data

kernel.org has some documentation for the Linux input subsystem, but I think it's written in such a way that it's not very helpful unless you already have a pretty good idea of what's going on. However, Section 1.5 has the following info:

You can use blocking and nonblocking reads, and also select() on the /dev/input/eventX devices, and you’ll always get a whole number of input events on a read. Their layout is:

struct input_event {
    struct timeval time;
    unsigned short type;
    unsigned short code;
    unsigned int value;
};

This doesn't seem right, since the size of input_event is way less than the data we read above. Unless, of course, we didn't get only one event. Since the first member is the time we can assume that all events should start with more or less the same. In addition, we suspect that an event is about 8 + 2 + 2 + 4 = 16 bytes long. Or maybe struct timeval is 16 bytes big? That seems to align much better with the data we have read:

0a 0a ee 5e 00 00 00 00  9c 82 06 00 00 00 00 00  01 00  00 01  01 00 00 00
0a 0a ee 5e 00 00 00 00  9c 82 06 00 00 00 00 00  03 00  28 00  0f 00 00 00
0a 0a ee 5e 00 00 00 00  9c 82 06 00 00 00 00 00  00 00  00 00  00 00 00 00

man 3 timeval says that struct timeval has two members, a time_t and a suseconds_t; in addition, we're on a big-endian machine, so our struct members should look like this⁵:

struct input_event {
    struct timeval {
        time_t       tv_sec = 0x000000005eee0a0a;
        suseconds_t tv_usec = 0x000000000006829c;
    }                                  // and for the other two events:
    unsigned short type  =     0x0001; //     0003 / 0000
    unsigned short code  =     0x0100; //     0028 / 0000
    unsigned int   value = 0x00000001; // 0000000f / 00000000
}

The time certainly looks reasonable:

$ printf "%016x\n" (date +"%s")
000000005eee1baa

The docs says that the types are defined in include/uapi/linux/input-event-codes.h, which I found on my system in /usr/include/linux/input-event-codes.h⁶. Looking through it we can infer that the events we're reading are; The difficulty is that how to interpret the code or value of an event depends on the type of the event. §2.2.1 is helpful here. As far as I can tell, this is what's going on:

| Type | Code | Value | Meaning | |:- |:- |-: |:-| | EV_KEY | BTN_0 | 1 | pressed | | EV_ABS | ABS_MISC | 15 | ? | | EV_SYN | SYN_REPORT | 0 | undef |

Note that in §2.2.1 they say

EV_SYN event values are undefined. Their usage is defined only by when they are sent in the evdev event stream.

Here are all of the events from earlier, but this time one per line and annotated on the right:

# A down
0a 0a ee 5e 00 00 00 00  9c 82 06 00 00 00 00 00  01 00  00 01  01 00 00 00 # KEY/BTN_0 Press
0a 0a ee 5e 00 00 00 00  9c 82 06 00 00 00 00 00  03 00  28 00  0f 00 00 00 # ABS/MISC 15
0a 0a ee 5e 00 00 00 00  9c 82 06 00 00 00 00 00  00 00  00 00  00 00 00 00 # SYN/REPORT
# A up
0b 0a ee 5e 00 00 00 00  3b fa 03 00 00 00 00 00  01 00  00 01  00 00 00 00 # KEY/BTN_0 Release
0b 0a ee 5e 00 00 00 00  3b fa 03 00 00 00 00 00  03 00  28 00  00 00 00 00 # ABS/MISC 0
0b 0a ee 5e 00 00 00 00  3b fa 03 00 00 00 00 00  00 00  00 00  00 00 00 00 # SYN/REPORT
# B down
0c 0a ee 5e 00 00 00 00  a9 ea 03 00 00 00 00 00  01 00  01 01  01 00 00 00 # KEY/BTN_1 Press
0c 0a ee 5e 00 00 00 00  a9 ea 03 00 00 00 00 00  03 00  28 00  0f 00 00 00 # ABS/MISC 15
0c 0a ee 5e 00 00 00 00  a9 ea 03 00 00 00 00 00  00 00  00 00  00 00 00 00 # SYN/REPORT
# B up
0c 0a ee 5e 00 00 00 00  42 0e 0f 00 00 00 00 00  01 00  01 01  00 00 00 00 # KEY/BTN_1 Release
0c 0a ee 5e 00 00 00 00  42 0e 0f 00 00 00 00 00  03 00  28 00  00 00 00 00 # ABS/MISC 0
0c 0a ee 5e 00 00 00 00  42 0e 0f 00 00 00 00 00  00 00  00 00  00 00 00 00 # SYN/REPORT
# C down
0e 0a ee 5e 00 00 00 00  f4 ee 01 00 00 00 00 00  01 00  02 01  01 00 00 00 # KEY/BTN_2 Press
0e 0a ee 5e 00 00 00 00  f4 ee 01 00 00 00 00 00  03 00  28 00  0f 00 00 00 # ABS/MISC 15
0e 0a ee 5e 00 00 00 00  f4 ee 01 00 00 00 00 00  00 00  00 00  00 00 00 00 # SYN/REPORT
# C up
0e 0a ee 5e 00 00 00 00  4c 76 0c 00 00 00 00 00  01 00  02 01  00 00 00 00 # KEY/BTN_2 Release
0e 0a ee 5e 00 00 00 00  4c 76 0c 00 00 00 00 00  03 00  28 00  00 00 00 00 # ABS/MISC 0
0e 0a ee 5e 00 00 00 00  4c 76 0c 00 00 00 00 00  00 00  00 00  00 00 00 00 # SYN/REPORT
# D down
0f 0a ee 5e 00 00 00 00  ee 8f 08 00 00 00 00 00  01 00  03 01  01 00 00 00 # KEY/BTN_3 Press
0f 0a ee 5e 00 00 00 00  ee 8f 08 00 00 00 00 00  03 00  28 00  0f 00 00 00 # ABS/MISC 15
0f 0a ee 5e 00 00 00 00  ee 8f 08 00 00 00 00 00  00 00  00 00  00 00 00 00 # SYN/REPORT
# D up
10 0a ee 5e 00 00 00 00  c2 96 04 00 00 00 00 00  01 00  03 01  00 00 00 00 # KEY/BTN_3 Release
10 0a ee 5e 00 00 00 00  c2 96 04 00 00 00 00 00  03 00  28 00  00 00 00 00 # ABS/MISC 0
10 0a ee 5e 00 00 00 00  c2 96 04 00 00 00 00 00  00 00  00 00  00 00 00 00 # SYN/REPORT

In trying to find out more info about ABS_MISC I found a wiki page on the linuxwacom/input-wacom repository named Kernel Input Event Overview, which explains pretty well how the wacom driver works. They state the following:

In addition to the BTN_TOOL_* events for informing user land what tool the current events being sent belong to, there is a MSC_SERIAL event that contains a serial # to aid in tracking current tool as well as a ABS_MISC which is a hard code device ID. Of these two, the MSC_SERIAL is the most useful to user land.

So, uuh.. maybe I won't worry too much about the ABS/MISC events. But this is good; we have confirmed that the data we're reading from /dev/input/event25 is of the type input_event, and that it makes sense, more or less. From reading the wiki page it really does sound like the driver is mapping whatever goes over the wire to the Linux input subsystem format, which is input_event.

At this point I realize that linuxwacom has three primary components: input-wacom which is the kernel driver, which presumably does the mapping just mentioned; xf86-input-wacom the X driver, which I suppose makes kernel driver events into X events? and libwacom, which really just seems to be a utility for simpler querying of state and button mapping and so on.

Going back, we can see that libwacom-list-local-devices seems to work just fine, which I think means that the pad is properly detected. In addition, we know that we get "good" events to /dev/input/event25 so presumably the kernel driver also works fine. However, libinput debug-events did not list the button presses, so libinput doesn't get those events, although it does get the stylus events.

Back to libinput

Next, we got back to libinput; looking through some of the docs it seems there's another command, libinput record. According to man 1 libinput-record,

The libinput record tool records kernel events from a device and prints them in a format that can later be replayed with the libinput replay(1) tool.

Running it, and selecting our device, actually shows that the buttons are detected:

$ sudo libinput record
Available devices:
/dev/input/event0:	Power Button
/dev/input/event1:	Power Button
/dev/input/event2:	PC Speaker
/dev/input/event3:	HDA ATI HDMI HDMI/DP,pcm=3
/dev/input/event4:	HDA ATI HDMI HDMI/DP,pcm=7
/dev/input/event5:	HDA ATI HDMI HDMI/DP,pcm=8
/dev/input/event6:	HDA ATI HDMI HDMI/DP,pcm=9
/dev/input/event7:	HDA ATI HDMI HDMI/DP,pcm=10
/dev/input/event8:	HDA ATI HDMI HDMI/DP,pcm=11
/dev/input/event9:	HD-Audio Generic Front Mic
/dev/input/event10:	HD-Audio Generic Rear Mic
/dev/input/event11:	HD-Audio Generic Line
/dev/input/event12:	HD-Audio Generic Line Out Front
/dev/input/event13:	HD-Audio Generic Line Out Surround
/dev/input/event14:	HD-Audio Generic Line Out CLFE
/dev/input/event15:	HD-Audio Generic Line Out Side
/dev/input/event16:	HD-Audio Generic Front Headphone
/dev/input/event17:	ZSA Ergodox EZ
/dev/input/event18:	ZSA Ergodox EZ Mouse
/dev/input/event19:	ZSA Ergodox EZ System Control
/dev/input/event20:	Logitech Performance MX
/dev/input/event21:	Kingsis Peripherals Evoluent VerticalMouse 4
/dev/input/event22:	ZSA Ergodox EZ Consumer Control
/dev/input/event23:	ZSA Ergodox EZ Keyboard
/dev/input/event24:	Wacom Intuos BT M Pen
/dev/input/event25:	Wacom Intuos BT M Pad
/dev/input/event26:	HD Pro Webcam C920
Select the device event number: 25
Recording to 'stdout'.
version: 1
ndevices: 1
libinput:
  version: "1.15.5"
  git: "unknown"
system:
  kernel: "5.7.2-arch1-1"
  dmi: "dmi:bvnAmericanMegatrendsInc.:bvr3.D0:bd07/11/2018:svnMicro-StarInternationalCo.,Ltd.:pnMS-7A33:pvr2.0:rvnMSI:rnX370SLIPLUS(MS-7A33):rvr2.0:cvnMicro-StarInternationalCo.,Ltd.:ct3:cvr2.0:"
devices:
- node: /dev/input/event25
  evdev:
    # Name: Wacom Intuos BT M Pad
    # ID: bus 0x3 vendor 0x56a product 0x378 version 0x110
    # Size in mm: unknown, missing resolution
    # Supported Events:
    # Event type 0 (EV_SYN)
    # Event type 1 (EV_KEY)
    #   Event code 256 (BTN_0)
    #   Event code 257 (BTN_1)
    #   Event code 258 (BTN_2)
    #   Event code 259 (BTN_3)
    #   Event code 331 (BTN_STYLUS)
    # Event type 3 (EV_ABS)
    #   Event code 0 (ABS_X)
    #       Value           0
    #       Min             0
    #       Max             1
    #       Fuzz            0
    #       Flat            0
    #       Resolution      0
    #   Event code 1 (ABS_Y)
    #       Value           0
    #       Min             0
    #       Max             1
    #       Fuzz            0
    #       Flat            0
    #       Resolution      0
    #   Event code 40 (ABS_MISC)
    #       Value           0
    #       Min             0
    #       Max             0
    #       Fuzz            0
    #       Flat            0
    #       Resolution      0
    # Properties:
    name: "Wacom Intuos BT M Pad"
    id: [3, 1386, 888, 272]
    codes:
      0: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15] # EV_SYN
      1: [256, 257, 258, 259, 331] # EV_KEY
      3: [0, 1, 40] # EV_ABS
    absinfo:
      0: [0, 1, 0, 0, 0]
      1: [0, 1, 0, 0, 0]
      40: [0, 0, 0, 0, 0]
    properties: []
  hid: [6, 13, 255, 9, 1, 161, 1, 133, 16, 9, 32, 53, 0, 69, 0, 21, 0, 37, 1, 161, 0, 9, 66, 9, 68, 9, 90, 37, 1, 117, 1, 149, 3, 129, 2, 149, 2, 129, 3, 9, 50, 9, 54, 149, 2, 129, 2, 149, 1, 129, 3, 10, 48, 1, 101, 17, 85, 13, 71, 96, 84, 0, 0, 39, 96, 84, 0, 0, 117, 24, 149, 1, 129, 2, 10, 49, 1, 71, 188, 52, 0, 0, 39, 188, 52, 0, 0, 129, 2, 9, 48, 85, 0, 101, 0, 38, 255, 15, 117, 16, 129, 2, 117, 8, 149, 6, 129, 3, 10, 50, 1, 37, 63, 117, 8, 149, 1, 129, 2, 9, 91, 9, 92, 23, 0, 0, 0, 128, 39, 255, 255, 255, 127, 117, 32, 149, 2, 129, 2, 9, 119, 21, 0, 38, 255, 15, 117, 16, 149, 1, 129, 2, 192, 133, 17, 101, 0, 85, 0, 53, 0, 69, 0, 9, 57, 161, 0, 10, 16, 9, 10, 17, 9, 10, 18, 9, 10, 19, 9, 21, 0, 37, 1, 117, 1, 149, 4, 129, 2, 149, 4, 129, 3, 117, 8, 149, 7, 129, 3, 192, 133, 19, 101, 0, 85, 0, 53, 0, 69, 0, 10, 19, 16, 161, 0, 10, 59, 4, 21, 0, 37, 100, 117, 7, 149, 1, 129, 2, 10, 4, 4, 37, 1, 117, 1, 129, 2, 9, 0, 38, 255, 0, 117, 8, 129, 2, 117, 8, 149, 6, 129, 3, 192, 9, 14, 161, 2, 133, 2, 10, 2, 16, 21, 2, 37, 2, 117, 8, 149, 1, 177, 2, 133, 3, 10, 3, 16, 21, 0, 38, 255, 0, 149, 1, 177, 2, 133, 4, 10, 4, 16, 21, 1, 37, 1, 149, 1, 177, 2, 133, 7, 10, 9, 16, 21, 0, 38, 255, 0, 149, 1, 177, 2, 177, 3, 10, 7, 16, 9, 0, 39, 255, 255, 0, 0, 117, 16, 149, 2, 177, 2, 117, 8, 149, 9, 177, 3, 133, 12, 10, 48, 13, 10, 49, 13, 10, 50, 13, 10, 51, 13, 101, 17, 85, 13, 53, 0, 70, 200, 0, 21, 0, 38, 144, 1, 117, 16, 149, 4, 177, 2, 133, 13, 10, 13, 16, 101, 0, 85, 0, 69, 0, 37, 1, 117, 8, 149, 1, 177, 2, 133, 20, 10, 20, 16, 38, 255, 0, 149, 13, 177, 2, 133, 204, 10, 204, 16, 149, 2, 177, 2, 133, 49, 10, 49, 16, 37, 100, 149, 3, 177, 2, 149, 2, 177, 3, 192, 10, 172, 16, 161, 2, 21, 0, 38, 255, 0, 117, 8, 133, 172, 9, 0, 150, 191, 0, 129, 2, 133, 21, 9, 0, 149, 14, 177, 2, 133, 51, 9, 0, 149, 18, 177, 2, 133, 68, 9, 0, 149, 4, 177, 2, 133, 69, 9, 0, 149, 32, 177, 2, 133, 96, 9, 0, 149, 63, 177, 2, 133, 97, 9, 0, 149, 62, 177, 2, 133, 98, 9, 0, 149, 62, 177, 2, 133, 101, 9, 0, 149, 4, 177, 2, 133, 102, 9, 0, 149, 4, 177, 2, 133, 103, 9, 0, 149, 4, 177, 2, 133, 104, 9, 0, 149, 17, 177, 2, 133, 111, 9, 0, 149, 62, 177, 2, 133, 205, 9, 0, 149, 2, 177, 2, 133, 22, 9, 0, 149, 14, 177, 2, 133, 53, 9, 0, 149, 10, 177, 2, 192, 133, 208, 9, 1, 150, 8, 0, 177, 2, 133, 209, 9, 1, 150, 4, 1, 177, 2, 133, 210, 9, 1, 150, 4, 1, 177, 2, 133, 211, 9, 1, 150, 4, 0, 177, 2, 133, 212, 9, 1, 150, 4, 0, 177, 2, 133, 213, 9, 1, 150, 4, 0, 177, 2, 133, 214, 9, 1, 150, 4, 0, 177, 2, 133, 215, 9, 1, 150, 8, 0, 177, 2, 133, 216, 9, 1, 150, 12, 0, 177, 2, 133, 217, 9, 1, 150, 0, 5, 177, 2, 133, 218, 9, 1, 150, 4, 2, 177, 2, 133, 219, 9, 1, 150, 6, 0, 177, 2, 133, 220, 9, 1, 150, 2, 0, 177, 2, 133, 221, 9, 1, 150, 4, 0, 177, 2, 133, 222, 9, 1, 150, 4, 0, 177, 2, 133, 223, 9, 1, 150, 34, 0, 177, 2, 133, 224, 9, 1, 150, 1, 0, 177, 2, 133, 225, 9, 1, 150, 2, 0, 177, 2, 133, 226, 9, 1, 150, 2, 0, 177, 2, 133, 227, 9, 1, 150, 2, 0, 177, 2, 133, 228, 9, 1, 150, 255, 1, 177, 2, 192 ]
  udev:
    properties:
    - ID_INPUT=1
    - ID_INPUT_TABLET=1
    - ID_INPUT_TABLET_PAD=1
    - LIBINPUT_DEVICE_GROUP=3/56a/378:usb-0000:29:00.3-2
  quirks:
  events:
  - evdev:
    - [  0,      0,   1, 256,       1] # EV_KEY / BTN_0                     1
  - evdev:
    - [  0,      0,   3,  40,      15] # EV_ABS / ABS_MISC                 15 (+15)
    - [  0,      0,   0,   0,       0] # ------------ SYN_REPORT (0) ---------- +0ms
  - evdev:
    - [  0, 214000,   1, 256,       0] # EV_KEY / BTN_0                     0
    - [  0, 214000,   3,  40,       0] # EV_ABS / ABS_MISC                  0 (-15)
    - [  0, 214000,   0,   0,       0] # ------------ SYN_REPORT (0) ---------- +214ms

These are exactly the same events as the ones we reverse engineered above, which is a good sign; libinput gets the same events as we are, but it seems to decide that they aren't worth sending further. Maybe if we dig a bit into libinput we can find out how devices and events are treated, set a breakpoint somewhere when libinput is reading the button press event and see what happens.

The libinput docs has an overview over libinput.

evdev_device_create calls evdev_configure_device with the device as a parameter; device->devname contains the name of the device and will contain Wacom for the devices we're interested in.

(gdb) break evdev_configure_device if ((int) strstr(device->devname, "acom"))
(gdb) c
Continuing.

Breakpoint 3, evdev_configure_device (device=0x55555566f000) at ../src/evdev.c:1775
1775		struct libevdev *evdev = device->evdev;
(gdb) p device->devname
$7 = 0x5555556365f0 "Wacom Intuos BT M Pad"
(gdb)

We're stepping through the function to see whether anything strange is happening. We would like it to be recognized as a tablet pad so that the correct dispatch methods are set up.

(gdb) p udev_tags
$8 = (EVDEV_UDEV_TAG_INPUT | EVDEV_UDEV_TAG_TABLET | EVDEV_UDEV_TAG_TABLET_PAD)

So far so good. Continuing down it does correctly go into the if on line 1852, and calls evdev_tablet_pad_create. The pad is thus identified as a tablet pad. Now we need to find out how events are read in libinput.

Apparently, libinput uses libevdev which is a wrapper library for evdev devices. So instead of reading the files in /dev/input like we did above, we can get the events from handles that we get through libevdev.

The function libevdev_next_event is called in four places in src/evdev.c, but the most promising one is in evdev_device_dispatch. We can set a conditional breakpoint here for when our event is coming through⁷:

(gdb) break evdev.c:1061 if (ev->type == 1 && ev->code == 0x100)
No source file named evdev.c.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (evdev.c:1061 if (ev->type == 1 && ev->code == 0x100)) pending.
(gdb) run debug-events
Starting program: /home/mht/src/libinput/build/libinput debug-events
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
process 48478 is executing new program: /home/mht/src/libinput/build/libinput-debug-events
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
-event1   DEVICE_ADDED     Power Button                      seat0 default group1  cap:k
-event0   DEVICE_ADDED     Power Button                      seat0 default group2  cap:k
-event20  DEVICE_ADDED     Logitech Performance MX           seat0 default group3  cap:p left scroll-nat scroll-button
-event21  DEVICE_ADDED     Kingsis Peripherals Evoluent VerticalMouse 4 seat0 default group4  cap:p left scroll-nat scroll-button
-event26  DEVICE_ADDED     HD Pro Webcam C920                seat0 default group5  cap:k
-event24  DEVICE_ADDED     Wacom Intuos BT M Pen             seat0 default group6  cap:T  size 216x135mm
-event25  DEVICE_ADDED     Wacom Intuos BT M Pad             seat0 default group6  cap:P buttons:4 strips:0 rings:0 mode groups:1
-event17  DEVICE_ADDED     ZSA Ergodox EZ                    seat0 default group7  cap:k
-event18  DEVICE_ADDED     ZSA Ergodox EZ Mouse              seat0 default group7  cap:p left scroll-nat scroll-button
-event19  DEVICE_ADDED     ZSA Ergodox EZ System Control     seat0 default group7  cap:k
-event22  DEVICE_ADDED     ZSA Ergodox EZ Consumer Control   seat0 default group7  cap:kp scroll-nat
-event23  DEVICE_ADDED     ZSA Ergodox EZ Keyboard           seat0 default group7  cap:k

Breakpoint 1, evdev_device_dispatch (data=0x5555556718e0) at ../src/evdev.c:1061
1061			if (rc == LIBEVDEV_READ_STATUS_SYNC) {
(gdb) n
1075			} else if (rc == LIBEVDEV_READ_STATUS_SUCCESS) {
(gdb) n
1076				if (!once) {
(gdb) n
1077					evdev_note_time_delay(device, &ev);
(gdb) n
1078					once = true;
(gdb) n
1080				evdev_device_dispatch_one(device, &ev);
(gdb) s
evdev_device_dispatch_one (device=0x5555556718e0, ev=0x7fffffffdf60) at ../src/evdev.c:989
989	{
(gdb) n
990		if (!device->mtdev) {
(gdb) n
991			evdev_process_event(device, ev);
(gdb) s
evdev_process_event (device=0x5555556718e0, e=0x7fffffffdf60) at ../src/evdev.c:974
974		struct evdev_dispatch *dispatch = device->dispatch;
(gdb) n
975		uint64_t time = input_event_time(e);
(gdb) n
981		libinput_timer_flush(evdev_libinput_context(device), time);
(gdb) n
983		dispatch->interface->process(dispatch, device, e, time);
(gdb) s
pad_process (dispatch=0x555555674c00, device=0x5555556718e0, e=0x7fffffffdf60, time=33217800469) at ../src/evdev-tablet-pad.c:483
483		struct pad_dispatch *pad = pad_dispatch(dispatch);
(gdb) bt
#0  pad_process (dispatch=0x555555674c00, device=0x5555556718e0, e=0x7fffffffdf60, time=33217800469)
    at ../src/evdev-tablet-pad.c:483
#1  0x00007ffff7f7bd06 in evdev_process_event (device=0x5555556718e0, e=0x7fffffffdf60) at ../src/evdev.c:983
#2  0x00007ffff7f7bd4b in evdev_device_dispatch_one (device=0x5555556718e0, ev=0x7fffffffdf60) at ../src/evdev.c:991
#3  0x00007ffff7f7bfef in evdev_device_dispatch (data=0x5555556718e0) at ../src/evdev.c:1080
#4  0x00007ffff7f74f06 in libinput_dispatch (libinput=0x5555555773b0) at ../src/libinput.c:2125
#5  0x000055555555d1e0 in handle_and_print_events (li=0x5555555773b0) at ../tools/libinput-debug-events.c:827
#6  0x000055555555d6df in mainloop (li=0x5555555773b0) at ../tools/libinput-debug-events.c:953
#7  0x000055555555db1e in main (argc=1, argv=0x7fffffffe588) at ../tools/libinput-debug-events.c:1091
(gdb) list
478	pad_process(struct evdev_dispatch *dispatch,
479		    struct evdev_device *device,
480		    struct input_event *e,
481		    uint64_t time)
482	{
483		struct pad_dispatch *pad = pad_dispatch(dispatch);
484	
485		switch (e->type) {
486		case EV_ABS:
487			pad_process_absolute(pad, device, e, time);
(gdb) n
485		switch (e->type) {
(gdb) n
490			pad_process_key(pad, device, e, time);
(gdb) s
pad_process_key (pad=0x555555674c00, device=0x5555556718e0, e=0x7fffffffdf60, time=33217800469) at ../src/evdev-tablet-pad.c:332
332		uint32_t button = e->code;
(gdb) p e
$1 = (struct input_event *) 0x7fffffffdf60
(gdb) p *e
$2 = {time = {tv_sec = 33217, tv_usec = 800469}, type = 1, code = 256, value = 1}
(gdb) n
333		uint32_t is_press = e->value != 0;
(gdb) n
336		if (e->value == 2)
(gdb) n
339		pad_button_set_down(pad, button, is_press);
(gdb) s
pad_button_set_down (pad=0x555555674c00, button=256, is_down=true) at ../src/evdev-tablet-pad.c:88
88		struct button_state *state = &pad->button_state;
(gdb) list
83	static inline void
84	pad_button_set_down(struct pad_dispatch *pad,
85			    uint32_t button,
86			    bool is_down)
87	{
88		struct button_state *state = &pad->button_state;
89	
90		if (is_down) {
91			set_bit(state->bits, button);
92			pad_set_status(pad, PAD_BUTTONS_PRESSED);
(gdb) n
90		if (is_down) {
(gdb) n
91			set_bit(state->bits, button);
(gdb) n
92			pad_set_status(pad, PAD_BUTTONS_PRESSED);
(gdb) n
97	}

At this point it's apparent that the event is going through the code successfully. That is, the pad is recognized as a pad, and events from libevdev are correctly setting the state of the evdev_device in libinput. But, we're not getting any events, so where are events in libinput made?

Events in libinput

In order to find out how events in libinput works we can take a look in the only place so far that we know we've seen them: in libinput-debug-events⁸. libinput-debug-events.c has a main function that does some argument parsing and initialization, and then calls mainloop, which contains a do while loop, which polls a (the?) fd from libinput, and calls handle_and_print_events. This function gets all events with libinput_get_event, and has a giant switch to dispatch how the event should be printed. But, there is no default branch so, while unlikely, we might already hit a dead end. We jump back into gdb to test this (running with permissions; otherwise we get nothing!).

Oh, that's right, my keyboard is also sending events, and so instead of cing though the initialization events until I get to press the button on the pad and see what happens, I'm getting swamped with events for me pressing c⁹! Okay; we'll just add a default case to the switch, put a printf there, and break on it.

But oh, there's nothing coming through.

Okay, so let's see where libinput_get_event get its events. It's here, from the circular buffer libinput->events. Let's see where this is written to:

$ rg "events\[.*\]\s*=" src/
src/libinput.c
2971:	events[libinput->events_in] = event;

src/evdev-tablet.c
2000:	struct input_event events[2] = {

Oh, that's five lines over the function we just looked at. The magical function in question is libinput_post_event, and so, presumably, all events we're getting from libevdev should end up being sent to post_event, and our precious button click isn't. This function is also sparingly called:

$ rg "libinput_post_event" src/
src/libinput.c
335:libinput_post_event(struct libinput *libinput,
2217:	libinput_post_event(libinput, event);
2244:	libinput_post_event(device->seat->libinput, event);
2923:libinput_post_event(struct libinput *libinput,

The first occurrence is the prototype, the second is in post_base_event, the third is in post_device_event which sounds promising, and the fourth is the function itself. The problem is that both of these are static, and so they have quite a few callers in the file. We want to get closer to where the mapping from libevdev events to struct libinput_events are, so maybe it makes sense to go hunting for where the libinput_events are initialized. There's even a struct libinput_event_tablet_pad. Looking further, we find the tablet_pad_notify_button function which creates a libinput_event_tablet_pad, and sends to to post_device_event. This is probably where the button click should en up.

$ rg "tablet_pad_notify_button" src/
src/libinput.c
2700:tablet_pad_notify_button(struct libinput_device *device,

src/libinput-private.h
662:tablet_pad_notify_button(struct libinput_device *device,

src/evdev-tablet-pad.c
394:				tablet_pad_notify_button(base,

The call in evdev-tablet-pad.c comes from the function pad_notify_button_mask, which we'll breakpoint.

/h/m/s/l/build$ sudo gdb ./libinput-debug-events
[sudo] password for mht:
GNU gdb (GDB) 9.2
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./libinput-debug-events...
(gdb) break pad_notify_button_mask
Function "pad_notify_button_mask" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (pad_notify_button_mask) pending.
(gdb) run
Starting program: /home/mht/src/libinput/build/libinput-debug-events
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
-event1   DEVICE_ADDED     Power Button                      seat0 default group1  cap:k
-event0   DEVICE_ADDED     Power Button                      seat0 default group2  cap:k
-event20  DEVICE_ADDED     Logitech Performance MX           seat0 default group3  cap:p left scroll-nat scroll-button
-event21  DEVICE_ADDED     Kingsis Peripherals Evoluent VerticalMouse 4 seat0 default group4  cap:p left scroll-nat scroll-button
-event26  DEVICE_ADDED     HD Pro Webcam C920                seat0 default group5  cap:k
-event24  DEVICE_ADDED     Wacom Intuos BT M Pen             seat0 default group6  cap:T  size 216x135mm
-event25  DEVICE_ADDED     Wacom Intuos BT M Pad             seat0 default group6  cap:P buttons:4 strips:0 rings:0 mode groups:1
-event17  DEVICE_ADDED     ZSA Ergodox EZ                    seat0 default group7  cap:k
-event18  DEVICE_ADDED     ZSA Ergodox EZ Mouse              seat0 default group7  cap:p left scroll-nat scroll-button
-event19  DEVICE_ADDED     ZSA Ergodox EZ System Control     seat0 default group7  cap:k
-event22  DEVICE_ADDED     ZSA Ergodox EZ Consumer Control   seat0 default group7  cap:kp scroll-nat
-event23  DEVICE_ADDED     ZSA Ergodox EZ Keyboard           seat0 default group7  cap:k

Breakpoint 1, pad_notify_button_mask (pad=0x555555672b30, device=0x55555566f400, time=42251078117, buttons=0x7fffffffddb0,
    state=LIBINPUT_BUTTON_STATE_PRESSED) at ../src/evdev-tablet-pad.c:365
365		struct libinput_device *base = &device->base;

And it fires! We're now here, and we would like to get to 394, or 402, which leads to tablet_pad_notify_key, which seems to be basically the same but different. We can set breakpoints and run, just in case we do end up in either:

(gdb) break 394
Breakpoint 2 at 0x7ffff7fa3b13: file ../src/evdev-tablet-pad.c, line 394.
(gdb) break 402
Breakpoint 3 at 0x7ffff7fa3b49: file ../src/evdev-tablet-pad.c, line 402.
(gdb) c
Continuing.

Breakpoint 1, pad_notify_button_mask (pad=0x555555672b30, device=0x55555566f400, time=42251378122, buttons=0x7fffffffddb0,
    state=LIBINPUT_BUTTON_STATE_RELEASED) at ../src/evdev-tablet-pad.c:365
365		struct libinput_device *base = &device->base;
(gdb) c
Continuing.

but we don't. We just get the second event, which is the key release. This is good, because we know exactly where our event gets lost.

The Last Missing Piece?

Now it's time to make sense of what's going on in that for loop. A quick gdb p of buttons->bits shows that they're mostly 0, so we'll put another breakpoint on line 378, just inside the while loop, which we also hit. Here are the local variables at that time:

(gdb) info locals
enabled = 21845
map = {value = 1432824624}
buttons_slice = 1 '\001'
base = 0x55555566f400
group = 0x555555672bd8
code = 256
i = 32

Note that enabled and map are garbage values so far. Stepping down to the first if changes things a little:

(gdb) info locals
enabled = 1
map = {value = 1432824624}
buttons_slice = 0 '\000'
base = 0x55555566f400
group = 0x555555672bd8
code = 257
i = 32

which means we're enabled. Good. Then we step past the map assignment

(gdb) p map
$4 = {value = 4294967295}

Now, since we know that we didn't get to either tablet_pad_notify function, or the abort call, map_is_unmapped will be true, which it is:

(gdb) n
387					continue;

How does one know whether a map is unmapped? Well,

#define map_is_unmapped(x_) ((x_).value == (uint32_t)-1)

and (uint32_t) -1 == 4294967295, which means that we need to rewind, and look at the line

map = pad->button_map[code - 1];

Looking further at what's in this map gives us a very important clue:

(gdb) p pad->button_map
$7 = {{value = 4294967295} <repeats 272 times>, {value = 0}, {value = 1}, {value = 4294967295}, {value = 4294967295}, {
    value = 4294967295}, {value = 2}, {value = 3}, {value = 4294967295} <repeats 489 times>}
(gdb) p pad->button_map[272]
$8 = {value = 0}
(gdb) p pad->button_map[273]
$9 = {value = 1}
(gdb) p pad->button_map[277]
$10 = {value = 2}
(gdb) p pad->button_map[278]
$11 = {value = 3}

These are exactly¹⁰ the numbers that we saw in the output from libwacom-list-local-devices, and that we bothered translating from hex to decimal! So the map is here, but we're skipping the iteration with the button click because we're mistaken in which index to look at. At this point, I really hoped this was a off-by-one thing and that code - 1 was 271. But,

(gdb) p code
$19 = 257

This is when clicking the first button, and clicking the second yields code == 258 and so on. In other words, it looks like we're off by 16 bits.

Let's get the overview: buttons->bits is a byte array of 96 bytes, and we're looking at which bits are set. To do this, we look at each byte (this is the for loop), and look at each bit in that byte until buttons_slice, the current byte, is 0 (this is the while loop). Our problem is that code, which is the bit offset in the whole byte array, is off by 16, i.e. two bytes. In other words, we need to find out where buttons->bits are set.

For at least one caller of the function, pad_notify_buttons, the buttons are set in pad_get_buttons_{pressed,released}. Looking at the stack trace (with bt in gdb) we see this is indeed the place where we come from. But the logic there is very simple, and leaves no room for errors such as this. In addition, pad->button_state has the same error:

(gdb) p pad->button_state
$16 = {bits = '\000' <repeats 32 times>, "\001", '\000' <repeats 62 times>}

We know this is wrong, since we are supposed to end up at 272. Well, according to libwacom anyways.

Back to libwacom

At this point I'm getting suspicious. How certain are we really that the mapping isn't set up wrong? After all, in the evdev events we read out from /dev/input/event25 was 0x0100 == BTN_0, and not 272 == 0x110 == BTN_LEFT, which I think is strangely well fitting for out problem. This would also make sense with libinput, since it presumably queries either the pad itself or libwacom to get the mapping, but there's a mismatch between the mapping and what's really being sent.

Let's push our current bug hunt onto our mental stack, and try to look at this map instead. Okay, so where does libwacom-list-local-devices get those numbers from? tools/list-local-devices.c contains a call to libwacom_print_device_description() in libwacom.c, which again calls print_buttons_for_device, which again calls print_button_evdev_codes, which calls libwacom_get_button_evdev_code. This function basically indexes into device->button_codes, which we now assume are wrong.

The button codes are set in this file, but by simple inspection it's not clear what's wrong, so we clone the repository, and build the tool ourselves with the build instructions from the wiki. We compile, start up gdb, set the breakpoints, but uh oh, SIGSEGV. I revert to the libwacom-1.3 tag, and now we don't segfault any more, but we get

Failed to initialize device database

which we solve by passing --database ../data when running. All is well, and the evdev codes are still 0x110 and counting. We run it in gdb:

(gdb) set args --database ../data
(gdb) break set_button_codes_from_heuristics if (device->model_name && ((int) strcmp(device->model_name, "CTL-6100WL")) == 0)
Breakpoint 5 at 0x7ffff7fc215e: file ../libwacom/libwacom-database.c, line 418.
(gdb) break set_button_codes_from_string if (device->model_name && ((int) strcmp(device->model_name, "CTL-6100WL")) == 0)
Breakpoint 6 at 0x7ffff7fc2024: file ../libwacom/libwacom-database.c, line 391.
(gdb) run
The program being debugged has been started already.
Start it from the beginning? (y or n) y
Starting program: /home/mht/src/libwacom/builddir/libwacom-list-local-devices --database ../data
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".

Breakpoint 5, set_button_codes_from_heuristics (device=0x5555555acdd0) at ../libwacom/libwacom-database.c:418
418		for (i = 0; i < device->num_buttons; i++) {
(gdb) p *device
$16 = {name = 0x5555555a30c0 "Wacom Intuos BT M", model_name = 0x5555555acc90 "CTL-6100WL", width = 9, height = 5, match = 0,
  matches = 0x5555555acbb0, nmatches = 2, paired = 0x0, cls = WCLASS_BAMBOO, num_strips = 0, features = 1, integration_flags = 0,
  strips_num_modes = 0, ring_num_modes = 0, ring2_num_modes = 0, num_styli = 1, supported_styli = 0x5555555abbe0, num_buttons = 4,
  buttons = 0x5555555ad100, button_codes = 0x5555555ad090, num_leds = 0, status_leds = 0x0,
  layout = 0x555555586690 "../data/layouts/intuos-m-p3.svg", refcnt = 1}
(gdb) list
413	
414	static inline void
415	set_button_codes_from_heuristics(WacomDevice *device)
416	{
417		gint i;
418		for (i = 0; i < device->num_buttons; i++) {
419			if (device->cls == WCLASS_BAMBOO ||
420			    device->cls == WCLASS_GRAPHIRE) {
421				switch (i) {
422				case 0:
(gdb)

So we're in set_button_codes_from_heuristics, and since our device class is BAMBOO, although I don't know why that is, we default to BTN_LEFT as the first button, which is 0x110.

The Fix

I'm not really sure what the Class field does in this config, apart from heuristically setting key codes, but the fix that made it all word was simple: set the class to something else. I changed it on my system (the file was in /usr/share/libwacom/intuos-m-p3-wl.tablet), and submitted a PR upstream. All in all, this adventure took my entire Saturday, and the fix was one line, but I'm finally getting events when I'm pressing the buttons.

Now, how do I make these buttons to anything useful?

Thanks for reading.

Never mind that XournalPP doesn't have good (or decent, or any?) support for key rebinding. ↩
Interestingly, we don't even have to touch the pad itself; it seems to be sufficient for the tip of the pen to be pushed in for the action to be interpreted as drawing. ↩ ↩²
This is supported by the output of debug-events, which humorously states that the size of the pen is 216x135mm. ↩
two things were confusing: the fact that in the format string you need a space in between the byte count and the format, which was not explicitly stated, and that "squeezing" is by default on, which completely messes up the output if you are defining your own format. ↩
It would probably be easier to just write a small program and cast a pointer to an array with the data we read to the struct we suspect. ↩
I guess uapi is for user space, and that the directory is superfluous when you're not doing kernel dev? ↩
This represents most of my workflow in gdb: set breakpoints, n or s down wherever, list unless I have the source code right by, and p expressions; sometimes I'll also pt for when I don't know the types of things. It's... not great? But it's alright. I would like to have better integration in my text editor, that is, I don't really want to leave my text editor when debugging, since mentally I'm doing the same in both programs, but I haven't actually bothered seeing what's out there. My experience from trying out gdb integration in vim was pretty bad, and if it doesn't work well in vim, I don't see how semi-obscure editors stand a chance. ↩
At least this is a place where we know that some of the libevdev events are coming through and some are not. ↩
I'm sure there's a way of breaking conditionally based on the event type, and I browsed through the types a little bit with gdb, but couldn't find anything that seemed useful. When the alternative, adding a default branch to a switch in a codebase I already had cloned and build, was so simple, it makes sense to do, despite not really being what I wanted to do. ↩
Had this been in a textbook I would think "yeah sure, that's reeeaaallly convenient how that minor thing we did way back when turned out to be useful.", but I promise, I did not go back and added the conversion after the fact! ↩

Writing a JPEG decoder in Rust - Part 1: Background

2016-08-05T13:12:00+02:00

In the past months I have spent the evenings and weekends on a little project:a JPEG decoder and encoder, written in Rust.

First, I should drop a little disclaimer: at the time I'm writing this post, I have successfully decoded multiple test images, but these are fairly standard type images, so more exotic and advanced parts of the JPEG and JFIF standard are yet to be implemented properly, or even at all. Therefore, current program design decisions, as well as explanations of formats and techniques, may be executed poorly, as they are based on what I currently know, and the subset of functionality I have implemented. Hopefully, I am not too far off.

When I started working on this project, I had not decided how this post should be. Should it be a step-by-step kind of guide, or more of a writeup of a working program? Initially, I wanted the former, but I changed my mind as I was progressing, because I struggeled with understanding how the decoding process worked, which lead to strange design decisions, commits going back and fourth, and weird naming conventions. I do not believe this process would make a pleasant reading experience for anyone. Instead I want to write a post on how it has turned out so far. This first post will cover the background needed to understand this mess that is JPEG.

Why?

But first, why am I writing this? The choice of writing a JPEG encoder/decoder is somewhat arbitrary. In fact, if I knew what I know now, I think I would have chosen a different format than JPEG. This is mostly because this was supposed to be a weekend, or maybe weeklong, project; as I'm writing this, there is approximately five weeks since the initial git commit¹.

The choice of using Rust, however, is not arbitrary. I have found Rust to be an expressive, performant, and fun language to use; it allows high level abstractions and patterns, while still keeping that raw, low-level feeling you get from e.g C. I guess the project of writing a JPEG encoder/decoder was a great excuse for writing a non-trivial program in Rust.

And as always, I want to improve my writing. Writing is hard! Feedback is of course very welcome, even though there is no comment section here.

Let us look a little closer on JPEG.

JPEG?

JPEG is a method for lossy image compression; it is not², as you might believe, an image format. There are however image formats which use JPEG, such as JPEG/Exif --- this is what your digital camera spits out (unless you are only capturing RAW) --- and JPEG/JFIF --- the file format we will work with. JFIF³ is the most common format of transmitting images on the web⁴. This is great knowledge to pull out when someone says "JPEG file" or something similiar: "Uhmm, actually ... "⁵.

Let's see how a JFIF file is laid out, and then what the JPEG data format looks like.

The JFIF Part

Apart from actual image data, the JFIF file contains data such as image dimensions, comment, different tables, and more. A JFIF file consists of segments. Each segment contains a marker, a length, and data. The marker is two bytes, and is used to identify the segment. The length is two bytes, and specifies how long the segment is, excluding the marker bytes, but including the length bytes. The data fills the rest of the segment, according to the length.

For instance, the marker for "Comment" is 0xfffe, making the bytes for specifying "Hello, World!" as a comment:

ff fe 00 10 48 65 6c 6c 6f 2c 20 57 6f 72 6c 64 21 00

The format of the data varies with the different segment types. When implementing the decoder⁶, I simply read parts of the JPEG specification, which is available here (pdf). See page 38 (marked as page 34) for an overview of the format. The format of different markers follow for the next 13 pages.

There is no need to go into too much detail just yet. We can simply start with saying that an image consists of a frame, which again consists of one or more scans. Each scan contains one or more entropy-coded segments (ECS). So far, the images I have tested have contained one frame, one scan, and one ECS, which is a good starting point. Complexity does not always have to be payed for up front.

The JPEG Part

Let's have a look at the image data --- this is after all the most exciting part.

Encoding from 10000 meters

In essence, this is how JPEG encoding works:⁷

Split the image into 8x8 blocks. In case the image is not perfectly divided into 8x8 blocks, extend the borders of the image such that it is
Convert the block to frequency domain, using the Discrete Cosine Transform
Reorder the block to a "zigzag" ordering
Quantize frequency coefficients
Encode the block using huffman coding

In order to understand why we are doing this, we need to take a closer look at frequency transforms, and huffman coding.

Discrete Cosine Transform

The Discrete Cosine Transform(DCT) creates a representation of a signal as a sum of cosines of different amplitude and frequency. I do not feel I am able to explain this good enough, but you are encouraged to look at the Wikipedia page for Fourier Series, which contains some great animation and images; for our use, DCT is basically the same thing. If this went over your head, do not worry. Understanding exactly how it works is not as important as understanding why we would like to use it.

So what does DCT have to do with images? Instead of looking at an image as a grid of pixels, we can interpret the image as a two dimentional signal. For instance, say we have a grayscale image of size 8x1 px, and that the image data, where each pixel is a number between 0 (black) and 255 (white), looks like this:

[0, 32, 64, 96, 128, 160, 192, 224]

The image looks like this (scaled by 3200%):

We can interpret this image as mathematical function; in this case it is rather easy: $f(x,y) = 32x$. So how does the DCT of this signal look? Like this:

[896.0, -583.1, 0.0, -61.0, 0.0, -18.2, 0.0, -4.6]

If we go backwards, using the inverse DCT, we get the exact same image data as the ones we fed into the DCT; no information is lost. This does not look too impressive; sure --- we got some zeroes here and there, but there seems to still be some data which needs to be saved. What if we increase our image to be 8x8, instead of 8x1?

Signal
   0.0   32.0   64.0   96.0  128.0  160.0  192.0  224.0
   0.0   32.0   64.0   96.0  128.0  160.0  192.0  224.0
   0.0   32.0   64.0   96.0  128.0  160.0  192.0  224.0
   0.0   32.0   64.0   96.0  128.0  160.0  192.0  224.0
   0.0   32.0   64.0   96.0  128.0  160.0  192.0  224.0
   0.0   32.0   64.0   96.0  128.0  160.0  192.0  224.0
   0.0   32.0   64.0   96.0  128.0  160.0  192.0  224.0
   0.0   32.0   64.0   96.0  128.0  160.0  192.0  224.0

After DCT
 896.0 -583.1    0.0  -61.0    0.0  -18.2    0.0   -4.6
  -0.0    0.0    0.0   -0.0    0.0    0.0   -0.0    0.0
   0.0   -0.0   -0.0   -0.0    0.0    0.0    0.0    0.0
  -0.0   -0.0   -0.0   -0.0    0.0    0.0    0.0    0.0
   0.0   -0.0   -0.0    0.0    0.0    0.0    0.0    0.0
  -0.0    0.0   -0.0    0.0    0.0    0.0   -0.0    0.0
   0.0   -0.0   -0.0   -0.0   -0.0   -0.0    0.0    0.0
  -0.0    0.0   -0.0   -0.0    0.0    0.0    0.0   -0.0

Now it is clear that for some images --- or image blocks --- there are great opportunities to minimize the size of the encoded data. There are even more tricks to this, such as quantization, which we will have a look at later.

Quantization

Quantization is the only lossy step in our 5 steps, and so it is this step that controls the level of compression. Quantization is the process of mapping a set of values to a smaller set, and it is done with a quantization matrix, which is the same size as an image block: 8x8. Predefined matrices are suggested in the JPEG standard⁸, but each image encodes its own quantization matrices.

Quantization is applied after DCT, and is used to reduce the information of the coefficients in every image block. By making almost equal numbers equal, we decrease the number of unique numbers, and increase the compression ratio enabled by huffman conding.

Let us predend blocks are 2x2 instead of 8x8. Say the data we got back from the DTC is

$$G = \begin{bmatrix} 230 & 68 \\ 99 & 72 \end{bmatrix} $$

A quantization matrix could be

$$Q = \begin{bmatrix} 10 & 11 \\ 12 & 13 \end{bmatrix} $$

We take each component of $G$ and divide it by its corresponding component of $Q$, and round the numbers to the nearest integer:

$$B = \begin{bmatrix} \frac{230}{10} & \frac{68}{11}\\ \frac{99}{12} & \frac{72}{13} \end{bmatrix} = \begin{bmatrix} 23 & 6 \\ 8 & 6 \end{bmatrix} $$

And we are done. The data from $B$ is what is passed down to the next step.

We can also see that if we are going the other way, taking the inner product, with the same quantization matrix, we get

$$G' = \begin{bmatrix} 230 & 66 \\ 96 & 78 \end{bmatrix} $$

Huffman Coding

So we found a way to represent our image block with a lot of similar numbers --- 0 in our case. How can we take advantage of this? If we use 32 bit integers, the size of the number is 32 bits, no matter the number! Or is it?

Huffman Coding is a coding scheme used for lossless coding. The gist of the scheme is to code numbers as bit strings, with the property that no code can be the prefix of another code (if 01 is a code, 011 can not be a code), and that frequent numbers should have a shorter code than less frequent numbers.

Say we want to encode this (totally random) list of bytes

[2, 7, 1, 8, 2, 8, 1, 8, 2, 8]

We can count the number of occurences of each number in the list,

Number | Occurences
-------------------
   1   |      2
   2   |      3
   7   |      1
   8   |      4

and create the codes⁹,

Number | Code
-------------------
   8   |  0
   2   |  10
   1   |  111
   7   |  110

making our data

// data
 2   7   1 8  2 8   1 8  2 8
// coded data
10 110 111 0 10 0 111 0 10 0
// squash together
1011011101001110100
// look at each byte
10110111 01001110 100?????
// pad with 1
10110111 01001110 10011111
// which is the same as
183 78 159

Effectively, we coded 10 bytes as 3 bytes¹⁰ ¹¹, by taking advanage of the fact that some numbers are more frequent than others.

Back to the 10000 meter view

Now that we have some control over roughly what happens, we can take another look at the whole procedure, with some additional steps.

So, we split the image into blocks, and each block is more or less processed by itself. This is done because of the principle of locality: pixels within a block are likely to be somewhat similar.

Next, we transform the image into frequency domain, so we can make it easier to encode the numbers. Then, we reorder the blocks, with the goal of getting long runs of zeroes. We also take the inner division (element-by-element division) of our frequencies and a quantization matrix, in order to make the coefficients somewhat similar. This is the lossy part. At last, we use huffman coding to write out the image data, taking advantage of the fact that we are writing lots of similiar numbers.

And that is pretty much it --- from a 10000 meters view.

Conclusion

We have looked at how a JFIF file looks like, and roughly how the image data is encoded. In Part 2 we will start implementing this, in actual, runnable, (hopefully) working, Rust code.

If you found some things confusing, or simply want better explanations than I can give, check of the Wiki page for JPEG; it explains the encoding process using a 8x8 sample block.

Read part 2 here

Errata

Quantization was listed before zigzagging the data. This was the wrong way around.

Although I have not worked on the project every day --- in fact, as of this writing, there are commits from only 14 distinct days: git log --format="%cd" --date=short | uniq | wc -l. ↩
As /u/AlyoshaV points out, pure JPEG files do exist, but are of limited use, because decoders have to guess how to decode it. I stand corrected! ↩
Wikipedia writes JPEG/JFIF, but the J in JFIF stands for JPEG. Not sure what to make of this, so I will call it JFIF. ↩
Ok, that sentence was nearly copied from Wikipeida. No matter --- here is the source ↩
Please don't. ↩
Which is still an ongoing process, mind you! ↩
We are taking a little shortcut here, ignoring things as multiple channels, color conversion and chroma-subsampling. ↩
See Annex K in the JPEG specification for two suggested quantization matrices. ↩
The algorithm of devising the bit strings can be found on the Wiki page¹². Check it out! ↩
Note that this introduces a little bit of a challenge: how do you know when you are done reading? Two possible solutions are to either know ahead of time how many elements we are to read, or we could encode a special byte, such as 0xff or 0x00 to mark end-of-data. ↩
Also note how it is posisble to decode the bit stream, since no code is a prefix of another code; we can simply check is 1 a code? No. Is 10 a code? Yes! Got 2. Now, the bit stream is 1101110100111010011111, and we can start again. Is 1 a code? No. Is 11 a code? No. etc. Alternatively, one can somehow know the length of the next code. ↩
I used online generators to get the code listed. It is worth noting that different generators gave different codes, so either the generators I found are not, strictly speaking, correct, or there are some ambiguity here. ↩

Hello World

2016-01-25T00:23:46+01:00

My last attempt at writing a blog didn't go so well. It turns out that finding the time to write a post that is even slightly interesting can be a challenge. But as we've just entered 2016, I'm trying once again.This first post will be a super fast rundown of how this blog was made.

I didn't want to spend too much time on creating this web page, as web really isn't my thing. Therefore, I went with the static site generator Hugo, which is written in golang. The only reason I chose Hugo over, say Jekyll or Hexo, is that it was easiest to set up. I tried to use Jekyll for a while, as my previous blog attempt used Jekyll, but after messing around with Ruby for 30 minutes trying to get something like virtualenv to work, I gave up.

The sites design is very simple, and somewhat inspired from this, although my css file ended up around 50 lines. It is also worth noting that there is no JavaScript here --- at least on the pages that don't have any math. (I did consider figuring out how to generate images from latex, but I think using MathJax is a better alternative.) Of course, in the year of Let's Encrypt, the site is https only.

Lastly, my posts gets from my laptop to the webserver by git. I have set up a remote repository on theserver, with a post-receive hook, which runs Hugo, and moves some files around. The script is really short, and probably has faults.

Thats about it. Nothing more, nothing less. Hopefully, I'll get around actually writing posts this time --- I'm targeting at least one new post each month.

mht

Stable Timings

2018-02-05T10:27:54+01:00

It was friday afternoon and I was working on a proof-of-concept garbagecollector for Rust. The general idea of the collector is to have threads register their roots --- pointers to memory that the collector cares about --- and every once in a while a thread collects these roots, fork off a new process, find memory that is no longer reachable from any root, and return these pointers back to the parent process using mmap.

The initial implementation was obviously not well tuned for performance. In order to get a rough picture of where my system spent its cycles, I inserted calls to time::precise_time_ns before and after certain blocks, and wrote out the difference at the end. The reported timings looked like this:

consolidate wait for signals 0ms
               collect roots 0ms
                        fork 0ms
              wait for child 67.108864ms
                   read ptrs 0ms
                   call free 201.3266ms

consolidate wait for signals 0ms
               collect roots 0ms
                        fork 0ms
              wait for child 67.108864ms
                   read ptrs 0ms
                   call free 134.21773ms

consolidate wait for signals 0ms
               collect roots 0ms
                        fork 0ms
              wait for child 67.108864ms
                   read ptrs 0ms
                   call free 134.21773ms

Strange. Don't mind the fact that several of the timings are 0 --- zero --- ms, but wait for child is the exact same number across the three iterations shown. And call free is the exact same the last two iterations. And what about the fact that 67.1 + 134.2 = 201.3?? Something is obviously wrong here. Maybe there is some weird OS synchronization going on with fork and/or mmap¹? Or maybe the numbers just are very stable since I'm using a test program that does the exact same thing a billion times? (This could be the case, as the number of pointers returned were in fact the exact same number on all iterations)

It was friday afternoon so I commited and went home.

The day after, I found myself programming on a side project, inserting the very same timing calls. Again, I got back very strange numbers. It was then I realized what was happening.

My code looked roughly like this:

let t0 = time::precise_time_ns();
do_some_work();
let t1 = time::precise_time_ns();
....

println!("do_some_work: {}ns", (t1 as f32 - t0 as f32) / 1_000_000.0);

time::precise_time_ns returns a u64. As I'm writing this, it returns 659_950_597_875_582. That's a large number. That number is way larger than the largest number a f32 can properly represent. By just casting back and forth, we clearly see this:

extern crate time;
fn main() {
    let t = time::precise_time_ns();
    println!("{}", t);                // prints 660_149_119_845_010
    println!("{}", t as f32 as u64);  // prints 660_149_089_861_632
}

So since I casted to f32 before doing the subtraction, time differences in the order of 29_983_378ns², 29ms, would simply disappear, causing the reported difference to be zero. And the larger differences would be truncated to the highest multiple of ~33. This explains it all!

Now, why did I cast before subtracting? I don't know. Why did I use f32 instead of f64 (which in this case would be sufficient)? I don't know. It was just one of those times when you write quick and dirty code without really thinking about the details of what you're doing. Luckily, this time the bug was fairly easy to spot.

I mean, it could obviously not be anything wrong with my code! ↩
Finding the exact bound is left as an exercise to the reader. ↩

A Neat Approximation Algorithm

2021-12-05T16:21:19+01:00

The algorithm and notation is based on section 9.3 of Williamson and Shmoys "The Design of Approximation Algorithms".

Let $G=(V,E)$ be a graph. A spanning tree $T$ of $G$ is a sub graph $T=(V, E')$ such that $T$ is connected and acyclic. Visually, you can think of it as a network that touches all vertices and doesn't contain and loops. Computing a spanning tree, even a minimal one if we have weights on the edges that we want to reduce, is easy¹.

Maybe we would like to ensure that no vertex is overloaded by having too many edges adjacent to it in $T$. We can look for spanning trees such that the maximum degree $\Delta(T)$ is bounded by some input number $k$. Is this problem difficult?

Yes, there is no polynomial algorithm that solves this in the general case². Consider what happens if we let $k=2$: now we are asking whether there is a Hamiltonian path in $G$, which is a well known NP-hard problem. To see why this is the same, note that if $\Delta = 2$ then the spanning tree can never branch out, since every vertex has at most one input and one output, and since it is connected we touch every vertex. Thus we get a simple path that touches every vertex exactly once --- a Hamiltonian path.

Okay, so we can't solve it exactly in polynomial time, but can we approximate it? It turns out yes, and with a surprising bound. Let $\text{OPT}$ denote the minimal maximum³ vertex degree in a spanning tree in a given graph. There is a polynomial algorithm that finds a spanning tree $T$ such that $\Delta(T)\leq\text{OPT}+1$. That is, it will either output a spanning tree that is optimal, or its maximum degree will be $1$ above the optimal.

We'll start off by stating a condition to have a tree with the bound above, then describe the algorithm, and then prove some claims.

Optimality Condition

First some notation. We are given a graph $G$ and consider a spanning tree $T$. Let $k=\Delta(T)$, $D_k$ be a non-empty set of vertices of degree $k$ in $T$, and $D_{k-1}$ be any set of vertices of degree $k-1$:

$$D_k \subseteq \{v\in V \mid d_T(v) = k \},\quad D_k\neq \emptyset$$ $$D_{k-1} \subseteq \{v\in V \mid d_T(v) = k-1 \}$$

$D_k$ are all of the bad vertices since they have the highest degree, and we want low degree vertices. $D_{k-1}$ are fine for now, but we need to be careful with them. We can't add any edges to these vertices, since then they'd be bad too.

Next, let $F$ be all edges in $T$ that touches either $D_k$ or $D_{k-1}$ (or both). These are the bad edges that we want to get rid of, because they are adjacent to the overloaded vertices in $D_k$ and $D_{k-1}$.

Last, let $C$ denote all connected components in $T$ if we were to remove all edges in $F$ from $T$. Note that we have exactly $|F|+1$ connected components: we start off by one component, since $T$ is connected, and for every edge we remove we split a component into two.

Here comes the condition:

If each edge in $G$ that connects distinct components in $C$ has at least one endpoint in $C_k\cup C_{k-1}$ then $\Delta(T)\leq\text{OPT}+1$.

A proof is at the bottom of the post (it is not very involved). This condition is almost all we need.

The High Level Picture

First a note on spanning trees. Since they span the entire graph and are acyclic, if we add a single edge to a spanning tree we will get a cycle⁴. Further, we can then remove any edge in the cycle and still have a spanning tree. At a high level, the operation we will do is exactly this: insert an edge adjacent to good vertices to make a cycle that involves some of the bad vertices, and then remove an edge that is adjacent to a bad vertex, making the total badness less.

Let's have some figures⁵ to make things a little more concrete.

Say this is the graph $G$ with some spanning tree $T$ marked with bold edges. In the spanning tree, $a$ and $f$ are the bad vertices since $d_T(f)=4$ and $d_T(a)=3$. This means that we can choose⁶ $D_k=\{f\}$ and $D_{k-1}=\{a\}$. $F$ is all edges touching $a$ and $f$. Removing $F$ from the graph leaves us with the following graph, still with the edges from the spanning tree highlighted.

In this example, $C$ contains 8 components⁷: $C=\{ \{a\}, \{b\}, \{c, d\}, \{e\}, \{f\}, \{g, j, k\}, \{h\}, \{i\} \}$. Here the optimality condition above does not hold, since there are plenty of edges that connects components in $C$ that doesn't touch either $a$ or $f$; in fact, it just so happens that all the edges that are not spanning edges (i.e. bold) are connecting distinct components. This needs not be the case: imagine $(g,j)$ as an edge.

The idea of the algorithm is to see that the component $\{c,d\}$ and $\{g,j,k\}$ and be connected through the edge $(d,g)$⁸ (in blue below), and that this opens up the possibility of removing either $(b,f)$ or $(f,j)$ (in red below), which reduces the degree of $f$.

The resulting tree still spans the graph, and now we have reduced the maximum degree in the tree from $4$ to $3$. You can imagine doing the same trick with replacing $(a,b)$ in favor of $(b,e)$ and $(f,i)$ in favor of $(i,j)$. The resulting spanning tree would be optimal since its maximum degree is 2.

The Algorithm

The algorithm iterates until the optimality condition above is true. Let $k=\Delta(T)$ be the maximal degree of the spanning tree. In each step we aim to reduce one vertex of degree $k$ to $k-1$. If we reduce all such vertices we'll set $k=k-1$ and continue, aiming to lower the new $k$ even further. Initialize $D_k$ and $D_{k-1}$ to be all⁹ vertices of degree $k$ and $k-1$ respectively in $T$, and compute $F$ and $C$ as defined above. Find an edge that connects distinct components of $C$. If no such edge exist, we have our optimality condition. Let $e=(u,v)$ be this edge, and look at the cycle we get when adding $e$ to $T$.

Case 1

If the cycle does not contain a vertex $x\in D_k$, we don't do the swap. Instead we make a note that if we want to reduce any of the vertices in the cycle that is also in $D_{k-1}$ we can do so though $e$. Further, we'll remove all of these vertices from $D_{k-1}$, and update $F$ and $C$ accordingly. Effectively what this removal does is to merge all¹⁰ components that were connected to the cycle.

We have not done any real work yet, but we have made the set $D_k\cup D_{k-1}$ smaller, so maybe next time we'll get lucky. The downside is that we can no longer trust $D_{k-1}$ to contain all vertices of degree $k-1$. We removed the vertices from the sets, but we didn't remove any edges from the tree $T$. However, with the edge $e$ we have said that if we need to reduce the degree of one of these vertices, we can do so with $e$. Let's call this edge $e_u$ for a vertex $u$.

Case 2

If the cycle does contain a vertex $x\in D_k$, we want to add $e$ to $T$ and remove either of the two edges in the cycle that is adjacent to $x$. Now, if the degree of both $u$ and $v$ is less than $k-1$ all is well, since we've reduced the degree of $x$ from $k$ to $k-1$, and increased the degrees of $u$ and $v$ from something to something smaller than $k-1$ to something smaller than $k$.

However, if either of them is of degree $k-1$ we cannot do this. WLOG¹¹ let $u$ be the problem vertex. Now the setup in Case 1 pays off. Since $u\notin D_{k-1}$, but $d(u)=k-1$ we know that we have had Case 1 with $u$, and that $u$ was in $D_{k-1}$ along a cycle, and that we removed it. We also made the note saying that we can reduce the degree of $u$ upon request, though the edge $e_u$. Now we can add $e_u$ and remove either of the two edges adjacent to $u$ in the cycle that was formed by adding $e_u$. This makes $d(u)=k-2$ and we still have a spanning tree. Then we add $e$ and remove an edge adjacent to $x$, which increases $d(u)$ back up to $k-1$, and decreases $d(x)$ to $k-1$. If $d(v)=k-1$ as well we do the same with it.

We have successfully improved our spanning tree, since the number of vertices of degree $k$ has been reduced. Repeat this from the start by setting $D_k$ and $D_{k-1}$ to be all vertices of degree $k$ and $k-1$ and recomputing $F$ and $C$, until we hit the optimality condition.

There is one catch though. When we are reducing $u$ we are adding in the edge $e_u$ to the tree. How do we know that this edge will not make either of its endpoints' degree too high? The answer is that we don't! These vertices might also be of degree $k-1$, but then they, too, will have a designated edge $e_w$ that we can add to reduce $w$. We may end up with a chain of reductions, but this chain has to eventually terminate. See below.

And that's it! Some bookkeeping, component queries and merging, and basic graph operations, and we're left with a minimum degree spanning tree that is at most one from the optimal case, which is NP-hard.

Proofs

Here are proofs for the optimality condition and the termination and soundness of the cascading chain of reduction.

Proof of optimality condition

For brevity I will write "the set" for $D_k\cup D_{k-1}$.

We first find a lower bound for $\text{OPT}$. We have seen that $T\setminus F$ contains exactly $|F|+1$ components (this is the size of $C$). When the condition holds this is also true for $G$, since there are no edges connecting distinct components that doesn't also touch the set that are left in $G\setminus F$. This implies that any spanning tree in $G$ will need $|F|$ edges to connect them all. Furthermore, if we look at all the vertices in the set we know that their average degree must be at least¹² $|F| / |D_k\cup D_{k-1}|$, since $F$ is exactly all edges touching those vertices. The maximum has to be at least as large as the average, and so

$$\left\lceil\frac{|F|}{|D_k\cup D_{k-1}|}\right\rceil \leq \text{OPT}.$$

Now we find a bound on $|F|$. If we sum up the degrees of all the vertices in our set we get $k|D_k|+(k-1)|D_{k-1}|$, since the $D$ sets are exactly vertices of that degree. This sum can be more than the number of edges in $F$, since edges internal to the set will be counted twice. However, we also know that $T$ is acyclic, and so there can be at most $|D_k\cup D_{k-1}|-1$ such edges¹³, so we can add up the degrees of the vertices in the set and subtract the edges internal to the set to not double count. This gives us a bound on the number of edges touching the set¹⁴:

$$\begin{align} &k|D_k|+(k-1)|D_{k-1}| - \left(|D_k\cup D_{k-1}|-1\right) \\ =\ \ &k|D_k|+(k-1)|D_{k-1}| - |D_k| - |D_{k-1}|+1 \leq |F| \end{align}$$

Now we can combine the two inequalities by inserting the bound for $|F|$ into the bound for $\text{OPT}$:

$$\begin{align} \text{OPT} &\geq \left\lceil\frac{k|D_k|+(k-1)|D_{k-1}| - |D_k| - |D_{k-1}|+1} {|D_k| + |D_{k-1}|}\right\rceil \\ &= \left\lceil\frac{k(|D_k|+|D_{k-1}|) - \left(|D_k| + |D_{k-1}|\right) - |D_{k-1}|+1} {|D_k| + |D_{k-1}|}\right\rceil \\ &= \left\lceil k - 1 - \frac{|D_{k-1}|-1} {|D_k| + |D_{k-1}|}\right\rceil \\ &\geq k-1 \end{align}$$

Recall that $k=\Delta(T)$, so $\Delta(T) \leq \text{OPT}+1$.

Proof of termination of the algorithm

Since each step is either reducing $D_k$ or $D_{k-1}$ each iteration in the algorithm will eventually terminate, and since $k=2$ is the best possible maximal degree (unless we only have 1 or 2 vertices) we cannot reduce forever.

The less obvious part of the algorithm to terminate is the reduction of a $k-1$ degree vertex. Recall that in reducing a bad vertex $u$ with $d(u)=k$ we had to add an edge $(v,w)$ to the graph where $d(v)=k-1$, and this was done by adding in a second edge $e_v$. The problem is that $e_v$ might again be adjacent to a vertex of degree $k-1$, which will also have to be reduced. Now we show that this procedure will terminate. We do so by induction on the iteration number $i$, showing that when we mark a node as reducible (Case 1) we can perform the later reductions as well.

$i=1$: In the first iteration we have not marked any nodes as reducible through some edge, and so this terminates.

$i=l$: Let $u$ be the node we want to reduce, and $e=(v,w)$ the edge we reduce it with. WLOG let $v$ be the node we need to reduce, and $j$ the iteration in which we marked $v$ as reducible. By induction we can reduce $v$ from degree $k-1$ to $k-2$, and the same is true for $w$. How do we know that reducing $v$ using the edge $e_v$ doesn't mess up our reduction of $u$ with $e$? Because in marking $v$ as reducible we joined the two components that $e_v$ connected, and so in iteration $i$, the edge $v_e$ is internal to some component. In fact, the entire cycle formed by adding $v_e$ to $T$ is in that component. Since $C$ is not affected by the reduction of $v$, the reduction of $u$ via $e$ is still valid.

The only thing left to consider is that there is no chain of reductions that all add an edge to some vertex $x$, so that $k \leq d(x)$ after all reductions are done. This cannot happen: in fact, in a single iteration the set of edges that we add to the tree when reducing vertices are pairwise disjoint, and so any vertex will at most have its degree bumped once. Since the vertices of degree $k-1$ are already taken care of, no other vertex will suddenly have degree $k$, and so we are guaranteed progress.

To show disjointedness, let $u$ be reduced by the edge $e = (v, w)$, and assume that $v$ also needs to be reduced. Recall that, by definition, when we decided that $v$ is reducible through edge $e_v=(x,y)$, $e_v$ connected two different components. Further, we merged the components of the cycle that was formed by adding the edge $e_v$ into a bigger component $C_v$, which contains both the two components that $e_v$ connected and $v$ itself. This is one of the components that $e$ connects, and it does so through $v$, so we know that $e$ and $e_v$ are disjoint. Now, if either $x$ or $y$ also needs to be reduced, say $x$, that will be through $e_x$, which will, by the same logic, be internal to the component $C_x$, which does not contain neither $u$, $v$, or $w$, so $e_x$ is also disjoint from both $e$ and $e_v$¹⁵.

Conclusion

We've seen an algorithm that finds a spanning tree that minimizes the maximum degree of the tree up to an error of 1. The original paper where this was taken from, link here, is called "Approximating the minimum degree spanning tree to within one from the optimal degree" by Fürer and Raghavachari. I didn't actually read the paper, just the book chapter mentioned in the beginning.

I thought the algorithm was especially neat due to the bound, since approximation algorithms often are multiplicative factors off the optimal solution, for instance with a factor of 2 or 1.5. It is interesting that finding a spanning tree with maximal degree 2 is NP-hard, but that there is a polynomial algorithm that will find a spanning tree of maximal degree of 2 or 3 if the graph is Hamiltonian.

I haven't tried to implement this, but from what I can tell it should not be too difficult. It would be interesting to see some statistics on the performance of this algorithm on a corpus of graphs that are known to be Hamiltonian. Things like:

How long do the reduction chains become?
How many iterations are required before we reach optimality?
How often do we get the optimal tree?
How does $|F|$ change during the lifetime of the algorithm?
How does $C$ change during the lifetime of the algorithm?

and probably many other thing. If you know of any work on this or have any ideas yourself, feel free to send it to my public inbox (plain text emails only).

Thanks for reading.

For instance with Prims or Kruskals algorithm. ↩
Unless $P=NP$, that is. ↩
That is, among all possible spanning tree, we are looking for the one where the maximal degree $\Delta$ is minimized. ↩
If you don't believe me, draw a spanning tree and try to insert an edge between any two vertices. ↩
I'd be interested in knowing how people make graphs for the web in some vector format. These are tikz converted to pngs using ImageMagick, and it's fine. ↩
To reach the optimality condition it is always better to have $D_k$ and $D_{k-1}$ be as large as possible, since then it touches more edges. ↩
Recall that we look at the components in $T$, and not in $G$. We only care about the bold edges. ↩
We could also have chosen $(c,g)$ to connect the two components. ↩
Note that in the optimality condition we said that they could be arbitrary subsets of these vertices. Now we choose all of them. ↩
The book said that it would merge the components of the endpoints of $e$, but I cannot see how it would not join other components that are attached to the cycle as well. It is also not listed in the errata. ↩
"Without loss of generality": if it was in fact $v$ and not $u$ you can just mentally swap them in the text. ↩
The "at least" comes from the fact that there might be edges internal to the set, i.e. connecting two high degree vertices. In that case we would have to count the edge twice to get the actual average. The proof doesn't need it though. ↩
This is just like how there are at most $n-1$ edges in a $n$ vertex graph without cycles. ↩
The equality holds since the two sets are disjoint. ↩
The logic here kind of follows from the $i=l$ step of the induction; I'm sure there's a nicer way of phrasing it though. ↩

Algorithm complexity

2014-09-23T12:00:00+01:00

Recently I've stumbeled upon a few blogposts and Internet discussions involving the complexity of popular algorihms and operations of well known data structures.In these posts, one usually can't miss the Big-O notation - including it's pitfalls, of which there are a few. Therefore I would like to try to clear things up.

The definition

Let's start head on. Wikipedia says:

Let $f(x)$ and $g(x)$ be two functions defined on some subset of the real numbers. One writes: $$f(x) = O(g(x)) \text{ as } x \to \infty$$ if and only if there is a positive constant $M$ such that for all sufficiently large values of $x$, $f(x)$ is at most $M$ multiplied by $g(x)$ in absolute value. That is, $f(x) = O(g(x))$ if and only if there exists a positive real number $M$ and a real number $x_0$ such that $$|f(x)|\leq M|g(x)| \text{ for all } x > x_0.$$

Pretty straight forward? No? Ok, let's try to break it down.

The basics

We have two functions, $f(n)$ and $g(n)$. $f(n)$ is the actual running time of our algorithm, and $g(n)$ is whats inside the $O()$. Lets set $f(n) = 2n^2+12n+61$, and $g(n) = n^2$ so we have some good ol' nubmers to look at. Now we could write:

$$f(n) = O(g(n)) \implies 2n^2+12n+61 = O(n^2).$$

If we simplyfy, our claim is that when $n$ gets large, the two functions are almost equal. This means that if we need $f(999999)$, a pretty good aproximation would be $g(999999) = 999998000001$.

Note the 'pretty'! $f(999999)$ is actually $ 2000008000051 $, which is twice as much. Now, look again at $f(n)$, and take a guess why it's twice, and not any other multiple. So far so good. This probably isn't news for anyone having seen the big-O notation before; however, remember that we simplyfied things a little.

The runtimes

When working with algorithms, one usualy is interested in getting the fastest. Why spend two seconds sorting a list of numbers when you could use only one second? Having now learnt the magic of big-O, we find a table of popular sorting algorimths, and their average case runtime; we laugh when we realize the author of the table has included multiple $O(n^2)$ algorithms, when we see multiple $O(n \log{n})$, and even some $O(n)$ algorithms. "Why would you even do something like that?" we say to ourselves, and shake our heads.

Later we decide to learn the almighty quicksort. We look up an implementation in our favorite language, but we are startled when the first lines (maybe) looks something like this:

def quicksort(array):
  if array.length < 10:
    insertionsort(array)
  # ...

We rush back to the runtime table, and finds insertion sort: $O(n^2)$?! Surely something must be wrong here; why else would one use a slower algorithm? We decide this implementation is far from optimal, and find another one. Which turns out to be exacly the same. So what is going one here? big-O promised us that $O(n\log{n})$ is better than $O(n^2)$, so how come quicksort wants to use a $O(n^2)$ algorithm? This is when our simplification comes back to bite us in the rear.

The catch

We said "If we simplyfy, our claim is that when $n$ gets large, the two functions are almost equal." So what about when $n$ isn't very large? Looking back to the Wikipedia definition, there was something about an $x_0$. It turns out Big-O got the definition of "large" covered; Large is if $x\gt x_0$. This doesn't say much, though. But here's the kicker: we get to choose $x_0$.

Consider the following graph:

If we choose $x_0=a$ we can't really say much; the graphs are switching between being on top and bottom. However, if we choose $x_0=b$ we can see that (for all we know, and which is probable) $f(x)$ is the larger function, and hence the one with the longer run time. In this case, $f(x)=x^2+8 = O(n^2)$ and $g(x) = 8x = O(n)$. Given that, we would say that $g(x)$ is the faster of the two. But what if the only possible values for $n$ is between $a$ and $b$? Then, clearly, $f(x)$ is faster, as seen from the graph, even though $O(n^2) \gt O(n)$.

Now we understand that the Big-O notation doesn't really say anything about actual runtime, just the runtime when your input is really large. This is exacly what happends in quicksort - the insertion sort algorithm, though pretty slow on large input, performs really well on smaller inputs. Even timsort, the standard sorting algorithm in Python, Java SE 7, and the Android platform, uses insertion sort¹.

What we've got so far

It's often easier to grasp a concept when you don't look at the general case, but rather a example. Let's say one of our functions have a runnning time of $f(n) = 2n^2 + 9n +120$. A start would be to set $f(n) = O(2n^2 + 8n + 120)$. Figuring out the asymptotic runitme of a function isn't excacly straight forward. The short story here is you can forget all parts without the highest exponential², which in this case is the $2n^2$ . Then we're left with $O(2n^2)$. We can also forget about all constants; now there's really nothing left for us to remove, as we've got $O(n^2)$.

How can we know this is correct? Let's just look at the following graph:

We see $M=3$, so $Mg(n)=3n^2$, and $f(n)=2n^2+8n+120$. We can see in the beginning, $f(n)$ is the slower one (remember, the one on top is the slower!), but $3n^2$ catches up pretty fast, and it only goes one way from there. From the definition, we now see that:

$$f(n) = 2n^2 + 9n + 120 = O(n^2)$$

because $3n^2$ was always larger than $f(n)$ from the intersection point.

Again, we don't know excacly where the graphs intersect, so just given the graphs we couldn't say what $x_0$ could be, but that doesn't matter. The intersection point could be a gazillion, and it would be OK. It could take the function a gazillion years to compute with the input $n=x_0$, but it wouldn't matter, because asymptotically $Mg(n)$ would be the larger function.

What does it all mean?

So we've seen a few cool graphs, and some math written in $\LaTeX$, but what does it all mean? Why should you care? Let's say you're writing a cool program with a function you know will be called a lot. Maybe it looks a little bit like this:

for (int i = 0; i < n; i++){
  for (int j = 0; j < n; j++){
    for (int k = 0; k < n; k++){
      // Lots of cool stuff
    }
  }
}

After what we've learned you could easily see that this code runs at least in $O(n^3)$, and being familiar with quite a few algorithms, you realize that's qutie a bit.³ You could now consider rewriting it, making it more efficient, even without ever running it! Sounds too good to be true? Well, it kind of is.

Remember that the actual running time and asymptotic running time is not the same thing. This means if this part of your code is crucial to your application, you should use a profiler to find out which of your algorithms that have the shortest actual running time. If it is not crucial, you should probably leave it be. Remember kids: premature optimization is very bad! Having said that, the asymptotic running time is usually a good indication.

"We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%."

--- Donald E. Knuth

Question time!

What about $\Omega()$ and $\Theta()$? : I have decided not to say anything about neither of those, because big-O is the one people usually run into, and it is definitely the most used - at least outside computer science circles (you could even say outside academia).

You said one shouldn't optimize without profiling anyway. Doesn't that make this whole mess kind of useless? : A lot of CS students jump to this conclusion when realizing you shouldn't go on an optimizing spree as soon as you have learned algrorithm analysis. And it isn't completely wrong; you probably did fine before reading this text, and you would probably continue to do so without all this. This is just another tool in your programmer toolbox. But the same way a musician should understand his instrument, a programmer should understand his algorithms and data structures - you wouldn't care to listen to a guitarist who had no idea where all the sound came from, would you?

Questions for you!

Time to see what you've learned.

When finding the runnning time of $f(n) = O(2n^2 + 8n + 120)$, I said you could ignore constants. Why?
How come I can say $2n^2 + 4 = O(n^4)$ and still be right?
How would you know which algorithm is faster: merge-sort or heap-sort, both of which are $O(n \log n)$?

Advent of Common Lisp, Day 5-9

2018-12-07T19:33:42+01:00

We continue our Common Lisp adventure!

Day 5

Part 1

It feels natural to first figure out how to find an reduce a pair of chars. A solution based on arrays might be easier to implement without having to loop through the input line multiple times, since we risk the situation that removing one pair of chars may make the new neighbors be reducable, like so:

abcdDCBA
 abcCBA
  abBA
   aA

With the array based solution we can just decrement the current index after reducing a char pair. This makes for plenty of problems: we first want to destructively remove a part of a string given two indices, and we also want a loop in which we can alter the iterating value; neither of which seems straight forward to do in CL. My first, non-working attempt looked like this:

(defun reducable (a b)
  (and (eq (char-downcase a) (char-downcase b))
       (not (eq a b))))

(defun reduce-chars (chars)
  (let ((end (- (length chars) 1)))
    (loop for i from 0 below (- (length chars) 1)
          when (or (> 0 i) (< end i) return ""
          when (reducable (char chars i) (char chars (+ i 1)))
            do (progn
                 (setf chars (remove-if #'(lambda (_x) t) chars :start i :end (+ i 2)))
                 (setf i (- i 2)))
          finally (return chars)))))

Having #'(lambda (_x) t) as the removal predicate is... not great. This was the first attempt in which I figured "OK, this might just work". However, it turns out that the termination check is not ran on each iteration in loop: it seems to be the case that (- (length chars) 1) is only evaluated once before the first iteration of the loop, as opposed to how for (int i = 0; i < length(foo); i++) works in eg. C. This makes for out-of-bounds indexing in the body.

As a note, there is a destructive version of remove called delete, which would seemingly allow me to skip the (setf stuff. This did not work, since delete for some reason would not update the length of the string, and I would end up with repeating the last two chars. That is, this happend:

abcCBA
abBABA
aABABA
BABABA

I eventually figured out a solution, and ended up with this:

(defun reduce-chars (chars)
  (loop for i from 0
        for end = (- (length chars) 2)
        when (or (> 0 i) (< end i)) return chars
        when (reducable (char chars i) (char chars (+ i 1)))
            do (progn
                 (setf chars (remove-if #'(lambda (_x) t) chars :start i :end (+ i 2)))
                 (setf i (- i 2)))
        finally (return chars)))

We run it on the input and ... we get 49998. Strange. After printing out i in the progn we see that the first index matches, and so i gets subtracted to -2, causing the loop to terminate on the next iteration. Ok, we fix this by adding a (max 0 ..) after subtracting 2. We run again, aaand .... wrong answer. The resulting string is slightly above 10000 chars, so looking at the output manually will probably take too much time. We try to run it twice on the same input, which should not change anything, and yet we get

* (length (reduce-chars (reduce-chars *input-5*)))
10768
* (length (reduce-chars (reduce-chars *input-5*)))
10766

A little thinking reveals the bug: since i is incremented after the loop body we should not max it to 0, but to -1. This fixes the bug, and solves part 1.

Part 2

The second part asks us to remove all occurences of one letter such that the reduced string is as short as possible. The simplest way to do this is just to try out all possible letter choices:

(defun day-5/2 (input)
  (let ((all-chars (remove-duplicates input :key #'char-downcase )))
    (loop for c across all-chars
          for inp = (remove-if #'(lambda (ch) (or (eq c ch) (eq (char-upcase c) ch))) input)
          minimizing (length (reduce-chars inp)) into l
          finally (return l))))

There's definitely room for optimization here:

 * (time (day-5/2 *input-5*))
Evaluation took:
  104.868 seconds of real time
  104.697084 seconds of total run time (104.137960 user, 0.559124 system)
  [ Run times consist of 2.633 seconds GC time, and 102.065 seconds non-GC time. ]
  99.84% CPU
  304,535,002,262 processor cycles
  57,568,946,208 bytes consed
  
6538

By using a list instead of an array, and swinging cdr to cdddr when we want to remove a pair of chars we significantly cut down on running time (this is part 1 again):

* (time (length (reduce-chars *input-5*)))
Evaluation took:
  4.307 seconds of real time
  4.305834 seconds of total run time (4.289201 user, 0.016633 system)
  [ Run times consist of 0.112 seconds GC time, and 4.194 seconds non-GC time. ]
  99.98% CPU
  12,506,949,589 processor cycles
  2,383,071,584 bytes consed
  
10766
* (time (day-5/1-list *input-5*))
Evaluation took:
  0.106 seconds of real time
  0.106221 seconds of total run time (0.106221 user, 0.000000 system)
  100.00% CPU
  309,274,661 processor cycles
  819,200 bytes consed
  
10766

If we now use this solution for the second part, we get the following running time:

* (time (day-5/2-list *input-5*))
Evaluation took:
  2.496 seconds of real time
  2.494232 seconds of total run time (2.494232 user, 0.000000 system)
  99.92% CPU
  7,248,068,477 processor cycles
  28,657,200 bytes consed
  
6538

Not bad!

Day 6

There might be fancy tricks to this task, but we'll try out the simplest approach first: we make the grid (and hope that the input isn't too big!), loop over each cell, find the closest point, and stores that in the cell. Afterwards we go through the borders and find all of the areas that touch it, because these areas will have inifinite area. At last we just sum up the number of cells for each area, and choose the maximum, excluding the infinite ones.

This time I would like to try out propper top-down programming. This will be our final function:

(defun day-6/1 (input)
  (let* ((points (parse-points input))
         (grid (make-grid points)))
    (mark-closest grid points)
    (let* ((infinites (get-border-areas grid))
           (area-sizes (count-area-sizes grid))
           (valids (exclude-infinites area-sizes infinites))
           (second (car (sort valids #'second)))))))

Now it's just a matter of filling in the blanks.

First with input parsing. The lines are in the format ", ", but regex seemes like overkill, so I figured I'd try out split-sequence. I could, however, not get it to install, so instead I went with the simpler solution:

(defstruct point x y id)
(defparameter *point-count* 0)
(defun pt (x y) (make-point :x x :y y :id (incf *point-count*)))

(defun parse-points (lines)
  (loop for line in lines
        for i = (search ", " line)
        collect (make-point :x (parse-integer (subseq line 0 i))
                            :y (parse-integer (subseq line (+ 2))))))

With a flash of clairvoyance we realize that we could use an id for all areas, in addition to their coordinates.

Creating the grid was slightly worse, since make-array didn't want to take my dynamic sizes as dimension arguments.

Quick sidenote, after entering a function name wrong, Slime prompted me to enter another expression as the function. I tried this, and Emacs froze. Having spent 6 years in Vim, I can not recall any specific time it has crashed (I know it has, but I can't remember it). After restarting I had to install Slime again (I am yet to find out how to properly install stuff with Emacs), and after installing it again, I run into problems with No Lisp subprocess; see variable 'inferior-lisp-buffer', despite Slime and Swank and whatever running just fine. Restarting Emacs, yet again, and installing Slime again, seems to fix it.

After not figuring out how to make arrays without a fixed size, since I couldn't make a 2d array of a dynamic size, I realized I could make it work using make-array and loop:

(defun make-grid (points)
  (let* ((max-x (+ 1 (reduce #'max (mapcar #'point-x points))))
         (max-y (+ 1 (reduce #'max (mapcar #'point-y points))))
         (grid (make-array max-y)))
    (loop for y from 0 below max-y do
      (setf (aref grid y) (make-array max-x)))
    grid))

We're adding 1 to max-{x,y} so that we can index with all coordinates in the input list.

Calculating the closest point for each cell in the grid is done with nested loops. The logic for finding the best got somewhat messy, but it should work.

(defun manhattan (a b)
  (+ (abs (- (point-x a) (point-x b)))
     (abs (- (point-y a) (point-y b)))))

(defun mark-closest (grid points)
  (let ((mx (length (aref grid 0)))
        (my (length grid)))
    (loop for y from 0 below my do
      (loop for x from 0 below mx
        for c = (make-point :x x :y y)
            do (let* ((dists (loop for p in points collect (list (manhattan c p) p)))
                      (sorted (sort dists #'< :key #'car ))
                      (best (car sorted))
                      (tie (eq (first (first sorted)) (first (second sorted)))))
                 (setf (aref (aref grid y) x)
                       (if tie 0 (point-id (second best)))))))
    grid))

In order to get the areas touching the border, we just loop through the four borders, collect the numbers we see, and dedup at the end.

(defun get-border-areas (grid)
  (let ((mx (length (aref grid 0)))
        (my (length grid)))
    (remove-duplicates (append
                        (loop for y from 0 below my collect (aref (aref grid y) 0))
                        (loop for y from 0 below my collect (aref (aref grid y) (- mx 1)))
                        (loop for x from 0 below mx collect (aref (aref grid 0) x))
                        (loop for x from 0 below mx collect (aref (aref grid (- my 1)) x))))))

For counting the sizes of the areas we could have used a hashmap, but we might as well use the fact that all areas are numbered between 0 and the number of areas. Then we can make an array of counts, loop over the grid, and count up. When done we return (id, count) pairs for all areas that were non-null.

(defun count-area-sizes (grid num-areas)
  (let ((arr (make-array num-areas))
        (mx (length (aref grid 0)))
        (my (length grid)))
    (loop for y from 0 below my do
      (loop for x from 0 below mx
        for area = (aref (aref grid y) x)
        do (incf (aref arr area))))
    (loop for i from 0 below num-areas
          when (< 0 (aref arr i)) collect (list i (aref arr i)))))

Removing the infinite area areas from the list of (id, area) tuples didn't have to be its own function, but we've come so far with the top-down mindset, so let's overuse it a little.

(defun exclude-infinities (area-sizes infinities)
  (remove-if #'(lambda (l) (find (car l) infinities)) area-sizes))

This is the last function we needed to implement day-6/1. Having all the helper functions, we just need to make some small adjustments to the function, and we're good to go.

(defun day-6/1 (input)
  (setf *point-count* 0)
  (let* ((points (parse-points input))
         (num-areas (1+ (length points)))
         (grid (make-grid points)))
    (mark-closest grid points)
    (let* ((infinites (get-border-areas grid))
           (area-sizes (count-area-sizes grid num-areas))
           (valids (exclude-infinities area-sizes infinites))
           (max-area (car (sort valids #'> :key #'second))))
      (second max-area))))

Part 2

The second part in comparison require very little code. We simply do the same thing: find the size of the grid, loop over the grid, measure the sum of the distances to all points, and count the number of cells with a suficciently low distance.

(defun day-6/2 (input)
  (setf *point-count* 0)
  (let* ((points (parse-points input))
         (max-x (+ 1 (reduce #'max (mapcar #'point-x points))))
         (max-y (+ 1 (reduce #'max (mapcar #'point-y points))))
         (count 0))
    (loop for y from 0 below max-y do
      (loop for x from 0 below max-x
            for point = (pt x y)
            when (< (reduce #'+ (mapcar #'(lambda (p) (manhattan p point)) points)) 10000)
              do (incf count)))
    count))

and that's it!

Day 7

Part 1

We start out by parsin each input line to a pair, so that we can easier handle the dependency edges.

(defun line-to-pair (line)
  (let ((a (subseq line 5 6))
        (b (subseq line 36 37)))
    (list a b)))

One approach we can take is to try to continuously find all nodes that does not depend on any other node, and select the first alphabetically. The most straight forward way of doing this is to look through the list of edges, and count the number any node is the second element of the list. Then we look throught the counts and chose the first node with a count of 0.

(defun get-next (nodes edges)
  (defun zero-keys (hm)
    (loop for k being the hash-keys of hm
          when (eq 0 (gethash k hm)) collect k))
  (let ((hm (make-hash-table :test #'equalp)))
    (loop for node in nodes do (setf (gethash node hm) 0))
    (let ((available (loop for e in edges
                           do (print (second e))
                           do (incf (gethash (second e) hm))
                           finally (return (zero-keys hm)))))
    (reduce #'(lambda (a e) (if (string< a e) a e)) available))))

The outer loop is mostly keeping track of the nodes and edges we have left, and removing the elements that we no longer use after outputting a node.

(defun day-7/1 (input)
  (let* ((output)
         (edges (mapcar #'line-to-pair input))
         (nodes (remove-duplicates (flatten edges) :test #'string=)))
    (loop when (not (car nodes)) return output
            do (let ((next (get-next nodes edges)))
                 (setf edges (delete-if #'(lambda (edge) (string= (first edge) next)) edges))
                 (setf nodes (delete next nodes))
                 (setf output (cons next output))))
    (reduce #'(lambda (a b) (concatenate 'string a b)) (reverse output))))

We also used a flatten function stolen from [rosettacode].

Part 2

Todays second part seems very different from the first. We are asked to schedule variable length tasks with 5 workers.

First off, it does not matter which worker does which task. Second off, we probably want to prioritize starting with longer tasks, if possible. We still have task dependencies, which we need to remember.

One approach to solving this is to have a queue of all tasks that is currently processed. Then at each step we would find the next task, assign a worker to it, and find the time for when the task is done. If there are multiple nodes without any dependencies we would chose as many as we have workers. In addition, we probably want to chose the longest tasks first; that is, the largest lexiograhpically, as opposed to the smallest, as in part 1.

This is the task data that we work with: id is the task name, done is the time at which the task is done, and worker is a worker id.

(defstruct task id done worker)
(defun task-cost (id)
  (+ 60 (- (char-int (char id 0)) 64)))

get-next/2 is just like get-next, except that we chose the largest instead of the smallest alphabetically, since this has the largest cost. Now our main function looks like this:

(defun day-7/2 (input)
  (let* ((edges (mapcar #'line-to-pair input))
         (nodes (remove-duplicates (flatten edges) :test #'string=))
         (available-workers (loop for i from 1 to 5 collect i))
         (in-flight-tasks))
    (loop for time from 0
          when available-workers do
            (let ((next (get-next/2 nodes edges)))
              (when next
                (setf nodes (delete next nodes))
                (push (make-task :id next
                                 :done (+ time 1 (task-cost next))
                                 :worker (pop available-workers))
                      in-flight-tasks)
                (setf in-flight-tasks (sort in-flight-tasks #'< :key #'task-done))))
          when in-flight-tasks do
              (loop while in-flight-tasks
                when (< time (task-done (car in-flight-tasks))) return nil
                do (let ((task (pop in-flight-tasks)))
                     (setf edges (delete-if #'(lambda (edge) (string= (first edge) (task-id task))) edges))
                     (push (task-worker task) available-workers)))
          when (not nodes) return time)))

Using this we pass the test input, but on the real input our output is wrong. A little format debugging shows us two things: 1. we should not add 1 to the task cost when constructing new tasks, and 2. we need to remove tasks that are done before trying to add new tasks this round. Without this a task that takes only one cycle would spend two: the one in which it gets dispatched, and the one in which it completes. In addition, we're not dispatching multiple tasks at a time, which we should. The end condition was also wrong, as it terminated as soon as the last task was dispatched, but not completed. Somehow all these errors canceled out when ran on the test input.

(defun day-7/2 (input num-workers)
  (let* ((edges (mapcar #'line-to-pair input))
         (nodes (remove-duplicates (flatten edges) :test #'string=))
         (available-workers (loop for i from 1 to num-workers collect i))
         (in-flight-tasks))
    (loop for time from 0
          when in-flight-tasks do
            (loop while in-flight-tasks
                  when (< time (task-done (car in-flight-tasks))) return nil
                    do (let ((task (pop in-flight-tasks)))
                         (setf edges (delete-if #'(lambda (edge) (string= (first edge) (task-id task))) edges))
                         (push (task-worker task) available-workers)))
          when available-workers do
            (loop for next = (get-next nodes edges)
                  when (not available-workers) return nil
                  if next do (progn
                               (setf nodes (delete next nodes))
                               (push (make-task :id next
                                                :done (+ time (task-cost next))
                                                :worker (pop available-workers))
                                     in-flight-tasks)
                               (setf in-flight-tasks (sort in-flight-tasks #'< :key #'task-done)))
                  else return nil)
          when (and (not nodes) (not in-flight-tasks)) return time)))

After about 90 minutes of debugging, formating, and asking around how people did resolve ties, I ended up with this, which gives me the correct answer.

Regarding ties, I was confused since the task did not explicitly say that ties should still be resolved alphabetically, and I suspect it does matter (although I haven't come up with an example). In order to see whether this actually was the error in my code, I resolved ties randomly, and ran the function on the input 100 times; they all gave me the same answer.

I'm still not sure what was the bug, since I ended up not doing anything meaningful edits in the last hour of debugging. There might have been stale function implementations or something as well, or just that I forgot to turn back the cost function or the number of workers, between testing the function on the test input vs. the real input.

In any case, day 7 is complete.

Day 8

Today we have good news and bad news. The good news is that todays data structure is the tree! The bad news is that the input is a single line of space separated digits, so we'll have to make split-sequence work.

split-sequence is not in the standard library, so we cannot just use it. It is apparently a part of the Common Lisp Utilities, although that tells me nothing; browsing through the homepage of the utilities doesn't give me much information about how to actually use this. The examples given for split-sequence seem to have already loaded a package called split-sequence. I guess that it is installable using quicklisp:

* (ql:quickload "split-sequence")
To load "split-sequence":
  Load 1 ASDF system:
    asdf
  Install 1 Quicklisp release:
    split-sequence

(rant warning)

... but nothing happens after this is ran, and I need to abort it with C-C C-C. Trying "cl-utilities" and "utilities" instead did not help out: I got ETIMEDOUT from the former in the debugger, and the latter did apparently not do anything. I figured that the ETIMEDOUT might be due to me having outdated stuff in my quicklisp installation, so I ran (ql:update-dist "quicklisp"), which I found at the quicklisp website. After (I'm guessing) 30 seconds without any feedback to whether something actually happend when typing that in the repl, I get yet another ETIMEDOUT. Maybe the quicklisp client is outdated? (this seems very unlikely, since I installed it about eight days ago, but at this point I have no idea what's going on) Running (ql:update-client) gets me nowhere: yet another ETIMEDOUT. I suppose the quicklisp site could be down? Folloing the install instructions I followed about a week ago I run

curl -O https://beta.quicklisp.org/quicklisp.lisp.asc

and... nothing happens! Great! .. or is it? Investigating further I tried to check isitdownrightnow.com to confirm that the site was indeed down, but I failed to connect to isitdownrightnow.com! It seemes unlikely that both of these sites are down, so I check downforeveryoneorjustme.com, which claims that both beta.quicklisp.org and isitdownrightnow.com` are in fact down for just me. QuickLisp just times out, but isitdownrightnow gives me a cloudflare page, so presumably the problem is not in my house, which means that there's probably not much I can do.

(Update the 10th: the network is more or less back to normal, and installing sequence-split was as simple as (ql:quickload "split-sequence"); so much for getting annoyed :)

(rant warning end)

Luckily, cl-ppcre offers the samf functionality with (ppcre:split delim string).

(defun line-to-numbers (line)
  (mapcar #'parse-integer (ppcre:split " " line)))

Then we make a function to parse the line into a tree. We first find the number of children an number of metadata entries, then for each child we recursively call the parse function on the list without the two numbers we've already read. This is slightly awkward since we have to both collect the child nodes, as well as keep track of where in the input list we are. For this we use multiple-value-bind, and have the function return a pair node, remaining-input.

(defun parse-tree (items)
  (let* ((num-children (first items))
         (num-metadata (second items))
         (rest (cddr items))
         (children
           (loop for i from 0 below num-children
                 collect
                 (multiple-value-bind (node new-rest) (parse-tree rest)
                   (setf rest new-rest)
                   node)))
         (metadata (subseq rest 0 num-metadata)))
    (values (list children metadata) (subseq rest num-metadata))))

Running this on the test input gives us this:

(((NIL (10 11 12)) (((NIL (99))) (2))) (1 1 2))
; formatted, and with labels:
node: (
  children: (
    node: (children: NIL metadata: (10 11 12))
    node: (
      children: (
        node: (children: NIL metadata: (99))
      metadata: (2)))
  metadata: (1 1 2))

So the tree looks like this:

      (1 1 2)
       /  \
      /    \
     /      \
(10 11 12)  (2)
             |
             |
            (99)

Which looks right, compared to the description on the task page.

Now we just need to sum up all metadata entries. We will again go for a recursive solution:

(defun metadata-sum (tree)
  (+ (reduce #'+ (second tree))
     (reduce #'+ (mapcar #'metadata-sum (first tree)))))

* (metadata-sum (parse-tree (line-to-numbers *test-input-8*)))
138

Yay! This completes the first part.

Part 2

Now we're asked to sum up the values of the children if the childs index is in the nodes metadata, with a note that if a number is multiple times in the list, it should be counted multiple times. This makes it possible to make inputs so that the running time becomes exponential, but we'll try to do the naive thing still.

The function is almost a straight mapping from the description of the value scoring. If the node has children, the metadata is the indices (ops: 1 indexed) for the children we count. If not, the sum of the metadata is the value.

(defun node-value (node)
  (if (first node)
    (loop for data in (second node)
          summing (node-value (nth (- data 1) (first node))) into sum
          finally (return sum))
    (reduce #'+ (second node))))

Tada! The running time is also pretty good!

* (time (day-8/2 *input-8*))
Evaluation took:
  0.113 seconds of real time
  0.113493 seconds of total run time (0.113443 user, 0.000050 system)
  [ Run times consist of 0.025 seconds GC time, and 0.089 seconds non-GC time. ]
  100.00% CPU
  328,615,301 processor cycles
  302,886,128 bytes consed
  
30063

Day 9

Today I want to try out something a little different. The marbles in the task are in a circle, so I want to try out having a circular list.

(defun make-circular (e)
  (let* ((l (list e)))
    (setf (cdr l) l)
    l))

Trying to print this out results in looping, so this seems to work. One downside of this is that we should be able to move in both directions around the circle; if we want to go in the other direction we would have to first find the length of the list, and then go n-k steps in the opposite direction. This might take some time, but we can try to do it this way first, in case it works out.

Next up is doing stuff with the list. Naturally we cannot just mapcar over our circular list (we cannot even subseq it - I exhausted my heap attempting to do so), so we need to write our own map:

(defun map-circular (f circ)
  (let* ((head (first circ))
        (result (list (funcall f head))))
    (loop for e in (cdr circ)
          when (eq e head) return (reverse result)
          do (push (funcall f e) result))))

Now (map-circular #'print (make-circular 1)) prints 1 and gives me back (1). Next we want to insert things, so we'll write a function for that. We return t here so that the REPL doesn't try to print out the entire list every time we add something

(defun insert-circular (e circ)
  (setf (cdr circ) (cons e (cdr circ)))
  t)

* (defparameter nums (make-circular 1))
NUMS
* (insert-circular 2 nums)
T
* (insert-circular 3 nums)
T
* (insert-circular 4 nums)
T
* (map-circular #'print nums)
1 
4 
3 
2 
(1 4 3 2)

Good.

Now implementing the game is not too difficult. We keep track of the current node, the player, and the player scores. For each round of the game we increment the current player, and check if the marble is special or not. If it is, we count the list, and go forward n-8 steps (backwards 8) steps, so we end up with the node before the one we want to remove. Then we remove in and increment the score for the current player. If the marble is not special we just insert it after the next marble in the circle. Lastly, we get the maximum of the scores.

(defun play-game (num-marbles players)
  (let* ((circle (make-circular 0))
         (current circle)
         (player 0)
         (scores (make-array players)))
    (insert-circular 1 circle)
    (setf current (cdr circle))
    (loop for marble from 2 to num-marbles do
      (progn
        (setf player (mod (1+ player) players))
        (if (eq (mod marble 23) 0)
            (let* ((len (length-circular circle))
                   (to-remove (nthcdr (- len 8) current)))
              (incf (aref scores player) (+ marble (second to-remove)))
              (remove-circular to-remove)
              (setf current (cdr to-remove)))
            (progn
              (insert-circular marble (cdr current))
              (setf current (cddr current))))))
    (loop for s across scores maximizing s into m finally (return m))))

This is a very wasteful implementation, since we need to go through almost the entire list twice when removing marbles, since we have to go backwards. Still, for the input we were given, it doesn't perform too bad:

* (time (play-game 70848 425))
Evaluation took:
  1.323 seconds of real time
  1.322374 seconds of total run time (1.252015 user, 0.070359 system)
  [ Run times consist of 0.207 seconds GC time, and 1.116 seconds non-GC time. ]
  99.92% CPU
  3,842,548,402 processor cycles
  3,189,760,080 bytes consed
  
413188

Part 2

This, of course, was foreseen by the creator of the task: part 2 asks for the same game, but for 100 times the number of marbles. Since the current implementation is roughly quadratic, this means that we'll use 10.000X the time: 10.000 seconds is roughly three hours, which means back to the drawing board.

The natural approach is to add support for both-way traversal of the list. We can do this by not using lists, but make our own list:

(defstruct node b e f)

(defun make-circular (e)
  (let ((n (make-node :f nil :e e :b nil)))
    (setf (node-f n) n)
    (setf (node-b n) n)
    n))

Inserts and removals are similar to as before, except that we must swing two pointers instead of one.

(defun insert-circular (e node)
  (let ((n (make-node :b node :e e :f (node-f node))))
    (setf (node-b (node-f node)) n)
    (setf (node-f node) n)
    t))


(defun remove-circular (node)
  (setf (node-f node) (node-f (node-f node)))
  (setf (node-b (node-f node)) node)
  t)

Having this it is very easy to go n steps backwards:

(defun n-back (n node)
  (if (eq n 0) node
      (n-back (- n 1) (node-b node))))

The main function is almost not changed at all, with the exception of swapping cars with node-e and cdrs with node-f, in addition to, of course, using n-back instead of nthcdr.

(defun play-game (num-marbles players)
  (let* ((circle (make-circular 0))
         (current circle)
         (player 0)
         (scores (make-array players)))
    (insert-circular 1 circle)
    (setf current (node-f circle))
    (loop for marble from 2 to num-marbles do
      (progn
        (setf player (mod (1+ player) players))
        (if (eq (mod marble 23) 0)
            (let* ((to-remove (n-back 8 current)))
              (incf (aref scores player) (+ marble (node-e (node-f to-remove))))
              (remove-circular to-remove)
              (setf current (node-f to-remove)))
            (progn
              (insert-circular marble (node-f current))
              (setf current (node-f (node-f current)))))))
    (loop for s across scores maximizing s into m finally (return m))))

Here's the running times for both input:

* (time (day-9/1))
Evaluation took:
  0.008 seconds of real time
  0.007581 seconds of total run time (0.007487 user, 0.000094 system)
  100.00% CPU
  22,124,513 processor cycles
  2,162,688 bytes consed
  
413188
* (time (day-9/2))
Evaluation took:
  1.169 seconds of real time
  1.167412 seconds of total run time (1.080873 user, 0.086539 system)
  [ Run times consist of 0.748 seconds GC time, and 0.420 seconds non-GC time. ]
  99.83% CPU
  3,394,813,588 processor cycles
  216,923,648 bytes consed
  
3377272893

A Checkmate Poster

2020-09-20T16:50:39+02:00

No affiliation with checkmateposters.com blabla

Some weeks ago I saw checkmateposters.com on HN(?). The concept is very simple: you generate a poster showing chess positions thoughout a game of your choosing. Colors are also, to some degree, customizable. After considering getting one for a friends birthday, I rather got one for myself, both since their birthday was a little further in the future, I wasn't sure they'd like it, and hey, what if the poster is not any good?

And besides, I wasn't sure which game to get.

At the same time I read Brian Kernighan's "Unix: A History and a Memoir", in which he mentions a chess game in between the programs Blitz 6.5 and Belle, Belle which was co-authored by Ken Thompson. In the book the game was only covered in annotated FEN, so I had to play it out on a board to see how it looks (my mental FEN skills are obviously not up to par).

Some days ago it arrived and I figured I'd post some pictures of it since (a) it's a pretty new service so new potential customers will have a hard time evaluating the product they're buying, and (b) I was very happy with the result, and don't mind posting about services that I enjoy.

The Shipping

I ordered the poster the 3rd of September, and it arrived here the 15th. I got three emails in between: one order confirmation which contained the shipping information (i.e. my address) and a low-res thumbnail of the poster; one update email from Avery on the 8th, saying they'll forward the tracking information as soon as they get it; one containing the link to the tracking info.

I think it's unfortunate that I only got a low-res picture of the poster in the confirmation, since I couldn't send pictures to friends saying "look what I just ordered". Furthermore, it would also have been nice if the colors (and FEN even) was included in the mail, since I now have no way of knowing exactly which colors I chose. This isn't a problem for me right now, but you could imaginge wanting to order either a copy or a new game in the same style as before. Unless you chose either of the presets it seems that you don't really have a way of buying the same again.

Here's the box I got:

Inside the box the poster was wrapped in soft-ish wrapping paper:

The Poster

Here's the poster itself, with a 0.5 liter bottle for scale.

The paper is semi thick, and feels pretty good. Here's a closeup picture of the poster; it looks a little blurry because (a) my camera isn't great, and (b) it is a little blurry.

It's difficult to get an idea of the print quality by looking at a picture of the poster itself, since it's hard to differentiate blur in the print and blur in the photo. I've tried to make this a little easier with a side-by-side comparison with a paper I had at hand.

Looking at the picture, the difference isn't too big, although I think it's more in real life; I attribute this to the phone camera. Still, this is only really noticable if you are pretty close to the poster. At an arms length away it is not noticably blurry.

Framing The Poster

I wanted to frame the poster to get some constrast between my wall and the poster itself since I chose a white background. The dimensions of the poster is found on the webpage: 18' by 24' ¹. Living in a SI country, this is slightly unfortunate since the dimensions aren't as nice in meters.

Furthermore, when framing you will probably end up covering parts of the poster with the mat. Unfortunately, it's not clear exactly how much space you have in between the border of the poster and where the chess boards start. In my case I had pretty little space, and the parts of the poster I didn't want to hide was slightly wider than the mat from the frame I got. I therefore had to cut in it, which is somewhat visible on the pictures.

It would be nice if this was somehow taken into account when the poster is generated, e.g. that you could get exact dimensions for the different parts of the poster so that you would know beforehand exactly how big the frame and mat would have to be. Still, this was fairly easy to get around.

Here's the poster, framed and standing on a chair:

You can tell that I didn't want to cut too much into the mat, as the borders of the boards are just slightly within the mat. You can also see my terrible paper cutting skills, especially on the right side of the mat.

Here's the poster, framed up on my wall.

End

In total, I'm very happy with the poster I got, and I think the minor hiccups I had (or thought of) are easily fixable if needed, and not really a dealbreaker if left as is.

I hope this either inspired you into getting some new wall decoration, or helped you in deciding whether to buy a poster or not.

Thanks for reading.

well, the website says 18' x 24', which as far as I can tell, not being american, would mean 28 by 24 feet, and not inches. I don't think it's really possible to misunderstand this though, considering there's a big picture of the poster right on the front page, as well as the fact that 24 feet is a lot. ↩

Writing a JPEG decoder in Rust - Part 2: Implementation I

2016-08-19T13:44:00+02:00

This is a blog series. Read part 1 here

Last time we got a basic understanding of the different steps in decoding a JPEG image, as well as how the file is structured. What we did not get was any code or hints at implementation. Finally we get to see an attempt of writing a JPEG decoder, in Rust.

Additionally, I have now open sourced the project on GitHub. If you are interested in seeing the full thing, or want to know what I will try to cover in Part 3 of this series, take a look! Again, feedback is very welcome, be it typos, questions, or suggestions for improvements. As I mentioned in Part 1, the project is very much ongoing, and I have cut quite a few corners here and there. If you are testing the decoder, and find an image that is not decoded properly, I will be very interested in hearing from you¹!

At last, this project is purely educational, for me and hopefully also for you. I do not actually need a new JPEG decoder, and chances are you do not either :)

Implementation

Now that we have a high level understanding of how JPEG works, we can write a simple decoder for a test image of Lena²:

Our Program

For testing and validation purposes, we will create a binary program, which we will run with

cargo run <input.jpeg> <output.ppm>

The output file is a ppm file, which was the simplest way I found to see the image data we decode. The format is very simple: see here how it works. Common image viewers should open ppm files. Personally I have used eog.

JFIF

What does the JFIF specification (pdf) enforce? The file has to start with the SOI (Start of Image) marker, followed by the APP0 (Application Segment 0) marker³. According to the JPEG spec, there are 16 application markers, 0xffe0-0xffef, which are "reserved for application use". JFIF uses the segment to hold an identifier (the string "JFIF"), and thumbnail data, among a few other things. The images I have tried to decode did not contain a thumbnail, so this is seemingly rare.

Reading the file

Initially we will simply read the whole image file. There might be some memory and/or speed optimization potential using some kind of streaming approach, but for now, let's stick with a Vec<u8>⁴.

fn file_to_bytes(path: &Path) -> Result<Vec<u8>, std::io::Error> {
    File::open(path).and_then(|mut file| {
        let mut bytes = Vec::new();
        try!(file.read_to_end(&mut bytes));
        Ok(bytes)
    })
}

The Marker Segment Loop

This will be the main loop of the decoder. We keep track of where we are in the file, and read segment for segment. Since most segments say exactly how long they are, this works out pretty well.

let mut i = 0;
while i < vec.len() {
    if let Some(marker) = bytes_to_marker(&vec[i..]) {
        if marker == Marker::EndOfImage || marker == Marker::StartOfImage {
            // These markers doesn't have length bytes, so they must be
            // handled separately, in order to to avoid out-of-bounds indexes,
            // or reading nonsense lengths.
            i += 2;
            continue;
        }

        let data_length = (u8s_to_u16(&vec[i + 2..]) - 2) as usize;
        i += 4;

        match marker {
            Marker::Comment => { /* Read comment data */ }
            Marker::QuantizationTable => { /* Read table data */ }
            // Handle the rest of the markers
        }
        i += data_length;
    } else {
        panic!("Unhandled byte marker: {:02x} {:02x}", vec[i], vec[i + 1]);
    }
}

There are quite a few different segments, but not all are very interesting. The segments we will look at in this post are Comment, DefineHuffmanTable, QuantizationTable, and StartOfScan.

Reading a segment

As a simple example to get us started reading data in a segment, consider the Comment marker, which is of the following form⁵:

 2 bytes  2 bytes   length-2 bytes
| marker | length | comment       |

The marker bytes are already read, and so are the length bytes, from which we also subtracted 2, making data_length the actual length of the data part of the segment (comment in this case). Reading the comment into a String is pretty straight forward:

Marker::Comment => {
    let comment = str::from_utf8(&vec[i..i + data_length])
        .map(|s| s.to_string())
        .ok();
    image.comment = comment;
}

`QuantizationTable`

This segment contains one or more quantization tables (or matrices). Each table is 65 bytes, where the first byte contains two fields: Element Precision and Table Destination, each 4 bits large. Element precision specifies how large each value in the matrix is; 0 for 8-bits, 1 for 16-bits. For baseline sequential DCT (which is the only mode we will support), this has to be 0. Table destination specifies one of four possible "slots" for the matrix to be saved in; that is, it is possible to have four quantization matrices and use different ones in different scans.

The remaining 64 bytes are the values in the matrix. If there are multiple tables in the segment they are located right after one another.

Marker::QuantizationTable => {
    // JPEG B.2.4.1
    let mut index = i;
    while index < i + data_length {
        let precision = (vec[index] & 0xf0) >> 4;
        assert!(precision == 0);
        let identifier = vec[index] & 0x0f;
        let table: Vec<u8> = vec[index + 1..index + 65]
            .iter()
            .cloned()
            .collect();

        image.quantization_tables[identifier as usize] = Some(table);
        // 64 entries + one header byte
        index += 65;
    }
}

`DefineHuffmanTable`

Reading the Huffman tables are some work. Each table consists of a Table Class and a table destination (sharing one byte), 16 bytes specifying the number of codes of length 1 through 16, and a one byte value for each code. The table class specifies wether the table is used for DC or AC coefficients, but we will get back to this in Part 3.

Marker::DefineHuffmanTable => {
    // JPEG B.2.4.2

    // Head of data for each table
    let mut huffman_index = i;
    // End of segment
    let segment_end = i + data_length;

    while huffman_index < segment_end {
        let table_class = (vec[huffman_index] & 0xf0) >> 4;
        let table_dest_id = vec[huffman_index] & 0x0f;
        huffman_index += 1;

        // There are `size_area[i]` number of codes of length `i + 1`.
        let size_area: &[u8] = &vec[huffman_index..huffman_index + 16];
        huffman_index += 16;

        // TODO: replace with `.sum` as of Rust 1.11
        let number_of_codes = size_area.iter()
            .fold(0, |a, b| a + (*b as usize));

        // Code `i` has value `data_area[i]`
        let data_area: &[u8] = &vec[huffman_index..huffman_index +
                                                   number_of_codes];
        huffman_index += number_of_codes;

        let huffman_table =
            huffman::HuffmanTable::from_size_data_tables(size_area, data_area);
        // DC = 0, AC = 1
        if table_class == 0 {
            image.huffman_dc_tables[table_dest_id as usize] =
                Some(huffman_table);
        } else {
            image.huffman_ac_tables[table_dest_id as usize] =
                Some(huffman_table);
        }
    }
}

Processing the table data we read from the file is a little work: this happens in huffman::HuffmanTable::from_size_data_tables, to which we pass a "size table" (how many codes are of length i?), and a "data table" (which value is code i mapped to?)⁶.

The Huffman module defines two structs:

#[derive(Debug, Clone)]
pub struct HuffmanCode {
    /// How many bits are used in the code
    length: u8,
    /// The bit code. If the number of bits used to represent the code is less
    /// than `length`, prepend `len-length` `0`s in front.
    code: u16,
    /// The value the code is mapped to.
    value: u8,
}

#[derive(Debug)]
pub struct HuffmanTable {
    /// A list of all codes in the table, sorted on code length
    codes: Vec<HuffmanCode>,
}

For simplicity, the actual table is just a Vec of HuffmanCodes⁷; we may see in a later post how to improve the performance here.

Creating the table is done in two steps. First we create a Vec with the length of each code, such that code i has length code_lengths[i]; remember that the size_data read from the file is the number of codes of each length, and now we want the length of each code. Then we create a Vec with the codes, by merging the three parts: code length, code bit string, and code value.

impl HuffmanTable {
    pub fn from_size_data_tables(size_data: &[u8], data_table: &[u8]) -> HuffmanTable {
        let code_lengths: Vec<u8> = (0..16)
            .flat_map(|i| repeat(i as u8 + 1).take(size_data[i] as usize))
            .collect();

        let code_table: Vec<u16> = HuffmanTable::make_code_table(&code_lengths);

        let codes: Vec<HuffmanCode> = data_table.iter()
            .zip(code_lengths.iter())
            .zip(code_table.iter())
            .map(|((&value, &length), &code)| {
                HuffmanCode {
                    length: length,
                    code: code,
                    value: value,
                }
            })
            .collect();

        HuffmanTable { codes: codes }
    }

    fn make_code_table(sizes: &[u8]) -> Vec<u16> {
        // This is more or less just an implementation of a
        // flowchart (Figure C.2) in the standard.
        let mut vec = Vec::new();
        let mut code: u16 = 0;
        let mut current_size = sizes[0];
        for &size in sizes {
            while size > current_size {
                code <<= 1;
                current_size += 1;
            }
            vec.push(code);
            if current_size > 16 || code == 0xffff {
                break;
            }
            code += 1;
        }
        vec
    }
}

Beware: make_code_table was previously purely an implmementation of the flow chart mentioned in the comment, but this was so ugly that I decided to implement it from scratch. It has not been thoroughly tested, but it seems to work as intended.

Now the table is read, processed, and put into its place. Next up is actually decoding image data.

`StartOfScan`

First we read in fields for the current scan. This includes eg. the number of components, and which tables they use. There is nothing fancy just yet; the data format is, as usual, listed in the JPEG specification.

Marker::StartOfScan => {
    // JPEG B.2.3
    let num_components = vec[i];
    let mut scan_components = Vec::new();
    for component in 0..num_components {
        scan_components.push(ScanComponentHeader {
            component_id: vec[i + 1],
            dc_table_selector: (vec[i + 2] & 0xf0) >> 4,
            ac_table_selector: vec[i + 2] & 0x0f,
        });
        i += 2;
    }

    let scan_header = ScanHeader {
        num_components: num_components,
        scan_components: scan_components,
        start_spectral_selection: vec[i + 1],
        end_spectral_selection: vec[i + 2],
        successive_approximation_bit_pos_high: (vec[i + 3] & 0xf0) >> 4,
        successive_approximation_bit_pos_low: vec[i + 3] & 0x0f,
    };
    // Register read data
    i += 4;

    if image.scan_headers.is_none() {
        image.scan_headers = Some(Vec::new());
    }
    image.scan_headers
        .as_mut()
        .map(|v| v.push(scan_header.clone()));

As we have seen, markers are of the format 0xff__. So what if the image data contains 0xff? The solution is to encode 0xff as 0xff00, which is not a marker code. Therefore we need to replace ff00 with ff before sending it to the huffman decoder, where we decode actual image data. Additionally we need to keep track of how many bytes we skip, in order to increment i correctly when we are done with this scan.

    // Copy data, and replace 0xff00 with 0xff.
    let mut bytes_skipped = 0;
    let mut encoded_data = Vec::new();
    {
        let mut i = i;
        while i < vec.len() {
            encoded_data.push(vec[i]);
            if vec[i] == 0xff && vec[i + 1] == 0x00 {
                // Skip the 0x00 part here.
                i += 1;
                bytes_skipped += 1;
            }
            i += 1;
        }
    }

Note that we are copying the entire image data here. Is this necessary? Not really. Is it slow? Uuh, maybe? It is simple? Hell yes.

After this we pass in the huffman tables and quantization matrices which we hopefully have read in an earlier segment. The code for this is nothing special, and somewhat ugly, so it has been omitted. At last, we decode the image, and advance the index appropriately.

    let (image_data, bytes_read) = jpeg_decoder.decode();
    image.image_data = Some(image_data);

    // Since we are calculating how much data there is in this segment,
    // we update `i` manually, and `continue` the `while` loop.
    i += bytes_read + bytes_skipped;
    continue;
}

Note that we assume the image contains only one scan. As I mentioned in Part 1, all images I have tested containst just that, so this is an assumption we will continue to make. If we are afraid of forgetting this, we could always add an assert!(image.image_data.is_none()) at the top of the StartOfScan block.

So what happens in JpegDecoder::decode? That will have to wait for the next part.

I know images with certain sampling factors are messed up. If possible, check identify -verbose <image.jpeg> | grep sampling (requires ImageMagick). Only 1x1 and 2x1 are supported. ↩
When developing, I did not use this image, because of its size (decoding this actually takes almost two seconds on my laptop!), and its complexity (multiple channels, present scaling factors, etc.). Rather, I tested the decoder with a 16x8 grayscale version of the same image, and advanced to a full size grayscale image of Lena, when the small image showed correctly. ↩
Note that while this is the JFIF spec, the markers used are defined in the JPEG spec. ↩
/u/DroidLogician pointed out a huge performance flaw in the original file reading code. Check the reddit thread out! ↩
If you think my ASCII skills are nonexistent, check page 47 (marked 43) in the JPEG spec. ↩
The naming here is the same as what is used in the spec. I think there are room for improvements, but I have yet to come up with better names. ↩
By now, you might see how I plan to use the Huffman table to decode data. In retrospect, constructing an actual tree and traversing it with a bit stream iterator of sorts would perhaps have been a better idea, although it would have been more work up front. ↩

Other quotes from Structured Programming with go to Statements

2019-02-25T11:31:54+01:00

Most of us have heard the (unfortunately) famous quote from Donald E. Knuths1974 paper "Structured Programming with go to Statements". Yes, it's probably the one you're thinking about¹. It's a really good paper with plenty of interesting ideas that holds up very well especially considering the paper is over 40 years old and is in great part about optimization. I highly recommend you to stop reading my blog, and to go and read the paper itself instead.

What follows is a list of other great quotes from the same paper. They are mostly copied straight from the paper, but I've occasionally omitted parts in order not to make them too long or filled with irrelevant context. My own edits are written [like this]. Enjoy!

[...] people are now beginning to renounce every feature of programming that can be considered guilty by virtue of its association with difficulties. Not only go to statements are being questioned; we also hear complaints about floating-point calculations, global variables, semaphores, pointer variables, and even assignment statements.

The improvement in speed from Example 2 to Example 2a is only about 12%, and many people would pronounce that insignificant. The conventional wisdom shared by many of today's software engineers calls for ignoring efficiency in the small; but I believe this is simply an overreaction to the abuses they see being practiced by penny-wise-and-pound-foolish programmers, who can't debug or maintain their "optimized" programs. In established engineering disciplines a 12% improvement, easily obtained, is never considered marginal; and I believe the same viewpoint should prevail in software engineering. Of course I wouldn't bother making such optimizations on a one-shot job, but when it's a question of preparing quality programs, I don't want to restrict myself to tools that deny me such efficiencies.

I've become convinced that all compilers written from now on should be designed to provide all programmers with feedback indicating what parts of their programs are costing the most; indeed, this feedback should be supplied automatically unless it has been specifically turned off.

He [Tony Hoare] points out quite correctly that the current practice of compiling subscript range checks into the machine code while a program is being tested, then suppressing the check during production runs, is like a sailor who wears his life preserver while training on land but leaves it behind when he sails!

[...] I also know of places where I have myself used a complicated structure with excessively unrestrained go to statements, especially the notorious Algorithm 2.3.3A for multivariate polynomial addition. The original program had at least three bugs; exercise 2.3.3-14 "Give a formal proof (or disproof) of the validity of Algorithm A", was therefore unexpectedly easy.

It is important to keep efficiency in its place, as mentioned above, but when efficiency counts we should also know how to achieve it.

A programmer should create a program P which is readily understood and well-documented, and then he should optimize it into a program Q which is very efficient. Program Q may contain go to statements and other low-level features, but the transformation from P to Q should be accomplished by completely reliable and well-documented "mechanical" operations. At this point many readers will say, "But he should only write P, and an optimizing compiler will produce Q." To this I say, "No, the optimizing compiler would have to be so complicated that it will in fact be unreliable"

We found ourselves always running up against the same problem: the compiler needs to be in a dialog with the programmer; it needs to know properties of the data, and whether certain cases can arise, etc. And we couldn't think of a good language in which to have such a dialog.

The programmer using such a system will write his beautifully-structure, but possibly inefficient, program P; then he will interactively specify transformations that make it efficient. Such a system will be much more powerful and reliable and a completely automatic one. [...] The original program P should be retained along with the transformation specifications, so that it can be properly understood and maintained as time passes.

He [Edsger Dijkstra] went on to say that he looks forward to the day when machines are so fast that we won't be under pressure to optimize our programs.

Though [a previous code snippet] is slightly cleaner looking than the method in my book, it is noticeable slower, and we have nothing to fear by using a slightly more complicated method once it has been proved correct. Beautiful algorithms are, unfortunately, not always the most useful.

One thing we haven't spelled out clearly, however, is what makes some go to's bad and others acceptable. The reason is that we've really been directing our attention to the wrong issue, to the objective question of go to elimination instead of the important subjective question of program structure.

We should ordinarily keep efficiency considerations in the background when we formulate our programs. We need to be subconsciously aware of the data processing tools available to us, but we should strive most of all for a program that is easy to understand and almost sure to work.

On the off-chance that you have no idea what I'm talking about: don't bother looking it up! Read the paper itself instead; this will provide you with a much needed context that is usually omitted from the quote. ↩

Swapping memory blocks in C

2016-02-10T14:01:17+01:00

Sometimes one have a memory block where we want to put the first n bytes at the end of the block, rather than the beginning, without changing the blocks themselves, as in the figure below. a, b, and e are pointers to the beginning of A, the beginning of B, and the end of B.

        a |=========|        |=========|
          | block A |  want  | block B |
        b |---------| =====> |         |
          |         |        |         |
          |         |        |         |
          | block B |        |---------|
          |         |        | block A |
        e |=========|        |=========|

By using extra space, more specifically $O(n)$ space, this is trivial.

// external buffer to hold A
char *tmp = malloc(sizeof(A));
// copy A to the buffer
memmove(tmp, a, sizeof(A));
// move B up to the top
memmove(b, a, sizeof(B));
// insert A at the bottom
memmove(a + sizeof(B)), tmp, sizeof(A));
free(tmp);

Of course, sizeof(A) will not work when A is a pointer, but the meaning is still clear. Additinally, we could use memcpy instead of memmove on the first and last call, if we really cared about not copying the data too much around¹.

This is a fine solution; it works. But can we do better? Well, can we do it without the malloc call?

A constant memory `block_swap`

This algorithm is based on the simple idea that if we swap block A to its final position, we have swapped a block B2, which is of the same size as A, to the top. We are then left with a similar but smaller problem, which is to swap B2 and B1:

  a |=========|            |=========|               |=========|
    | block A |  one pass  | block B2|  new problem  | block B2|
  b |---------| =========> |---------| ============> |---------|
    |         |            |         |               |         |
    | block B1|            | block B1|               | block B1|
  c |- - - - -|            |---------|               |=========|
    | block B2|            | block A |               :(block A):
  e |=========|            |=========|               ...........

With this we can sketch out the general idea of our alogrithm:

void one_by_one_swap(char *a, char *b, size_t n) {
    for (size_t i = 0; i < n; i++) {
        char tmp = a[i];
        a[i] = b[i]:
        b[i] = tmp;
    }
}

void block_swap(char *a, char *b, char *e) {
    char *c = e - sizeof(A);
    one_by_one_swap(a, c, sizeof(A));
    block_swap(a, b, c);
}

Specifics

One assumption we have made so far is that the B block is larger than the A block. In order to fix this, we could check which of the blocks is the larger, and swap around the logic, such that we always swap the smaller 'into' the larger. This allows us for a little optimization: if the blocks are of equal size, we can simply make one call to one_by_one_swap. In the second and final listing, we have even added some argument checking. This code is runable.

void block_swap(char *a, char *b, char *e) {
    assert(a < b);
    assert(b < e);
    size_t a_size = b - a;
    size_t b_size = e - b;
    if (a_size < b_size) {
        // The case we assumed above
        char *c = e - a_size;
        one_by_one_swap(a, c, a_size);
        block_swap(a, b, c);
    } else if (b_size < a_size) {
        // The opposite case
        // Now `c` is between `a` and `b`
        char *c = a + b_size;
        one_by_one_swap(a, b, b_size);
        block_swap(c, b, e);
    } else {
        // The trivial case
        one_by_one_swap(a, b, a_size);
    }
}

We can still do one more thing, which is replacing the recursion with a loop. However, the function is tail-recursive, so this is a trivial transformation for the compiler. Additionally, the transformation isn't very interesting, so we will keep this recursive definition.

Efficiency

What is the running time of this? We can figure this out pretty intuetively, by the observation that each pass swaps a block of a elements into their correct position. Then, these elements are not touched again. This means that all elements are moved exacly once --- no more, no less. Hence, this is a linear algorithm, as one would expect from a memory moving algorithm.

What about the space complexity? Even though the function is recursive, it is tail recursive, as the last thing that happends in the code paths where recursion is used is the recursive call itself. The compiler transforms this into a simple loop, such that we keep the linear space complexity (which, after all, was our main motivation to do this).

Lastly, a final optimization that could be used is in the one_by_one_swap. Instead of swapping one byte at a time, we could swap, say, eight bytes at a time while there are more than eight bytes left to swap, and swap the remaining bytes one by one.

We have shown a possible implementation of the problem of swapping two ajacent memory block, without using an auxiliary buffer.

The GNU C library² implementation of memmove calls memcpy if the memory blocks are not overlapping, so this would be a minor optimization. ↩
https://www.gnu.org/software/libc/download.html ↩

Negative Comments

2020-06-12T18:47:04+02:00

This post is a stream of conciousness from reading comments at this post.More specifically, this comment got me thinking¹:

You know no-one is forcing you to play CS in your browser, right? Why is it so offensive to you that this exists and someone else is finding joy in playing it? Why does HN love to rag on web technologies so much?

The assumption here is that in some places, like HN, it's not unusual for projects to be, for the lack of a better word, shat on. I too have experienced this when I started, and unfortunately didn't finish, a series on writing a JPEG encoder/decoder in Rust. People were complaining among other things that rewriting something in A New And Different Language didn't solve a real problem. They were competely right, of course, but they also missed the point completely, as I didn't really need a new encoder/decoder of JPEG files. It was strictly educational, as I wanted to learn more about JPEG, as well as experiencing how it was trying to implement a spec.

Back then I felt the criticism was stupid. Hadn't they read the post at all? How could they really think I did this for some technical gain? My own feelings were at the time backed up by the general consensus on various sites.

Still, looking back, I think their frustration was warranted, and I think the same frustration is surfacing on the CS 1.6 thread from HN linked above.

The problem isn't that some people took the time to to a thing that they're proud of, and posted it to some news aggregation site.

The problem is that these posts are celebrated by the community when their only contribution is "oh, that's neat".

Our attention is limited, and literally with mankinds full knowledge at our fingertips our attention is one of the most important resources we have. Therefore, we must be frugal with it. When people are getting upset that some silly hobby project is invading their news feed, it's not because they were forced to look at it, or because they thought the project was stupid and not worth doing. I think it's mainly because it shows that their peers are more interested in neat hacks as opposed to more meaningful content.

You could argue that hacker news is the natural place for neat hacks, and I do agree, and this problem is of course not isolated to HN. However, I think the frustration comes from a deeper level where we are concerned that we as a community, are getting lost in cheap flashy tricks instead of sound solid concepts and ideas.

You only have about 16 hours of attention every day², and where you spend this time is paramount.

Thanks for reading.

In a sense, this is really the perfect post for these thoughts I've had. For people who don't know, Counter Strike is a highly competitive first person shooter game focused on fast paced combat and shooting accuracy. The professional players, because there are professional players, play with monitors which refresh rates are either 144Hz or 240Hz, which amounts to 6.94ms and 4.17ms per frame, respectively. Putting this game in a web browser, a program known for being bloated and sloppy (although they have their reasons!), is not because it is viable in any sense, but stricktly because "here's a game people know, and look! we can play it in the browser!" ↩
Assuming you are focusing on something every single hour of your day, of course ;) ↩

A Sunday Morning Boot Problem

2020-06-14T15:55:32+02:00

I woke up today to my computer not booting into my arch installation, but into memtest86+.A few months ago I also had problems with booting when I flashed a new BIOS¹ that turned out to be a beta version (thanks MSI!) and not working. At the time I removed GRUB and decided to use EFISTUB instead, since I don't need anything fancy for my booting; I only have one disk from which I boot.

After having changed to EFISTUB I had some problems the first times I upgraded my Linux version; when a new version is installed you build two important files, vmlinuz-linux and initramfs-linux.img, which, as far as I can tell, are the kernel itself and the initial data you want to be in RAM. So, when you update linux you'll get a new vmlinuz-linux with that new version:

/h/mht$ file /boot/EFI/arch/vmlinuz-linux
/boot/EFI/arch/vmlinuz-linux: Linux kernel x86 boot executable bzImage, version 5.7.2-arch1-1 (linux@archlinux) #1 SMP PREEMPT Wed, 10 Jun 2020 20:36:24 +0000, RO-rootFS, swap_dev 0x7, Normal VGA

The problem I got was that the new generated files were put in /boot, but my EFI partition, which should either contain (or know the location of) the files above, was mounted to /boot/efi, and so when I tried to boot there was a mismatch between the linux image loaded, which was the old version in /boot/efi and the new version, which was installed to my system at /.

The solution was to make a systemd service thingy that would run whenever /boot/initramfs-linux-fallback.img changed and copy the three files into /boot/efi/EFI/arch. This worked, and all was well.

That is, until this morning.

I don't know why it stopped working, but it suddenly did, and my system refused to boot. In the boot options menu from my motherboard I still had the correct boot entry, but upon selection it would flash black and a message along the lines of

The file '\EIF\arch\vmlinuz-linux' could not be found.

would flash for about a frame.

I flashed a USB drive with the June arch .iso on it, and looked around in the UEFI shell, which I got kind of familiar with from the last time I messed around with these things. I was able to find the linux image on the EFI partition, and boot it with the right kernel parameters saying which block device is to be mounted as root (which involves typing in a long PARTUUID) and where the initramfs file is. Luckily it all worked, which kind of rules out hardware failure.

I went back and forth a bit, trying to edit the boot entries with bcfg in the UEFI shell or with efibootmgr after having arch-chrooted into my disk from the live flash drive. Nothing seemed to work; in fact, nothing even seemed wrong about the boot entry that I had from before.

I didn't really know what to try out next, so I tried to trim some of the paths to the files on the EFI partition from being full to just the filename. This did not work. Then, looking through the EFI system partition page on the Arch wiki I noticed the following:

/efi is a replacement[6] for the previously popular (and possibly still used by other Linux distributions) ESP mountpoint /boot/efi.

Alright, maybe this is an issue for some reason. I changed it and updated the systemd scripts, but before I got the chance to test it I read a little bit more on the wiki:

mount ESP to /boot. This is the preferred method when directly booting a EFISTUB kernel from UEFI.

Oh okay, I guess I'll just mount it to /boot then. This even means I don't need the systemd scripts anymore since this is the default place in which to put the files.

> mkdir boot
> cp /boot/* boot
> du -h boot
60M
> rm -r /boot/*
> rm -r /efi
> mount /dev/nvme0n1p1 /boot
> mv -r boot/ /boot/
> kak /etc/fstab # update mount point
> rm -r boot

Restart, and now it all works. This time I also tried the two boot entries with relative and absolute paths, and they both worked. Note that I didn't have to change the boot entries since they contain the file paths within the ESP partition, and I didn't change anything on the partition itself, only where in the root partition the ESP partition would be mounted. This is what's strange about it all to me.

The whole ordeal took about 3 hours, from starting the download of the arch ISO to the final working boot.

The day after writing this I stumbled upon a post on /r/archlinux. It turns out that the mount point of my EFI partition wasn't the problem after all, but the path containing forward slashes instead of backslashes were. Apparently, the parsing code was rewritten, and it just so happens that it accidently worked before.

Looking back, I can't really make sense of this, since I thought I'd tried to use the proper slashes after having read that the backslashes are the way EFI wants it. This is especially strange since I tried once to not have the full path but only the filenames (perhaps I started with a /?). Oh well.

I'm not sure if calling it BIOS is technically correct, as I think UEFI, which I now use, is a replacement for BIOS, and that BIOS really is only the booting part and not the terrible GUI in which you change motherboard settings. ↩

Playback speed on Substack

2023-08-25T16:33:19+02:00

I'm subscribed to a Substack that I enjoy, but the Substack video player doesn't haveany options for adjusting the playback speed of the video. Often, I prefer 1.33 speed, or even 1.5 depending on the content. This has been a little annoying, and what I thought was the alternative, downloading the video (somehow) and playing it in some player which does support changing the speed was a little too much work.

However, HTML5 video elements do support changing the playback speed, and while it's not easy to do as a user, it's very straight-forward for a developer. Here's how (on Firefox; I assume other browsers are similar):

Select the video element with the DOM inspector
Right click on the element and select "Use in Console". This opens the console with temp0 bound to the video player element.
Execute temp0.playbackRate = 1.33 (or whatever speed you want)

That's it!

Here's the MDN docs on the playbackRate property. It says that if you set playbackRate to a negative value, the video will play backwards (!). This, however, doesn't seem to always be supported.

Any remarks can be sent to my public inbox.

Thanks for reading.

Languages, Performance, and Intent

2022-08-24T22:51:20+02:00

Optimizing compilers are really cool!They look at your code and rewrite it so that it's behavior is unchanged and it's execution time is reduced. The fact that the compiler cannot change the semantics of your code sounds obvious, but there is a crucial detail here: it cannot change your code for any input¹. If you write a function fn foo(i32) and the compiler wants to generate the code fn fast_foo(i32), it must hold that for any input, like 0, 1, 123, -999, i32::MIN, i32::MAX, or 1337, the behavior of foo and fast_foo is identical. This means the compiler is forced to take into account the corner cases of yout code, which may or may not be a part of your intent when writing that code.

Chandler Carruth shows an example in his talk "Garbage In, Garbage Out: Arguing about Undefined Behavior With Nasal Demons" as seen here. His example shows the difference between signed and unsigned integers in C++, and how the defined wrapping of unsigned integers causes the compiler to output bad code, whereas using signed integers would make the compiler generate good code because it can assume that overflow does not happen. He suggests that the reason the programmer chose unsigned in this case was (a) because it is semantically correct², and (b) that they were "a little bit worried about a narrow contract"³. A little later he answers a question by underscoring the fact that the behavior is different:

Q: Isn't this just a failure of the optimizer doing the right thing?
A: No! We cannot produce this assembly [shows good assembly] for this function [shows initial function]. [...] They are semantically different.

I like this example because it is such a minor choice. It sounds like unsigned would be the right choice since we wouldn't have to worry about accidentally passing in negative offsets⁴, and yet it has a very significant performance impact due to how the compiler is allowed to reason.

Actually checking the bzip2 asm

I decided to double check the asm from Chandler's presentation. The function from the video is still using unsigned integers, and I can't find an issue or a pull request suggesting to make this change, so I decided to check the compiler output myself. If the performance improvement is so great by using signed integers instead, why aren't they doing so?

Here's exactly what I did:

$ git clone https://gitlab.com/bzip2/bzip2
$ cd bzip2
$ mkdir build
$ cmake -H. -Bbuild -DCMAKE_BUILD_TYPE=Release

Building straight away doesn't help us, since mainGtU is marked static, and we'd like to have it exported in the final executable. I removed the two lines static __inline__ from mainGtU, and we're off to the races:

make -Cbuild
objdump build/bzip2 | less

Search for mainGtU and I get the following:

0000000000006ce0 <mainGtU.part.0>:
    6ce0:	48 89 d0             	mov    %rdx,%rax
    6ce3:	49 89 ca             	mov    %rcx,%r10
    6ce6:	8d 57 03             	lea    0x3(%rdi),%edx
    6ce9:	8d 4e 03             	lea    0x3(%rsi),%ecx
    6cec:	0f b6 14 10          	movzbl (%rax,%rdx,1),%edx
    6cf0:	0f b6 0c 08          	movzbl (%rax,%rcx,1),%ecx
    6cf4:	38 ca                	cmp    %cl,%dl
    6cf6:	75 12                	jne    6d0a <mainGtU.part.0+0x2a>
    6cf8:	8d 57 04             	lea    0x4(%rdi),%edx
    6cfb:	8d 4e 04             	lea    0x4(%rsi),%ecx
    6cfe:	0f b6 14 10          	movzbl (%rax,%rdx,1),%edx
    6d02:	0f b6 0c 08          	movzbl (%rax,%rcx,1),%ecx
    6d06:	38 ca                	cmp    %cl,%dl
    6d08:	74 06                	je     6d10 <mainGtU.part.0+0x30>
    6d0a:	38 d1                	cmp    %dl,%cl
    6d0c:	0f 92 c0             	setb   %al
    6d0f:	c3                   	ret    
    6d10:	8d 57 05             	lea    0x5(%rdi),%edx
    6d13:	8d 4e 05             	lea    0x5(%rsi),%ecx
    6d16:	0f b6 14 10          	movzbl (%rax,%rdx,1),%edx
    6d1a:	0f b6 0c 08          	movzbl (%rax,%rcx,1),%ecx
    6d1e:	38 ca                	cmp    %cl,%dl
    6d20:	75 e8                	jne    6d0a <mainGtU.part.0+0x2a>
    6d22:	8d 57 06             	lea    0x6(%rdi),%edx
    6d25:	8d 4e 06             	lea    0x6(%rsi),%ecx
    6d28:	0f b6 14 10          	movzbl (%rax,%rdx,1),%edx
    6d2c:	0f b6 0c 08          	movzbl (%rax,%rcx,1),%ecx
    6d30:	38 ca                	cmp    %cl,%dl
    6d32:	75 d6                	jne    6d0a <mainGtU.part.0+0x2a>
    6d34:	8d 57 07             	lea    0x7(%rdi),%edx
    6d37:	8d 4e 07             	lea    0x7(%rsi),%ecx
    6d3a:	0f b6 14 10          	movzbl (%rax,%rdx,1),%edx
    6d3e:	0f b6 0c 08          	movzbl (%rax,%rcx,1),%ecx
    6d42:	38 ca                	cmp    %cl,%dl
    6d44:	75 c4                	jne    6d0a <mainGtU.part.0+0x2a>
    6d46:	8d 57 08             	lea    0x8(%rdi),%edx
    6d49:	8d 4e 08             	lea    0x8(%rsi),%ecx
    6d4c:	0f b6 14 10          	movzbl (%rax,%rdx,1),%edx
    6d50:	0f b6 0c 08          	movzbl (%rax,%rcx,1),%ecx
    6d54:	38 ca                	cmp    %cl,%dl
    6d56:	75 b2                	jne    6d0a <mainGtU.part.0+0x2a>

Compare this to what we get from the same function if we replace the types of i1 and i2 with Int32. I copied the whole function, added a _2 suffix, and recompiled.

00000000000085d0 <mainGtU_2>:
    85d0:	48 89 d0             	mov    %rdx,%rax
    85d3:	4c 63 d6             	movslq %esi,%r10
    85d6:	48 89 ca             	mov    %rcx,%rdx
    85d9:	48 63 cf             	movslq %edi,%rcx
    85dc:	0f b6 0c 08          	movzbl (%rax,%rcx,1),%ecx
    85e0:	46 0f b6 14 10       	movzbl (%rax,%r10,1),%r10d
    85e5:	44 38 d1             	cmp    %r10b,%cl
    85e8:	74 0e                	je     85f8 <mainGtU_2+0x28>
    85ea:	41 38 ca             	cmp    %cl,%r10b
    85ed:	0f 92 c0             	setb   %al
    85f0:	c3                   	ret    
    85f1:	0f 1f 80 00 00 00 00 	nopl   0x0(%rax)
    85f8:	8d 4f 01             	lea    0x1(%rdi),%ecx
    85fb:	48 63 c9             	movslq %ecx,%rcx
    85fe:	44 0f b6 14 08       	movzbl (%rax,%rcx,1),%r10d
    8603:	8d 4e 01             	lea    0x1(%rsi),%ecx
    8606:	48 63 c9             	movslq %ecx,%rcx
    8609:	0f b6 0c 08          	movzbl (%rax,%rcx,1),%ecx
    860d:	41 38 ca             	cmp    %cl,%r10b
    8610:	74 0e                	je     8620 <mainGtU_2+0x50>
    8612:	44 38 d1             	cmp    %r10b,%cl
    8615:	0f 92 c0             	setb   %al
    8618:	c3                   	ret    
    8619:	0f 1f 80 00 00 00 00 	nopl   0x0(%rax)
    8620:	8d 4f 02             	lea    0x2(%rdi),%ecx
    8623:	48 63 c9             	movslq %ecx,%rcx
    8626:	44 0f b6 14 08       	movzbl (%rax,%rcx,1),%r10d
    862b:	8d 4e 02             	lea    0x2(%rsi),%ecx
    862e:	48 63 c9             	movslq %ecx,%rcx
    8631:	0f b6 0c 08          	movzbl (%rax,%rcx,1),%ecx
    8635:	41 38 ca             	cmp    %cl,%r10b
    8638:	75 d8                	jne    8612 <mainGtU_2+0x42>
    863a:	8d 4f 03             	lea    0x3(%rdi),%ecx
    863d:	48 63 c9             	movslq %ecx,%rcx
    8640:	44 0f b6 14 08       	movzbl (%rax,%rcx,1),%r10d
    8645:	8d 4e 03             	lea    0x3(%rsi),%ecx
    8648:	48 63 c9             	movslq %ecx,%rcx
    864b:	0f b6 0c 08          	movzbl (%rax,%rcx,1),%ecx
    864f:	41 38 ca             	cmp    %cl,%r10b
    8652:	75 be                	jne    8612 <mainGtU_2+0x42>
    8654:	8d 4f 04             	lea    0x4(%rdi),%ecx
    8657:	48 63 c9             	movslq %ecx,%rcx
    865a:	44 0f b6 14 08       	movzbl (%rax,%rcx,1),%r10d
    865f:	8d 4e 04             	lea    0x4(%rsi),%ecx
    8662:	48 63 c9             	movslq %ecx,%rcx
    8665:	0f b6 0c 08          	movzbl (%rax,%rcx,1),%ecx
    8669:	41 38 ca             	cmp    %cl,%r10b
    866c:	75 a4                	jne    8612 <mainGtU_2+0x42>
    866e:	8d 4f 05             	lea    0x5(%rdi),%ecx
    8671:	48 63 c9             	movslq %ecx,%rcx
    8674:	44 0f b6 14 08       	movzbl (%rax,%rcx,1),%r10d
    8679:	8d 4e 05             	lea    0x5(%rsi),%ecx
    867c:	48 63 c9             	movslq %ecx,%rcx
    867f:	0f b6 0c 08          	movzbl (%rax,%rcx,1),%ecx
    8683:	41 38 ca             	cmp    %cl,%r10b
    8686:	75 8a                	jne    8612 <mainGtU_2+0x42>
    8688:	8d 4f 06             	lea    0x6(%rdi),%ecx
    868b:	48 63 c9             	movslq %ecx,%rcx
    868e:	44 0f b6 14 08       	movzbl (%rax,%rcx,1),%r10d
    8693:	8d 4e 06             	lea    0x6(%rsi),%ecx
    8696:	48 63 c9             	movslq %ecx,%rcx
    8699:	0f b6 0c 08          	movzbl (%rax,%rcx,1),%ecx
    869d:	41 38 ca             	cmp    %cl,%r10b
    86a0:	0f 85 6c ff ff ff    	jne    8612 <mainGtU_2+0x42>
    86a6:	8d 4f 07             	lea    0x7(%rdi),%ecx
    86a9:	48 63 c9             	movslq %ecx,%rcx
    86ac:	44 0f b6 14 08       	movzbl (%rax,%rcx,1),%r10d
    86b1:	8d 4e 07             	lea    0x7(%rsi),%ecx
    86b4:	48 63 c9             	movslq %ecx,%rcx
    86b7:	0f b6 0c 08          	movzbl (%rax,%rcx,1),%ecx
    86bb:	41 38 ca             	cmp    %cl,%r10b
    86be:	0f 85 4e ff ff ff    	jne    8612 <mainGtU_2+0x42>
    86c4:	8d 4f 08             	lea    0x8(%rdi),%ecx
    86c7:	48 63 c9             	movslq %ecx,%rcx
    86ca:	44 0f b6 14 08       	movzbl (%rax,%rcx,1),%r10d
    86cf:	8d 4e 08             	lea    0x8(%rsi),%ecx
    86d2:	48 63 c9             	movslq %ecx,%rcx
    86d5:	0f b6 0c 08          	movzbl (%rax,%rcx,1),%ecx
    86d9:	41 38 ca             	cmp    %cl,%r10b
    86dc:	0f 85 30 ff ff ff    	jne    8612 <mainGtU_2+0x42>

The setup code of the two functions is a little different, but once we get going it looks something like this:

              UInt32                          Int32
====================================================================
                                  | jne    8612 <mainGtU_2+0x42>
jne    6d0a <mainGtU.part.0+0x2a> | lea    0x8(%rdi),%ecx   
lea    0x8(%rdi),%edx             | movslq %ecx,%rcx   
lea    0x8(%rsi),%ecx             | movzbl (%rax,%rcx,1),%r10d   
movzbl (%rax,%rdx,1),%edx         | lea    0x8(%rsi),%ecx   
movzbl (%rax,%rcx,1),%ecx         | movslq %ecx,%rcx   
cmp    %cl,%dl                    | movzbl (%rax,%rcx,1),%ecx   
jne    6d0a <mainGtU.part.0+0x2a> | cmp    %cl,%r10b   
                                  | jne    8612 <mainGtU_2+0x42>

I'm not sure if the signed code is better or worse, although in terms of instruction count, it is definitely more. Weird.

I also tried to CFLAGS=-march=native before building, thinking that maybe there's some platform specific code that we wanted the compiler to generate, but the code for the two functions seems to be identical with and without this flag.

There are cases where we are explicitly being made aware of a hidden tradeoff, and are given tools for dealing with it. A good example of this is the C and C++⁵ flag -ffast-math, which enables a collections of other flags that relaxes some of the requirements of the IEEE-754 floating-point number standard. One of the flags it sets, -fassociative-math, allows the compiler to reorder the operation (a + b) + c to a + (b + c)⁶. Another is -freciprocal-math, which allows the compiler to consider a / b the same as a * (1 / b). In of itself these transformations are not so valuable, but in combination with common subexpression elimination or loop hoisting they can yield good speedups. By specifying -ffast-math we can allow the compiler to change the semantics of our programs (in a limited sense) such that it can make the output code faster. However, we still need to know that this is a flag we can set. If we don't konw about -ffast-math and we don't mind these transformations⁷ we are inhibiting the compiler's ability to generate good code without gaining any benefit.

John Carmack's conversation on Lex Fridman's podcast contains a similar example of the idea that some of these trade-offs are very skewed. The part can be heard here. In talking about the innovations required to make Quake, and speficically about optimization, John says this:

The most leverage comes from making the decisions that are a litte bit higher up, where you figure out how to change your large scale problem so that these lower level problems are easier to do, or it makes it possible to do them in a uniquely fast way.
--- John Carmack

The changes John are talking about are slightly different than Chandler's because in John's case there is, say, a design decision with some wiggle room which can be used to yield huge benefits in terms of speeup. Chandler's signedness case is an example of a decision that was accidentally made⁸, maybe without knowing so and probably without knowing it's impact⁹.

In an earlier post I explored the codegen of a merge function¹⁰ and tried to have the compiler output branchless code with the conditional instruction cmove. By writing the function in slightly different ways, and eventually writing it straight in x86, I got a total of seven variants of what I considered the same function. The difference in time spent on a micro benchmark was 31ms for the slowest (the initial straight-forward way) to 19ms for the fastest (the asm, but two C-variants were also down there). At the time I was happy with having beaten the compiler on optimizing such a small and simple function. Now I'm not so sure this is what happened, and I suspect that there are inputs which would make my trivial-and-slow implementation behave differently than any of the fast ones. This would mean that the optimizer wasn't too stupid to get it right, but that I accidentally encoded behavior in the implementation that was too constraining for it to work around. Behavior that I potentially didn't care about and that I would gladly give up for a 38% reduction in execution time.

If we want our programs to be fast we clearly need to understand what our computers can do, but we also need to understand what our programs are actually instructing the computer to do, and which constraints we are setting for an optimizing compiler. We cannot simply out-source the job of generating fast machine code to the compiler, because we need to use the wiggle room of our design space in our advantage, and the compiler cannot do this. Without working from both ends we may often find ourselves with terrible machine code that a simple and insignificant change to our code would have fixed. By choosing to ignore this, we pay the price.

Suggestions, comments, tips, and the signed bit of your integers can be sent to my public inbox (plain text email only).

Thanks for reading.

This is also where undefined behavior comes in; the compiler is allowed to assume that UB does not happen, because a program execution in which is does happen is non-sensical. If if can show that a variable having a certain value would cause UB it is allowed to assume that this variable will not have that value. Depending on how offended you would be by being called a "language lawyer" you might find this argument to be non-sensical, but this is the status quo. ↩
The numbers in this case were used for offsets from a base pointer. These should never be negative, and so by using unsigned integers we can enforce this trait. ↩
I don't think Chandler is being very charitable in his guesswork here, but it is a talk trying to disarm UB-fear so as a story telling device I guess it's ... fine? ↩
Conversely, if we accidentally have huge positive offsets it is very likely that we segfault at once, as the address of block[2147483648] and higher is very likely not mapped. ↩
I assume many more languages have either the same flag or a similar flag with a different name. ↩
I wrote an blog post about some surprises of floating point numbers here where I also give an explicit example of non-associativity. ↩
There are good reasons for not using -ffast-math; very carefully written numerical code will often depend on the exact order of operations in order to avoid losing precision, dealing correctly with NaNs, and so on. -ffast-math throws all of this out the window. ↩
Again, I'm guessing here. ↩
Now that we did the work and found that the assembly looked, if not worse, not better, it is worth questioning whether this decision really had any impact or not. Maybe this is the real lesson here? ↩
merge takes two sorted lists and merges them together to one sorted list. Usually this is done by walking along the fronts of the two lists and poping the smaller of the two elements. ↩

Mathematica's Scoping is Weird

2021-12-30T23:30:11+01:00

I've been using Mathematica a little bit in the past few weeks to do some simple plotting and symbolic manipulation of equations.It's okay; I keep running into weird behavior and getting funny errors that I assume more seasoned Mathematica users would not get. Here's one of them.

`With`

Mathematica has weird scoping rules. For instance, there's a thing called With that let's you assign values to variables in some expression and then have these values be replaced in that expression. It feels similar to a regular block in C-like languages. It looks like this:

In[1]:= With[{x=1}, x+1]
Out[1]= 2

No surprise so far, since 1 + 1 == 2. However, what happens if you make a new variable in the expression in a With?

In[2]:= With[{}, inner=1]
Out[2]= 1
In[3]:= inner
Out[3]= 1

Okay, so inner has now leaked out to the global scope. Annoying, since it might be difficult to avoid having symbols leak out of your scope, but maybe it's not so bad.

`Block`

Another similar form to With is Block, which is used for dynamic scoped variables. Assume we have a bunch of values that we don't want to keep passing around all functions that we use. For instance we can have a function that just adds its argument to some "global" symbol:

In[1]:= addX[a_]:=a + x

Here x is a free variable in the function addX. We can evaluate the function and assign a value to a:

In[2]:= addX[12]
Out[2]= 12+x

We can also define x to be some value.

In[3]:= x=3;
        addX[12]
Out[4]= 15

Maybe we would like to evaluate addX but use a different temporary value for x. We can use Block for this:

In[5]:= Block[{x=10}, addX[10]]
Out[5]= 20

This does not change the value of x

In[6]:= x
Out[6]= 3

Using the other construct from the beginning, With, does not work the same way, since the addX function will already look up the global value of x when it is evaluated. In a sense, Block makes references to x give higher precedence to the Block value instead of the global value.

In[7]:= With[{x=10}, addX[10]]
Out[7]= 13

This might be surprising, but hey, different constructs for different semantics; presumably there are times when you'd want With semantics and other times when you want Block semantics.

Another problem arise when we have an old variable still in the notebook, maybe introduced from a scope you thought was local. Consider the following

In[8]:= Block[{y=10, x=y+10}, addX[10]]
Out[8]= 30

So far all is well; y is 10, x is 20, and we add 10 to x which gives us 30. What happens now if we add a global variable named y?

In[9]:= y=0;
        Block[{y=10, x=y+10}, addX[10]]
Out[10]= 20

We get a different answer! It turns out that when Block evaluates its arguments, it does so without binding the values it creates as dynamic, so the evaluation of x=y+10 does not use the newly made variable y but rather the global value 0. Unless, of course, y has no value yet, in which, I guess at a later stage, it is bound to the value introduced by Block. Makes sense? No?

Presumably this is documented somewhere in the language specification, if you just know where to look and exactly how the language works. But man, this is not intuitive.

Pointers, complaints, suggestions, and your bitcoin wallet, can be sent to my public inbox (plain text emails only).

Thanks for reading.

A Ghost Story

2019-10-15T21:29:19+02:00

Spooktober is upon us and it's been a while since I've written anything here,so here's a short computing ghost story that I experienced a few years back.

I was living in Zürich at the time with a bunch of people from all over the world, including "Bee" from south korea, and as many other ghost stories, we were enjoying ourselves with beer. I cannot remember exactly what we were talking about, but my phone, at the time a Samsung, was lying on top of Bees phone, which brand or model I do not remember.

For some reason, I ended up checking that maps app but to my surprise, the GPS didn't indicate that I was drinking beer in Zürich, but that I was in Busan. Busan, South Korea.

Humoured by this, I give my phone to Bee and ask "Hey Bee, do you know where this is?". He laughs, and looks at me in a funny way: "Yes, this is my parents house?"

I want to clarify, my phone didn't magically open up that maps app on the position of Bees parent's home addresss in Busan. That would have been strange, but I could probably explain it. Maybe there was some NFC thing going on, or maybe he (or even I) had picked up my phone and, for some reason, found his parents house. No. My phone decided that it was in Busan.

I have absolutely no explanation for what happened here, apart from a visit by the Android ghosts. There is absolutely no reason whatsoever that I can come up with for my phones GPS position to be changed to, not anything arbitrary, but to the house of the parents of another person in the room.

It is unclear whether Bees phone had saved that address as his home address, but I simply cannot believe that it hasn't. Thus, this is really all that I have for unraveling this mystery.

TL;DR

Phone A is on top of phone B.
Phone B has saved address.
???
Phone As GPS thinks its position is that address.

Please fill in step 3.

If anyone has suggestions or ideas for what happened, please do send a mail to my public inbox:

https://lists.sr.ht/~mht/public-inbox

How Much Abstraction Is Too Much?

2017-06-21T12:37:40+02:00

Let's talk about abstraction.As we know from RFC 1925 it is easier to move a problem around than it is to solve it. This directly suggests Abstraction Based Development. It goes like this:

Have a problem
Solve 5% of the problem
Invent an abstraction to solve the remaining 95%
Recucrse on the abstraction

After only 91 steps (more or less) we have reduced the problem to only 1% of its original size, making the problem trivial to solve, since it can be solved with a one-liner in Python (this is the unit of problem hardness in CS¹).

On a more serious note, abstractions are useful. Abstractions are everywhere. Dynamically-sized arrays are an abstraction. Iterators are an abstraction. Abstractions are all about hiding the stuff we do not care about, and reduce it to the stuff we do care about. We don't really care about the fact that arrays are of fixed size in memory, and that we have to resize it when we need more, we just want to push stuff onto our Vec. It is easy to infer from this that abstractions are about simplification: we do not want the details, but only the big picture.

What often follows from abstraction is indirection. Dynamic dispatch. Compile time generics. The magic stuff the compiler does for us when we only kinda say what we want. Calling iterator.next()? Ah compiler, you understand what I mean. But to someone reading the code it is not always obvious what happens. In the conceptual sense, we understand that the iterator will produce the next value in the set of values it is iterating over. That part is alright. But what exactly is happening? Where is that code? Is this operation cache friendly? Is it computaitonally complex? How confident can we be in that the implementation is correct? What can we do if we suspect something is wrong? These questions are not always simple to answer².

I will try to argue that abstractions have a very real and very serious downside, that (seemingly) is often overlooked: complexity.

But first of all, the code in this post is real code written by real people. This should go without saying, but just to be absolutely clear: I do not mean to talk down on neither the code nor the authors of the code, and I do not think this is bad code. It is just a good example.

A Motivating Example: `String::contains`

Maybe you have just learned about string searching algorithms. Aho-Corasick, Boyer-Moore, Knuth-Morris-Pratt, you name it. You, a curious person and a Rust programmer, start to wonder. How is String::contains implemented in the Rust standard library? Let us take a look. First we need to find the method:

$ rg "struct String"
src/liballoc/string.rs
262:pub struct String {
...
...

Okay, String is defined in liballoc, which is kind of weird? But alright.

We enter the file and search for fn contains, but nothing shows up. Strange, isn't it listed under String in the docs? After scrolling up 21 methods in the docs, we can find our issue: Methods from Deref<Target = str>

Yes, of course. The function is not a String method, but a str method (we know the difference between String and &str, but what was str again? Oh, maybe str becomes &str in (&self) methods). rg "struct str" and rg "struct Str " gives us nothing. No worries, we have fuzzy file search in our editor. Besides, str sounds fundamental enough that it should be in libcore. And we do find libcore/str/mod.rs.

Again we search for fn contains, and now we get three matches:

fn contains_nonascii(x: usize) -> bool {

...

/// Methods for string slices
pub trait StrExt {
    // NB there are no docs here are they're all located on the StrExt trait in
    // liballoc, not here.

    #[stable(feature = "core", since = "1.6.0")]
    fn contains<'a, P: Pattern<'a>>(&'a self, pat: P) -> bool;

...

#[stable(feature = "core", since = "1.6.0")]
impl StrExt for str {
    #[inline]
    fn contains<'a, P: Pattern<'a>>(&'a self, pat: P) -> bool {
        pat.is_contained_in(self)
    }

Here they are. The function is generic over Pattern. What is a Pattern anyways? If we read the docs of contains, we clearly see that Pattern is the argument, even though the examples would suggest that the argument is a &str. So we click on Pattern. Aha, it is just an abstraction that allows us to use different types as the pattern - both char, String, &str, and more³. Personally I would think that searching for a char and matching a &str are rather different problems⁴, but maybe this turned out to be a convenient way to handle Rusts lack of function overloading.

Back to contains. Pattern::is_contained_in is called. Maybe this is used to allow the types that implements Pattern to choose how they want to search themselves. Sounds reasonable, since we then are in the same situation as if we had function overloading. We are mostly concerned about &str (or String, or str?).

pub trait Pattern<'a>: Sized {
    ...
    /// Checks whether the pattern matches anywhere in the haystack
    #[inline]
    fn is_contained_in(self, haystack: &'a str) -> bool {
        self.into_searcher(haystack).next_match().is_some()
    }
    ...
}

So we make the pattern into a Searcher, and the haystack is the text we are searching in. The searcher seemingly iterates over all matches, but we are only interested if it is there at all, so we take the first and see if we got something.

We search for Searcher (a little ironic, don't you think?), and get this, in code form. We are getting closer. next_match seems alright: if we get a match between a and b, we got a match. If we are done without getting a match, we didn't get a match. Otherwise, we continue to call next (it is not really clear what the remaining case is, but maybe we get some information about mismatches). So what does next do? And where is its implementation for &str? Let's search for &str:

/////////////////////////////////////////////////////////////////////////////
// Impl for &str
/////////////////////////////////////////////////////////////////////////////

/// Non-allocating substring search.
///
/// Will handle the pattern `""` as returning empty matches at each character
/// boundary.
impl<'a, 'b> Pattern<'a> for &'b str {
    type Searcher = StrSearcher<'a, 'b>;

    #[inline]
    fn into_searcher(self, haystack: &'a str) -> StrSearcher<'a, 'b> {
        StrSearcher::new(haystack, self)
    }

Oh, okay so this is how we got the implementor of Searcher in the first place. Here we find StrSearcher, which sounds promising. The struct has members, there is an enum here, and another struct with something fw and bw (forwards and backwards?). No need to worry, we can try to understand all this stuff when it comes up. Let us look at the Searcher implementation.

unsafe impl<'a, 'b> Searcher<'a> for StrSearcher<'a, 'b> {
    ...
    fn next(&mut self) -> SearchStep {
        match self.searcher {
            StrSearcherImpl::Empty(ref mut searcher) => {
                // empty needle rejects every char and matches every empty string between them
                let is_match = searcher.is_match_fw;
                searcher.is_match_fw = !searcher.is_match_fw;
                let pos = searcher.position;
                match self.haystack[pos..].chars().next() {
                    _ if is_match => SearchStep::Match(pos, pos),
                    None => SearchStep::Done,
                    Some(ch) => {
                        searcher.position += ch.len_utf8();
                        SearchStep::Reject(pos, searcher.position)
                    }
                }
            }
            StrSearcherImpl::TwoWay(ref mut searcher) => {
                // TwoWaySearcher produces valid *Match* indices that split at char boundaries
                // as long as it does correct matching and that haystack and needle are
                // valid UTF-8
                // *Rejects* from the algorithm can fall on any indices, but we will walk them
                // manually to the next character boundary, so that they are utf-8 safe.
                if searcher.position == self.haystack.len() {
                    return SearchStep::Done;
                }
                let is_long = searcher.memory == usize::MAX;
                match searcher.next::<RejectAndMatch>(self.haystack.as_bytes(),
                                                      self.needle.as_bytes(),
                                                      is_long)
                {
                    SearchStep::Reject(a, mut b) => {
                        // skip to next char boundary
                        while !self.haystack.is_char_boundary(b) {
                            b += 1;
                        }
                        searcher.position = cmp::max(b, searcher.position);
                        SearchStep::Reject(a, b)
                    }
                    otherwise => otherwise,
                }
            }
        }
    }

We start of by matching on self.searcher, which is either Empty or TwoWay (what about OneWay?), and by the comment in the first case, we understand what is happening: StrSearchImpl::Empty is actually an empty pattern (this is confirmed by StrSearcher::new above). Not a very interesting case for us, so we move on to TwoWay.

First we check if we are Done. If so, we return Done. Then we check something about searcher.memory, but it is not clear what memory is, so maybe we should check that out. We find struct TwoWaySearcher (which is the type that TwoWay contains), and lo and behold: A comment describing the algorithm! Well, some background information anyways, but the code in TwoWaySearcher, which turns out to be the place where the real stuff happens, is well documented. Natually, the algorithm is rather convoluted (hard problems are hard --- who knew?), but we found out what we wanted to.

Let us try to sum up our journey. We wanted to know which string searching algorithm String::contains was. This method is from str, as String Derefs to str, and we would like to call contains on strings we don't own. Then our search string becomes a Pattern which we transform into a Searcher, which takes a haystack, which is our original text, and this searcher does the searching. Simple, right?

So what is the point?

I think that this journey was not trivial. We have skimmed a lot of code (I have, anyways), jumped between files and modules, read inline comments and markdown docs, and finally, at the bottom of the rabbit hole, we actually found out what we wanted. Why does something seemingly so simple have to be behind so many layers⁵?

Some of these layers have benefits --- there is no doubt about that. Maybe we would like to write s.contains('a'), s.contains("ab") or |s: &[char]| "ayyy".contains(s), and since we don't have function overloading, we need Pattern to abstract over the variations. Maybe we would like to have a common argument type for &str methods: &str have 20 methods that takes a Pattern, including contains, find, split, and replace.

But all of the layers? If we list all structs, enums, and Traits one needs to understand in order to dig through something rather simple like this⁶, can we explain why they all have to be there? Are some of them there because we might need them in the future⁷? Is it possible that we made this more convoluted than strictly needed? I don't claim to have an answer to any of these questions, but I think these are important questions, and I do not think they are asked often enough.

Of course, this is not to say that the str module is overengineered, or that I think there is anything wrong with this implementation. The only reason I brought it up as an example is because I did try to find out how contains works some months ago, but I gave up because I could not understand the whole system (admittedly I didn't spend too much time on it). There were simply too much stuff! It is ironic that we can simplify a system with abstractions so far as to end up with an even more complex system.

We like short functions⁸, and we like to introduce types to ensure type safety⁹. We like flexible solutions, and generalized interfaces. But it is easy to overlook what we are giving up by building a tower of abstractions, namely simplicity.

I think simplicity is something we, as developers, should strive for, and I think it is often something that is forgotten. I don't think that there is an inherit tradeoff between complex and flexible, and simple but inflexible. I think we can get both. But, as always, the best solution is the hardest to find.

And don't let your algorithm professor/friend/family relative tell you otherwise! ↩
Documentation is a great tool here, but docs might be (1) misleading, (2) out of date, (3) lacking, or (4) non-existent. ↩
... and impl<'a, F> Pattern<'a> for F where F: FnMut(char) -> bool? This is seemingly the same as s.chars().any(f). ↩
In the implementation sense, not the conceptual sense. ↩
Is str::contains actually an orge? ↩
You might think "Aha, but this isn't rather simple, because such and such", but I do think that something like this should be simple. I did not want to understand how the algorithm works, I just wanted to find it! ↩
In which case why are they there today? ↩
Although why functions and methods should be short is not always explained. ↩
For instance, see Pascal Hertleif's talk Writing Idiomatic Libraries in Rust [10:38] ↩

Searching High and fLow

2023-10-19T18:10:56+02:00

Recently¹ at work I found myself with an interesting problem in need of solving.The problem was one stage of a larger algorithm, and we wanted to be able to run the whole algorithm in an optimization loop, so it was important that it was fast.

The problem is this: given a set of points $P \subseteq\mathbb{R}^2$ and a set of sites $S\subseteq\mathbb{R}^2$, assign each point to a site so that:

The sum of the distances from each assigned point to its site is minimized
The number of points assigned to a site is below some limit $L$.

Without constraint (2) we can always choose the closest site to each point and be happy. This is also really fast, since we just need to compute the pairwise distances. Before, this is what the system did, but with the introduction of requirement (2), I had to come up with something else. Here is an example of how the optimal solutions look with and without constraint (2), for a sample set of points and sites:

Points grouped to their closest site.

Sites have a capacity of 5.

Intuitively, it's clear what's going on here; the left site has to "give up" some of its points to the right site, so that while these points were closer to the left site, it frees up the site to also be assigned to the points on the far left. Computationally, however, it is not so straight forward to see how we can do this.

Looking up the problem

Or, "How I nearly got tricked into thinking this was NP-Hard".

When presented with a problem like this, the first thing I always do is to figure out what the "proper" name of the problem is, because it is very likely to have already been solved. Some quick searching lead me to a couple of problems:

Facility location problem

In the facility location problem you are given a set of potential facility sites $L$ and a set of demand points $D$, and the task is to choose which facilities to open so that the distances from each demand point to an open facility is minimized. This problem is NP-hard.

It's very similar to my problem -- we're given two sets of points and we're minimizing distances -- but it's not exactly the same, since my problem asked to optimize the choice of facility per demand point, with the capacity constraint. So I had to keep looking.

Vertex k-center problem

In the vertex k-center problem we're given a complete undirected graph $G=(V,E)$ and a cost function $c: E\to\mathbb{R}$, and we're asked to choose vertices $V_0\subseteq V$ to minimize the cost of the vertex farthest away from $V_0$: $$\text{minimize} \max_{v\in V}\min_{ w\in V_0} c(v, w)$$

The vertex k-center problem is also NP-hard.

Again, it looks related, but not quite right. We already know which of our points are in which class, so the "combinatorial choosing" aspect of this problem isn't a part of my problem. The search continued.

Assignment problem

Eventually, my search lead me to the assignment problem. Here we want to assign tasks to workers where each pair has a cost associated to it, and we seek to minimize the total cost. In graph theory terms (and if the number of tasks and workers is the same), this is finding a minimal matching of a certain size, of a weighted bipartite graph.

This too looks similar to what we want, but again, not quite, due to our constraint. Also, our number of points and sites are not the same, so the matching stuff doesn't apply. However, this lead me to Ulrich Bauer's master thesis, which table of contents include a section named "Minimum Cost Flow". Reading the section name was enough.

Minimum Cost Flow

Let's start with the more known, related problem: max-flow. In max-flow, you're given a flow network, and the task is to figure out how much flow you can send through the network. Imagine a network of water pipes through a city with a water basin (a source node) on one side of the city, and a pipe to the ocean (a sink node) on the other side. We want to figure out how much water we can send from the basin to the ocean.

In graph terms, it looks like this: we have a directed graph $G=(V,E)$ and two special nodes: the source $v_s$ and the sink $v_t$. All edges $e\in E$ have a capacity $e_c$. Now we assign a flow, a non-negative number $f(e)\in\mathbb{R}^+$, to the edges. In the analogy above, the flow correspond to the amount of water flowing through the pipe that is that edge. We also have some constraints:

The flow in an edge cannot exceed its capacity.
Nodes have to conserve the flow, so that the flow in the edges going in to the node has to be equal to the flow in the edges going out of that node.
Only the source $v_s$ is allowed to produce flow (meaning it can send out more than it got in).
Only the sink $v_t$ is allowed to consume flow (meaning it can take in more than it sends out).

We want to maximize the flow that the source node produces, which, due to the conservation of flow (req. (2)), is the same as what the sink consumes, while respecting the constraints.

That's max-flow. In minimum cost flow, we also have a cost $c(e)$ for each edge, per flow. Instead of finding the maximum amount of flow we can send through the network, we want to find the cheapest way of sending a certain amount of flow.

Going back to our problem

Water pipes and capacities can seem like a long way from points and sites in the plane; what's the connection? If we create a graph $G=(P\cup S, E)$ and imagine the sites $S$ to be on the left side, and points $P$ on the right side, we can draw edges between all pairs of sites and points to get a bipartite graph:

A bipartite graph with the sites on the left and points on the right.

Now we want to say that a flow going through an edge $(s,p)\in E$ means that site $s$ and point $p$ are connected. The cost of this edge should be the distance between the site and point, so that we'll reduce the total distance of the pairs that we end up assigning.

There's a couple more things we need to make sure that a minimum cost flow through our made-up network actually solves our original problem.

The network has to be a flow network
The sites respect their given limit $L$
Each point is only connected to one site

For (1) we can add in a source to the very left and a sink on the very right and connect them to the two groups, like so:

The leftmost node is the source node, and the rightmost the sink node.

For (2) we can set the capacity on the edge from the source to the site equal to $L$; this way, if we also set the capacity on the edges in the middle to 1, then each edge that is filled with flow will spend one capacity of the source-site edge. For (3), we can set the capacity of the edges from the points to the sink to be 1. This way we expect all the flow that comes in along the edge from the site to the point to go along this edge and into the sink. We still haven't assigned costs to the edges adjacent to the source and sinks, but we don't really care about which of these edges are used, so we can set them all to 0. We also know how much flow we want to send through the network, since all the points should be connected to a site, and each of these connections uses 1 flow.

Here's the final network, where I have set the site capacity $L$ to be 2. For readability, only one edge in each "layer"² is labeled, but the only difference from the other edges are the costs in the middle layer.

The middle gray edges will vary in cost. Otherwise, the capacities and costs are constant for each layer.

And here's one possible solution, in which edges with saturated with flow are black, and edges without flow are gray. Counting from the top with 1-indexing, the first site is connected to the first and fourth point, the second site to the third and sixth, and the third site to the second and fifth point. We see that all sites have maxed their capacity since they have two edges going out to the right, and all points are accounted for, since all of them have an edge to the sink.

Gray edges are not used, and black edges are saturated with flow.

A quick note before I continue, I've glossed over one point: what happens if the flows we get from solving the problem aren't integer? Could we get a bunch of 0.5 flows in the graph? For the approach I went for, the answer is no by construction (as we'll see). However, when writing this post I tried to prove that any optimal flow with fractional flows could be converted to an integer flow that was at least as cheap, irrespective of how this flow was found, but I couldn't quite figure out how. If you do know, my public inbox is open³.

Solving

Great, we now have a flow network in which we want to find a min-cost flow. This was a Rust codebase, so I searched through crates.io and found mcmf, a crate that wraps the LEMON library. LEMON was referenced in the Wikipedia page for minimum cost flow, so I figured it was a safe bet. I added it to my project, set up my graph, ran .mcmf() which ran in a fraction of a second, read out the results I wanted, and it all Just Work™ed.

However ...

LEMON is a C++ library, and I am compiling my Rust crate to wasm using wasm-pack. This works really well for Rust code, but not for a joint C++ code base; it seems the issue is with the compilation target. rustc has both wasm32-unknown-unknown and wasm32-unknown-emscripten listed as Tier 2 supported platforms, but rustc cannot, of course, compile C++ code. So we need a separate toolchain for C++. emscripten is a complete toolchain for compiling C++ to wasm32-unknown-emscripten, but wasm-pack compiles to wasm32-unknown-unknown. Doesn't sound like a big difference, right?

Wrong.

I'm still a little hazy about the details here, but it seems that the two targets are fundamentally different, and that it is not possible to compile for the two targets and somehow join them. Furthermore, it also seems that adding -emscripten as a target to wasm-pack is also a no-no. If there is a solution to this problem, I'd love to hear it! Please send a mail to my public inbox if you know.

I gave up on this path, and went down the other path: implementing it myself.

Implementation

Truth be told, I have never been very comfortable with implementing max-flow. It's not something I do very often, and there are a lot choices one has to make. Choice of algorithm: Standard Ford–Fulkerson with BFS (aka. Edmonds-Karp), try Dinic's, or finally try to understand Push-relabel? Ford-Fulkerson feels like a safe choice for the first version. How do you represent the graph? Everything on the heap? Vec<Node> for the nodes and have Node contain an adjacency list of indices for the edges? Where does the flow and capacities go? BFS through the graph sounds okay, but how do you represent a path? Won't there be a lot of them? Vec<usize> again for each path sounds like a lot of allocations, but maybe it's okay. Oh and by the way, this is all just for max-flow. How do we even solve min-cost-max-flow?

When bogged down with these uncertainties, the best way forward is to just do something, with the expectation that you're only trying something out. At this stage, the only important thing is building a theory of the problem.

First version

I decided that I didn't want to represent the graph completely as-is; the source and sink nodes could be implicit in the code⁴. The flow and capacities for the edges adjacent to the source and sink could also be handled separately, since we know what the graph looks like around these two vertices.

Further, I wasn't sure if the min-cost aspect of the problem was difficult, and decided on a greedy approach without making sure that the solutions produced were optimal. I wanted to write a loop in which we find the cheapest way of increasing the flow by 1, and do that $|P|$ times. This is just Ford-Fulkerson where you find the min-cost path, and I figured it was probably right, but I didn't sit down and prove it. By checking against mcmf later I would get a hunch for whether this really is optimal or not, but I didn't want to spend time figuring this out before seeing if the implementation was feasible⁵ .

For the vertices I made an enum with an index to identify the two different types

pub enum Node {
    Site(usize),
    Point(usize),
}

and for the edges I made a struct containing these indices, as well as the edge cost.

pub struct Edge {
    pub site: usize,
    pub point: usize,
    pub cost: F64,
}

We don't actually have to represent the edges with an adjacency list or anything like that, because we have a complete bipartite graph, so we already know what all the edges are.

Edge costs (I duplicated these for some reason) were precomputed and stored in a $|P|\times |S|$ matrix; Since all edges are there, the matrix is completely full. Capacities were handled in a slightly funny way; I made a Vec<usize> of length $|S|$ ($|P|$), where each entry corresponded to the capacity of the site (point) with the same index, respectively. For the cross-edges I made a Matrix<bool> of size $|S|\times |P|$ called edge_used where each entry corresponded to whether the edge was used or not, since these edges had unit capacity.

The funny looking F64 is a type alias

type F64 = float_ord::FloatOrd<f64>;

using float_ord so that we can order floats⁶.

A path through the network was also a struct listing the Nodes in the path, as well as the negative cost of the path, because the only priority queue in Rust's standard library is std::collections::BinaryHeap, which is a max-heap⁷.

pub struct Path {
    pub neg_cost: F64,
    pub edges: Vec<Node>,
}

Now we can write the function step on Path which extends the path by one move, to all possible new paths:

pub fn step(&self, cost: &Matrix<F64>, edge_used: &Matrix<bool>) -> Vec<Path> {
    let last = self.edges.last().unwrap();
    match last {
        Node::Site(si) => (0..cost.cols)
            .filter(|pi| !edge_used.get(*si, *pi))
            .flat_map(|pi| {
                if self.has_edge(*si, pi) {
                    return None;
                }
                let cost = cost.get(*si, pi);
                let mut edges = self.edges.clone();
                edges.push(Node::Point(pi));
                Some(Path {
                    neg_cost: FloatOrd(self.neg_cost.0 - cost.0),
                    edges,
                })
            })
            .collect(),
        Node::Point(pi) => (0..cost.rows)
            .filter(|si| *edge_used.get(*si, *pi))
            .flat_map(|si| {
                if self.has_edge(si, *pi) {
                    return None;
                }
                let cost = cost.get(si, *pi);
                let mut edges = self.edges.clone();
                edges.push(Node::Site(si));
                Some(Path {
                    neg_cost: FloatOrd(self.neg_cost.0 + cost.0),
                    edges,
                })
            })
            .collect(),
    }
}

Some notes on what's going on here:

When going from sites to points we only want to try edges that haven't already been used, since these would have no capacity left, so these are .filtered out.
has_edge checks that the edge is not already included in the path, in order to avoid looping.
Not sure why I used flat_map and returned Option instead of just another .filter.
When going from points back to sites, we can only go along edges that have been used, since we are effectively undoing the flow that goes along the edge. For this reason, we're adding to the negative cost.

These are the main mechanisms for computing the min-cost flow, apart from the actual flow algorithm itself. This was pretty simple now, but the implementation was somewhat noisy, so here's the pseudo code:

loop {
    create initial paths from sites that has capacity to all points
    put the paths in a max-heap

    while max-heap has elements {
        path = pop(max-heap)
        if path leads to a point that's not assigned yet {
            reduce site capacity by one
            set point capacity to zero (reduce by 1, but we know it is 1)
            mark edges as used
            restart main loop
        }
        children = expand the path
        add children to the heap
    }
}

This algorithm produced exactly the same pairings as mcmf did, but it was a lot slower. Not 2x or 5x, more like 1000x.

The program took over 20 seconds.

Second version

Why was the first attempt so slow? Here I had a few hypotheses straight off the bat, with some ideas for solutions:

Inefficient representations of the graph and paths. Many allocations.
- Try to add back links instead of storing whole paths?
Poor search through the graph; too many paths are expanded
- Maybe prune based on cost? Can we bound cost?
std::collections::BinaryHeap is slow; should try something else
- Probably something on crates.io?
- Write my own?
Only a single flow is added in each iteration; should figure out how to augment many paths at the same time.
- Doesn't Dinic's do this? Or was that Push-relabel?

I tried cost-pruning; it helped, but not by much. I tried to change Path::has_edge to just check for a site, since I was pretty sure I didn't have cycles of negative cost (if you have such a cycle, it will pay off to walk it, which means you'll visit a node twice, so the two checks aren't the same); it helped but not by much. I tried the priority-queue crate (which also easily supported making a min-heap), but that was even slower.

Eventually, I decided to specialize the search to my instance of the problem⁸. I didn't need to solve MCMF for any general graph, since I had very specific knowledge about the types of graph I would solve it on. When searching for a path from a site to the sink (which, again, we didn't include in the graph explicitly), we can do two operations: (1) go to an unassigned point and finish there, let's call this operation Connect; or (2) go to a point and follow the back-edge to a site, let's call this operation Route. Note here that Route is unique for a point, since there is at most one site connected to it.

Here's Connect, when standing at the filled-in site, and routing along the edges with arrows:

Connect corresponds to finding a path straight to the sink.

And here is Route:

Left: We find a shortest-path following a back-edge. Right: The resulting flow.

Instead of having long paths, each listing the vertices in the path, maybe it would help to have a path be a Vec<Move>:

type SiteId = usize; // I also aliased these, for later on
type PointId = usize;

pub enum Move {
    Connect(SiteId, PointId),
    Route(SiteId, PointId, SiteId),
}

Why would this help? Consider the path in the Route operation in the figure above. Had we done this in a full graph representation, we would have had six vertices in the path: the source, the site we're currently at, the point we're routing around, the site currently connected to that point, the cheapest point for that site to get to the sink, and the sink. For the Move representation however, we only need three IDs, namely two for which site we're talking about, and one for which point we're rerouting.

Further, we can imagine splitting up the list of points into two: the points that are already assigned to a site, and the ones that aren't: Connect only works for unassigned points, and Route only works for assigned points. Thus, when standing at a site, we have $|P|$ choices to make, since each point correspond to one Move. Before, we had $|P|$ choices to make (which point to visit next), and then for each choice we had $|S|+1$ choices to make (go back to any of the sites, or go to the sink). Most of these were quickly pruned, but I figured they might have added a lot of work for the search.

There was one more important insight to make: When we perform the move Route(a, p, b) , we disconnect the point p from site b and connect it to a, to free up capacity at b, by paying the cost difference of the new edge ap from the old edge pb, as well as spending as capacity. This is the only thing going on: we move one capacity from b to a by paying the difference in edge cost. Thus, for the rest of the search, it doesn't matter which t we choose. The only thing that matters is the edge cost difference.

This means that when considering different Route(a, t, b)s for different choices of t, we only need to look at the cheapest, because the net-result of performing the Route is the same for all Routes around these two sites. We can look at all possible ts, choose the cheapest, and continue the search with only that Route. This is a huge help, because for each pair of sites we don't have a combinatorial explosion of different Route operations. We only have one.

A few other small optimizations (another stab at a cost_limit to prune old paths, sorting the points for each site so that it should be easier to find cheap moves, move a Vec::clone down below an if .. { ... continue; }, and other small improvements) were small done after this. I got big speedup, but I was still not where I would have to be.

The program took around 2 seconds.

Third version

I felt like I was really getting somewhere with Move, but at the same time, the second version still felt too general. The insight about "only the best route matters" helped me find a new framing of the problem: when routing from a site, there is really only $(|S|-1)+1$ moves we can do:

Connect to the closest unpaired point and be done with the search (1 move).
Route the best route around any of the other sites and continue the search ($|S|-1$ moves).

I also knew that for my application, $|S|$ would always be less than 10, and 10 is a really small number. What does this buy us?

Forget the graph, and don't think about nodes and edges. We only need to find the cheapest sequence of these Moves. And for this, we need their cost.

I made a routing_table that was a $|S|\times|S|$ matrix. Entry $S_{ij}$ contained the cost of the best Route(i, p, j) over all points $p$, and the diagonal entries $S_{ii}$ contained the cost of the best Connect(i, p). In addition, the table contained the index of the point p, which works out nicely in both the Route and Connect case, since they both have one point. Here's the code to initialize the table:

fn initialize_routing_table(&mut self) {
    let num_sites = self.cost.rows;
    let num_pts = self.cost.cols;
    self.routing_table = Matrix::new(
        num_sites,
        num_sites,
        (FloatOrd(f64::INFINITY), PointId::MAX)
    );
    for a in 0..num_sites {
        for b in 0..num_sites {
            if a == b {
                continue;
            }

            let mut cand_cost = f64::INFINITY;
            let mut cand_ind = SiteId::MAX;

            for t in 0..num_pts {
                if !*self.edge_used.get(b, t) {
                    continue;
                }
                let route = Move::Route(a, t, b);
                let cost = self.move_cost(&route).0;
                if cost < cand_cost {
                    cand_cost = cost;
                    cand_ind = t;
                }
            }
            *self.routing_table.get_mut(a, b) = (FloatOrd(cand_cost), cand_ind);
        }
    }

    for s in 0..num_sites {
        if let Some((min_cost, t)) = (0..num_pts)
            .filter(|pi| self.tur_cap[*pi] == 1)
            .map(|pi| (*self.cost.get(s, pi), pi))
            .min()
        {
            *self.routing_table.get_mut(s, s) = (min_cost, t);
        }
    }
}

To make up some numbers, here's what the routing_table could look like, for a graph with 3 sites⁹:

$$ \begin{bmatrix} (32.08, t_3) & (32.79, t_1) & (12.01, t_7)\\ (28.62, t_4) & (41.41, t_2) & (24.88, t_4)\\ (14.19, t_5) & (21.31, t_9) & (15.89, t_6) \end{bmatrix} $$

Here we're saying that if we wanted to connect the first site to the nearest unpaired point ($t_3$), it would cost $32.08$ ($S_{1,1}$). However, re-routing a point from site 3 to site 1 around $t_5$ costs $14.19$ ($S_{3,1}$), and connecting site 3 to its nearest unpaired turbine ($t_6$) costs $15.89$ ($S_{3,3}$) for a total of $30.08$. If site 3 is already full, this is the shortest path.

I want to highlight that this table is the only information we need to perform the search. We don't need to know anything about the graph, or the points, or the sites. We don't even need to look at capacities, because the information they're giving us is already encoded in the table. Once we have this table, these 18 numbers is all that's required to find the cheapest way of increasing the network flow by 1.

This was already a pretty large leap from the last version, so I decided to be blunt in the next step, and pre-compute all possible paths. After all, we don't have that many of them. A list of SiteInds can be used as a path, where the first site is the start site, the intermediate sites are routed around using the precomputed points, and the last site goes to its closest unmatched point. Since the sites in this list have to be unique, we have at most $|S|!$ of them, which sounds bad, but since $|S|$ is at most 10, this is at most 3'628'800. If $|S|$ is a more reasonable¹⁰ 5, this is merely 120. Compute all paths, and choose the cheapest.

Each iteration of this main loop invalidates the routing_table, since the site-point assignment has changed. Since I had just implemented the table approach, I didn't want to also incrementally update only the parts of it that did change, so instead I recomputed the whole table before every iteration.

This caveman solution took 200ms.

Version 3.5

200ms is better, but this was still a very decent chunk of the total time of my program. Recall that this whole computation is just the first step of a bigger system. But with a few very naïve choices in the last solution I was confident that I could speed it up some more.

I stored a Search (my new name for Path) in a struct with a SmallVec listing the indices in the matrix corresponding to the move that made up the path:

struct Search {
    moves: SmallVec<[(SiteInd, SiteInd); 10]>,
    cost: F64,
}

There's a lot of duplicate data here, since moves is of the form [(a, b), (b, c), (c, d)], but maybe this was easier to use? I am not sure why I did it this way. Now it's just a matter of finding the shortest path through the routing_table.

If the entries in the routing_table are all non-negative, life is good, because we can use Dijkstra's algorithm to find the shortest path to a diagonal entry (which, recall, represents ending the flow path). We've gone full circle, and are back again at std::collections::BinaryHeap and searching through a graph (this time, $K_{|S|}$: the complete graph on $|S|$ vertices).

I initialized the queue with the legal subset¹¹ of the total $|S|^2$ initial moves, and started to pop. If I got a diagonal entry back, that's the path. If not, I expanded the path from the end of the path (search.moves.last().unwrap().1), and considered all other possible sites to extend to. Sites that were already in the path were filtered out. Here's the code:

while let Some(search) = searches.pop() {
    let mov = search.moves.last().unwrap();
    // Diagonal entry is the best; take it if possible.
    if mov.0 == mov.1 {
        if search.cost < candidate.cost {
            candidate = search;
            break;
        }
        continue;
    }

    // Non-diagonal entry; expand to the possible next moves in the sequence.
    let a = mov.1;
    for b in (0..num_sites).filter(|&b| !search.contains_site(b)) {
        let cost =
            FloatOrd(search.cost.0 + self.routing_table.get(a, b).0 .0);

        self.cost.partial_cmp(&other.cost).map(|o| o.reverse())
        let mut moves = search.moves.clone();
        moves.push((a, b));
        let s = Search { moves, cost };
        searches.push(s);
    }
}

I hinted at this above, but note that we don't have to check for site capacities in this loop. Direct connections are only in the table if they're valid (otherwise they're $\infty$), and Route operations don't need the target site to have capacity, since we're freeing up capacity at the source site. Since the initial paths are also only to sites with capacity, we don't have to check for capacities at all.

The case with negative entries in routing_table calls for a different strategy, since now you can suddenly produce cheaper paths by continuing to route around other sites. Instead of implementing a "real" search algorithm, I bounded the cost savings possible, and used this to prune paths that were so expensive that there would be no way for the negative entries to make up for it.

I did this in a very loose way: if $\text{m}$ is the smallest entry in routing_table, then that's the most savings we can get from extending a path of length $l$ to $l+1$. Since we also know the max length of a path, $|S|$, the best possible cost decrease of a started path of length $l$ is $m(|S|-l)$. This is not at all tight¹², but it's really easy. Here's what it looked like:

while let Some(search) = searches.pop() {
    let mov = search.moves.last().unwrap();
    // Diagonal entry is the best; take it if possible.
    if mov.0 == mov.1 {
        if search.cost < candidate.cost {
            candidate = search;
        }
        continue;
    }

    // Non-diagonal entry; expand to the possible next moves in the sequence.
    let a = mov.1;
    for b in (0..num_sites).filter(|&b| !search.contains_site(b)) {
        let cost =
            FloatOrd(search.cost.0 + self.routing_table.get(a, b).0 .0);
        let best_future_cost = cost.0
            + (num_subs - (search.moves.len() as SiteInd + 1)) as f64 * min_table_entry;

        if candidate.cost.0 < best_future_cost {
            continue;
        }

        let mut moves = search.moves.clone();
        moves.push((a, b));
        let s = Search { moves, cost };
        searches.push(s);
    }
}

Along the way I also pulled the trigger and changed SiteInd and PointInd to be u8 and u16 respectively, which, surprisingly, sped up the code by 30% (!). I continued to recompute the routing_table from scratch in between every single call to route.

Now the program took 4ms, and I declared it Good Enough.

The "Right" Solution

I had a lot of fun with this problem. It's both fun and rewarding to iterate on a problem and seeing the time required to solve it going from "get a coffee", to "impatiently wait" to "wait" to "quick, if you run it once" to "fast" to "can be called in a loop by another program". It's also fun when this isn't just an exercise in how good it can get, but actually a part of what you're really trying to do¹³.

But mostly, it was fun because the solution feels right. I have a theory that most problems¹⁴ we programmers are dealing with are pretty simple, when viewed from the right angle. This angle is often hard to find, but once you have found it, things just seem to "work out"¹⁵, in terms of complexity, number of bugs, maintainability, debugability, all of these axes.

My final version is around 10'000 times faster than my initial version. When presented with such a huge difference without having any context, it is very easy to jump to conclusions. For instance, one might attribute this to:

Language; it was written in slowlang first, and then ported to fastlang.
Lack of optimizations; ran without --release, -O2, or similar.
Algorithmic improvements; change a naïve algorithm to a high performing one.
A full team of experts worked on it for months, creating an engineering jewel that normal programmers simply can't match.
Hyper-optimized code; inline assembly, unsafe everywhere, PGO, lot's of impossible to read code, impossible to debug, probably requires a blood sacrifice.

In this case though, none of the above is true¹⁶.

Language was the same.
Optimization levels were the same.
I would argue that the algorithm is still the same --- we're still solving min-cost-max-flow with successive shortest paths --- but since we are making assumptions about the input I can see a claim for the algorithm being different.
Full team for months is also off the mark; I'm no expert, and certainly not a whole team of them. Further, this whole process, from initial test with mcmf to final code written, took slightly longer than two working days. The first commit was around 15:00 on Wednesday¹⁷, and the last commit was 16:15 on Friday (plus a small bugfix on Monday morning).
Most importantly though, the last one is not true. There's no inline assembly, no unsafe, no special tooling, no architecture specific code, and no "every-trick-in-the-book-pulled" code. Quite the opposite: there's very little code. The whole module (excluding the Matrix struct) is 212 lines of code, as reported by tokei.

So how come we got a 10'000x speedup? I think it's the all due to the updated framing of what we're really trying to do. "The problem" was never about flow through a graph. This is a made up mental framework for us to work in, so that we can use general techniques to specific problems.

Shoutout to LEMON

I did compare my solution to LEMON when I was getting below 100ms. I had given up trying to integrate it, but it was still a very useful benchmark. For the "largest reasonable" input I was testing with, LEMON was still faster than my final version. LEMON, of course, solves the problem in its general form, and as such, is a way better implementation than mine. However, LEMON do "pull-many-tricks-in-the-book"¹⁸, and are written by people who have extensive experience with max-flow-min-cost, so I didn't feel so bad.

I have stepped through their source code, and started reading some of their references to better understand how they achieve what they have; most notably this experimental study. The "Network Simplex Method" seems to be a key term, but I haven't understood this yet.

There's still a hope, of course, that if I only partially invalidate my routing_table instead of recomputing it at every iteration, and store the visited sites as a bitfield in the Search struct, I'll be faster. If the requirements for my system drastically changes, maybe I'll get to find out.

As always, any input, shorter paths, or excess flow, can be sent to my public inbox.

Thanks for reading.

"Hunting High and Low" by a-ha. Other title candidates include "one Flow over the cuckoo's nest" and "Even Flow". ↩
I tried to avoid naming that could be associated to neural networks, but here I came short. ↩
Something something subgraph induced by fractional flow edges, and cancel loops? Something no negative cycles in residual graph (assumption by optimality)? ↩
This is a pattern I feel like I keep seeing when looking at "good" code; algorithms and data structures are very often used in an "abstract" sense, as opposed to directly implemented in the code. ↩
This is a chicken-and-egg problem, because you need both of these. You both want to know what guarantees a method provides (in this case, optimality of the solution), but you also want to make sure that the method is implementable and has the right characteristics (performance, usability, maintainability) for what you're trying to do. Looking back it seemed risky that I started writing code without knowing if what I was trying to do would even lead me to a correct solution, but on the other hand, sitting down trying to prove the correctness of an hypothetical implementation. ↩
While I understand why f{32,64} aren't Ord, this is so annoying to always have to go around that I can't imagine this being the best choice. I wish the ordering was consistently defined to be NaN at either end of the ordering. Maybe there are hairly details I'm not thinking about though. ↩
One alternative to doing this is to implement PartialOrd and Ord yourself, and .reverse the ordering there. I ended up doing this later. ↩
More often than not, we're not dealing with the full-general version of these computational problems, and sometimes there's significant savings when we only solve the actual problem we have. ↩
The numbers here are completely made up, and I did not spend any time checking if they made sense. ↩
Again, this was domain specific knowledge I had about the problem that I was trying to solve. ↩
Each entry in routing_table corresponds to one initial move. Off-diagonal entry $S_{i,j}$ means go from the source to site $i$ and route to site $j$ around whichever point was the best; this is legal iff site $i$ has capacity for another path. Diagonal entry $S_{i,i}$ is legal iff site $i$ has capacity. ↩
You can't use the edge of cost $m$ more than once, since one site can only appear in a path once. However, there could be multiple entries in the table of cost $m$. You could get a tighter bound by looking at the $|S|$ cheapest entries, but this would also not be very tight, depending on which sites your path already contains. It gets complicated. ↩
A business requirement if you will. ↩
"problem" in this specific CS-y narrow sense. The world is big and complicated, and contains plenty of hard problems. ↩
Sometimes this manifests really clearly. I will have tried to write something --- an API, a function, a class, a library, anything --- but it's awkward to use right, often has off-by-one errors, weird bugs, and somehow things are never in the right place. Then, I write it again, changing for instance what I store in some state, and this time, everything just falls out naturally. Off by one errors suddenly can't exist any more, things are always conveniently where they need to be, and everything runs smoothly. ↩
This is not to say that these aren't the reasons for similar speedups in other circumstances; these are often the culprits. But sometimes, there just exist code that is orders of magnitude better. ↩
This was the time of the first commit, but I don't remember when on Wednesday I started this. I also wasn't exclusively working on this during these days, so it's hard to get a time estimate with an hour granularity. ↩
I wanted to write "pull-every-trick", but this simply isn't true. They do, however, pull some tricks. ↩

Actix and FOSS Responsibility

2020-01-18T16:32:36+01:00

The developer of the Rust web framework actix is "done with open source"¹,as you might have already seen on Reddit², HN³, Lobsters⁴, or somewhere else. Steve Klabnik⁵ has said a few words, and Raph Linus⁶ has some suggestions for how to avoid something like this in the future, but I don't feel like either of them (or any other write-up of this I've seen) does a proper job of addressing what I think is the real "issue" at play here, hence this post.

Edit, 2021.06.22: some months later Drew DeVault wrote a very relevant post aobut exactly this; also see other posts on his blog.

Here's a quick rundown of what happened. Person makes web framework. It gets some traction in the "community" and is often in benchmarks, where it does quite well. However, it uses unsafe liberally, and is not 100% secure from invoking UB. This is a common complaint of the framework, and people have tried to get patches that addresses this merged in. The maintainer doesn't like the patches. People get angry, and insults are thrown left and right. Maintainer takes their ball and goes home.

Let's do a Q&A-esque thing, in no apparent order.

<Maintainer> was an asshole

This pops up every now and then, which is strange since I'd assume that it was common knowledge that others being assholes doesn't give you the permission to be an asshole yourself.

But The Community⁷ didn't just complain, "we" even submitted a patch!

Yes, but a maintainer has no obligation to accept any patches coming their way. They also have no obligation to explain why.

`Actix` was high up on Some Benchmark, so the maintainers have a bigger responsibility

A programmer doesn't have to defend their code when it's compared against other code. If you think, maybe rightfully so, that Actixs usage of unsafe made an unfair comparison in a benchmark, then that's the fault of the authors of the benchmark. They are comparing apples to oranges, in that case. Of course, if the benchmark is simply "Which web framework written in Rust is fastest?" then this shouldn't be contested at all, as Actix definitely is written in Rust.

`Actix` was well known in the ecosystem, so the maintainers have a bigger responsibility

If The Community have chosen Actix as a valid representation of how Rust can be used, then so be it. It does not seem to have been a secret that the author was relaxed when it comes to safe-proofing the code; when it has become such a high profile project, that seem to speak for itself when it comes to whether Rust programmers really care about "safety over all".

`Actix` was professional-looking, so the maintainers have a bigger responsibility

This does not make much sense to me. Having a polished landing page or README.md does not somehow magically imply that the project itself is polished and ready for indefinite use by anyone. The author have not implicitly made a promise on anything related to their code. Almost any licence clearly say this.

But <Maintainer> deleted GitHub issues!

As is their right. None of us have any right to post any kind of content to another persons bug tracker/wiki/"digital property". They are in charge. If you don't like it, don't use it.

`Actix` was not "sound" and therefore bad

I'm not sure I like this craze for soundness. One of the big selling points of Rust for many people is the safety combined with the low level control. However, sometimes these things are simply at odds, and it's not clear that going with the safety all the time is really the best way.

Taken to the extreme, if some library was, for any practical purpose, useless, unless one did some unsafe trickery which also caused some adversarial code to invoke UB, would this library remain useless and safe, or would we just say "don't use it that way, then"?

It seems to me that the consensus among Rust programmers is that any safe code should be safe under any usage. I fear this greatly limits the potential of the language and its ecosystem⁸.

`Actix` was not "sound" and it hurts the Rust ecosystem

If the ecosystem of a programming language is so fragile that a single project which values aren't completely aligned with a large part of The Community is enough to bring it all down, then we should just pack our stuff and go home. There would be no reason to continue doing any of this, if Rust reputation depended on that all users of the language was equally evangelist about it.

<Maintainer> should step down and let someone else take over the project

This seems to confuse Actix, which as far as I understand is pretty much one persons library, with a community project. Actix does not belong to The Community, and the author has no obligation to The Community to give it to them when The Community feels like it. That "Contributions are welcome" doesn't change anything at all; it is still not a community project.

Taking this concept to any other situation reveals how crazy this idea is: if I invite people over to my house for dinner and tell them that contributions are welcome, they are in no right to take over my kitchen if they disagree with the way I cook.

This behaviour should not be acceptable for any open source maintainer

This is simply ridiculous, because it implies that the moment someone publishes FOSS code they are automatically subject to rules governing how they should deal with the technical decisions in their project.

It also seem to imply that being a maintainer sets a higher bar to what is social behaviour than merely being a contributor or user. There should be no differentiation here. Be civil, always.

If they don't care about safety, why are they even using Rust?

There are plenty of good reasons to use Rust that does not concern safety at all, as I'm sure any Rust programmer would be able to tell you. The author of Actix chose to write Rust. That's okay, and they does not have to explain why. This certainly does not give anyone the right to tell them that they should not be writing Rust.

`Actix` was labelled production ready when it had soundness bugs!

See the other comment about soundness. I would guess that almost all of the system code running on the device you're reading this on contain what you in Rust land would call soundness bugs. Yet, it is very much in production.

The <Maintainer> screws over people depending on their code

If you have committed to using third party code without any backup plan, you have already screwed over yourself. For very popular libraries this usually is not a problem, since, in case of maintainer rage-quit there are a lot of other people with the same needs as you - it's a kind of safety of the herd. Ideally, the only change most users would be affected by something like this is that they have to wait a little longer on the next release, and update the repo url. That's it.

Why are you defending <Maintainer> when they were an asshole?

I don't care who was an asshole or not, and I'm not here to put labels on people; I just think that most of the comments I've read on this event has had a lot of things in the FOSS world backwards.

Conclusion

I don't think a FOSS author has any responsibility over their users. If you don't like how they're prioritizing patches, handling criticism, or tuning to win benchmarks, you don't have to use it. Nobody is forcing you to use this code, and if someone is, you have (potentially) valid concerns you can raise.

If you want to bring in third-party code into your codebase, then that is your responsibility as a developer. If you don't trust the authors you should seriously reconsider using their code for anything more than a weekend project⁹. You don't get to complain if the other developers take down their code¹⁰, leave bugs unfixed, refuses patches, or deletes tickets from their bug tracker. In fact, you don't get to say anything at all except "Thank you!".

Don't be an asshole.

Thanks for reading. Thoughts and comments are welcome in my public inbox.

https://twitter.com/fafhrd91/status/1218135374339301378 ↩
https://www.reddit.com/r/rust/comments/epzukc/actix_web_repository_cleared_by_author_who_says/ ↩
https://news.ycombinator.com/item?id=22073908 ↩
https://lobste.rs/s/brcn0w/actix_web_author_i_am_done_with_open_source ↩
https://words.steveklabnik.com/a-sad-day-for-rust ↩
https://raphlinus.github.io/rust/2020/01/18/soundness-pledge.html ↩
I still find it strange to have people flock around a tool and call it a community. Each to their own, I guess. ↩
I can't dig it up now, but I've read a blog post about how some C++ concurrency primitive tries to detect whether any other threads have been spawned and takes a fast-path without any synchronization if not. The author provided an example of this behaviour being faulty, but under adversarial conditions. In my opinion, this is a perfectly reasonable approach for the library in question to take, since under any normal working condition it would behave as intended. ↩
This is in my opinion one of the last big blockers for proper package management; it's sad that the web-of-trust approach somehow didn't really work for PGP, since that would be the first approach to consider for package management as well. ↩
You know which Javascript library I'm thinking about ↩

Posts on mht.wtf

Another Static Site

Javascript

Templating

Others

Efficient Simulation Through Linear Algebra

Finite-Elements

Finding Equilibrium

Newton's Method

Linear Systems

Solving Sparse Linear Systems

Property vs. Representation

Solving The New System

Quick Micro benchmark

Conclusion

Footnotes

Confusing Words for a Beginner

Int

Float and Double

String

Print

Argument

Function

Void

Footnotes

Simplicity as a Value

Footnotes

Expanding TeX's \newif

Introduction

How Do I Write TeX?

The Goal

How TeX Reads Tokens

A Short Example

A Primer on Catcodes

Some Not So Bad Macros

\string

\escapechar

\uccode

\csname and \endcsname

\gdef

\outer

The \expandafter Macro

Chaining

Start Actually Expanding \newif

The Bottom Group

A note about more advanced parameter texts

Back to Bob

The Middle \def

The First \def

In Conclusion

Footnotes

Building Zig structs at Compile Time

Building Structs

So What?

Footnotes

Content Aware Image Resize

Sketching out a top down approach

image

The path of least resistance

Implementation

Finding the path

Removal

Seeing is believing

Footnotes

WTFs in Floating Point Math

Adding more than one

Conclusion

Advent of Common Lisp, Day 1-4

Common Lisp, Emacs, Slime, and QuickLisp

Installing QuickLisp

Reading Input

Day 1

Part 1

Part 2

Day 2

Part 1

Part 2

Day 3

Part 1

Part 2

`\string`

`\escapechar`

`\uccode`

`\csname` and `\endcsname`

`\gdef`

`\outer`

The `\expandafter` Macro

Start Actually Expanding `\newif`

The Middle `\def`

The First `\def`

`image`