Languages, Performance, and Intent

August 24, 2022 back to posts

Optimizing compilers are really cool! They look at your code and rewrite it so that it's behavior is unchanged and it's execution time is reduced. The fact that the compiler cannot change the semantics of your code sounds obvious, but there is a crucial detail here: it cannot change your code for any input¹. If you write a function fn foo(i32) and the compiler wants to generate the code fn fast_foo(i32), it must hold that for any input, like 0, 1, 123, -999, i32::MIN, i32::MAX, or 1337, the behavior of foo and fast_foo is identical. This means the compiler is forced to take into account the corner cases of yout code, which may or may not be a part of your intent when writing that code.

Chandler Carruth shows an example in his talk "Garbage In, Garbage Out: Arguing about Undefined Behavior With Nasal Demons" as seen here. His example shows the difference between signed and unsigned integers in C++, and how the defined wrapping of unsigned integers causes the compiler to output bad code, whereas using signed integers would make the compiler generate good code because it can assume that overflow does not happen. He suggests that the reason the programmer chose unsigned in this case was (a) because it is semantically correct², and (b) that they were "a little bit worried about a narrow contract"³. A little later he answers a question by underscoring the fact that the behavior is different:

Q: Isn't this just a failure of the optimizer doing the right thing?
A: No! We cannot produce this assembly [shows good assembly] for this function [shows initial function]. [...] They are semantically different.

I like this example because it is such a minor choice. It sounds like unsigned would be the right choice since we wouldn't have to worry about accidentally passing in negative offsets⁴, and yet it has a very significant performance impact due to how the compiler is allowed to reason.

Actually checking the bzip2 asm

I decided to double check the asm from Chandler's presentation. The function from the video is still using unsigned integers, and I can't find an issue or a pull request suggesting to make this change, so I decided to check the compiler output myself. If the performance improvement is so great by using signed integers instead, why aren't they doing so?

Here's exactly what I did:

$ git clone https://gitlab.com/bzip2/bzip2
$ cd bzip2
$ mkdir build
$ cmake -H. -Bbuild -DCMAKE_BUILD_TYPE=Release

Building straight away doesn't help us, since mainGtU is marked static, and we'd like to have it exported in the final executable. I removed the two lines static __inline__ from mainGtU, and we're off to the races:

make -Cbuild
objdump build/bzip2 | less

Search for mainGtU and I get the following:

0000000000006ce0 <mainGtU.part.0>:
    6ce0:	48 89 d0             	mov    %rdx,%rax
    6ce3:	49 89 ca             	mov    %rcx,%r10
    6ce6:	8d 57 03             	lea    0x3(%rdi),%edx
    6ce9:	8d 4e 03             	lea    0x3(%rsi),%ecx
    6cec:	0f b6 14 10          	movzbl (%rax,%rdx,1),%edx
    6cf0:	0f b6 0c 08          	movzbl (%rax,%rcx,1),%ecx
    6cf4:	38 ca                	cmp    %cl,%dl
    6cf6:	75 12                	jne    6d0a <mainGtU.part.0+0x2a>
    6cf8:	8d 57 04             	lea    0x4(%rdi),%edx
    6cfb:	8d 4e 04             	lea    0x4(%rsi),%ecx
    6cfe:	0f b6 14 10          	movzbl (%rax,%rdx,1),%edx
    6d02:	0f b6 0c 08          	movzbl (%rax,%rcx,1),%ecx
    6d06:	38 ca                	cmp    %cl,%dl
    6d08:	74 06                	je     6d10 <mainGtU.part.0+0x30>
    6d0a:	38 d1                	cmp    %dl,%cl
    6d0c:	0f 92 c0             	setb   %al
    6d0f:	c3                   	ret    
    6d10:	8d 57 05             	lea    0x5(%rdi),%edx
    6d13:	8d 4e 05             	lea    0x5(%rsi),%ecx
    6d16:	0f b6 14 10          	movzbl (%rax,%rdx,1),%edx
    6d1a:	0f b6 0c 08          	movzbl (%rax,%rcx,1),%ecx
    6d1e:	38 ca                	cmp    %cl,%dl
    6d20:	75 e8                	jne    6d0a <mainGtU.part.0+0x2a>
    6d22:	8d 57 06             	lea    0x6(%rdi),%edx
    6d25:	8d 4e 06             	lea    0x6(%rsi),%ecx
    6d28:	0f b6 14 10          	movzbl (%rax,%rdx,1),%edx
    6d2c:	0f b6 0c 08          	movzbl (%rax,%rcx,1),%ecx
    6d30:	38 ca                	cmp    %cl,%dl
    6d32:	75 d6                	jne    6d0a <mainGtU.part.0+0x2a>
    6d34:	8d 57 07             	lea    0x7(%rdi),%edx
    6d37:	8d 4e 07             	lea    0x7(%rsi),%ecx
    6d3a:	0f b6 14 10          	movzbl (%rax,%rdx,1),%edx
    6d3e:	0f b6 0c 08          	movzbl (%rax,%rcx,1),%ecx
    6d42:	38 ca                	cmp    %cl,%dl
    6d44:	75 c4                	jne    6d0a <mainGtU.part.0+0x2a>
    6d46:	8d 57 08             	lea    0x8(%rdi),%edx
    6d49:	8d 4e 08             	lea    0x8(%rsi),%ecx
    6d4c:	0f b6 14 10          	movzbl (%rax,%rdx,1),%edx
    6d50:	0f b6 0c 08          	movzbl (%rax,%rcx,1),%ecx
    6d54:	38 ca                	cmp    %cl,%dl
    6d56:	75 b2                	jne    6d0a <mainGtU.part.0+0x2a>

Compare this to what we get from the same function if we replace the types of i1 and i2 with Int32. I copied the whole function, added a _2 suffix, and recompiled.

00000000000085d0 <mainGtU_2>:
    85d0:	48 89 d0             	mov    %rdx,%rax
    85d3:	4c 63 d6             	movslq %esi,%r10
    85d6:	48 89 ca             	mov    %rcx,%rdx
    85d9:	48 63 cf             	movslq %edi,%rcx
    85dc:	0f b6 0c 08          	movzbl (%rax,%rcx,1),%ecx
    85e0:	46 0f b6 14 10       	movzbl (%rax,%r10,1),%r10d
    85e5:	44 38 d1             	cmp    %r10b,%cl
    85e8:	74 0e                	je     85f8 <mainGtU_2+0x28>
    85ea:	41 38 ca             	cmp    %cl,%r10b
    85ed:	0f 92 c0             	setb   %al
    85f0:	c3                   	ret    
    85f1:	0f 1f 80 00 00 00 00 	nopl   0x0(%rax)
    85f8:	8d 4f 01             	lea    0x1(%rdi),%ecx
    85fb:	48 63 c9             	movslq %ecx,%rcx
    85fe:	44 0f b6 14 08       	movzbl (%rax,%rcx,1),%r10d
    8603:	8d 4e 01             	lea    0x1(%rsi),%ecx
    8606:	48 63 c9             	movslq %ecx,%rcx
    8609:	0f b6 0c 08          	movzbl (%rax,%rcx,1),%ecx
    860d:	41 38 ca             	cmp    %cl,%r10b
    8610:	74 0e                	je     8620 <mainGtU_2+0x50>
    8612:	44 38 d1             	cmp    %r10b,%cl
    8615:	0f 92 c0             	setb   %al
    8618:	c3                   	ret    
    8619:	0f 1f 80 00 00 00 00 	nopl   0x0(%rax)
    8620:	8d 4f 02             	lea    0x2(%rdi),%ecx
    8623:	48 63 c9             	movslq %ecx,%rcx
    8626:	44 0f b6 14 08       	movzbl (%rax,%rcx,1),%r10d
    862b:	8d 4e 02             	lea    0x2(%rsi),%ecx
    862e:	48 63 c9             	movslq %ecx,%rcx
    8631:	0f b6 0c 08          	movzbl (%rax,%rcx,1),%ecx
    8635:	41 38 ca             	cmp    %cl,%r10b
    8638:	75 d8                	jne    8612 <mainGtU_2+0x42>
    863a:	8d 4f 03             	lea    0x3(%rdi),%ecx
    863d:	48 63 c9             	movslq %ecx,%rcx
    8640:	44 0f b6 14 08       	movzbl (%rax,%rcx,1),%r10d
    8645:	8d 4e 03             	lea    0x3(%rsi),%ecx
    8648:	48 63 c9             	movslq %ecx,%rcx
    864b:	0f b6 0c 08          	movzbl (%rax,%rcx,1),%ecx
    864f:	41 38 ca             	cmp    %cl,%r10b
    8652:	75 be                	jne    8612 <mainGtU_2+0x42>
    8654:	8d 4f 04             	lea    0x4(%rdi),%ecx
    8657:	48 63 c9             	movslq %ecx,%rcx
    865a:	44 0f b6 14 08       	movzbl (%rax,%rcx,1),%r10d
    865f:	8d 4e 04             	lea    0x4(%rsi),%ecx
    8662:	48 63 c9             	movslq %ecx,%rcx
    8665:	0f b6 0c 08          	movzbl (%rax,%rcx,1),%ecx
    8669:	41 38 ca             	cmp    %cl,%r10b
    866c:	75 a4                	jne    8612 <mainGtU_2+0x42>
    866e:	8d 4f 05             	lea    0x5(%rdi),%ecx
    8671:	48 63 c9             	movslq %ecx,%rcx
    8674:	44 0f b6 14 08       	movzbl (%rax,%rcx,1),%r10d
    8679:	8d 4e 05             	lea    0x5(%rsi),%ecx
    867c:	48 63 c9             	movslq %ecx,%rcx
    867f:	0f b6 0c 08          	movzbl (%rax,%rcx,1),%ecx
    8683:	41 38 ca             	cmp    %cl,%r10b
    8686:	75 8a                	jne    8612 <mainGtU_2+0x42>
    8688:	8d 4f 06             	lea    0x6(%rdi),%ecx
    868b:	48 63 c9             	movslq %ecx,%rcx
    868e:	44 0f b6 14 08       	movzbl (%rax,%rcx,1),%r10d
    8693:	8d 4e 06             	lea    0x6(%rsi),%ecx
    8696:	48 63 c9             	movslq %ecx,%rcx
    8699:	0f b6 0c 08          	movzbl (%rax,%rcx,1),%ecx
    869d:	41 38 ca             	cmp    %cl,%r10b
    86a0:	0f 85 6c ff ff ff    	jne    8612 <mainGtU_2+0x42>
    86a6:	8d 4f 07             	lea    0x7(%rdi),%ecx
    86a9:	48 63 c9             	movslq %ecx,%rcx
    86ac:	44 0f b6 14 08       	movzbl (%rax,%rcx,1),%r10d
    86b1:	8d 4e 07             	lea    0x7(%rsi),%ecx
    86b4:	48 63 c9             	movslq %ecx,%rcx
    86b7:	0f b6 0c 08          	movzbl (%rax,%rcx,1),%ecx
    86bb:	41 38 ca             	cmp    %cl,%r10b
    86be:	0f 85 4e ff ff ff    	jne    8612 <mainGtU_2+0x42>
    86c4:	8d 4f 08             	lea    0x8(%rdi),%ecx
    86c7:	48 63 c9             	movslq %ecx,%rcx
    86ca:	44 0f b6 14 08       	movzbl (%rax,%rcx,1),%r10d
    86cf:	8d 4e 08             	lea    0x8(%rsi),%ecx
    86d2:	48 63 c9             	movslq %ecx,%rcx
    86d5:	0f b6 0c 08          	movzbl (%rax,%rcx,1),%ecx
    86d9:	41 38 ca             	cmp    %cl,%r10b
    86dc:	0f 85 30 ff ff ff    	jne    8612 <mainGtU_2+0x42>

The setup code of the two functions is a little different, but once we get going it looks something like this:

              UInt32                          Int32
====================================================================
                                  | jne    8612 <mainGtU_2+0x42>
jne    6d0a <mainGtU.part.0+0x2a> | lea    0x8(%rdi),%ecx   
lea    0x8(%rdi),%edx             | movslq %ecx,%rcx   
lea    0x8(%rsi),%ecx             | movzbl (%rax,%rcx,1),%r10d   
movzbl (%rax,%rdx,1),%edx         | lea    0x8(%rsi),%ecx   
movzbl (%rax,%rcx,1),%ecx         | movslq %ecx,%rcx   
cmp    %cl,%dl                    | movzbl (%rax,%rcx,1),%ecx   
jne    6d0a <mainGtU.part.0+0x2a> | cmp    %cl,%r10b   
                                  | jne    8612 <mainGtU_2+0x42>

I'm not sure if the signed code is better or worse, although in terms of instruction count, it is definitely more. Weird.

I also tried to CFLAGS=-march=native before building, thinking that maybe there's some platform specific code that we wanted the compiler to generate, but the code for the two functions seems to be identical with and without this flag.

There are cases where we are explicitly being made aware of a hidden tradeoff, and are given tools for dealing with it. A good example of this is the C and C++⁵ flag -ffast-math, which enables a collections of other flags that relaxes some of the requirements of the IEEE-754 floating-point number standard. One of the flags it sets, -fassociative-math, allows the compiler to reorder the operation (a + b) + c to a + (b + c)⁶. Another is -freciprocal-math, which allows the compiler to consider a / b the same as a * (1 / b). In of itself these transformations are not so valuable, but in combination with common subexpression elimination or loop hoisting they can yield good speedups. By specifying -ffast-math we can allow the compiler to change the semantics of our programs (in a limited sense) such that it can make the output code faster. However, we still need to know that this is a flag we can set. If we don't konw about -ffast-math and we don't mind these transformations⁷ we are inhibiting the compiler's ability to generate good code without gaining any benefit.

John Carmack's conversation on Lex Fridman's podcast contains a similar example of the idea that some of these trade-offs are very skewed. The part can be heard here. In talking about the innovations required to make Quake, and speficically about optimization, John says this:

The most leverage comes from making the decisions that are a litte bit higher up, where you figure out how to change your large scale problem so that these lower level problems are easier to do, or it makes it possible to do them in a uniquely fast way.
--- John Carmack

The changes John are talking about are slightly different than Chandler's because in John's case there is, say, a design decision with some wiggle room which can be used to yield huge benefits in terms of speeup. Chandler's signedness case is an example of a decision that was accidentally made⁸, maybe without knowing so and probably without knowing it's impact⁹.

In an earlier post I explored the codegen of a merge function¹⁰ and tried to have the compiler output branchless code with the conditional instruction cmove. By writing the function in slightly different ways, and eventually writing it straight in x86, I got a total of seven variants of what I considered the same function. The difference in time spent on a micro benchmark was 31ms for the slowest (the initial straight-forward way) to 19ms for the fastest (the asm, but two C-variants were also down there). At the time I was happy with having beaten the compiler on optimizing such a small and simple function. Now I'm not so sure this is what happened, and I suspect that there are inputs which would make my trivial-and-slow implementation behave differently than any of the fast ones. This would mean that the optimizer wasn't too stupid to get it right, but that I accidentally encoded behavior in the implementation that was too constraining for it to work around. Behavior that I potentially didn't care about and that I would gladly give up for a 38% reduction in execution time.

If we want our programs to be fast we clearly need to understand what our computers can do, but we also need to understand what our programs are actually instructing the computer to do, and which constraints we are setting for an optimizing compiler. We cannot simply out-source the job of generating fast machine code to the compiler, because we need to use the wiggle room of our design space in our advantage, and the compiler cannot do this. Without working from both ends we may often find ourselves with terrible machine code that a simple and insignificant change to our code would have fixed. By choosing to ignore this, we pay the price.

Suggestions, comments, tips, and the signed bit of your integers can be sent to my public inbox (plain text email only).

Thanks for reading.

This is also where undefined behavior comes in; the compiler is allowed to assume that UB does not happen, because a program execution in which is does happen is non-sensical. If if can show that a variable having a certain value would cause UB it is allowed to assume that this variable will not have that value. Depending on how offended you would be by being called a "language lawyer" you might find this argument to be non-sensical, but this is the status quo. ↩
The numbers in this case were used for offsets from a base pointer. These should never be negative, and so by using unsigned integers we can enforce this trait. ↩
I don't think Chandler is being very charitable in his guesswork here, but it is a talk trying to disarm UB-fear so as a story telling device I guess it's ... fine? ↩
Conversely, if we accidentally have huge positive offsets it is very likely that we segfault at once, as the address of block[2147483648] and higher is very likely not mapped. ↩
I assume many more languages have either the same flag or a similar flag with a different name. ↩
I wrote an blog post about some surprises of floating point numbers here where I also give an explicit example of non-associativity. ↩
There are good reasons for not using -ffast-math; very carefully written numerical code will often depend on the exact order of operations in order to avoid losing precision, dealing correctly with NaNs, and so on. -ffast-math throws all of this out the window. ↩
Again, I'm guessing here. ↩
Now that we did the work and found that the assembly looked, if not worse, not better, it is worth questioning whether this decision really had any impact or not. Maybe this is the real lesson here? ↩
merge takes two sorted lists and merges them together to one sorted list. Usually this is done by walking along the fronts of the two lists and poping the smaller of the two elements. ↩

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License

Languages, Performance, and Intent

Footnotes