Languages, Performance, and Intent
Optimizing compilers are really cool!They look at your code and rewrite it so that it's behavior is unchanged and it's execution time is reduced.
The fact that the compiler cannot change the semantics of your code sounds obvious, but there is a crucial detail here:
it cannot change your code for any input1.
If you write a function fn foo(i32)
and the compiler wants to generate the code fn fast_foo(i32)
,
it must hold that for any input, like 0
, 1
, 123
, -999
, i32::MIN
, i32::MAX
, or 1337
, the behavior of foo
and fast_foo
is identical.
This means the compiler is forced to take into account the corner cases of yout code, which may or may not be a part of your intent when writing that code.
Chandler Carruth shows an example in his talk "Garbage In, Garbage Out: Arguing about Undefined Behavior With Nasal Demons" as seen here.
His example shows the difference between signed
and unsigned
integers in C++, and how the defined wrapping of unsigned
integers causes the compiler to output bad code, whereas using signed
integers would make the compiler generate good code because it can assume that overflow does not happen.
He suggests that the reason the programmer chose unsigned
in this case was (a) because it is semantically correct2, and (b) that they were "a little bit worried about a narrow contract"3.
A little later he answers a question by underscoring the fact that the behavior is different:
Q: Isn't this just a failure of the optimizer doing the right thing?
A: No! We cannot produce this assembly [shows good assembly] for this function [shows initial function]. [...] They are semantically different.
I like this example because it is such a minor choice.
It sounds like unsigned
would be the right choice since we wouldn't have to worry about accidentally passing in negative offsets4,
and yet it has a very significant performance impact due to how the compiler is allowed to reason.
Actually checking the bzip2 asm
I decided to double check the asm from Chandler's presentation.
The function from the video is still using unsigned
integers,
and I can't find an issue or a pull request suggesting to make this change, so I decided to check the compiler output myself.
If the performance improvement is so great by using signed
integers instead, why aren't they doing so?
Here's exactly what I did:
$ git clone https://gitlab.com/bzip2/bzip2
$ cd bzip2
$ mkdir build
$ cmake -H. -Bbuild -DCMAKE_BUILD_TYPE=Release
Building straight away doesn't help us, since mainGtU
is marked static
, and we'd like to have it exported in the final executable. I removed the two lines static __inline__
from mainGtU
, and we're off to the races:
make -Cbuild
objdump build/bzip2 | less
Search for mainGtU
and I get the following:
0000000000006ce0 <mainGtU.part.0>:
6ce0: 48 89 d0 mov %rdx,%rax
6ce3: 49 89 ca mov %rcx,%r10
6ce6: 8d 57 03 lea 0x3(%rdi),%edx
6ce9: 8d 4e 03 lea 0x3(%rsi),%ecx
6cec: 0f b6 14 10 movzbl (%rax,%rdx,1),%edx
6cf0: 0f b6 0c 08 movzbl (%rax,%rcx,1),%ecx
6cf4: 38 ca cmp %cl,%dl
6cf6: 75 12 jne 6d0a <mainGtU.part.0+0x2a>
6cf8: 8d 57 04 lea 0x4(%rdi),%edx
6cfb: 8d 4e 04 lea 0x4(%rsi),%ecx
6cfe: 0f b6 14 10 movzbl (%rax,%rdx,1),%edx
6d02: 0f b6 0c 08 movzbl (%rax,%rcx,1),%ecx
6d06: 38 ca cmp %cl,%dl
6d08: 74 06 je 6d10 <mainGtU.part.0+0x30>
6d0a: 38 d1 cmp %dl,%cl
6d0c: 0f 92 c0 setb %al
6d0f: c3 ret
6d10: 8d 57 05 lea 0x5(%rdi),%edx
6d13: 8d 4e 05 lea 0x5(%rsi),%ecx
6d16: 0f b6 14 10 movzbl (%rax,%rdx,1),%edx
6d1a: 0f b6 0c 08 movzbl (%rax,%rcx,1),%ecx
6d1e: 38 ca cmp %cl,%dl
6d20: 75 e8 jne 6d0a <mainGtU.part.0+0x2a>
6d22: 8d 57 06 lea 0x6(%rdi),%edx
6d25: 8d 4e 06 lea 0x6(%rsi),%ecx
6d28: 0f b6 14 10 movzbl (%rax,%rdx,1),%edx
6d2c: 0f b6 0c 08 movzbl (%rax,%rcx,1),%ecx
6d30: 38 ca cmp %cl,%dl
6d32: 75 d6 jne 6d0a <mainGtU.part.0+0x2a>
6d34: 8d 57 07 lea 0x7(%rdi),%edx
6d37: 8d 4e 07 lea 0x7(%rsi),%ecx
6d3a: 0f b6 14 10 movzbl (%rax,%rdx,1),%edx
6d3e: 0f b6 0c 08 movzbl (%rax,%rcx,1),%ecx
6d42: 38 ca cmp %cl,%dl
6d44: 75 c4 jne 6d0a <mainGtU.part.0+0x2a>
6d46: 8d 57 08 lea 0x8(%rdi),%edx
6d49: 8d 4e 08 lea 0x8(%rsi),%ecx
6d4c: 0f b6 14 10 movzbl (%rax,%rdx,1),%edx
6d50: 0f b6 0c 08 movzbl (%rax,%rcx,1),%ecx
6d54: 38 ca cmp %cl,%dl
6d56: 75 b2 jne 6d0a <mainGtU.part.0+0x2a>
Compare this to what we get from the same function if we replace the types of i1
and i2
with Int32
. I copied the whole function, added a _2
suffix, and recompiled.
00000000000085d0 <mainGtU_2>:
85d0: 48 89 d0 mov %rdx,%rax
85d3: 4c 63 d6 movslq %esi,%r10
85d6: 48 89 ca mov %rcx,%rdx
85d9: 48 63 cf movslq %edi,%rcx
85dc: 0f b6 0c 08 movzbl (%rax,%rcx,1),%ecx
85e0: 46 0f b6 14 10 movzbl (%rax,%r10,1),%r10d
85e5: 44 38 d1 cmp %r10b,%cl
85e8: 74 0e je 85f8 <mainGtU_2+0x28>
85ea: 41 38 ca cmp %cl,%r10b
85ed: 0f 92 c0 setb %al
85f0: c3 ret
85f1: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
85f8: 8d 4f 01 lea 0x1(%rdi),%ecx
85fb: 48 63 c9 movslq %ecx,%rcx
85fe: 44 0f b6 14 08 movzbl (%rax,%rcx,1),%r10d
8603: 8d 4e 01 lea 0x1(%rsi),%ecx
8606: 48 63 c9 movslq %ecx,%rcx
8609: 0f b6 0c 08 movzbl (%rax,%rcx,1),%ecx
860d: 41 38 ca cmp %cl,%r10b
8610: 74 0e je 8620 <mainGtU_2+0x50>
8612: 44 38 d1 cmp %r10b,%cl
8615: 0f 92 c0 setb %al
8618: c3 ret
8619: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
8620: 8d 4f 02 lea 0x2(%rdi),%ecx
8623: 48 63 c9 movslq %ecx,%rcx
8626: 44 0f b6 14 08 movzbl (%rax,%rcx,1),%r10d
862b: 8d 4e 02 lea 0x2(%rsi),%ecx
862e: 48 63 c9 movslq %ecx,%rcx
8631: 0f b6 0c 08 movzbl (%rax,%rcx,1),%ecx
8635: 41 38 ca cmp %cl,%r10b
8638: 75 d8 jne 8612 <mainGtU_2+0x42>
863a: 8d 4f 03 lea 0x3(%rdi),%ecx
863d: 48 63 c9 movslq %ecx,%rcx
8640: 44 0f b6 14 08 movzbl (%rax,%rcx,1),%r10d
8645: 8d 4e 03 lea 0x3(%rsi),%ecx
8648: 48 63 c9 movslq %ecx,%rcx
864b: 0f b6 0c 08 movzbl (%rax,%rcx,1),%ecx
864f: 41 38 ca cmp %cl,%r10b
8652: 75 be jne 8612 <mainGtU_2+0x42>
8654: 8d 4f 04 lea 0x4(%rdi),%ecx
8657: 48 63 c9 movslq %ecx,%rcx
865a: 44 0f b6 14 08 movzbl (%rax,%rcx,1),%r10d
865f: 8d 4e 04 lea 0x4(%rsi),%ecx
8662: 48 63 c9 movslq %ecx,%rcx
8665: 0f b6 0c 08 movzbl (%rax,%rcx,1),%ecx
8669: 41 38 ca cmp %cl,%r10b
866c: 75 a4 jne 8612 <mainGtU_2+0x42>
866e: 8d 4f 05 lea 0x5(%rdi),%ecx
8671: 48 63 c9 movslq %ecx,%rcx
8674: 44 0f b6 14 08 movzbl (%rax,%rcx,1),%r10d
8679: 8d 4e 05 lea 0x5(%rsi),%ecx
867c: 48 63 c9 movslq %ecx,%rcx
867f: 0f b6 0c 08 movzbl (%rax,%rcx,1),%ecx
8683: 41 38 ca cmp %cl,%r10b
8686: 75 8a jne 8612 <mainGtU_2+0x42>
8688: 8d 4f 06 lea 0x6(%rdi),%ecx
868b: 48 63 c9 movslq %ecx,%rcx
868e: 44 0f b6 14 08 movzbl (%rax,%rcx,1),%r10d
8693: 8d 4e 06 lea 0x6(%rsi),%ecx
8696: 48 63 c9 movslq %ecx,%rcx
8699: 0f b6 0c 08 movzbl (%rax,%rcx,1),%ecx
869d: 41 38 ca cmp %cl,%r10b
86a0: 0f 85 6c ff ff ff jne 8612 <mainGtU_2+0x42>
86a6: 8d 4f 07 lea 0x7(%rdi),%ecx
86a9: 48 63 c9 movslq %ecx,%rcx
86ac: 44 0f b6 14 08 movzbl (%rax,%rcx,1),%r10d
86b1: 8d 4e 07 lea 0x7(%rsi),%ecx
86b4: 48 63 c9 movslq %ecx,%rcx
86b7: 0f b6 0c 08 movzbl (%rax,%rcx,1),%ecx
86bb: 41 38 ca cmp %cl,%r10b
86be: 0f 85 4e ff ff ff jne 8612 <mainGtU_2+0x42>
86c4: 8d 4f 08 lea 0x8(%rdi),%ecx
86c7: 48 63 c9 movslq %ecx,%rcx
86ca: 44 0f b6 14 08 movzbl (%rax,%rcx,1),%r10d
86cf: 8d 4e 08 lea 0x8(%rsi),%ecx
86d2: 48 63 c9 movslq %ecx,%rcx
86d5: 0f b6 0c 08 movzbl (%rax,%rcx,1),%ecx
86d9: 41 38 ca cmp %cl,%r10b
86dc: 0f 85 30 ff ff ff jne 8612 <mainGtU_2+0x42>
The setup code of the two functions is a little different, but once we get going it looks something like this:
UInt32 Int32
====================================================================
| jne 8612 <mainGtU_2+0x42>
jne 6d0a <mainGtU.part.0+0x2a> | lea 0x8(%rdi),%ecx
lea 0x8(%rdi),%edx | movslq %ecx,%rcx
lea 0x8(%rsi),%ecx | movzbl (%rax,%rcx,1),%r10d
movzbl (%rax,%rdx,1),%edx | lea 0x8(%rsi),%ecx
movzbl (%rax,%rcx,1),%ecx | movslq %ecx,%rcx
cmp %cl,%dl | movzbl (%rax,%rcx,1),%ecx
jne 6d0a <mainGtU.part.0+0x2a> | cmp %cl,%r10b
| jne 8612 <mainGtU_2+0x42>
I'm not sure if the signed
code is better or worse, although in terms of instruction count, it is definitely more.
Weird.
I also tried to CFLAGS=-march=native
before building, thinking that maybe there's some platform specific code that we wanted the compiler to generate, but the code for the two functions seems to be identical with and without this flag.
There are cases where we are explicitly being made aware of a hidden tradeoff, and are given tools for dealing with it.
A good example of this is the C and C++5 flag -ffast-math
,
which enables a collections of other flags that relaxes some of the requirements of the IEEE-754 floating-point number standard.
One of the flags it sets, -fassociative-math
, allows the compiler to reorder the operation (a + b) + c
to a + (b + c)
6.
Another is -freciprocal-math
, which allows the compiler to consider a / b
the same as a * (1 / b)
.
In of itself these transformations are not so valuable, but in combination with common subexpression elimination or loop hoisting they can yield good speedups.
By specifying -ffast-math
we can allow the compiler to change the semantics of our programs (in a limited sense) such that it can make the output code faster.
However, we still need to know that this is a flag we can set.
If we don't konw about -ffast-math
and we don't mind these transformations7 we are inhibiting the compiler's ability to generate good code without gaining any benefit.
John Carmack's conversation on Lex Fridman's podcast contains a similar example of the idea that some of these trade-offs are very skewed. The part can be heard here. In talking about the innovations required to make Quake, and speficically about optimization, John says this:
The most leverage comes from making the decisions that are a litte bit higher up, where you figure out how to change your large scale problem so that these lower level problems are easier to do, or it makes it possible to do them in a uniquely fast way.
--- John Carmack
The changes John are talking about are slightly different than Chandler's because in John's case there is, say, a design decision with some wiggle room which can be used to yield huge benefits in terms of speeup. Chandler's signedness case is an example of a decision that was accidentally made8, maybe without knowing so and probably without knowing it's impact9.
In an earlier post I explored the codegen of a merge
function10 and tried to have the compiler output branchless code with the conditional instruction cmove
.
By writing the function in slightly different ways, and eventually writing it straight in x86
, I got a total of seven variants of what I considered the same function.
The difference in time spent on a micro benchmark was 31ms for the slowest (the initial straight-forward way) to 19ms for the fastest (the asm, but two C-variants were also down there).
At the time I was happy with having beaten the compiler on optimizing such a small and simple function.
Now I'm not so sure this is what happened, and I suspect that there are inputs which would make my trivial-and-slow implementation behave differently than any of the fast ones.
This would mean that the optimizer wasn't too stupid to get it right, but that I accidentally encoded behavior in the implementation that was too constraining for it to work around.
Behavior that I potentially didn't care about and that I would gladly give up for a 38% reduction in execution time.
If we want our programs to be fast we clearly need to understand what our computers can do, but we also need to understand what our programs are actually instructing the computer to do, and which constraints we are setting for an optimizing compiler. We cannot simply out-source the job of generating fast machine code to the compiler, because we need to use the wiggle room of our design space in our advantage, and the compiler cannot do this. Without working from both ends we may often find ourselves with terrible machine code that a simple and insignificant change to our code would have fixed. By choosing to ignore this, we pay the price.
Suggestions, comments, tips, and the signed bit of your integers can be sent to my public inbox (plain text email only).
Thanks for reading.
Footnotes
-
This is also where undefined behavior comes in; the compiler is allowed to assume that UB does not happen, because a program execution in which is does happen is non-sensical. If if can show that a variable having a certain value would cause UB it is allowed to assume that this variable will not have that value. Depending on how offended you would be by being called a "language lawyer" you might find this argument to be non-sensical, but this is the status quo. ↩
-
The numbers in this case were used for offsets from a base pointer. These should never be negative, and so by using
unsigned
integers we can enforce this trait. ↩ -
I don't think Chandler is being very charitable in his guesswork here, but it is a talk trying to disarm UB-fear so as a story telling device I guess it's ... fine? ↩
-
Conversely, if we accidentally have huge positive offsets it is very likely that we
segfault
at once, as the address ofblock[2147483648]
and higher is very likely not mapped. ↩ -
I assume many more languages have either the same flag or a similar flag with a different name. ↩
-
I wrote an blog post about some surprises of floating point numbers here where I also give an explicit example of non-associativity. ↩
-
There are good reasons for not using
-ffast-math
; very carefully written numerical code will often depend on the exact order of operations in order to avoid losing precision, dealing correctly withNaN
s, and so on.-ffast-math
throws all of this out the window. ↩ -
Again, I'm guessing here. ↩
-
Now that we did the work and found that the assembly looked, if not worse, not better, it is worth questioning whether this decision really had any impact or not. Maybe this is the real lesson here? ↩
-
merge
takes two sorted lists and merges them together to one sorted list. Usually this is done by walking along the fronts of the two lists and poping the smaller of the two elements. ↩
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License