July 17, 2016

We all know that floating point nubmers have a limited precision.
With small numbers, say between zero and one, we can describe the fractional part of the number in great detail.
However, if the number is large, there may not be room for any fractional part.
So how large numbers do we need before `a == a + 1`

?

We’ll find our answer by simply checking:

```
#include <stdio.h>
int simple() {
float num = 1.0f;
float inc = num + 1;
while (num != inc) {
num *= 10;
inc = num + 1;
}
printf("%f\n", num);
}
```

which outputs `100000000.000000`

.
If we multiply by `2`

insead of `10`

, we may get a more intuitive number: `16777216 = 2^24`

.
If we look at the IEEE 754 single-precision floating-point format,
we see that the fraction part uses only `24`

bits;
our number `2^24`

needs `25`

bits, so we can safely add one, because the last bit gets truncated.

How high can we go if we are using `double`

instead of `float`

?
As IEEE 754 double-presicion uses `53`

bits, we are expecting to get to `2^53 = 9007199254740992`

, or
`10000000000000000`

if we multiply by `10`

.
And that is what we get.

How much can we add before getting a new number?
One would think the answer was simply until the addition carries over to the `24`

first bits.
However, the answer is not that straight forward.
Instead, IEEE has defined Rounding rules, for deciding what to do when the number we would like to
represent doesn’t quite fit in `24`

bits.
The default rule is *«Round to nearest, ties to even»*, meaning if we are in between two numbers, round to the even number.

Let’s try this out with the totaly random number `314159256`

:

```
314159265 = 10010101110011011000010100001
|------- 24 bits ------||---|
// The last 5 bits will be truncated
// .. so the 16 numbers from
314159264 = 10010101110011011000010100000
// to
314159279 = 10010101110011011000010101111
// are all the same.
```

This is fairly easy to confirm:

```
void confirm_suspicions(void) {
double first = 314159264;
double number = first;
while (((float) first) == ((float) number)) {
number++;
}
printf("%f != %f\n", number, first);
// 314159280 != 314159264
}
```

If we however were to set the fifth bit, the rounding rule would round all the way up to `314156296`

```
314159280 = 10010101110011011000010110000
|------- 24 bits ------||---|
// Round up, so the least significant bit of the 24 bits becomes 0
314159296 = 10010101110011011000011000000
// This will be the same number all the way up to
314159312 = 10010101110011011000011010000
// .. where the rounding rule allows us to include ..10000, as the lsb is 0.
// This makes 33 different numbers, all represented as the same floating point number.
```

What can we take home from this? First of all, floating point numbers doesn’t behave like real numbers, and its easy to forget this:

```
float a,b;
// ...
a = b;
a++;
a == b // true ???
```

or what about this?

```
float a;
// ...
for (; a < large_float; a++) {
// ...
}
```

which is an infinite loop, if `a == (a + 1)`

.
Lastly, consider this:

```
void oops(void) {
float a = 314159264;
float c = a + 10 + 10;
printf("%d\n", c == 314159284.f); // 0
float d = a + (10 + 10);
printf("%d\n", d == 314159284.f); // 1
}
```

And you thought addition was associative? Not in the world of floating point numbers!

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License