Introduction
I'm writing this blog because of the large number of blogs asking about why they get strange floating arithmetic behaviour in C++. For example:
"WA using GNU C++17 (64) and AC using GNU C++17" https://codeforces.me/blog/entry/78094
"The curious case of the pow function" https://codeforces.me/blog/entry/21844
"Why does this happen?" https://codeforces.me/blog/entry/51884
"Why can this code work strangely?" https://codeforces.me/blog/entry/18005
and many many more.
Example
Here is a simple example of the kind of weird behaviour I'm talking about
Looking at this example, the output that one would expect from $$$10 * 10 - 10^{-15}$$$ is exactly $$$100$$$ since $$$100$$$ is the closest representable value of a double. This is exactly what happens in 64 bit g++. However, in 32 bit g++ there seems to be some kind of hidden excess precision causing the output to only sometimes(???) be $$$100$$$.
Explanation
In C and C++ there are different modes (referred to as methods) of how floating point arithmetic is done, see (https://en.wikipedia.org/wiki/C99#IEEE_754_floating-point_support). You can detect which one is being used by the value of FLT_EVAL_METHOD
found in cfloat
. In mode 2 (which is what 32 bit g++ uses by default) all floating point arithmetic is done using long double. Note that in this mode numbers are temporarily stored as long doubles while being operated on, this can / will cause a kind of excess precision. In mode 0 (which is what 64 bit g++ uses by default) the arithmetic is done using each corresponding type, so there is no excess precision.
Detecting and turning on/off excess precision
Here is a simple example of how to detect excess precision (partly taken from https://stackoverflow.com/a/20870774)
If b is rounded (as one would "expect" since it is a double), then the result is zero. Otherwise it is something like 8e-17 because of excess precision. I tried running this in custom invocation. MSVC(C++17), Clang and g++17(64bit) all use mode 0 and round b to 0, while g++11, g++14 and g++17 as expected all use mode 2 and b = 8e-17.
The culprit behind all of this misery is the old x87 instruction set, which only supports (80 bit) long double arithmetic. The modern solution is to on top of this use the SSE instruction set (version 2 or later), which supports both float and double arithmetic. On GCC you can turn this on with the flags -mfpmath=sse -msse2
. This will not change the value of FLT_EVAL_METHOD
, but it will effectively turn off excess precision, see 81993714.
It is also possible to effectively turn on excess precision with -mfpmath=387
, see 81993724.
Fun exercise
Using your newfound knowledge of excess precision, try to find a compiler + input to "hack" this
Conclusion / TLDR
32 bit g++ by default does all of its floating point arithmetic with (80 bit) long double. This causes a ton of frustrating and weird behaviours. 64 bit g++ does not have this issue.
So what is the suggested solution for CP contests?
#pragma GCC target("fpmath=387") // Turns on excess precision
=> This seems very brittle and platform-dependent.Does using
long double
for all calculations also work?Solution: Don't do stupid things with floating point. Don't check floating point for equality. Be aware how floating point behaves wrt precision.
If you need extended precision you can explicitly use it, but in most cases it won't be needed if you take some care with how you do floating point operations (or avoid floating point when it's not needed).
I explicitly use
-Wfloat-equal
to ensure I never end up equating floating points.However, at least in the example codes in the blog, a very simple subtraction operation is done. While I am aware of floating point issues, I would say that extended precision is more preferable than thinking about whether this particular line may lead to WA later.
Is there any substantial difference between
a - b == 0
anda == b
? The code in the blog doesa - b
and realizes it can give either 0 or stupidly close to zero. This is the exact same phenomenon as trying to compare two floats. I.e. not doing equality checks resolves it.Yes, using long double everywhere works, doing it like that means you won't have to worry about excess precision. In general simply using 64 bit C++ means you don't have to worry about excess precision.
I wrote this blog to inform people about how floating arithmetic is done in C++. That does not mean I think excess precision is a good idea. The way I see it, excess precision is a thing of the past and I'm happy I don't have to bother with it when I use 64 bit C++.
Well explained! But the real problem is people comparing floats using a == b :)
Doesn't this violate IEEE 754 since it requires basic floating point operations to be correctly rounded (results in closest representable value)? Is it related to https://codeforces.me/blog/entry/21844?
Edit: Seems that IEEE 754 allows excess precision.
After testing I'm pretty sure https://codeforces.me/blog/entry/21844 is related to excess precision. I found a minimum working example that I believe has the same issue.
As you can see, excess precision is able to "leak out" of
f
.The thing I'm not sure of is if pow having excess precision should be seen as a bug or not. From testing on cf, only when submitting under g++11 does pow have excess precision. It seems to not have excess precision on any later version. So probably pow having excess precision should be categorized as a bug, and it has now been fixed.
Auto comment: topic has been updated by pajenegod (previous revision, new revision, compare).
Auto comment: topic has been updated by pajenegod (previous revision, new revision, compare).
Auto comment: topic has been updated by pajenegod (previous revision, new revision, compare).
Auto comment: topic has been updated by pajenegod (previous revision, new revision, compare).
The Code in (Detecting and turning on/off excess precision) Section:
using g++20 in Clion Gives 0 and the
FLT_EVAL_METHOD
says that the mode is 2 but when I changedouble b = a*a
tolong double b = a*a
it give 8e-17 as I understand from you they should work the same or did i get something wrong?and The code in (Fun exercise) Section :
how does the
int w = pow(y, 2);
affect the result?The way mode 2 works is that all consecutive floating point calculations are done using
long doubles
. But when the floating point numbers are stored in memory, they are stored as their respecitive type.The pow call forces
y
to be stored in memory, which roundsy
. For example, ify
had the value $$$1-2^{-60}$$$, then the pow call rounds it to exactly1.0
since1.0
is the closest representable double to $$$1-2^{-60}$$$.