Discrepancy in performance depending on compiler

#	User	Rating
1	jiangly	3898
2	tourist	3840
3	orzdevinwang	3706
4	ksun48	3691
5	jqdai0815	3682
6	ecnerwala	3525
7	gamegame	3477
8	Benq	3468
9	Ormlis	3381
10	maroonrk	3379

#	User	Contrib.
1	cry	168
2	-is-this-fft-	165
3	Dominater069	161
4	Um_nik	160
5	atcoder_official	159
6	djm03178	157
7	adamant	153
8	luogu_official	150
9	awoo	149
10	TheScrasse	146

While trying to squeeze my solution of 1336E2 - Chiori and Doll Picking (hard version) into the time limit, I have encountered an unexpectedly large difference in the execution times among the various C++ compilers offered by Codeforces. Since I think it is something worth knowing, I am sharing this discovery.

Consider the following minimal working example (it generates $$$2N=30$$$ random $$$56$$$-bits numbers and computes the sum of the bits of the xors of all the subsets of the $$$30$$$ numbers).

Code

#pragma GCC optimize ("unroll-loops")
#pragma GCC target("sse3,sse4")

#include <assert.h>
#include <stdlib.h>

typedef unsigned long long ULL;

ULL rand_short() { // 8 bits
    return rand() & ((1<<8)-1);
}

ULL rand_ull() { // 56 bits
    ULL res = 0;
    for (int i = 0; i < 7; i++) res = (res<<8)|rand_short();
    return res;
}

int main() {
    srand(123);
    const int N = 15;
    
    ULL a[N], b[N];
    for (int i = 0; i < N; i++) a[i] = rand_ull(), b[i] = rand_ull();

    ULL c[1<<N], d[1<<N];
    for (int bb = 0; bb < (1<<N); bb++) {
        c[bb] = d[bb] = 0;
        for (int i = 0; i < N; i++) {
            if (bb&(1<<i)) c[bb] ^= a[i], d[bb] ^= b[i];
        }
    }
    
    ULL res = 0;
    for (int i = 0; i < (1<<N); i++) for (int j = 0; j < (1<<N); j++) {
        res += __builtin_popcountll(c[i]^d[j]);
    }
    assert(res == 30064771072);
}

The important lines are the following ones, where the functions __builtin_popcountll and xor are called $$$2^{30}$$$ times.

 ULL res = 0;
 for (int i = 0; i < (1<<N); i++) for (int j = 0; j < (1<<N); j++) {
     res += __builtin_popcountll(c[i]^d[j]);
 }

Executing the above program in Codeforces custom invocation yields these execution times:

 Compiler                         Execution Time
 GNU GCC C11 5.1.0                4040 ms
 GNU G++11 5.1.0                  4102 ms
 GNU G++14 6.4.0                  1123 ms
 GNU G++17 7.3.0                  1107 ms
 GNU G++17 9.2.0 (64bit, msys 2)  374 ms

Notice that the 64bit-native compiler produces a much faster executable (and notice also that among the other compilers there is quite a difference). Hence, next time you have to optimize a solution with a lot of bit-operations on 64bit integers (and, in fact, this situation is not so uncommon), consider using the compiler GNU G++17 9.2.0 (64bit, msys 2).

It might be that the differences among the execution times are due to the way I have written the program (maybe the wrong PRAGMAS? Maybe preventing some compilers from optimizing because of a certain idiom? Maybe something else?), if this is the case, please enlighten me!

Rev.	By	When	Δ	Comment
en3	dario2994	2020-04-17 17:49:49	3	(published)
en2	dario2994	2020-04-17 17:45:49	195	Tiny change: 'ion times when using the var' -> 'ion times among the var'
en1	dario2994	2020-04-17 13:18:41	2783	Initial revision (saved to drafts)

Rev.

Lang.

When

Comment

en3