Huge discrepancy in performance depending on compiler

№	Пользователь	Рейтинг
1	tourist	4009
2	jiangly	3823
3	Benq	3738
4	Radewoosh	3633
5	jqdai0815	3620
6	orzdevinwang	3529
7	ecnerwala	3446
8	Um_nik	3396
9	ksun48	3390
10	gamegame	3386

№	Пользователь	Вклад
1	cry	164
1	maomao90	164
3	Um_nik	163
4	atcoder_official	160
4	adamant	160
6	-is-this-fft-	158
7	awoo	157
8	TheScrasse	154
8	Dominater069	154
8	nor	154

While trying to squeeze my solution of 1336E2 - Chiori and Doll Picking (hard version) into the time limit, I have noticed a huge discrepancy in performance among the C++ compilers offered by codeforces. Since I think it is something worth knowing (and it is rather unexpected for me), I am sharing this discovery.

Consider the following minimal working example (it generates $$$2N=30$$$ random $$$56$$$-bits numbers and computes the sum of the bits of the xors of all the subsets of the $$$30$$$ numbers).

Code

#pragma GCC optimize ("unroll-loops")
#pragma GCC target("sse3,sse4")

#include <assert.h>
#include <stdlib.h>

typedef unsigned long long ULL;

ULL rand_short() { // 8 bits
    return rand() & ((1<<8)-1);
}

ULL rand_ull() { // 56 bits
    ULL res = 0;
    for (int i = 0; i < 7; i++) res = (res<<8)|rand_short();
    return res;
}

int main() {
    srand(123);
    const int N = 15;
    
    ULL a[N], b[N];
    for (int i = 0; i < N; i++) a[i] = rand_ull(), b[i] = rand_ull();

    ULL c[1<<N], d[1<<N];
    for (int bb = 0; bb < (1<<N); bb++) {
        c[bb] = d[bb] = 0;
        for (int i = 0; i < N; i++) {
            if (bb&(1<<i)) c[bb] ^= a[i], d[bb] ^= b[i];
        }
    }
    
    ULL res = 0;
    for (int i = 0; i < (1<<N); i++) for (int j = 0; j < (1<<N); j++) {
        res += __builtin_popcountll(c[i]^d[j]);
    }
    assert(res == 30064771072);
}

The important lines are the following ones, where the function __builtin_popcountll is called $$$2^{30}$$$ times.

 ULL res = 0;
 for (int i = 0; i < (1<<N); i++) for (int j = 0; j < (1<<N); j++) {
     res += __builtin_popcountll(c[i]^d[j]);
 }

Executing the above program in Codeforces custom invocation, yields these execution times:

 Compiler                         Execution Time
 GNU GCC C11 5.1.0                4040 ms
 GNU G++11 5.1.0                  4102 ms
 GNU G++14 6.4.0                  1123 ms
 GNU G++17 7.3.0                  1107 ms
 GNU G++17 9.2.0 (64bit, msys 2)  374 ms

Notice that the only 64bit-native compiler that Codeforces offers produces a much faster executable (and notice that also among the other compilers there is quite a difference).

Next time you have to optimize a solution with a lot of bit-operations on 64bit integers (and, in fact, this situation is not so uncommon), consider using the compiler GNU G++17 9.2.0 (64bit, msys 2).

It might be that the differences among the various compilers are due to the way I have written the program (maybe the wrong PRAGMAS? Maybe preventing some compilers from optimizing because of a certain idiom? Maybe something else), if this is the case, please enlighten me!

Rev.	Кто	Когда	Δ	Комментарий
en3	dario2994	2020-04-17 17:49:49	3	(published)
en2	dario2994	2020-04-17 17:45:49	195	Tiny change: 'ion times when using the var' -> 'ion times among the var'
en1	dario2994	2020-04-17 13:18:41	2783	Initial revision (saved to drafts)

Rev.

Язык

Кто

Когда

Комментарий

en3