[ASK] Huge performance problem: C++ is faster than C++ 64 bit!

#	User	Rating
1	tourist	3856
2	jiangly	3747
3	orzdevinwang	3706
4	jqdai0815	3682
5	ksun48	3591
6	gamegame	3477
7	Benq	3468
8	Radewoosh	3462
9	ecnerwala	3451
10	heuristica	3431

#	User	Contrib.
1	cry	167
2	-is-this-fft-	162
3	Dominater069	160
4	Um_nik	158
5	atcoder_official	156
6	Qingyu	155
7	djm03178	152
7	adamant	152
9	luogu_official	150
10	awoo	147

Hello!

Recently I've faced a confusing problem:

I've accidentally re-submitted my solution for problem 1554E - You on different compiler. Suddenly I've noticed that its running time has significantly increased.

This solution 128134170 takes 1263 ms under GNU C++17 compiler.
This solution 128134185 takes 2277 ms under GNU C++17 (64) compiler.
You can see that their codes are fully identical.

So, common code with some vectors, self-written modular class and lambdas causes that much time difference — approximately 2 times. The memory also increased (1.5x times), it is also kinda suspicious because I haven't used any pointers.

I've re-written it without lambdas — lifted dfs out of solve (128134438 and 128134426), both execution times have increased in about 200 ms.

I had somewhere heard that vector<bool> is very problematic structure. I even have changed it to vector<int> and resubmitted again (128141354 and 128141276). The execution times have increased a lot and gone closer, but the gap is still quite big (2105 ms / 2807 ms).

So, I am quite at loss, why would 64-bit compiled program take (2 times) more time to execute. Can anyone help me and explain the reason?

P. S. I can hardly think of any good query for googling. This comment is talking about something similiar; however, it isn't answered.

# g++-9.3.0 -O2 -g test.cpp && perf record ./a.out < testcase.txt > /dev/null && perf report 92.85% a.out a.out [.] solve()::{lambda(auto:1, int, int)#1}::operator()<{lambda(auto:1, int, int)#1}> 2.11% a.out a.out [.] solve 0.76% a.out libstdc++.so.6.0.29 [.] std::num_get<char, std::istreambuf_iterator<char, std::char_traits<char> > >::_M_extract_int<long> 0.46% a.out libc-2.33.so [.] _int_malloc 0.40% a.out libc-2.33.so [.] _int_free 0.38% a.out a.out [.] std::vector<int, std::allocator<int> >::_M_realloc_insert<int const&> 0.38% a.out libc-2.33.so [.] malloc_consolidate 0.37% a.out libc-2.33.so [.] cfree@GLIBC_2.2.5 0.34% a.out libc-2.33.so [.] __memmove_sse2_unaligned_erms 0.32% a.out libc-2.33.so [.] unlink_chunk.isra.2 0.29% a.out libc-2.33.so [.] malloc # perf stat ./a.out < testcase.txt > /dev/null Performance counter stats for './a.out': 5,079.73 msec task-clock:u # 1.000 CPUs utilized 0 context-switches:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 8,253 page-faults:u # 0.002 M/sec 14,121,518,317 cycles:u # 2.780 GHz 12,510,832,066 stalled-cycles-frontend:u # 88.59% frontend cycles idle 10,556,594,313 stalled-cycles-backend:u # 74.76% backend cycles idle 3,359,030,926 instructions:u # 0.24 insn per cycle # 3.72 stalled cycles per insn 546,097,415 branches:u # 107.505 M/sec 25,891,347 branch-misses:u # 4.74% of all branches 5.080345274 seconds time elapsed 5.040075000 seconds user 0.040000000 seconds sys

Comments (24)

Show archived | Write comment?

LeoPro

3 years ago, # |

Auto comment: topic has been updated by LeoPro (previous revision, new revision, compare).

→ Reply

Kyou_mo_kawaii

← Rev. 2 →

+76

The memory also increased (1.5x times), it is also kinda suspicious because I haven't used any pointers.

You implicitly use pointers when you use recursion (for pushing return addresses and temporary variables onto the stack, and &tree, &minus, &right are pointers).

3 years ago, # ^ |

It makes sense. Thank you!

faker_faker

Just a note, I think that allocating vectors in for loops is very slow. For a heavily composite number + 1 as n, you allocate $$$2*128*n = 256n \leq 768e5 \approx 7e7$$$, which sounds really excessive. My hunch didn't translate well when I submitted with changing doing this allocation outside the loop, but I think this is a problem, even if not the cause for this timing issue.

MrDindows

+54

GCC 7.3 is so cool that it's able to inline (unroll) your DFS somehow, probably partially. While GCC 9.2 can't do that :(

never_giveup

← Rev. 3 →

Do you know why is it so, and if there are more such tricks? Sounds interesting!

Also I'm not able to replicate 32bit result on 64bit using pragmas, but maybe you know some way to do that?

Wow, if it's so, the compiler is really cool!

By the way, how did you know that? Did you examine the compiled assembly code?

ssvb

+28

GCC 7.x generates significantly larger code than GCC 9.x at least in my Linux system with and without -m32 option (32-bit and 64-bit modes), which is likely caused by more inlining/unrolling:

Spoiler

# g++-7.3.0 --version
g++-7.3.0 (Gentoo 7.3.0-r3 p1.4) 7.3.0

# g++-9.3.0 --version
g++-9.3.0 (Gentoo 9.3.0-r1 p3) 9.3.0

# g++-9.3.0 -std=c++17 -O2 -c test.cpp && size test.o
   text	   data	    bss	    dec	    hex	filename
   5189	     16	      1	   5206	   1456	test.o

# g++-7.3.0 -std=c++17 -O2 -c test.cpp && size test.o
   text	   data	    bss	    dec	    hex	filename
   7366	     16	      1	   7383	   1cd7	test.o

# g++-7.3.0 -std=c++17 -O2 -m32 -c test.cpp && size test.o
   text	   data	    bss	    dec	    hex	filename
   7775	      8	      1	   7784	   1e68	test.o

I tried to use the following testcase generator for producing random large input (which is obviously not good enough, but nobody came up with anything better yet):

Ruby script

t = 3
puts t
t.times do
  n = 10 ** 5
  puts n
  nodes = [1]
  2.upto(n) do |new_node|
    random_old_node = nodes.sample
    nodes.push(new_node)
    puts "#{random_old_node} #{new_node}"
  end
end

With this input file, the performance of GCC 9.x and GCC 7.x linux binaries is roughly the same (the 32-bit output of GCC 7.x is slightly slower than its 64-bit output):

# g++-9.3.0 -std=c++17 -O2 test.cpp && time ./a.out < large_input.txt > /dev/null

real	0m0.358s
user	0m0.348s
sys	0m0.010s

# g++-7.3.0 -std=c++17 -O2 test.cpp && time ./a.out < large_input.txt > /dev/null

real	0m0.364s
user	0m0.354s
sys	0m0.010s

# g++-7.3.0 -std=c++17 -O2 -m32 test.cpp && time ./a.out < large_input.txt > /dev/null

real	0m0.409s
user	0m0.399s
sys	0m0.010s

Profiling with the perf tool shows that the time is primarily spent on I/O processing and malloc/free, and less than 30% in the solve() function:

# g++-9.3.0 -std=c++17 -O2 -g test.cpp && perf record ./a.out < large_input.txt > /dev/null && perf report

    28.90%  a.out    a.out                [.] solve
    11.18%  a.out    libstdc++.so.6.0.29  [.] std::num_get<char, std::istreambuf_iterator<char, std::char_traits<char> > >::_M_extract_int<long>
     7.57%  a.out    libc-2.33.so         [.] _int_malloc
     6.56%  a.out    libc-2.33.so         [.] cfree@GLIBC_2.2.5
     5.55%  a.out    libc-2.33.so         [.] malloc_consolidate
     5.08%  a.out    a.out                [.] std::vector<int, std::allocator<int> >::_M_realloc_insert<int const&>
     5.08%  a.out    libc-2.33.so         [.] __memmove_sse2_unaligned_erms
     4.64%  a.out    libc-2.33.so         [.] _int_free
     4.26%  a.out    libc-2.33.so         [.] memcpy@GLIBC_2.2.5
     3.57%  a.out    libc-2.33.so         [.] unlink_chunk.isra.2
     3.29%  a.out    libc-2.33.so         [.] malloc
     1.72%  a.out    libstdc++.so.6.0.29  [.] std::istream::operator>>
     1.60%  a.out    libstdc++.so.6.0.29  [.] std::__ostream_insert<char, std::char_traits<char> >
     1.59%  a.out    libstdc++.so.6.0.29  [.] std::istream::sentry::sentry
     1.15%  a.out    libstdc++.so.6.0.29  [.] std::basic_filebuf<char, std::char_traits<char> >::xsputn
     1.02%  a.out    libstdc++.so.6.0.29  [.] std::num_put<char, std::ostreambuf_iterator<char, std::char_traits<char> > >::_M_insert_int<long>
     0.95%  a.out    libstdc++.so.6.0.29  [.] operator new
     0.70%  a.out    libstdc++.so.6.0.29  [.] std::basic_streambuf<char, std::char_traits<char> >::xsputn
     0.57%  a.out    libstdc++.so.6.0.29  [.] std::__use_cache<std::__numpunct_cache<char> >::operator()
     0.51%  a.out    libstdc++.so.6.0.29  [.] std::ostream::_M_insert<long>
     0.51%  a.out    libstdc++.so.6.0.29  [.] std::istreambuf_iterator<char, std::char_traits<char> >::equal

But again, without a proper worst case input file my comment is mostly useless. It just shows how performance analysis can be done in general. If somebody provides a better testcase, then we can look into it further.

dalex

+10

In most tree problems worst case is either line tree (1-2-...-n) or sun tree (1-2, 1-3, ...1-n). It's also worth to try their combination (1-2-...-(n/2), then (n/2)-(n/2+1), (n/2)-(n/2+2), (n/2)-n).

oversolver

just line is not enough, need random shuffled line

What's wrong with them? My program takes $$$O(n \cdot (d(n - 1) + \log n))$$$ for line and the same for sun (and it's the worst case). Do you have any smarter solution?

I answered to comment without context, 1-2-..-n is good for cache and usually works 2-3 times faster than shuffled

+21

Yes, I took a look on assembly code on Godbolt, and saw that dfs function by gcc 7.3 was like 5 times larger than by gcc 9.2. It still has recursive call inside, so it's not a full unroll. Also I tested some code without lambdas, where both compilers produced short and simple assembly implementation, and after adding inline keyword to dfs function gcc7 gave out huge assembly again, like in lambda version, while gcc9 didn't change anything.

And replying to never_giveup question too: I did not read the logic of large generated code. I just assumed that it is something like unrolling, but I do not know what the optimization is exactly.

How to construct the slowest testcase for this problem? Just a random tree doesn't seem to be good enough.

TimeTraveler

I had the opposite experience in round 714. For problem C I made 3 submissions(all TLE 1000ms) with GNU C++17 compiler and 2 submission(545ms and 654 ms) with GNU C++17 (64) compiler.

First two submissions were inefficient and TLEd. After improving it further it still TLEd on pretest 1. Then I submitted the same code in GNU C++17 (64) compiler and surprisingly it did not TLE and gave WA on pretest 3. Changed the ans variable from int to long long and I got AC.

My submissions ran twice as fast in GNU C++17 (64) compiler. Since that day I always submit in 64 bit compiler.

+16

LL operations are around twice faster in 64: 32, 64

CodingKnight

The following update to your code has slight improvement in the execution time when compiled using GNU G++17 9.2.0 (64-bit). The same update compiled using GNU G++17 7.3.0 is slower than your code, even though it is still faster than the former execution time.

128493289

128493370

Hm, what did you do? I read your code; it seems that you tried to optimize it as mich as possible.

However, you changed the type of dfs to const function<void(int, int)> which is supposed to be slow (I can't find any source, but here it is: 128507433 and 128507422. The first version of my code is much slower if auto is replaced by function). So that big speed increase is rather strange.

Yes, you are right. That's exactly what I did. I have just done another bit of optimization. I moved the recursive function dfs out of the test-case loop, and changed its object type back to auto. However, I changed its type in the parameters list of the function to const auto& instead of auto. The following is the execution time obtained after this optimization.

128540002

← Rev. 7 →

The following is yet another bit of optimization. I added a function to prune the adjacency vectors of the tree nodes, by removing the parent node from the adjacency list of each of its children nodes. Therefore, the dfs function does not need to check that an adjacent node is not parent of the current node. I also changed the return type of the dfs so that it returns the value right[current_node] or false. The following is the execution time obtained after this optimization.

128545594

hly1204

I tried to use int_fast32_t instead of int, but it seems even slower. https://codeforces.me/contest/1554/submission/128509324

pajenegod

+20

On the subject of vector<bool>. Instead of switching it out with vector<int>, try using something like vector<char>. It cuts down your time in C++17(32 bit) to 950 ms 128678604.

I also tried to apply this change together with other optimizations, but it:

slightly improved performance for the 32-bit version 873 ms 128686428 vs. 841 ms 128686708
slightly regressed performance for the 64-bit version 1154 ms 128685795 vs. 1232 ms 128686737

The larger storage space tradeoff makes cache thrashing problem even worse for the 64-bit version and that's probably what is degrading performance. The difference is not very significant though and it might be within the measurement error margin.

+18

Thanks to helpful comments from dalex and oversolver, I finally managed to construct a reasonably good testcase generator:

def gen_testcase(n)
  puts n
  nodes = (1 .. n).to_a.shuffle
  edges = []
  1.upto(n - 1) {|i| edges.push([nodes[i - 1], nodes[i]]) }
  edges.shuffle.each {|edge| puts "#{edge[0]} #{edge[1]}" }
end

t = 3
puts t
t.times do
  srand(957)
  gen_testcase(98280 + 1)
end

Yes, randomly shuffling nodes before connecting them in a line and then randomly shuffling edges in the input data indeed does the job. And picking the right $$$n$$$ value to maximize the number of divisors was important too. The choice of seed for randomization makes a very big difference, resulting in times between ~300ms and ~3500ms on my computer. Using one of the worst performing testcases, profiling results for the original LeoPro's solution from this blog now look like this:

So the time is now spent in the solve() function. Branches are predicted just fine, but only 0.24 instructions per cycle IPC is very bad. Such low performance is primarily caused by data cache misses. And 64-bit code is noticeably slower than the 32-bit code for both GCC 7.x and GCC 9.x, because 64-bit pointers (such as return addresses on stack and other data) have a larger footprint and suffer more from cache thrashing.

Some performance improvements are possible. Sorting nodes in the adjacency list partially reverses randomization and can make memory access pattern a bit more cache friendly (this helps with the current codeforces testcases, but is not always effective). Reworking DFS to manually enforce inlining also helps to reduce the amount of data pushed to stack. Here's my modification of LeoPro's solution: 128685795 (64-bit version, 1154 ms) and 128686428 (32-bit version, 873 ms).

LeoPro's blog