Hello!
Recently I've faced a confusing problem:
I've accidentally re-submitted my solution for problem 1554E - You on different compiler. Suddenly I've noticed that its running time has significantly increased.
- This solution 128134170 takes 1263 ms under GNU C++17 compiler.
- This solution 128134185 takes 2277 ms under GNU C++17 (64) compiler.
- You can see that their codes are fully identical.
So, common code with some vectors, self-written modular class and lambdas causes that much time difference — approximately 2 times. The memory also increased (1.5x times), it is also kinda suspicious because I haven't used any pointers.
I've re-written it without lambdas — lifted dfs
out of solve
(128134438 and 128134426), both execution times have increased in about 200 ms.
I had somewhere heard that vector<bool>
is very problematic structure. I even have changed it to vector<int>
and resubmitted again (128141354 and 128141276). The execution times have increased a lot and gone closer, but the gap is still quite big (2105 ms / 2807 ms).
So, I am quite at loss, why would 64-bit compiled program take (2 times) more time to execute. Can anyone help me and explain the reason?
P. S. I can hardly think of any good query for googling. This comment is talking about something similiar; however, it isn't answered.
Auto comment: topic has been updated by LeoPro (previous revision, new revision, compare).
You implicitly use pointers when you use recursion (for pushing return addresses and temporary variables onto the stack, and &tree, &minus, &right are pointers).
It makes sense. Thank you!
Just a note, I think that allocating vectors in for loops is very slow. For a heavily composite number + 1 as n, you allocate $$$2*128*n = 256n \leq 768e5 \approx 7e7$$$, which sounds really excessive. My hunch didn't translate well when I submitted with changing doing this allocation outside the loop, but I think this is a problem, even if not the cause for this timing issue.
GCC 7.3 is so cool that it's able to inline (unroll) your DFS somehow, probably partially. While GCC 9.2 can't do that :(
Do you know why is it so, and if there are more such tricks? Sounds interesting!
Also I'm not able to replicate 32bit result on 64bit using pragmas, but maybe you know some way to do that?
Wow, if it's so, the compiler is really cool!
By the way, how did you know that? Did you examine the compiled assembly code?
GCC 7.x generates significantly larger code than GCC 9.x at least in my Linux system with and without -m32 option (32-bit and 64-bit modes), which is likely caused by more inlining/unrolling:
I tried to use the following testcase generator for producing random large input (which is obviously not good enough, but nobody came up with anything better yet):
With this input file, the performance of GCC 9.x and GCC 7.x linux binaries is roughly the same (the 32-bit output of GCC 7.x is slightly slower than its 64-bit output):
Profiling with the
perf
tool shows that the time is primarily spent on I/O processing and malloc/free, and less than 30% in thesolve()
function:But again, without a proper worst case input file my comment is mostly useless. It just shows how performance analysis can be done in general. If somebody provides a better testcase, then we can look into it further.
In most tree problems worst case is either line tree (1-2-...-n) or sun tree (1-2, 1-3, ...1-n). It's also worth to try their combination (1-2-...-(n/2), then (n/2)-(n/2+1), (n/2)-(n/2+2), (n/2)-n).
just line is not enough, need random shuffled line
What's wrong with them? My program takes $$$O(n \cdot (d(n - 1) + \log n))$$$ for line and the same for sun (and it's the worst case). Do you have any smarter solution?
I answered to comment without context, 1-2-..-n is good for cache and usually works 2-3 times faster than shuffled
Yes, I took a look on assembly code on Godbolt, and saw that dfs function by gcc 7.3 was like 5 times larger than by gcc 9.2. It still has recursive call inside, so it's not a full unroll. Also I tested some code without lambdas, where both compilers produced short and simple assembly implementation, and after adding
inline
keyword to dfs function gcc7 gave out huge assembly again, like in lambda version, while gcc9 didn't change anything.And replying to never_giveup question too: I did not read the logic of large generated code. I just assumed that it is something like unrolling, but I do not know what the optimization is exactly.
How to construct the slowest testcase for this problem? Just a random tree doesn't seem to be good enough.
I had the opposite experience in round 714. For problem C I made 3 submissions(all TLE 1000ms) with GNU C++17 compiler and 2 submission(545ms and 654 ms) with GNU C++17 (64) compiler.
First two submissions were inefficient and TLEd. After improving it further it still TLEd on pretest 1. Then I submitted the same code in GNU C++17 (64) compiler and surprisingly it did not TLE and gave WA on pretest 3. Changed the ans variable from int to long long and I got AC.
My submissions ran twice as fast in GNU C++17 (64) compiler. Since that day I always submit in 64 bit compiler.
LL operations are around twice faster in 64: 32, 64
The following update to your code has slight improvement in the execution time when compiled using GNU G++17 9.2.0 (64-bit). The same update compiled using GNU G++17 7.3.0 is slower than your code, even though it is still faster than the former execution time.
128493289
128493370
Hm, what did you do? I read your code; it seems that you tried to optimize it as mich as possible.
However, you changed the type of dfs to
const function<void(int, int)>
which is supposed to be slow (I can't find any source, but here it is: 128507433 and 128507422. The first version of my code is much slower ifauto
is replaced byfunction
). So that big speed increase is rather strange.Yes, you are right. That's exactly what I did. I have just done another bit of optimization. I moved the recursive function
dfs
out of the test-case loop, and changed its object type back toauto
. However, I changed its type in the parameters list of the function toconst auto&
instead ofauto
. The following is the execution time obtained after this optimization.128540002
The following is yet another bit of optimization. I added a function to prune the adjacency vectors of the tree nodes, by removing the parent node from the adjacency list of each of its children nodes. Therefore, the
dfs
function does not need to check that an adjacent node is not parent of the current node. I also changed the return type of thedfs
so that it returns the valueright[current_node]
orfalse
. The following is the execution time obtained after this optimization.128545594
I tried to use
int_fast32_t
instead ofint
, but it seems even slower. https://codeforces.me/contest/1554/submission/128509324On the subject of
vector<bool>
. Instead of switching it out withvector<int>
, try using something likevector<char>
. It cuts down your time in C++17(32 bit) to950 ms
128678604.I also tried to apply this change together with other optimizations, but it:
slightly improved performance for the 32-bit version
873 ms
128686428 vs.841 ms
128686708slightly regressed performance for the 64-bit version
1154 ms
128685795 vs.1232 ms
128686737The larger storage space tradeoff makes cache thrashing problem even worse for the 64-bit version and that's probably what is degrading performance. The difference is not very significant though and it might be within the measurement error margin.
Thanks to helpful comments from dalex and oversolver, I finally managed to construct a reasonably good testcase generator:
Yes, randomly shuffling nodes before connecting them in a line and then randomly shuffling edges in the input data indeed does the job. And picking the right $$$n$$$ value to maximize the number of divisors was important too. The choice of seed for randomization makes a very big difference, resulting in times between ~300ms and ~3500ms on my computer. Using one of the worst performing testcases, profiling results for the original LeoPro's solution from this blog now look like this:
So the time is now spent in the solve() function. Branches are predicted just fine, but only 0.24 instructions per cycle IPC is very bad. Such low performance is primarily caused by data cache misses. And 64-bit code is noticeably slower than the 32-bit code for both GCC 7.x and GCC 9.x, because 64-bit pointers (such as return addresses on stack and other data) have a larger footprint and suffer more from cache thrashing.
Some performance improvements are possible. Sorting nodes in the adjacency list partially reverses randomization and can make memory access pattern a bit more cache friendly (this helps with the current codeforces testcases, but is not always effective). Reworking DFS to manually enforce inlining also helps to reduce the amount of data pushed to stack. Here's my modification of LeoPro's solution: 128685795 (64-bit version, 1154 ms) and 128686428 (32-bit version, 873 ms).