Блог пользователя LeoPro

Автор LeoPro, история, 3 года назад, По-английски

Hello!

Recently I've faced a confusing problem:

I've accidentally re-submitted my solution for problem 1554E - You on different compiler. Suddenly I've noticed that its running time has significantly increased.

  • This solution 128134170 takes 1263 ms under GNU C++17 compiler.
  • This solution 128134185 takes 2277 ms under GNU C++17 (64) compiler.
  • You can see that their codes are fully identical.

So, common code with some vectors, self-written modular class and lambdas causes that much time difference — approximately 2 times. The memory also increased (1.5x times), it is also kinda suspicious because I haven't used any pointers.

I've re-written it without lambdas — lifted dfs out of solve (128134438 and 128134426), both execution times have increased in about 200 ms.

I had somewhere heard that vector<bool> is very problematic structure. I even have changed it to vector<int> and resubmitted again (128141354 and 128141276). The execution times have increased a lot and gone closer, but the gap is still quite big (2105 ms / 2807 ms).

So, I am quite at loss, why would 64-bit compiled program take (2 times) more time to execute. Can anyone help me and explain the reason?

P. S. I can hardly think of any good query for googling. This comment is talking about something similiar; however, it isn't answered.

  • Проголосовать: нравится
  • +111
  • Проголосовать: не нравится

»
3 года назад, # |
  Проголосовать: нравится 0 Проголосовать: не нравится

Auto comment: topic has been updated by LeoPro (previous revision, new revision, compare).

»
3 года назад, # |
Rev. 2   Проголосовать: нравится +76 Проголосовать: не нравится

The memory also increased (1.5x times), it is also kinda suspicious because I haven't used any pointers.

You implicitly use pointers when you use recursion (for pushing return addresses and temporary variables onto the stack, and &tree, &minus, &right are pointers).

»
3 года назад, # |
  Проголосовать: нравится +4 Проголосовать: не нравится

Just a note, I think that allocating vectors in for loops is very slow. For a heavily composite number + 1 as n, you allocate $$$2*128*n = 256n \leq 768e5 \approx 7e7$$$, which sounds really excessive. My hunch didn't translate well when I submitted with changing doing this allocation outside the loop, but I think this is a problem, even if not the cause for this timing issue.

»
3 года назад, # |
  Проголосовать: нравится +54 Проголосовать: не нравится

GCC 7.3 is so cool that it's able to inline (unroll) your DFS somehow, probably partially. While GCC 9.2 can't do that :(

  • »
    »
    3 года назад, # ^ |
    Rev. 3   Проголосовать: нравится +3 Проголосовать: не нравится

    Do you know why is it so, and if there are more such tricks? Sounds interesting!

    Also I'm not able to replicate 32bit result on 64bit using pragmas, but maybe you know some way to do that?

  • »
    »
    3 года назад, # ^ |
      Проголосовать: нравится +3 Проголосовать: не нравится

    Wow, if it's so, the compiler is really cool!

    By the way, how did you know that? Did you examine the compiled assembly code?

    • »
      »
      »
      3 года назад, # ^ |
        Проголосовать: нравится +28 Проголосовать: не нравится

      GCC 7.x generates significantly larger code than GCC 9.x at least in my Linux system with and without -m32 option (32-bit and 64-bit modes), which is likely caused by more inlining/unrolling:

      Spoiler

      I tried to use the following testcase generator for producing random large input (which is obviously not good enough, but nobody came up with anything better yet):

      Ruby script

      With this input file, the performance of GCC 9.x and GCC 7.x linux binaries is roughly the same (the 32-bit output of GCC 7.x is slightly slower than its 64-bit output):

      Spoiler

      Profiling with the perf tool shows that the time is primarily spent on I/O processing and malloc/free, and less than 30% in the solve() function:

      Spoiler

      But again, without a proper worst case input file my comment is mostly useless. It just shows how performance analysis can be done in general. If somebody provides a better testcase, then we can look into it further.

      • »
        »
        »
        »
        3 года назад, # ^ |
          Проголосовать: нравится +10 Проголосовать: не нравится

        In most tree problems worst case is either line tree (1-2-...-n) or sun tree (1-2, 1-3, ...1-n). It's also worth to try their combination (1-2-...-(n/2), then (n/2)-(n/2+1), (n/2)-(n/2+2), (n/2)-n).

        • »
          »
          »
          »
          »
          3 года назад, # ^ |
            Проголосовать: нравится +9 Проголосовать: не нравится

          just line is not enough, need random shuffled line

          • »
            »
            »
            »
            »
            »
            3 года назад, # ^ |
            Rev. 2   Проголосовать: нравится 0 Проголосовать: не нравится

            What's wrong with them? My program takes $$$O(n \cdot (d(n - 1) + \log n))$$$ for line and the same for sun (and it's the worst case). Do you have any smarter solution?

            • »
              »
              »
              »
              »
              »
              »
              3 года назад, # ^ |
                Проголосовать: нравится +10 Проголосовать: не нравится

              I answered to comment without context, 1-2-..-n is good for cache and usually works 2-3 times faster than shuffled

    • »
      »
      »
      3 года назад, # ^ |
        Проголосовать: нравится +21 Проголосовать: не нравится

      Yes, I took a look on assembly code on Godbolt, and saw that dfs function by gcc 7.3 was like 5 times larger than by gcc 9.2. It still has recursive call inside, so it's not a full unroll. Also I tested some code without lambdas, where both compilers produced short and simple assembly implementation, and after adding inline keyword to dfs function gcc7 gave out huge assembly again, like in lambda version, while gcc9 didn't change anything.

      And replying to never_giveup question too: I did not read the logic of large generated code. I just assumed that it is something like unrolling, but I do not know what the optimization is exactly.

»
3 года назад, # |
  Проголосовать: нравится +5 Проголосовать: не нравится

How to construct the slowest testcase for this problem? Just a random tree doesn't seem to be good enough.

»
3 года назад, # |
  Проголосовать: нравится 0 Проголосовать: не нравится

I had the opposite experience in round 714. For problem C I made 3 submissions(all TLE 1000ms) with GNU C++17 compiler and 2 submission(545ms and 654 ms) with GNU C++17 (64) compiler.

First two submissions were inefficient and TLEd. After improving it further it still TLEd on pretest 1. Then I submitted the same code in GNU C++17 (64) compiler and surprisingly it did not TLE and gave WA on pretest 3. Changed the ans variable from int to long long and I got AC.

My submissions ran twice as fast in GNU C++17 (64) compiler. Since that day I always submit in 64 bit compiler.

»
3 года назад, # |
  Проголосовать: нравится 0 Проголосовать: не нравится

The following update to your code has slight improvement in the execution time when compiled using GNU G++17 9.2.0 (64-bit). The same update compiled using GNU G++17 7.3.0 is slower than your code, even though it is still faster than the former execution time.

128493289

128493370

  • »
    »
    3 года назад, # ^ |
    Rev. 2   Проголосовать: нравится +8 Проголосовать: не нравится

    Hm, what did you do? I read your code; it seems that you tried to optimize it as mich as possible.

    However, you changed the type of dfs to const function<void(int, int)> which is supposed to be slow (I can't find any source, but here it is: 128507433 and 128507422. The first version of my code is much slower if auto is replaced by function). So that big speed increase is rather strange.

    • »
      »
      »
      3 года назад, # ^ |
      Rev. 2   Проголосовать: нравится 0 Проголосовать: не нравится

      Yes, you are right. That's exactly what I did. I have just done another bit of optimization. I moved the recursive function dfs out of the test-case loop, and changed its object type back to auto. However, I changed its type in the parameters list of the function to const auto& instead of auto. The following is the execution time obtained after this optimization.

      128540002

      • »
        »
        »
        »
        3 года назад, # ^ |
        Rev. 7   Проголосовать: нравится 0 Проголосовать: не нравится

        The following is yet another bit of optimization. I added a function to prune the adjacency vectors of the tree nodes, by removing the parent node from the adjacency list of each of its children nodes. Therefore, the dfs function does not need to check that an adjacent node is not parent of the current node. I also changed the return type of the dfs so that it returns the value right[current_node] or false. The following is the execution time obtained after this optimization.

        128545594

»
3 года назад, # |
  Проголосовать: нравится 0 Проголосовать: не нравится

I tried to use int_fast32_t instead of int, but it seems even slower. https://codeforces.me/contest/1554/submission/128509324

»
3 года назад, # |
  Проголосовать: нравится +20 Проголосовать: не нравится

On the subject of vector<bool>. Instead of switching it out with vector<int>, try using something like vector<char>. It cuts down your time in C++17(32 bit) to 950 ms 128678604.

  • »
    »
    3 года назад, # ^ |
      Проголосовать: нравится 0 Проголосовать: не нравится

    I also tried to apply this change together with other optimizations, but it:

    • slightly improved performance for the 32-bit version 873 ms 128686428 vs. 841 ms 128686708

    • slightly regressed performance for the 64-bit version 1154 ms 128685795 vs. 1232 ms 128686737

    The larger storage space tradeoff makes cache thrashing problem even worse for the 64-bit version and that's probably what is degrading performance. The difference is not very significant though and it might be within the measurement error margin.

»
3 года назад, # |
  Проголосовать: нравится +18 Проголосовать: не нравится

Thanks to helpful comments from dalex and oversolver, I finally managed to construct a reasonably good testcase generator:

Ruby script

Yes, randomly shuffling nodes before connecting them in a line and then randomly shuffling edges in the input data indeed does the job. And picking the right $$$n$$$ value to maximize the number of divisors was important too. The choice of seed for randomization makes a very big difference, resulting in times between ~300ms and ~3500ms on my computer. Using one of the worst performing testcases, profiling results for the original LeoPro's solution from this blog now look like this:

Spoiler

So the time is now spent in the solve() function. Branches are predicted just fine, but only 0.24 instructions per cycle IPC is very bad. Such low performance is primarily caused by data cache misses. And 64-bit code is noticeably slower than the 32-bit code for both GCC 7.x and GCC 9.x, because 64-bit pointers (such as return addresses on stack and other data) have a larger footprint and suffer more from cache thrashing.

Some performance improvements are possible. Sorting nodes in the adjacency list partially reverses randomization and can make memory access pattern a bit more cache friendly (this helps with the current codeforces testcases, but is not always effective). Reworking DFS to manually enforce inlining also helps to reduce the amount of data pushed to stack. Here's my modification of LeoPro's solution: 128685795 (64-bit version, 1154 ms) and 128686428 (32-bit version, 873 ms).