-14's blog

By -14, history, 11 months ago, In English

When I was doing the problem C. Pokémon Arena in the last Div. 1 round, I submitted my solution and got a TLE on pretest 20.

I was not suspecting anything but my code, I thought there may be some degenerate issue, undefined behavior or constant issue, but nothing really found. After some unsuccessful attempts, I was able to pass the pretest using fast IO. It's then I found something weird: it only takes a few hundred milliseconds, which is contradictory to my intuition: cin with sync_with_stdio(false) is fairly fast and it should not take up so much more (at least 10x) time.

After the contest, I submitted exactly the same code with different language. You know what?

It's not only me, and some other participants also encountered such issue. For example:

Here's snippet for the key code:

int n, m;
cin >> n >> m;
vector a(n, vector (m, 0));
vector c(n, 0);
for (int i = 0; i < n; i++) cin >> c[i];
for (int i = 0; i < n; i++) {
	for (int j = 0; j < m; j++) {
		cin >> a[i][j];
	}
}

But there are also some successful cin submissions using GNU C++20 (64). After investigation by (including but not limited to) Sugar_fan, Boboge, -skyline-, the key components of TLE are:

  • The language must be GNU C++20 (64).
  • vector must be of int. If the elements are long long, it passed. 249050206
  • The definition of 2D vector must be before reading c. If swap these two lines, it passed. 249050425

So here's the thing. It could not even be simply interpreted as some branch mispredictions or cache misses, it seems something is completely broken, and we still don't understand what is actually wrong.

  • Vote: I like it
  • +392
  • Vote: I do not like it

»
11 months ago, # |
  Vote: I like it +3 Vote: I do not like it

Auto comment: topic has been updated by -14 (previous revision, new revision, compare).

»
11 months ago, # |
  Vote: I like it -8 Vote: I do not like it

GNU C++17(20) also fails. 249014238 Maybe it's related with 64bit platform. int is 32bit. There may be some bugs in the complier dealing with this case, I think.

»
11 months ago, # |
  Vote: I like it 0 Vote: I do not like it

It seems that the issue is related to the Windows system or codeforces judger. I can't reveal the problem on my Linux(Ubuntu) computer.

»
11 months ago, # |
  Vote: I like it +2 Vote: I do not like it

Probably, it is caught by cache misses. You can swap indices by which you access data in the a array (vector a(m, vector (n, 0)); -> vector a(n, vector (m, 0));, a[x][i] -> a[i][x] and so on) and check if it gets passed.

  • »
    »
    11 months ago, # ^ |
      Vote: I like it +16 Vote: I do not like it

    It's not the issue. My reason is: replacing cin into scanf leads to AC. 249050013

    • »
      »
      »
      11 months ago, # ^ |
        Vote: I like it +3 Vote: I do not like it
      • »
        »
        »
        »
        11 months ago, # ^ |
          Vote: I like it +11 Vote: I do not like it

        Yes, it works. But why I said it's not the issue:

        1. Keeping the same memory structure, replacing cin into scanf passed. So swapping indices is not necessary.
        2. There are many subtle changes to let it work: even swapping two lines of code, or changing int to long long (which is bad for cache) as mentioned in the main text. I am not surprised if there are many more changes to pass, but that's not the real issue which should be supported by ablation experiments.
        • »
          »
          »
          »
          »
          11 months ago, # ^ |
            Vote: I like it +3 Vote: I do not like it

          249055299 should give some ideas ;)

          • »
            »
            »
            »
            »
            »
            11 months ago, # ^ |
              Vote: I like it 0 Vote: I do not like it

            That's interesting. I had never thought of the vector capacity is not a power of two when constructing it with size. This may contribute to the slowdown in an indirect (but still mysterious) way. Incorporating the long long observation, the bug trigger must include a weird specific vector size (in this case, $$$40000 \times 10$$$).

            But to my understanding of cache, using power of two as the low dimension size leads to more cache conflicts. So currently my main suspection is: there may be some allocator issues in STL, so that cin and 2D vector slowed down each other.

»
11 months ago, # |
Rev. 2   Vote: I like it 0 Vote: I do not like it

Maybe there is difference between std::ios_base::sync_with_stdio(false); and std::ios::sync_with_stdio(false);? I used first one and had no issue on 64-bit compiler: https://codeforces.me/contest/1936/submission/248927650 You used second one and caught TLE

  • »
    »
    11 months ago, # ^ |
      Vote: I like it 0 Vote: I do not like it

    It seems not the issue. My code still got TLE. 249055450

  • »
    »
    11 months ago, # ^ |
      Vote: I like it 0 Vote: I do not like it

    249054815 without sync_with_stdio(false) at all.

  • »
    »
    11 months ago, # ^ |
      Vote: I like it +39 Vote: I do not like it

    All of the sync_with_stdio functions are calling the same exact function. sync_with_stdio is a static member function inherited from the ios_base class. I presonally prefer to call it using cin.sync_with_stdio, but you could call it in all kind of ways. It doesn't matter how you do it.

»
11 months ago, # |
  Vote: I like it +9 Vote: I do not like it

C++ is a weird programming language where bugs can be elusive, only appearing seemingly at random. Small seemingly unimportant changes to the code can make them appear or dissapear.

  • vector must be of int. If the elements are long long, it passed. 249050206
  • The definition of 2D vector must be before reading c. If swap these two lines, it passed. 249050425

Just from reading this, it sounds like you likely have UB (undefined behaviour), possibly due to overflow.

Claiming that fast IO is the culprit is far too hasty of a conclusion. I would need to be 100% sure that no UB is present before drawing any conclusion like that.

  • »
    »
    11 months ago, # ^ |
      Vote: I like it +21 Vote: I do not like it

    I am 99.9% sure it's not my code that UB.

    See 249078517. On test case 20, it should be equivalent to something like this, which I don't think an UB can exist.

    #include <bits/stdc++.h>
    using namespace std;
    
    void work () {
    	int n, m;
    	cin >> n >> m;
    	vector <vector <int>> a(n, vector <int> (m, 0));
    	vector <int> c(n, 0);
    	for (int i = 0; i < n; i++) cin >> c[i];
    	for (int i = 0; i < n; i++) {
    		for (int j = 0; j < m; j++) {
    			cin >> a[i][j];
    		}
    	}
    	if (n == 40000 && m == 10) {
    	    return;
    	}
    	// all following skipped
    }
    
    int main() {
    	ios::sync_with_stdio(false); cin.tie(0);
    	int T = 1;
    	cin >> T;
    	for (int ca = 1; ca <= T; ca ++)
    		work();
    }
    
    • »
      »
      »
      11 months ago, # ^ |
        Vote: I like it +31 Vote: I do not like it

      I did some digging. I took multiple different C++ submissions from different users and modified their IO. Then by submitting to codeforces GYM I'm able to test the problem with a higher TL.

      First thing, I've only been able to trigger the slowness bug using cin, and not using scanf. I've also noticed that it doesn't not seem matter if you use cin.tie and/or sync_with_stdio. In all cases I've tested, the slowness bug triggered independently of this.

      One way to trigger the slowness bug is by reading input like this (It takes 4.1s on both TC 20 and 21):

      int n, m;
      cin >> n >> m;
      vector<int> c(n);
      vector<vector<int>> a(n, vector<int>(m));
      for (int i = 0; i < n; ++i)
          cin >> c[i];
      for (int i = 0; i < n; ++i)
          for (int j = 0; j < m; ++j)
              cin >> a[i][j];
      

      The following also takes 4.1s (on both TC 20 and 21):

      int n, m;
      cin >> n >> m;
      vector<vector<int>> a(n, vector<int>(m));
      vector<int> c(n);
      for (int i = 0; i < n; ++i)
          cin >> c[i];
      for (int i = 0; i < n; ++i)
          for (int j = 0; j < m; ++j)
              cin >> a[i][j];
      

      While reading input like this gives AC (typically < 0.2s):

      int n, m;
      cin >> n >> m;
      vector<int> c(n);
      for (int i = 0; i < n; ++i)
          cin >> c[i];
      vector<vector<int>> a(n, vector<int>(m));
      for (int i = 0; i < n; ++i)
          for (int j = 0; j < m; ++j)
              cin >> a[i][j];
      

      It is completely unreasonable that reading $$$4 \cdot 10^5$$$ integers would take 4 s. I have no clue what is actually going wrong here.

    • »
      »
      »
      11 months ago, # ^ |
        Vote: I like it +19 Vote: I do not like it

      Here is an update on this. I tested doing

      cin >> n >> m;
      if (n == 40000 && m == 10)
        work2();
      else
        work(n,m);
      

      where

      void work2 () {
          const int n = 40000, m = 10;
          vector<vector<int>> a(n, vector<int>(m));
          for (int i = 0; i < n; i++) {
              for (int j = 0; j < m; j++) {
                  cin >> a[0][0];
              }
          }
      }
      

      and it still TLEs on TC20 249166970. This is absurd. It gets TLE simply by reading $$$4 \cdot 10^5$$$ values into a[0][0].

      • »
        »
        »
        »
        11 months ago, # ^ |
          Vote: I like it 0 Vote: I do not like it

        Where do you read c[i] though

        • »
          »
          »
          »
          »
          11 months ago, # ^ |
            Vote: I like it 0 Vote: I do not like it

          In my code I simply read 4e5 ints. That's all my code is doing on TC20.

»
11 months ago, # |
  Vote: I like it +6 Vote: I do not like it

To add on, I was easily able to reproduce this myself by swapping two lines in my submission: 249017883.

»
11 months ago, # |
  Vote: I like it 0 Vote: I do not like it

Hmm, maybe there is a bug in the compiler or libstdc++: 249085796.

»
11 months ago, # |
Rev. 10   Vote: I like it +21 Vote: I do not like it

hey

I've set up test20 as a separate problem, you can add it by using 50437e8d91037c0fdbe5c65a in the mashup, or you can download here

I hope this helps in any way.

After deleting all calculations, it uses 4000ms for inputs


Observed some strange content, here's test code

code

It seems like it's just uniformly slow for no reason at all, There isn't any suspicious lag.

result graph
  • »
    »
    11 months ago, # ^ |
      Vote: I like it +21 Vote: I do not like it

    Just FYI, it seems both TC20 and TC21 triggers the bug. It is not only TC20.

  • »
    »
    11 months ago, # ^ |
      Vote: I like it +19 Vote: I do not like it

    Reading input in 4000ms :clown

»
11 months ago, # |
Rev. 2   Vote: I like it +141 Vote: I do not like it

I've figured out how to trigger this slowness bug on basically any problem on CF!

  1. Take any problem with a decent amount of IO on cf (like 2e5 ints).
  2. Take a random AC C++ submission that uses cin.
  3. Add the line vector<vector<int>> TLE(40000, vector<int>(7)); somewhere in global space.
  4. ???
  5. TLE

For example take Tourist's solution 248948695 to problem 1936-D - Bitwise Paradox. With the vector of death added to the code 249181295, it gets TLE on TC5 (taking 5s). While without the vector of death, Tourist's submission takes 155 ms on TC5.

  • »
    »
    11 months ago, # ^ |
      Vote: I like it 0 Vote: I do not like it

    When you compile the code locally asm and submit it directly (a cheat trick known to many), it will still result in a TLE, but locally there is no problem.

    My guess is that there is a problem with memory allocation in codeforces 64 bit.

  • »
    »
    11 months ago, # ^ |
      Vote: I like it +47 Vote: I do not like it

    Here is a stand alone example with the slowdown (credit to kostia244 for showing me this)

    #include<bits/stdc++.h>
    using namespace std;
    vector<vector<int>> TLE(40000, vector<int>(7));
    int main() {
        int n = 17;
        string s;
        for(int i = 0; i < n; i++) s += to_string(i) + " ";
        
        for (int j = 0; j < 30000; ++j) {
            istringstream ss(s);
        
            int x = 0;
            while (ss >> x);
        }
    }
    

    The vector of death makes this run about 100 times slower.

    • »
      »
      »
      11 months ago, # ^ |
        Vote: I like it +71 Vote: I do not like it

      Maybe, it's time to invite MikeMirzayanov to have a look at the problem. It's almost not reproducible on any other environment, so only administrators can further investigate the issue.

    • »
      »
      »
      11 months ago, # ^ |
      Rev. 3   Vote: I like it 0 Vote: I do not like it

      original comment: replacing " " with to_string(' ') take 290ms as pointed out below this is because ' ' is 32 lol

      Reproduced it for 64 bit integers using size 4 for inner vector

      #include<bits/stdc++.h>
      using namespace std;
      vector<vector<long long>> TLE(40000, vector<long long>(4));
      int main() {
          int n = 17;
          string s;
          for(int i = 0; i < n; i++) s += to_string(i) + ' ';
          
          for (int j = 0; j < 30000; ++j) {
              istringstream ss(s);
          
              int x = 0;
              while (ss >> x);
          }
      }
      
      • »
        »
        »
        »
        11 months ago, # ^ |
          Vote: I like it +8 Vote: I do not like it

        AFAIU, it was intended to generate a string containing n integers to pass it to istringstream. You broke this because to_string(' ') is just to_string(32).

»
11 months ago, # |
  Vote: I like it +17 Vote: I do not like it

I'm having strong flashbacks by comparing this blog and https://codeforces.me/blog/entry/108168 ...

It feels kinda frightening when you realize that your correct solution can get TLE due to some absolutely unexpected stuff D:

»
11 months ago, # |
  Vote: I like it -8 Vote: I do not like it

what about basic_string or valarray?

»
11 months ago, # |
  Vote: I like it 0 Vote: I do not like it

How do you install GNU compiler on Windows? I have the MinGW compiler from CodeBlocks (added to PATH) and I can't even compile with -c++17 using it.

  • »
    »
    11 months ago, # ^ |
      Vote: I like it 0 Vote: I do not like it

    I installed mingw64 compiler from this (don't follow any link, download from this page).

    g++ -v

    Now my g++ version is 13.2.0 and it supports c++20. I took this test and compiled with g++ main.cpp -std=c++20 -o main.exe. It executes in about 600 ms on both c++20 and c++17. I don't see strange behaviour.

    Did someone experienced this locally?

    • »
      »
      »
      11 months ago, # ^ |
        Vote: I like it +3 Vote: I do not like it

      I installed specifically version 11.2.0, used compilation flags from here:

      g++ -Wall -Wextra -Wconversion -static -DONLINE_JUDGE -Wl,--stack=268435456 -O2 -std=c++20 %1

      and used this program from one of the comments above.

      Code

      Still, no strange behaviour. It executes in about 850 ms.

      • »
        »
        »
        »
        11 months ago, # ^ |
          Vote: I like it +22 Vote: I do not like it

        Wait, I just noticed strange thing on Codeforces.

        I just ran the exact this code in custom test. It shows time 4851 ms always and actually, the time between the moment I press button Run and log appears is about 1.5 seconds.

        Seems the problem is in judge system, not the compiled program.

        • »
          »
          »
          »
          »
          11 months ago, # ^ |
            Vote: I like it +28 Vote: I do not like it

          I think that is simply due to codeforces caching the result of the invocation. You need to change the code/input slightly to make codeforces actually run your code.

          • »
            »
            »
            »
            »
            »
            11 months ago, # ^ |
              Vote: I like it 0 Vote: I do not like it

            Indeed.

            However, the question remains: could someone experience this locally, or it is the issue of only mysterious codeforces compiler.

»
11 months ago, # |
  Vote: I like it +18 Vote: I do not like it

A while back i faced a similar issue: https://codeforces.me/blog/entry/99038

Maybe they are related

»
11 months ago, # |
  Vote: I like it +13 Vote: I do not like it

I have new conjecture. Look at this submission: https://codeforces.me/contest/1936/submission/249331767 It is same as the one that gets TLE, but with fast allocator. So I suppose that each invocation std::cin allocates and free memory which is slow due to "vector of death" allocating many small chunks of memory. And scanf seems to not allocate or free memory.