TL;DR
Currently people are discussing a slowdown bug on Codeforces that seems to happen randomly, and can cause code to run 100x slower and get TLE. More details in pajenegod's blog post.
In this article, I present a mitigation: add the following to your code.
#include <windows.h>
void *operator new(size_t size) {
if (void *ptr = HeapAlloc(GetProcessHeap(), 0, size ? size : 1)) return ptr;
throw std::bad_alloc{};
}
void operator delete(void *ptr) {
HeapFree(GetProcessHeap(), 0, ptr);
}
TL;DR, part 2
This section is for Codeforces admins including MikeMirzayanov.
I have very high confidence that some poorly implemented functions in Microsoft's C runtime is the root cause of the bug which is outside contestants' control, creating unfair disadvantages in contests. The bug has likely existed since C++17 (64-bit) first came out on Codeforces.
Every C++ submission on Codeforces (and also the entire Codeforces infrastructure) currently depends on that library, on top of the internals of Windows that very few people has any clue about. Both of these are infamous for having bad implementations in many places, and I don't think it's a good idea to rely on those for one of the online judges with the most users.
I ask Codeforces developers to seriously consider other, more reliable possibilities, for example an option for judgment on Linux machines, possibly using one of the plenty judge server implementations already available. An increased stability and robustness would hopefully make contest and judging experience on Codeforces a lot better.
What's happening?
There has been a bug on Codeforces causing random TLEs recently. Sometimes when a seemingly unrelated line of code is added, or the order of two lines are swapped, or a vector size is changed, etc., the code suddenly becomes 100x slower on certain inputs. The bug got some attention because of its unexplained behavior, and the fact that many submissions (include tourist's) that should have passed a problem got TLE or were hacked because of this. An example showing the bug (by kostia244):
#include <bits/stdc++.h>
using namespace std;
int main() {
vector<vector<int>> TLE(40000, vector<int>(7));
string s;
for (int i = 0; i < 17; i++) s += to_string(i) + " ";
for (int j = 0; j < 10000; ++j) {
istringstream ss(s);
int x;
while (ss >> x);
}
}
A loop of about 10000 * 17 iterations takes 1.5s to complete. The weird and funny thing about the bug is that if 40000 or 7 is changed slightly, or if the vector is moved to the middle, then nothing happens and the code runs normally. This peculiar behavior indicates something is buggy and TLE is not caused by the code's authors. It only happens on Codeforces, and only with 64-bit C++. No one was able to reproduce it anywhere else or explain why.
After pajenegod also posted a blog on the situation, people in the comments (drdilyor, mtw, et al.) quickly discovered that it's only the (de-)allocations that matter. Here's my attempt at reducing the code to only new
and delete
:
#include <bits/stdc++.h>
using namespace std;
void *alloc(size_t s) {
auto c = new char[s];
asm volatile("" ::"r"(c));
return c;
}
int main() {
auto c = alloc(0x1c);
for (int i = 0; i < 40000; i++) {
alloc(0x1c);
}
delete[] c;
for (int i = 0; i < 100000; i++) {
delete[] alloc(0x1c);
}
}
This consistently takes around 900ms in both C++17 (64) and C++20 (64).
An overview of what I investigated in the past few hours
The vector<int>(7)
in one case immediately caught my attention, because it makes an allocation of 28 bytes. 28 is special because it's $$$32 - 4$$$, where $$$32$$$ is a common chunk size of various memory allocation algorithms, and $$$4$$$ is the size of a pointer used to store chunks (if you didn't know, on Codeforces malloc
uses 32-bit libraries and returns 32-bit pointers, even with 64-bit C++. you are scammed). This combined with the fact that changing the 7 makes the bug disappear, hints that something is wrong with allocations.
By default new
and delete
with GCC simply call the malloc
and free
functions from the C standard library. I tried adding some time measurements to see if malloc
is really at fault. And surely enough, it is.
Knowing that it's not caused by the C++ STL themselves, but the allocation functions called when creating them, debugging became a little easier.
With my superficial knowledge of heaps, I can quickly guess what is going on. The 40000 malloc
s eats the top chunk to a specific size, then the first free
sends a chunk into the tcache, or the 0x20 fastbin if that's not a thing. Then the next sequence of malloc
/free
s keep taking and returning the same chunk, and this combined with the 40000 mallocs
somehow breaks the allocation algorithm, making it take a lot more time, is what happened, right? Oh, wait. uh. no...
...wait. WAIT. I forgor 💀
The above is what might happen in GNU's C library. Codeforces runs on Windows, using Microsoft's C library. What does that do?
Codeforces's C++20 (64) uses WinLibs, which uses either MSVCRT (Microsoft Visual C++ Runtime) or UCRT (Universal C Runtime) as the C standard library. We don't know what exact setup Codeforces has, but we can first investigate what each library does. Fortunately, the source code of both are available.
Let's look at UCRT first. In non-debug mode, malloc
simply calls _malloc_base
, which is just a wrapper to the Windows API HeapAlloc
. Similarly free
calls HeapFree
. These memory management functions are provided by the operating system, could they be the source of bugs? We can test this by replacing malloc
and free
calls by the API calls directly.
This runs normally, which means that UCRT is unlikely the reason of the bug.
Then we look at MSVCRT's source code. It also uses Windows API functions, except when it doesn't. Depending on the system version, there are 2 more implementation of memory allocation algorithms, V6_HeapAlloc
and V5_HeapAlloc
, which seem to allocate memory by looking through a linked list for free chunks. I did not have enough time or energy to read the implementation, but it is possible that the function has bad performance in certain cases.
Then I found this interesting article, which shows that MSVCRT's malloc
might really be some terrible implementation, and Codeforces users are not its first victims.
If we can know for sure which C library C++20 (winlibs) and C++17 (msys2) use, we can basically pin the possibility down to one bad function in a library. However, it's difficult to get the exact setup on Codeforces even with custom test. If only there was a way to download compiled binaries from Codeforces... wait, there is. On Polygon, full problem packages include compiled versions of checkers, validators, etc. I can simply add a generator to a random problem and download the package to obtain an executable file (although only C++17). Fortunately, it has some debug info, and is statically linked.
After briefly inspecting the 3MB, impossible-to-reverse-engineer PE32+ executable, I can confirm that it indeed uses MSVCRT's malloc
. Now we can basically be sure that the cause of the weirdest slowdown bugs in Codeforces history is the poor implementation of the C runtime.
A possible fix?
Now we know the exact issue that caused the mysterious slowdowns: Microsoft's astronomically disastrous dumpster fire C library. How to fix this? It's easy, just replace it with memory management libraries that are known to work well, for example jemalloc, mimalloc (ironically made by MS themselves to replace their library functions they themselves know are terrible), or ptmalloc. One could also replace the entire library with, say, UCRT or GNU libc.
But as regular Codeforces users we can't control what libraries we use; only Codeforces is able to change that. However, we can rewrite our code to not use the C library to allocate at all. It's possible to do it by calling raw Windows API functions, as shown above. The following is copy-pasted from the TL;DR section:
#include <windows.h>
void *operator new(size_t size) {
if (void *ptr = HeapAlloc(GetProcessHeap(), 0, size ? size : 1)) return ptr;
throw std::bad_alloc{};
}
void operator delete(void *ptr) {
HeapFree(GetProcessHeap(), 0, ptr);
}
Does this fix the slowdown bug? No one will likely ever know, because unlike Microsoft's C libraries or STL, the lack of source code (and the sheer complexity) makes the Windows API impossible to understand. But at least we now know the bug is probably caused by library functions and can pray that skipping the library solves it.
In a way you could say that the bug is not Codeforces's fault... but it's not really not their fault either, as there is a very simple solution (that would also solve countless other issues reported by users before): run GNU/Linux, like literally every other online judge of considerable scale.
Update: About hacks
After the bug was discovered, people are hacking submissions by using special inputs that cause a certain sequence of allocations to happen. Since this is a problem of the library and not of contestants' codes, I don't think it's fair doing that.
A similar case is hacking bad std::unordered_map
, compared to, e.g., hacking bad rolling hashes. The latter is a problem with contestant implementations, and I enjoy hacking them. But the former is just bad implementation from C++ STL, and isn't really the contestants' fault, so I don't usually hack them. This problem is more well-known though, and has many simple fixes available, so it's probably OK to hack them.
Before this slowdown bug and its fixes are well-known, I think hacking using the bug is equivalent to exploiting n-day vulnerabilities on Codeforces to manipulate verdicts, and I'm not very supportive of that. Let's hope a patch is deployed soon.
Afterwords
I'm not touching Windows or PE for the next 6 months. Leave a comment if you play CTFs and understand my feeling.
Subscribe for more posts like this where i solve CP bugaboos with the most absurd techniques