This is my answer to https://codeforces.me/blog/entry/137484, in which -is-this-fft- asked a bizarre question. This answer is too big for a comment so I made a blog post instead.
The question
The submission 297353626 behaves normally with 180 MiB memory usage while 297352244 got a runtime error due to stack overflow (with 256 MiB stack). Both submissions have identical code, don't seem to contain bugs, and an assertion exists to make sure the recursion is at most $$$2\times10^5$$$ layers deep. What's happening?
TL;DR
Notice that the solution with stack overflow uses GCC 7 (32-bit) while the accepted ones use GCC 13 (64-bit). This is not a coincidence.
The main reason of unreasonable stack usage boils down to two points: GCC 7, and 32-bit.
32-bit
First of all, the code is compiled and run in 32-bit mode. Compiling the submission with GCC 7.5 in 32-bit on godbolt shows that the long long solve(long long, long long)
function has a gargantuan stack frame, seen from its prelude:
_Z5solvexx:
push ebp
push edi
push esi
push ebx
sub esp, 1948
This means that each call to solve
uses $$$8\times2+4+4\times4+1948=1984$$$ bytes of stack space (2 8-byte arguments + 4-byte return address + 4 4-byte registers + 1948 bytes for stack variables). In a $$$2\times10^5$$$-layer recursion, the stack usage would be about 378 MiB, which probably overflows the stack. However, solve
looks like a very simple function. Why does it need that much memory?
In 32-bit x86, there are less registers to use than in 64-bit mode, and each register only stores 32 bits. In the submission, every integer variable is a 64-bit long long, including loop indices, answer values, etc. Since 64-bit arithmetic cannot be done directly, all those variables have to be stored somewhere on the stack, and when they are used, only half of a variable (32 bits) is pulled out and processed each time. If 64-bit intermediate values are required in the middle of a computation, they must be stored as well.
If we count all the variables, including temporary values used in statements, it can be seen that they use a good amount of space (at least a couple hundred bytes) but nowhere close to 1948 bytes. What else is happening?
cdecl my beloved
Another disastrous part comes from function calling. Let's take the function long long solve(long long, long long)
as an example.
In 64-bit x86, the main calling conventions used in C are Microsoft x64 and System V. Since there are many registers, the first few (small) arguments and the return value can be put in registers for faster access. In both conventions, the 2 64-bit arguments are placed in 2 registers, and the 64-bit return value is stored in a register.
In 32-bit, the main calling convention in C is cdecl. The 16 bytes of function arguments are pushed onto the stack in reverse order, as well as an extra pointer. This extra pointer points to 8 bytes of stack-allocated space and is used to store the return value. This causes every solve
call itself to use 24 more bytes of stack space.
The problem gets worse when functions get a bit bigger, for example:
pair<long long, long long> op_min(pair<long long, long long> p, pair<long long, long long> q)
: in 64-bit, a pair is stored in 2 registers for both arguments and the return value, so no extra stack is used; in 32-bit, they are all on the stack, taking up 48 bytes.pair<long long, long long> Segtree::query(Segtree *this, ll ql, ll qr, ll u)
: left as exercise for the reader.
GCC
You might think compiler optimizations help by optimizing register usage, reusing stack space, etc., but in this case, GCC actually made the problem a lot worse.
Looking at the compiled code closely, you can see that the solve
function has too much calls to Segtree::query
. 28, to be exact. This is because of function inlining — to reduce the overhead of function calls, compilers sometimes replace a function call with the code of the function itself. This can be done recursively, for example if the query
calls in the query
function were to be replaced by itself twice, it results in a large function with 8 query
calls.
With inlining, the short solve
function gets expanded into a lot of code. Each function call and its surrounding code requires stack space to operate, and it seems that the older GCC 7 is less efficient at utilizing stack space for different parts. For example, it can be seen that the results of different query
or op_min
calls are all stored in variables in different places, but their values are not used together. While the optimization seems to work well for speed, it multiplies the stack usage problems from previous sections.
GCC 14 does the optimizing job a lot better. In 32-bit mode, a lot of code is inlined too, but stack variables are reused more efficiently, and it uses only 380 bytes instead of 1948. In 64-bit mode, even more stack space is saved, requiring only 88 bytes.
Conclusion
It's almost 2025, consider using something newer than GCC 7 (32-bit) on Codeforces. Also to the clown developers/end users who keep shipping/running 32-bit executables on Windows for "compatibility issues" or other reasons: please stop.