In the recent Goodbye 2021 Round on Problem E my code seems to run a lot slower when running on pypy64 as opposed to regular pypy.
The following submissions are identical, with the exception being the language used.
141113239 (pypy3.7 — AC 732ms)
141185703 (pypy3.8 64bit — TLE)
Running on Custom Invocation on one testcase of size $$$10^5$$$ shows that pypy3.7 will take about 250ms whereas pypy3.8-64 will take 5600ms.
Profiling my code it seems like the culprit is my find function
#Finds smallest index of segtree such that seg[ind] <= x
#Only works if func = min
def find(self, x):
curr = 1
if self.data[curr] > x:
return -1
while curr < self._size:
assert self.data[curr] <= x
if self.data[2 * curr] <= x:
curr = 2 * curr
else:
curr = 2 * curr + 1
assert self.data[curr] <= x
return curr - self._size
Replacing this function with a trivial (but incorrect) function makes the submission run in ~230 ms in both languages. However it seems like this function should run in $$$O(log n)$$$ and not be 20x slower than the rest of my code (and if itis slow, why is it not an issue in 32bit pypy?). If anyone can figure out the issue here (and what I need to avoid to be able to run code in pypy64 quickly) it would be greatly appreciated.
I've also noticed that 64 bit PyPy on CF sometimes runs super slow. Like for example here
PyPy3 64 bit 1949 ms 140113619
PyPy3 32 bit 374 ms 141188674
I've also had many people messaging me about weird TLEs with 64 bit PyPy:
https://codeforces.me/blog/entry/98385?#comment-872293
https://codeforces.me/blog/entry/98385?#comment-872290
Something definitely seems buggy with 64 bit PyPy on CF. But I haven't yet had the time to figure out what is causing it or if I can even replicate the TLEs locally.
An update, I found a really really good killer
Using custom invocation on Codeforces, it runs in 4.5 s in 64 bit PyPy3 and 61 ms in 32 bit PyPy3.
I've opened up an issue about this on PyPy's website https://foss.heptapod.net/pypy/pypy/-/issues/3629 .
It seems that when PyPy made the jump from Python3.7 to Python3.8 they introduced some kind of bug.
pypy3.8-v7.3.7
has the bug, butpypy3.7-v7.3.7
doesn't. A quick fix for CF would be to switch frompypy3.8-v7.3.7-win64
topypy3.7-v7.3.7-win64
. The problem is not related to 32 bit vs 64 bit, nor is it a windows vs linux thing.One fun thing to note is that adding empty for loops everywhere seems to fix it. For example
runs fast no matter PyPy version.
Thank you. I temporarily reverted it to 7.3.5 here.
Oh nice!
I just tried resubmitting many of the mysterious TLEs, and they all got AC now. =)
This is why I don't use 64-bit on contests :P
I tested it a bit in custom invocation with random input. If you comment out the asserts in the while loop, it runs in 650ms, otherwise 6.5 seconds. Funny thing is that you can replace it with an
assert True
orif 0 > 1: break
and the same bug will still occur (but lowered to ~5seconds).So my theory is that loop got unrolled but the additional branches blew up the number of possible code paths. I think told pajenegod about an similar issue before for pyrival's RMQ in 117322863. The solution there was to add a no-op
for _ in range(1): pass
to prevent inlining (and indeed the same trick seems to work for unrolling too). Explanation from the dev about why this happens can be found in this issue. Some background on how jit guards/bridges work can be found at: https://youtu.be/NQfpHQII2cU?t=1389Anyway this is occurring often enough that you should report it to https://foss.heptapod.net/groups/pypy/-/issues. The devs are fairly responsive and have fixed other jit bugs reported by codeforce users before. You should probably try to reduce it to a minimal reproducible example first though.
Though I can't read them, you can also output jit logs with something like
PYPYLOG=jit-log-opt,jit-summary,jit-backend:logfile pypy3.8 bug.py
.Summary without that assert line:
Summary with the assert:
I updated PyPy3-64 to 7.3.9 (3.9.10). I will be glad if you could test this version.
I retested their submissions and it's now very good:
new version 779 ms vs old version TLE.
new version 358 ms vs old version 1949 ms.
The below code also run in custom invocation in 779 ms in pypy3 64bit vs 1185 ms in pypy3 32bit
Yes, the 3.9 version is better. Thank you MikeMirzayanov!
I have tested it now against one of the recent TLEs I faced in a contest.
Mysterious TLE — 2000ms
Sweetly Accepted — 1996ms
What I was able to learn from this is the importance of even .004 seconds in CP for a Python coder, to move from a TLE code to an AC code. I had say that's a narrow escape.
Also I updated Go and PHP. Welcome!
new pypy3 64 sub pypy3 sub do not know the issue of this random TLE