What makes pypy64 slow here?

#	User	Rating
1	tourist	4009
2	jiangly	3823
3	Benq	3738
4	Radewoosh	3633
5	jqdai0815	3620
6	orzdevinwang	3529
7	ecnerwala	3446
8	Um_nik	3396
9	ksun48	3390
10	gamegame	3386

#	User	Contrib.
1	cry	167
2	Um_nik	163
3	maomao90	162
3	atcoder_official	162
5	adamant	159
6	-is-this-fft-	158
7	awoo	157
8	TheScrasse	154
9	Dominater069	153
9	nor	153

In the recent Goodbye 2021 Round on Problem E my code seems to run a lot slower when running on pypy64 as opposed to regular pypy.

The following submissions are identical, with the exception being the language used.

141113239 (pypy3.7 — AC 732ms)

141185703 (pypy3.8 64bit — TLE)

Running on Custom Invocation on one testcase of size $$$10^5$$$ shows that pypy3.7 will take about 250ms whereas pypy3.8-64 will take 5600ms.

Profiling my code it seems like the culprit is my find function

#Finds smallest index of segtree such that seg[ind] <= x
#Only works if func = min
def find(self, x):
    curr = 1
    
    if self.data[curr] > x:
        return -1
    while curr < self._size:
        assert self.data[curr] <= x
        if self.data[2 * curr] <= x:
            curr = 2 * curr
        else:
            curr = 2 * curr + 1
        assert self.data[curr] <= x
    return curr - self._size

Replacing this function with a trivial (but incorrect) function makes the submission run in ~230 ms in both languages. However it seems like this function should run in $$$O(log n)$$$ and not be 20x slower than the rest of my code (and if itis slow, why is it not an issue in 32bit pypy?). If anyone can figure out the issue here (and what I need to avoid to be able to run code in pypy64 quickly) it would be greatly appreciated.

Tracing: 81 0.070865 Backend: 81 0.029770 TOTAL: 0.422342 ops: 116056 heapcached ops: 129132 recorded ops: 35992 calls: 3995 guards: 7006 opt ops: 20512 opt guards: 3911 opt guards shared: 2224 forcings: 0 abort: trace too long: 0 abort: compiling: 0 abort: vable escape: 0 abort: bad loop: 0 abort: force quasi-immut: 0 abort: segmenting trace: 0 nvirtuals: 2512 nvholes: 825 nvreused: 625 vecopt tried: 0 vecopt success: 0 Total # of loops: 30 Total # of bridges: 52 Freed # of loops: 0 Freed # of bridges: 0

Tracing: 672 1.453424 Backend: 671 0.186138 TOTAL: 3.310882 ops: 3332025 heapcached ops: 1164403 recorded ops: 318645 calls: 111855 guards: 68843 opt ops: 338865 opt guards: 43555 opt guards shared: 39539 forcings: 0 abort: trace too long: 1 abort: compiling: 0 abort: vable escape: 0 abort: bad loop: 0 abort: force quasi-immut: 0 abort: segmenting trace: 0 nvirtuals: 2999 nvholes: 768 nvreused: 611 vecopt tried: 0 vecopt success: 0 Total # of loops: 29 Total # of bridges: 643 Freed # of loops: 10 Freed # of bridges: 19

Comments (12)

Write comment?

pajenegod

3 years ago, # |

+48

I've also noticed that 64 bit PyPy on CF sometimes runs super slow. Like for example here

PyPy3 64 bit 1949 ms 140113619

PyPy3 32 bit 374 ms 141188674

I've also had many people messaging me about weird TLEs with 64 bit PyPy:

https://codeforces.me/blog/entry/98385?#comment-872293

https://codeforces.me/blog/entry/98385?#comment-872290

Something definitely seems buggy with 64 bit PyPy on CF. But I haven't yet had the time to figure out what is causing it or if I can even replicate the TLEs locally.

→ Reply

3 years ago, # ^ |

← Rev. 2 →

+97

An update, I found a really really good killer

for i in range(10**7):
    if sum([]):
        break

Using custom invocation on Codeforces, it runs in 4.5 s in 64 bit PyPy3 and 61 ms in 32 bit PyPy3.

+55

I've opened up an issue about this on PyPy's website https://foss.heptapod.net/pypy/pypy/-/issues/3629 .

It seems that when PyPy made the jump from Python3.7 to Python3.8 they introduced some kind of bug. pypy3.8-v7.3.7 has the bug, but pypy3.7-v7.3.7 doesn't. A quick fix for CF would be to switch from pypy3.8-v7.3.7-win64 to pypy3.7-v7.3.7-win64. The problem is not related to 32 bit vs 64 bit, nor is it a windows vs linux thing.

One fun thing to note is that adding empty for loops everywhere seems to fix it. For example

for i in range(10**7):
    for _ in range(1): pass
    if sum([]):
        break

runs fast no matter PyPy version.

MikeMirzayanov

+52

Thank you. I temporarily reverted it to 7.3.5 here.

+23

Oh nice!

I just tried resubmitting many of the mysterious TLEs, and they all got AC now. =)

whatshisbucket

This is why I don't use 64-bit on contests :P

at_f

+87

I tested it a bit in custom invocation with random input. If you comment out the asserts in the while loop, it runs in 650ms, otherwise 6.5 seconds. Funny thing is that you can replace it with an assert True or if 0 > 1: break and the same bug will still occur (but lowered to ~5seconds).

So my theory is that loop got unrolled but the additional branches blew up the number of possible code paths. I think told pajenegod about an similar issue before for pyrival's RMQ in 117322863. The solution there was to add a no-op for _ in range(1): pass to prevent inlining (and indeed the same trick seems to work for unrolling too). Explanation from the dev about why this happens can be found in this issue. Some background on how jit guards/bridges work can be found at: https://youtu.be/NQfpHQII2cU?t=1389

Anyway this is occurring often enough that you should report it to https://foss.heptapod.net/groups/pypy/-/issues. The devs are fairly responsive and have fixed other jit bugs reported by codeforce users before. You should probably try to reduce it to a minimal reproducible example first though.

Though I can't read them, you can also output jit logs with something like PYPYLOG=jit-log-opt,jit-summary,jit-backend:logfile pypy3.8 bug.py.

Summary without that assert line:

Summary with the assert:

2 years ago, # |

+28

I updated PyPy3-64 to 7.3.9 (3.9.10). I will be glad if you could test this version.

beethoven97

2 years ago, # ^ |

+12

I retested their submissions and it's now very good:
new version 779 ms vs old version TLE.
new version 358 ms vs old version 1949 ms.

The below code also run in custom invocation in 779 ms in pypy3 64bit vs 1185 ms in pypy3 32bit

for i in range(10**9):
    if sum([]):
        break

blackhatinside

+14

Yes, the 3.9 version is better. Thank you MikeMirzayanov!

I have tested it now against one of the recent TLEs I faced in a contest.

Mysterious TLE — 2000ms

Sweetly Accepted — 1996ms

What I was able to learn from this is the importance of even .004 seconds in CP for a Python coder, to move from a TLE code to an AC code. I had say that's a narrow escape.

+33

Also I updated Go and PHP. Welcome!

shiviDON

new pypy3 64 sub pypy3 sub do not know the issue of this random TLE

conqueror_of_tourist's blog