Comprehensive guide to avoid TLE in python

№	Пользователь	Рейтинг
1	tourist	3993
2	jiangly	3743
3	orzdevinwang	3707
4	Radewoosh	3627
5	jqdai0815	3620
6	Benq	3564
7	Kevin114514	3443
8	ksun48	3434
9	Rewinding	3397
10	Um_nik	3396

№	Пользователь	Вклад
1	cry	167
2	Um_nik	163
3	maomao90	162
3	atcoder_official	162
5	adamant	159
6	-is-this-fft-	158
7	awoo	155
8	TheScrasse	154
9	Dominater069	153
10	djm03178	152

Introduction

Hello everyone, I have recieved some messages asking how to improve python code to avoid TLE. There have been blogs before but they are rather old, so I will share some tips, hope it helps! These are the tips that I have learnt over the years and some of them may not be well known to non-python mains.

Do let me know in the comments if you have more tips as well.

Essentials (Most questions can be solved using these tips)

Basically, try to make the code as "Pythonic" as possible, the more python gets to call its underlying C functions the faster and better it is.

Use PyPy 3 64 bit, it is faster than python 3.
For fast I/O, I use input = sys.stdin.readline, then for output, if the output has multiple lines, I just create a list to store the outputs, then print("\n".join(lst)) on the list. Because multiple print statements are slow in python.
Put all your code into functions like a main function, as local variables lookup in python is faster than global variables, I think your code will run around ~30% faster just by doing this.
List creation: [0]*n is fastest, list comprehension is also fast. For 2d lists: I do: [[0]*m for _ in range(n)], as [[0]*m]*n will not work due to reference issues. Also, .extend() is faster than .append() if you are adding multiple elements to the list at a time.
Use for i in arr instead of for i in range(n): arr[i] as it is faster. If you need the index as well as the data, use enumerate() as it is faster. Also for x, y in arr: to iterate over multiple elements.
Use built-in methods like map() and sum() as they are faster than implementing the stuff yourself. The more Pythonic the better.
For deep recursion, use an explicit stack instead of a recursive function. Recursion in python is slow and memory intensive.
Try to keep your code as simple as possible especially in the for loops, to reduce the amount of computation needed and also so that PyPy can optimise your code easily.
String concatenation in python is slow in a for loop as new intermediate strings are always being created. To solve this, use a list and .append() (or extend()), at the end call "".join(lst)
Try to not change the data type of variables and collections. Collections should store a single data type. For example PyPy has special dict strategies to speed up dicts with keys of all same types. I believe for lists and other collections there will be some optimisations as well, as python can predict the type better and reduce overhead from mutating types.

🦀-Python

As mentioned in this comment: https://codeforces.me/blog/entry/106541?#comment-949024

Non-Pythonic ways to improve python performance. You won't need to use some of these unless the time limits are tight.

Flattening 2d lists into 1d: This TLE: https://cses.fi/paste/f6db6ab935e23d2072b133/ This passes: https://cses.fi/paste/c7fa7fcb039aa28f72b17f/
Packing tuple into ints to sort, then unpacking as sorting tuples are slow: This TLE: https://cses.fi/paste/972de727f25b8f1171102a/ This passes: https://cses.fi/paste/25ce406e0f776551711046/
Iterating over multiple elements using zip: This TLE: https://cses.fi/paste/1c4afcf48c0400b071e0f8/ This passes: https://cses.fi/paste/85c4f40b683a6eec71e10d/
In some cases deque() can be faster than list, at least for some questions I've done in CSES graphs.

Other useful tips:

Learn your standard library properly (take a look at the documentations), Python has some very useful functions and classes, for example I sometimes use Decimal class to deal with floats of higher precision (but take note the precision cannot be too high else TLE). Counter and bisect are nice to use as well.
Pyrival is good, has lots of algorithms and data structures, lots of useful stuff there like SortedList() and prime factors.
To avoid integer hash hacking TLE for set() and dict(), you can random.shuffle() the array if order is not important, otherwise you can hash the str(int) or use random int xor to wrap the ints.

def splode_sort(x): # plain array in d = defaultdict(list) # beware hash hax for ri, e in enumerate(reversed(x)): d[e].append(n-ri-1) return [d[e].pop() for e in sorted(x)] # could render element-index pairs too

cpython pypy-64 sorted(intlist) 189ms 66ms tuplesort 778ms 1282ms keyed tuplesort 420ms 780ms getter_sort 256ms* 712ms splode_sort 1765ms 477ms*

Комментарии (18)

Показать архивные | Написать комментарий?

drugkeeper

13 месяцев назад, # |

Auto comment: topic has been updated by drugkeeper (previous revision, new revision, compare).

→ Ответить

exhausted

bro your nickname is awesome

Pash_Mak

13 месяцев назад, # ^ |

+20

bro your organization is awesome

fbrunodr

As sad as it is, I would still suggest using C++. If your goal is doing great on ICPC finals, then it's okay to have Python as the main language, as they set different time limits to each language. But if your goal is getting a certain rank in codeforces, you are better using C++. Some problem setters will make the time limit really small and if you use python and get an unexpected TLE, you kind don't know if the problem is your implementation or the language itself.

Anyway, great post, Python is a higher level language and it is really easy to write unoptimized code as it is not always clear what is actually going on. This list of char tip instead of string is really usefull btw.

Forgot to mention 1 useful tip: if I suspect that python may be slow, I will usually check the problem status to see if there are submissions in python, if there is, it means its possible to solve.

Sometimes I get lazy to optimise the solution and will just use chatGPT to convert my python code to C++, with the help of some modifications.

It is possible to reach a very high level (LGM) using python: conqueror_of_tourist

For ICPC, I don't think ill make it, but I do agree that if the solution seems to slow to use python, can just use C++

Also, I am not advocating that Python is superior or that people should switch to Python, use whatever you prefer (I prefer Python). Just wanted to share some ways to make python not TLE, where some of the optimisations are rather simple to fix.

Some of the optimizations to make python pass are not that trivial to implement, so use whatever is best for you.

Geothermal

The ICPC World Finals (as well as all North American regionals and the NAC, not sure about the rest of the world) does not set different time limits for each language.

← Rev. 2 →

Thanks for your input. The latam regionals set different time limits for each language and that confused me. This is even another reason to favor C++.

rope_and_snake

I don't think points 3,5,6 are that important. It may vary depending on the problem. Also regarding point 2, sometimes sys.stdin.readline is not fast enough. sys.stdin.buffer.readline or io.BytesIO(os.read(0, os.fstat(0).st_size)).readline is required. Note that you need to use input().decode() for string inputs in both. For interactive problems the latter is preferred along with os.write, which is much faster than print followed by sys.stdout.flush

To avoid hashing issues in dict/defaultdict, sort the array before adding elements and store the keys of every element in the dict as a string.

bjy

It'd help if you could provide example submissions or some underlying explanation. Your last point specifically indicates you're kinda kludging together rules of thumb and/or habits... ie to avoid integer-key hash hacks, you don't need to do both. Sorting isn't always a good option for a problem (I know I happened to pick a problem for which it was in an old post about this, so, sorry if that's how the rumor started). Xor-ing the integer key with a random salt will be faster than string conversions, but adding the module random will add startup/load/initialization time if you're trying to measure differences.

You're probably right about micro-adjustments mattering less if pypy can go brrr.

The reason I default to 'just use stdin.readline' as advice is that it's pretty rare that the difference between that and more heroic measures is what makes a solve pass. There's some benefit to read-all-available methods if the problem isn't interactive and if the input is presented by a troll e.g. hundreds of thousands of pairs, one line each. It's been my experience that you often need to turn I/O heroics OFF for some interactive problems with no way to really know in advance (ie it depends on how they wrote their end...?) -- reliably flushing is still the main point... counterexamples are welcome.

carolinux

for (7) -> " use an explicit stack instead of a recursive function." do you have an example?

I typically switch to c++ when I see deep recursion

For example my code for https://cses.fi/problemset/task/1666:

https://cses.fi/paste/cf06db4f6f9856bc72c271/

Apologies if any of the noise-y advice out there was amplified by me. A lot of the time I was working through it out loud myself, and you know, mentally it was okay because I'd post something comprehensive like this next someday... but really I meant next someday. So, thanks OP for doing it this time around.

Some overly specific stuff that probably doesn't matter, overall post is good:

Fear of sorting tuples — 32-bit pypy3 died in a not-great state, and comparisons-are-hard is a meme for all implementations of python because if you don't know the types in advance, there's always another case you might have to branch into. SO, for ye olde 32-bit pypy one might bounce between the issues of comparing non-primitive type tuple with the tendency to multiply-pack them into a single large presumed-primitive type int only to be surprised when python handily converts the latter into a non-primitive bigint type... and sadness happened.

It took until pypy-64 for me to realize what was going on... so most of the time, key=lambda x:x[42] is enough as long as the selected element is a single primitive type. But also the multiply-pack strategy is now less likely to accidentally blow up into bigint range (vs. old 32-bit threshold). I... don't know off-hand which pypy cses uses, actually.

Tip 8: 'simple' is general... "branches choke pypy's attempts to optimize" would be more specific, and while I'll mentally avoid some check-0-then-check-max-then-most-likely-codepath sequences by padding a grid's borders, I haven't explicitly thought about it in-contest. I did run into a floyd-warshall problem (so, inner of 3 nested loops) where setting the default distance to float('inf') was the source of many many many type-mismatched (read: branch-y) comparisons. Setting the default to a large non-bigint int worked annoyingly well in that case.

Tip 9: this relates to why I default to tip 1 as advice rather than juggling cpython's specifically-good-generally-slower quirks. Standard cpython is pretty 'nice' about optimizing certain things while still being generally slow about things you can't depend on in this way. So something like str+=str works well in cpython but doesn't get special treatment in pypy which will dutifully go fast at doing the wrong thing: mangling/reallocating immutable types. Therefore, advice like in tip 9 is useful in any language with immutable types/sequences (java string vs. stringbuilder comes to mind, might be outdated tho), preferring the use of mutable types to build, and therefore no longer wasting pypy's relative speed advantage. This was also mentioned in the context of a.pop(0) which ordinarily would be a no-no but cpython was nice about it, so to speak. Counterexample to 1 but invoking 9's 'do better on your end' sentiment: meta hackercup 2022 had a problem that I lazily flattened into a substring search, I anticipated the mostly-equal-elements countercase, looked up what algo python uses and noted it'd be okay, aaaaand then died in TLE. Why? I was using pypy by default and they hadn't (yet?) made the same choice of algo. Whatever y'all do, please don't mindlessly relay this as "use cpython for strings" because it's incomplete enough to make things worse in the long run (ie risks developing even more weird coping mechanisms when stuck in cpython's overall slowness).

Tip 10: added later? similar ideas... there isn't usually occasion to mix types in a single collection, but you may find yourself making compound/tuple types like (string, int-index) especially when using zip or enumerate. But that's covered by the above bits about keying your function (e.g. sort) to focus on one element/position and ideally a single easily-compared type.

Only things I'd add are around the libraries/modules bit: itertools and functools are neat...!

edit: out of curiosity I tried to reproduce the pop(0) and substring-search-TLE differences, but I haven't been able to... since it was more than a year ago, it's possible I too am propagating rumors due to what was likely still 32-bit pypy on my local machine. My level of interest caps at rendering that as 'just use pypy-64' though. The str+=str difference still holds fwiw...

Thanks for the tips!

← Rev. 4 →

Upon yet closer inspection, sorting tuples is still something to be wary of and a case where cpython can beat pypy, even if keying helps significantly for both.

For a subset of that, sorting (value, index) pairs:

cpython likes:

def getter_sort(x):  # plain array in, returns only indices
   return sorted(range(n), key=x.__getitem__)

and seems faster than what pypy-64 likes:

Maybe there are other 'spellings' that'll work better though.

Local timings for n=5*10**5 give or take some noise, errors, flawed timing rig:

Shrug?

Whenever I need to sort tuples I will usually want to sort both their 1st and 2nd values and get the original tuple.

As you can see in my original post 2. for tuple sort, thats usually how I do it, I wonder how fast it compares to your other sorting algorithms

Mukundan314

You might also want to try out ordersort from PyRival, which is faster than builtin sorted on PyPy (but much slower on CPython).

Comparison of sorting functions at different maximum values (PyPy 3-64)

<= 2 ** 16
sorted           1.174s / 100
getter_sort      8.191s / 100
splode_sort      2.866s / 100
bucketsort       0.171s / 100
ordersort        0.349s / 100
long_ordersort   0.829s / 100

<= 2 ** 32
sorted           1.156s / 100
getter_sort      8.150s / 100
splode_sort      4.287s / 100
ordersort        0.513s / 100
long_ordersort   0.970s / 100

<= 2 ** 62
sorted           1.161s / 100
getter_sort      8.149s / 100
splode_sort      4.291s / 100
long_ordersort   1.156s / 100

<= 2 ** 64 (long)
sorted           7.422s / 100
getter_sort      9.450s / 100
splode_sort     13.320s / 100
long_ordersort   1.848s / 100

For numbers larger than 2 ** 64, 3 passes in long_ordersort will probably be better

Benchmark Code

sarvajnya_18

4 месяца назад, # |

orz. Thank you for this!

Блог пользователя drugkeeper

Introduction

Essentials (Most questions can be solved using these tips)

🦀-Python

Other useful tips: