I would like to thank redpanda and qmk for their great help in writing and reviewing this blog. They are also the users who queried O1-mini for all the following problems.
We tried O1-mini on several problems, from a variety of sources. Let's list the results first.
Note :
Some of the WA verdicts here actually means that the AI just "stopped thinking" which means the AI thought for a long enough time without any useable results so it ran into an error.
All ratings mentioned are Codeforces ratings, the Atcoder ratings have been converted to codeforces rating. To measure the approximate codeforces rating of the atcoder problems mentioned here, you can use https://kenkoooo.com/atcoder/#/table/ + https://silverfoxxxy.github.io/rating-converter.
D2ABs
- 1762B - Make Array Good [1100] AC
- 1998B - Minimize Equal Sum Subarrays [1000] AC
- 1998A - Find K Distinct Points with Fixed Center [800] AC
- 1831A - Twin Permutations [800] AC
- 1831B - Array merging [1000] AC
- 1905B - Begginer's Zelda [1100] AC (this one even used 4o)
- 1905A - Constructive Problems [800] AC
- 2009C - The Legend of Freya the Frog [1100] AC
- 1816A - Ian Visits Mary [800] WA
- 1944B - Equal XOR [1100] WA
- 1758B - XOR = Average [900] WA
Standard problems
- 2000E - Photoshoot for Gorillas [1400] AC
- 13E - Holes [2700] AC after giving hidden test
- abc365_e — Xor Sigma Problem [1544] AC
- abc293_g — Triple Index [2037] AC
- abc258_g — Triangle [2181] AC
- abc342_g — Retroactive Range Chmax [2305] WA
- abc362_g — Count Substring Query [1940] TLE
- 1891F - A Growing Tree [2000] WA
ARC AB Problems
- arc174_b — Bought Review [1378] AC
- arc179_a — Partition [1092] AC
- arc183_a — Median of Good Sequences [1118] WA
- arc175_b — Parenthesis Arrangement [1410] WA
D2C+ (Non-standard) problems
- 1904D1 - Set To Max (Easy Version) [1600] WA
- 1904C - Array Game [1400] WA
- 1967A - Permutation Counting [1400] WA
- 1965A - Everything Nim [1400] WA
- 1943A - MEX Game 1 [1300] WA
- 1893A - Anonymous Informant [1400] WA
- agc051_a — Dodecagon [1842] WA
- agc043_a — Range Flip Find Route [1417] WA
- the famous 1188B - Count Pairs [2300] WA
Conclusion
Overview
gpt o1-mini is very good and faster than humans at some standard problems ≤ atcoder blue, these types of problems can't appear in contests if we want fair contests where AI doesnt determine ranking
gpt o1-mini is very good at most easy "adhoc" problems like D2AB and is faster than or as fast as tourist at these problems, I'm not certain about how to make D2A's that ai cant solve and D2B has limited options too.
the AI struggles a lot on harder "adhoc" problems , especially multistep problems where the steps aren't standard
Div1 people are completely unaffected (for now at least). Some extra care needs to go into choosing d2Bs such that O1-mini can't solve them, but otherwise Div2 contests are also mostly unaffected.
D2AB Ad-Hoc problems
It can "solve" most of the Div2AB level problems in under a minute. Why do i double quote around "solve"? Often, even when the AI can't really reason the solution, if the solution is simple enough like $$$min(a, b)$$$, it gets there by brute forcing.
It is hard to create problems which are suitable for such positions and still not solvable by beginners, but I don't believe it is impossible. We have 3 examples listed above ($$$1$$$ d2A and $$$2$$$ d2B), which are all constructive problems (and the construction isn't as simple as print $$$1, 2, ... n$$$). Constructive problems are one of the weaknesses of the AI, or more generally, problems where the ad-hoc part cannot be reached easily through bruteforcing.
It also failed arc183_a — Median of Good Sequences and 1943A - MEX Game 1, which are easy enough to be D2Bs.
$$$2$$$ tips in particular for easy problems i would like to stress on for problemsetters :
Don't have them be a stupid (even if it is cute) result. For example, output $$$a + b$$$ or print $$$1, 2, .. n$$$.
Try to have problems where genuine insight can't be easily replaced by guessing.
I know these are hard to do, especially for D2A. D2B might still be possible....and it is important if we try to make it as fair as possible.
Another option is to remove D2A and start at D2B (with registered => rated) with a less steep difficulty progress over the contest. There won't be a big difference for most participants however maybe completely new beginners will feel bad however if they solve no problems.
Standard Contests (ABC/Div3/Div4)
AI can perform exceedingly well in such contests. It is trained on infinite problems, and can recognize techniques with ease. For, example it solved the MO's problem in only $$$27$$$ seconds, while it took minutes for several Ad hoc problems (even when it managed to solve it)
These contests simply aren't fair in the current format. Even if you ban AI, i can always just understand and rewrite it's solution. I strongly urge one of the following $$$3$$$ actions :
Remove them
Have them unrated and solely for practice
Rework them to not have so many Standard problems
$$$2$$$-nd option probably isn't very feasible, and you may not want to do the $$$1$$$-st. Thus, I suggest the $$$3$$$-rd.
Div2C+ Ad-hoc Problems
These problems generally need multi steps, and thus, the AI has a very low probability to solve it. It failed each and every one of the last $$$5$$$ D1As under 1400 rating.
Problems requiring some thinking which can't be easily replaced by brute-forcing solutions is the way to beat it, along with having multistep problems. With every non-trivial step (or even a trivial step given how the AI works) reduces the probability for the AI to solve the problem exponentially.
A meta shift needs to happen in the recent future, where we have to solely focus on making multistep problems with several thinking non-trivial steps. However, we are not there yet.
Rating Approximations
(These are purely approximations from the data we have, and is my personal opinion. They could be more accurate with more data but unfortunately we only have limited queries.)
ABC, Div3, Div4 contests : [1900+] according to CF rating
Div2 contests : [1500]
ARC : [1300] according to CF rating
Div1 & AGC : -
Final Remarks
We are mostly fine. Some amount of care needs to go to build easy problems which AI can't solve but other than that not majorly affected. Banning AI probably is the right move now however.
In the future, I don't see this type of "generate $$$10^9$$$ solutions and find the right one" AI being good at solving harder problems, especially if we try to counter it. An AI that can actually reason is much more dangerous imo...