Codeforces Round 1000 (Div. 2) — Anti-LLM Evaluation Report

#	User	Rating
1	jiangly	4039
2	tourist	3841
3	jqdai0815	3682
4	ksun48	3590
5	ecnerwala	3542
6	Benq	3535
7	orzdevinwang	3526
8	gamegame	3477
9	heuristica	3357
10	Radewoosh	3355

#	User	Contrib.
1	cry	169
2	-is-this-fft-	165
3	atcoder_official	160
3	Um_nik	160
5	djm03178	158
6	Dominater069	156
7	adamant	153
8	luogu_official	152
9	awoo	151
10	TheScrasse	147

As the main problemsetter of Codeforces Round 1000 (Div. 2), I know that a lot of you were somewhat disappointed to see that it was not a Div.1 contest. Though I am not the one who caused a lack of Div.1 contests, I understand your feeling. So I present to you the great surprise: The Anti-LLM Evaluation Report for Codeforces Round 1000 (Div. 2), the first of its kind on a Div.2! In this blog we discuss about how the round combats against LLMs (especially focusing on OpenAI o1, the greatest of its kind while we prepared and tested the problemset), by looking at the timeline of how the problemset changed.

So, let us begin with the initial problemset we had when the testing began. (The task names are anonymized, unless they are released to the public before or in the round)

In the beginning, we had a problemset that looks like this:

A' — B — C0 — D' — E' — F0

(If you are confused with the meanings, (letter)' means it was not used in the final problemset but existed, while (letter)0 means that it appeared in the final problemset in a different form.)

So, you may see, at least half of the problemset has changed in the testing phase. How did o1 do against this revision of our problemset?

A': Not Solved.
B: Not Solved.
C0: Solved.
D': Not Solved.
E': Not Solved.
F0: Not Solved.

Okay, so it already was a bit strong against o1, but C0 being solved by LLMs while A' being not solved looked too odd to me. It felt like LLMs would get a too large unfair advantage if it was kept this way.

Now, the timeline of the problemset begins. (Note that the timeline is purely based on my memory and whatever information is left in the testing mashup)

The first task to get swapped was A'. A lot of testers felt A' a bit too hard for its position, and we had to find a new one. Luckily, we found A'', and swapped A' with it. A'' was solved by o1. But we believed that would be fine if we could find a replacement to C0 that isn't solved by LLMs. Back on that later.

The second task to get swapped was F0. F0 was a version of F that only asks for uniqueness. It was interactive, and we didn't really have a good way to force them to solve online. And there were unexpected solutions even in the online setting, so we made F that asks for counting, and split it to two subtasks. As you might expect, F was not solved by o1.

The third task to get swapped was C0. We swapped it out for C' initially, but the testers did not like it (probably it was too hard for its position), and we swapped it back to C0. C' was not used afterwards, and it was not evaluated against LLMs either.

It was very hard to find a replacement for C0. Until...

We found C. It was just a random idea I pulled while ranting about how hard it is to make tasks on the position. But somehow, very surprisingly, it was not solved by o1. I am still surprised about how o1 cannot solve it, don't ask us, go ask Sam Altman instead. Anyways, this replaced C0. Very lucky!

At this point, most people struggled on E'. We decided that it is not a good fit for the position. As a replacement, we found E and replaced E' with it. Thankfully, the testers pointed out that it's just the fine difficulty for the position. And it's also not solved by o1. Another good one. Later, E' became COUNTGOOD on CodeChef Starters 167.

Some time after that, some testers pointed out that D' is very similar to a task from GCJ. But worry not, we have discussed this out to balance the difficulty for a long time. Surprise: C0 became buffed to D. And it is now not solved by o1! Nice.

Around this time, I asked rewhile to test. He knew better about prompting LLMs, so I asked explicitly to test only using o1 (and he gladly accepted). Here was the result.

A'': AC
B: -11
C: -4
D: -2
E: -1
F (Easy): -2
F (Hard): -1

So yes, I believed that o1 will die horribly if it's the same way.

After this, I just changed A'' to A, which was much easier than A'' for humans. It is solved by o1, but it's fine now.

There are some omitted changes also (Such as a task proposed for Div2D but rejected immediately after I found that it's a Div2B), but these are all of the significant changes.

So the final result for o1 is as follows.

Expected Score: $$$498$$$.
Corresponding Rating: 911 (762 rating points less than the 1673, initially claimed by OpenAI)

The point we need to focus on is that, when these LLMs came out, people thought it's the end of the world for Div.2, but not yet! It might not be the end of the world! But it takes more effort to combat them. Here are some things I found, as a guidance for problemsetters who want to make your Div.2 contest strong against LLM.

You might have noticed, the round has significantly different problem styles compared to the usual Div.2; that might have helped combat them.
Maybe it could be time to change the meta again, after the last time it changed, which was probably when 1188B - Count Pairs appeared? I don't know. Up to the next Div.2 problemsetters to decide.
In terms of problem style, problems that require multiple small observations will be generally more robust against LLMs than those that require one or two big observations. Stronger LLMs usually get one or two first observations correct. Making more small observations will make them also suffer from limit on number of tokens. For example, on problem C, o1 got the immediate first observation, but got WA by choosing the first $$$k=\mathcal{O}(\sqrt{n})$$$ and bruteforcing $$$k \choose 2$$$ pairs.
It will require you significantly more effort to make easier tasks than to make hard tasks. Position C required us more effort than position F required us. For harder tasks you can care less if your task isn't extremely classic or already known, which I assume is usually not the case.

Maybe for a few things noted I might be not the closest to the ground truth. Tell me in comments if you need to point something out.

Also, for cheaters that tried to use o1:

I gave you a hint already. I hope you learn from negative delta, and become honest and diligent again. I hope you are a sane person. I, myself, truly improved only after moving on over bad behaviour. Yes, I had cheated back in the days, and now I became as honest as one could. I believe you can be better also.

Please take this opportunity as a lesson.

Thank you for reading.

Comments (23)

Write comment?

grecil

3 hours ago, # |

Yann Lecun Vindicated

→ Reply

Nolirue_Sola

2 hours ago, # |

This is the first contest I've seen have an Anti-LLM report explaining the process behind the contest design, and I really appreciate this and hope more codeforces contests do this in the future. I do have a question for rewhile about how he tested the LLMs; I find the issue with LLMs isn't just that they can solve singular observation problems but they can also help a user make observations they would not make otherwise, thus cheating for them by indirectly solving the problem. If it is possible, could you explain some of the general methods you used in testing against the o1?

Laure_S

117 minutes ago, # |

It's awesome. though I think OpenAI should pay you for testing their model :)

takopi

97 minutes ago, # |

how to make anti-gpt contest -> set problems until not solvable by gpt 😂

if any proposed problem is solvable then just reject immediately

shrapnelzz

94 minutes ago, # |

Just a query, was there any special rating distribution for this round as there are participants who were lower than me in the contest and higher rated than me, but they finished with positive delta while I received negative?

chromate00

87 minutes ago, # ^ |

uhh, I think your rank is incorrectly applied for rating change. You could ask KAN for a quick answer, or probably wait until recalculation after plag check.

beaaaan

93 minutes ago, # |

← Rev. 2 →

Your problemset is not as anti-LLM as you think. After seeing this post, I tested it out for myself, this is the result:

A: solved (one try)
B: not solved yet (two tries so far)
C: not solved yet (three tries so far)
D: solved (one try)
E: solved (one try)

as you can see from here, although B and C are indeed hard for o1 to solve, it managed to solve D and E in just one try. I suspect F1 might be solve-able too, but have not tried for myself.

It only takes around a performance of 1500 to solve ABC, but if you manage to solve ABCDE, your performance is now 2200+. Which means that a normal Specialist with GPT o1 can easily cheat their way to become a Master.

For anyone interested, check out my submissions.

88 minutes ago, # ^ |

Update: C is also solved

Chaeryeong

84 minutes ago, # ^ |

Could you share screenshots of GPT's response or a link to view the interaction?

78 minutes ago, # ^ |

I will screen record them, they don't allow conversations with images to be shared

73 minutes ago, # ^ |

Here's a screen record of me telling o1 to solve A-E, you can check it and my submissions, the code matches.

https://streamable.com/1k9d86

Dev_2021038

33 minutes ago, # ^ |

Hey the link seems broken. I can't open it

29 minutes ago, # ^ |

It works fine for me.

68 minutes ago, # ^ |

B is also solved as of this moment.

tickbird

86 minutes ago, # ^ |

I kneel chatgpt sama

83 minutes ago, # ^ |

← Rev. 4 →

We checked multiple times that it could not solve the tasks, and there were rated participants almost matching the expected rating who seem to have used o1. Maybe it could be different for different prompts, like depending on if you can give it better ideas.

I also understand that the results can be very different for DeepSeek-r1. In fact we caught some users suspicious for using r1 already :skull:

77 minutes ago, # ^ |

No, all of this was achieved using o1, not r1, no ideas were given from my end either.

63 minutes ago, # ^ |

true, I didn't mean to say you didn't. Probably the experiment was done in different conditions considering that our tester didn't have o1-pro.

Maybe the round is not $$$100$$$ percent LLM-proof (DeepSeek-r1 was about that level), but it is still meaningful that we found out it is at least possible to fool it partially.

59 minutes ago, # ^ |

I was not using o1-pro nor do I have GPT pro, everything had been conducted on the normal version of o1, the only difference would probably be in the prompting, I argue that this should not be considered as it is something easily editable. I appreciate your effort, but please do not make such bold claims.

49 minutes ago, # ^ |

Sorry, we were not sure if we are really seeing the same version of o1 as we tested against the problemset...

Our testers are telling me that o1 has never thought for more than 2 minutes for them.

46 minutes ago, # ^ |

I am quite sure this is due to a difference in prompting, if I just tell it to solve normally, I don't see it thinking for long either, you can check out my prompt in the video above

dsogari

61 minute(s) ago, # ^ |

I don't know much about LLMs, but could it be that the o1 model improves with each new interaction/inference? Maybe it can use recent data in some way (such as code on the Internet that came after the contest).

31 minute(s) ago, # ^ |

no way LLM like ChatGPT can improve in real time or... are they...?

chromate00's blog