yummy's blog

By yummy, 4 months ago, In English

Hi Codeforces! I am a member of the reasoning team at OpenAI. We are especially excited to see your interest in the OpenAI o1 model launch, many of us being Codeforces users ourselves (chenmark, meret, qwerty787788, among others). Given the curiosity around the IOI results, we wanted to share the submissions that scored 362.14—above the gold medal threshold—from the research blog post with you. These were the highest scoring among 10,000 submissions, so still a ways to go until top human performance, but we aspire to be there one day.

The following C++ programs (including comments!) are written entirely by the model. Special thanks to PavelKunyavskiy for maintaining the IOI mirror, which we used to check our scores. We hope you enjoy taking a look!

nile (100/100)

message (79.64/100)

  • Submission (79.64/100; subtask 1 and partial credit on subtask 2)

tree (30/100)

hieroglyphs (44/100)

mosaic (37/100)

sphinx (71.5/100)

Lastly, we hope you find the new model magical and delightful—we can’t wait to hear about the amazing things you’ll build with it. (But please don’t use it to cheat on Codeforces!)

  • Vote: I like it
  • +362
  • Vote: I do not like it

»
4 months ago, # |
  Vote: I like it +62 Vote: I do not like it

Great work!

It seems that o1 has extremely impressive scores all around; its most impressive score is probably actually hieroglyphs, where a score of 44 would place it fourth relative to onsite contestants! It seems that the model was able to decipher some of the subtasks where we could not!

»
4 months ago, # |
  Vote: I like it 0 Vote: I do not like it

And how was the performance of the model on Codeforces problems measured? Did it participate in rated rounds? Is it possible to reveal a username of the model on Codeforces?

  • »
    »
    4 months ago, # ^ |
      Vote: I like it 0 Vote: I do not like it

    We evaluated the Codeforces performance of the model via simulation, doing a best effort to approximate how the model would have performed had it participated live. With our Codeforces eval, the model is limited to 10 submissions per problem. We use these submissions to simulate the score; from the score we get a ranking; and from the ranking we estimate the model's rating.

    • »
      »
      »
      3 months ago, # ^ |
        Vote: I like it 0 Vote: I do not like it

      'Simulation' means not participating in contest, only engaging virtually? Is the ranking solely based on the results from the IOI_contest_codeforces? How is the model's rating estimated—does it use the Elo system for just that one contest?

      In the problem Sphinx, submission1 scored 50 and submission2 scored 43. How is the total score calculated to be 71.5?

»
4 months ago, # |
  Vote: I like it 0 Vote: I do not like it

Could be a naive question but: Do you guys (OpenAI) plan on watermarking the code generated by future models? It could make the process of detecting AI generated code much easier.

  • »
    »
    4 months ago, # ^ |
      Vote: I like it +3 Vote: I do not like it

    Watermarking a 50 line code seems impossible, unlike watermarking an image

  • »
    »
    4 months ago, # ^ |
      Vote: I like it +18 Vote: I do not like it

    I can't speak to future plans for OpenAI. That said, speaking for myself (and not OpenAI), I think watermarking is a cool research direction but not a panacea. For many problems, all AC solutions fall into a few broad buckets, and within those buckets, it is difficult to identify AI vs. non-AI solutions if one is allowed to rewrite/obfuscate code.

»
4 months ago, # |
  Vote: I like it +7 Vote: I do not like it

Out of curiosity: can you share if there are any endeavors in problem setting?

  • »
    »
    4 months ago, # ^ |
      Vote: I like it +20 Vote: I do not like it

    We don't have any results on problem setting, and I could imagine that writing creative problems is a bit out of reach of current models. (I struggle to even get them to tell me a new joke :)) But synthetic problems have been used in the training of models e.g. AlphaGeometry

»
4 months ago, # |
Rev. 3   Vote: I like it +13 Vote: I do not like it

This is probably obvious, but I want to ask: Did AI use stress testing locally to check for correctness? I suppose it is capable of writing the brute force solution and test locally. Just curious if the 10k submissions could be avoided (or if this could even improve the performance).

Maybe you didnt want to do that because adding human heuristics on top of the AI just for the sake of performance is not the goal?

  • »
    »
    4 months ago, # ^ |
      Vote: I like it +12 Vote: I do not like it

    In the blog post, we discussed this a little:

    For each problem, our system sampled many candidate submissions and submitted 50 of them based on a test-time selection strategy. Submissions were selected based on performance on the IOI public test cases, model-generated test cases, and a learned scoring function. If we had instead submitted at random, we would have only scored 156 points on average, suggesting that this strategy was worth nearly 60 points under competition constraints.

    It would be super cool if one day the AI could do stress testing without human heuristics on top!

    • »
      »
      »
      4 months ago, # ^ |
      Rev. 4   Vote: I like it +5 Vote: I do not like it

      Edit: Oh, I got it. The model only submitted 50 solutions, as is the competition constraint. It generated thousands of solutions, but it only submitted 50.

      • »
        »
        »
        »
        4 months ago, # ^ |
          Vote: I like it +36 Vote: I do not like it

        I think you misunderstood here...

        There are actually 3 different results:

        • submitting 50 random submissions: 156 points
        • strategically choosing 50 submissions: 213 points
        • up to 10k submissions: 362 points

        It can be seen on this webpage: https://openai.com/index/learning-to-reason-with-llms/#coding

        • »
          »
          »
          »
          »
          4 months ago, # ^ |
          Rev. 4   Vote: I like it +11 Vote: I do not like it

          Oh, thanks! I’m just lazy to read about it. I prefer to read on codeforces comments :)

          So I keep my position: I would expect that a sophisticated heuristic on top of the model with stress testing would, in most cases, be as accurate as the real verdict. That is, score should not improve by allowing more submissions.

»
4 months ago, # |
  Vote: I like it 0 Vote: I do not like it

When do you think AI will be able to solve Master level problems? Or is that even possible?

»
4 months ago, # |
  Vote: I like it +2 Vote: I do not like it

No way competitive programmers are the ones trying to ruin the sport

  • »
    »
    4 months ago, # ^ |
      Vote: I like it +20 Vote: I do not like it

    Creating 1 algorithm to solve all problems is the ultimate challenge.

»
4 months ago, # |
  Vote: I like it +10 Vote: I do not like it

For each task and the 10,000 submissions, if the score distrubution histogram can be shared, it will be more impressive!

»
4 months ago, # |
  Vote: I like it +36 Vote: I do not like it

How much computing power was used?

»
4 months ago, # |
  Vote: I like it 0 Vote: I do not like it

what was the prompt after seeing that the code is failing? did it generate some testcases somehow?

»
4 months ago, # |
  Vote: I like it 0 Vote: I do not like it
  1. What is the effective context size for o1-ioi model (in tokens)? I assume since competitive programming doesn't require major decomposition for tasks (and tasks themselves are small) it should be way lower. Or even here bigger context size always leads to better results with no diminishing returns so far?
  2. Problem with RLHF is that it generally tries to optimize human vibe (humans liking the answer), not some clear final metric. When tuning o1-ioi, have you tried giving rating/number of points as a reward function and using it instead of hf?
»
4 months ago, # |
  Vote: I like it +4 Vote: I do not like it

Why competitive programming?

»
4 months ago, # |
  Vote: I like it +82 Vote: I do not like it

Thanks for the posting such details. You guys do so interesting things!

»
4 months ago, # |
Rev. 2   Vote: I like it +27 Vote: I do not like it

Very insightful.

Edit: Seems like its solution is in fact correct, so 1-0 for the AI against me

Original: In particular I find the results on "message" interesting — it seems like its basic idea of determining a known safe column is not really correct, but given 10 000 submissions I imagine it tried a lot of different ways to communicate a safe column and eventually one went past the grader. That gives one view of why more submissions can be more helpful. I haven't examined the sphinx code, but I imagine in principle a similar thing is possible there, too.

  • »
    »
    4 months ago, # ^ |
      Vote: I like it +7 Vote: I do not like it

    What's wrong with it?

    • »
      »
      »
      4 months ago, # ^ |
        Vote: I like it +6 Vote: I do not like it

      Actually, you're right, I seem to have misunderstood its approach. Not sure why so many people agreed with me.

  • »
    »
    4 months ago, # ^ |
      Vote: I like it 0 Vote: I do not like it

    Are you suggesting it can be hacked? Can you hack it?

»
4 months ago, # |
  Vote: I like it 0 Vote: I do not like it

but, what about cheaters who use ai?

  • »
    »
    4 months ago, # ^ |
      Vote: I like it +24 Vote: I do not like it

    It is not much different from cheaters who use their friends/submit from multiple accounts. New technology, but the same old problem.

»
4 months ago, # |
  Vote: I like it 0 Vote: I do not like it

Amazing things you are doing :)

Any plan on participating in ICPC world finals?