Блог пользователя yummy

Автор yummy, 2 дня назад, По-английски

Hi Codeforces! I am a member of the reasoning team at OpenAI. We are especially excited to see your interest in the OpenAI o1 model launch, many of us being Codeforces users ourselves (chenmark, meret, qwerty787788, among others). Given the curiosity around the IOI results, we wanted to share the submissions that scored 362.14—above the gold medal threshold—from the research blog post with you. These were the highest scoring among 10,000 submissions, so still a ways to go until top human performance, but we aspire to be there one day.

The following C++ programs (including comments!) are written entirely by the model. Special thanks to PavelKunyavskiy for maintaining the IOI mirror, which we used to check our scores. We hope you enjoy taking a look!

nile (100/100)

message (79.64/100)

  • Submission (79.64/100; subtask 1 and partial credit on subtask 2)

tree (30/100)

hieroglyphs (44/100)

mosaic (37/100)

sphinx (71.5/100)

Lastly, we hope you find the new model magical and delightful—we can’t wait to hear about the amazing things you’ll build with it. (But please don’t use it to cheat on Codeforces!)

  • Проголосовать: нравится
  • +318
  • Проголосовать: не нравится

»
2 дня назад, # |
  Проголосовать: нравится +72 Проголосовать: не нравится

Great work!

It seems that o1 has extremely impressive scores all around; its most impressive score is probably actually hieroglyphs, where a score of 44 would place it fourth relative to onsite contestants! It seems that the model was able to decipher some of the subtasks where we could not!

»
2 дня назад, # |
  Проголосовать: нравится +8 Проголосовать: не нравится

And how was the performance of the model on Codeforces problems measured? Did it participate in rated rounds? Is it possible to reveal a username of the model on Codeforces?

  • »
    »
    2 дня назад, # ^ |
      Проголосовать: нравится +5 Проголосовать: не нравится

    We evaluated the Codeforces performance of the model via simulation, doing a best effort to approximate how the model would have performed had it participated live. With our Codeforces eval, the model is limited to 10 submissions per problem. We use these submissions to simulate the score; from the score we get a ranking; and from the ranking we estimate the model's rating.

»
2 дня назад, # |
  Проголосовать: нравится +10 Проголосовать: не нравится

Could be a naive question but: Do you guys (OpenAI) plan on watermarking the code generated by future models? It could make the process of detecting AI generated code much easier.

  • »
    »
    2 дня назад, # ^ |
      Проголосовать: нравится +3 Проголосовать: не нравится

    Watermarking a 50 line code seems impossible, unlike watermarking an image

  • »
    »
    46 часов назад, # ^ |
      Проголосовать: нравится +18 Проголосовать: не нравится

    I can't speak to future plans for OpenAI. That said, speaking for myself (and not OpenAI), I think watermarking is a cool research direction but not a panacea. For many problems, all AC solutions fall into a few broad buckets, and within those buckets, it is difficult to identify AI vs. non-AI solutions if one is allowed to rewrite/obfuscate code.

»
2 дня назад, # |
  Проголосовать: нравится +7 Проголосовать: не нравится

Out of curiosity: can you share if there are any endeavors in problem setting?

  • »
    »
    46 часов назад, # ^ |
      Проголосовать: нравится +20 Проголосовать: не нравится

    We don't have any results on problem setting, and I could imagine that writing creative problems is a bit out of reach of current models. (I struggle to even get them to tell me a new joke :)) But synthetic problems have been used in the training of models e.g. AlphaGeometry

»
2 дня назад, # |
Rev. 3   Проголосовать: нравится +13 Проголосовать: не нравится

This is probably obvious, but I want to ask: Did AI use stress testing locally to check for correctness? I suppose it is capable of writing the brute force solution and test locally. Just curious if the 10k submissions could be avoided (or if this could even improve the performance).

Maybe you didnt want to do that because adding human heuristics on top of the AI just for the sake of performance is not the goal?

  • »
    »
    46 часов назад, # ^ |
      Проголосовать: нравится +12 Проголосовать: не нравится

    In the blog post, we discussed this a little:

    For each problem, our system sampled many candidate submissions and submitted 50 of them based on a test-time selection strategy. Submissions were selected based on performance on the IOI public test cases, model-generated test cases, and a learned scoring function. If we had instead submitted at random, we would have only scored 156 points on average, suggesting that this strategy was worth nearly 60 points under competition constraints.

    It would be super cool if one day the AI could do stress testing without human heuristics on top!

    • »
      »
      »
      46 часов назад, # ^ |
      Rev. 4   Проголосовать: нравится +5 Проголосовать: не нравится

      Edit: Oh, I got it. The model only submitted 50 solutions, as is the competition constraint. It generated thousands of solutions, but it only submitted 50.

      • »
        »
        »
        »
        45 часов назад, # ^ |
          Проголосовать: нравится +36 Проголосовать: не нравится

        I think you misunderstood here...

        There are actually 3 different results:

        • submitting 50 random submissions: 156 points
        • strategically choosing 50 submissions: 213 points
        • up to 10k submissions: 362 points

        It can be seen on this webpage: https://openai.com/index/learning-to-reason-with-llms/#coding

        • »
          »
          »
          »
          »
          45 часов назад, # ^ |
          Rev. 4   Проголосовать: нравится +11 Проголосовать: не нравится

          Oh, thanks! I’m just lazy to read about it. I prefer to read on codeforces comments :)

          So I keep my position: I would expect that a sophisticated heuristic on top of the model with stress testing would, in most cases, be as accurate as the real verdict. That is, score should not improve by allowing more submissions.

»
42 часа назад, # |
  Проголосовать: нравится 0 Проголосовать: не нравится

When do you think AI will be able to solve Master level problems? Or is that even possible?

»
42 часа назад, # |
  Проголосовать: нравится +2 Проголосовать: не нравится

No way competitive programmers are the ones trying to ruin the sport

»
41 час назад, # |
  Проголосовать: нравится +10 Проголосовать: не нравится

For each task and the 10,000 submissions, if the score distrubution histogram can be shared, it will be more impressive!

»
41 час назад, # |
  Проголосовать: нравится +36 Проголосовать: не нравится

How much computing power was used?

»
34 часа назад, # |
  Проголосовать: нравится 0 Проголосовать: не нравится

what was the prompt after seeing that the code is failing? did it generate some testcases somehow?

»
33 часа назад, # |
  Проголосовать: нравится 0 Проголосовать: не нравится
  1. What is the effective context size for o1-ioi model (in tokens)? I assume since competitive programming doesn't require major decomposition for tasks (and tasks themselves are small) it should be way lower. Or even here bigger context size always leads to better results with no diminishing returns so far?
  2. Problem with RLHF is that it generally tries to optimize human vibe (humans liking the answer), not some clear final metric. When tuning o1-ioi, have you tried giving rating/number of points as a reward function and using it instead of hf?
»
33 часа назад, # |
  Проголосовать: нравится +4 Проголосовать: не нравится

Why competitive programming?

»
32 часа назад, # |
  Проголосовать: нравится +82 Проголосовать: не нравится

Thanks for the posting such details. You guys do so interesting things!

»
31 час назад, # |
Rev. 2   Проголосовать: нравится +27 Проголосовать: не нравится

Very insightful.

Edit: Seems like its solution is in fact correct, so 1-0 for the AI against me

Original: In particular I find the results on "message" interesting — it seems like its basic idea of determining a known safe column is not really correct, but given 10 000 submissions I imagine it tried a lot of different ways to communicate a safe column and eventually one went past the grader. That gives one view of why more submissions can be more helpful. I haven't examined the sphinx code, but I imagine in principle a similar thing is possible there, too.

»
30 часов назад, # |
  Проголосовать: нравится 0 Проголосовать: не нравится

but, what about cheaters who use ai?

  • »
    »
    26 часов назад, # ^ |
      Проголосовать: нравится +14 Проголосовать: не нравится

    It is not much different from cheaters who use their friends/submit from multiple accounts. New technology, but the same old problem.

»
30 часов назад, # |
  Проголосовать: нравится 0 Проголосовать: не нравится

Amazing things you are doing :)

Any plan on participating in ICPC world finals?