Блог пользователя aberter0x3f

Автор aberter0x3f, история, 13 часов назад, По-английски

Are you tired of the inconsistencies and unpredictability in competitive programming evaluations? Do you wish for a more stable, efficient, and fair judging system? Look no further! We are excited to unveil our innovative evaluation solution that leverages the power of WebAssembly (WASM) and the Word-RAM model to transform the way of competitive programming judgement.

Why This Project Matters

Traditional evaluation systems often suffer from instability, environmental discrepancies, and outdated performance metrics. Competitors frequently face the frustration of inconsistent results due to machine fluctuations, leading to unnecessary optimization efforts. Our project addresses these challenges head-on, ensuring a level playing field for all participants.

Key Features of Our Solution

  • Absolute Stability: By simulating program execution, our system guarantees consistent results across different machines, eliminating the unpredictability that often plagues competitive programming.
  • Enhanced Security: The WASM runtime operates in a secure sandbox environment, protecting the host system from any potential risks posed by participant code.
  • Unified Interface: With standardized interfaces, handling interactive and communication problems becomes a breeze, simplifying the development process and enhancing compatibility across various problems.
  • Broad Compatibility: Unlike traditional systems that rely on specific operating environments, our solution is designed to work seamlessly across a wide range of platforms, making deployment easier than ever.

How It Works

Our approach involves compiling participant code into WASM bytecode, which is then executed in a controlled environment. We utilize a cost function based on the Word-RAM model to evaluate the complexity of the code, providing a fair assessment of its performance. This method not only reflects the actual execution time but also adapts to the evolving landscape of computing power.

For a more detailed explanation, please refer to the following link: Principles Overview (Chinese).

Experimentation and Results

We have rigorously tested our system using a variety of classic algorithms and data structures, ensuring comprehensive evaluation. The results show a strong correlation between WASM execution costs and traditional execution times, validating the effectiveness of our approach.

It's an Evaluation!

Imagine a future where this system is widely adopted. We can achieve some outcomes that might even transform the landscape of competitive programming.

In competitions where real-time feedback is absent and evaluations are concentrated at the end, our system ensures that participants will not lost points due to discrepancies between their personal computers and the judging machines. This means that every competitor can reproduce the same evaluation environment on any computer without worrying about performance differences.

For regions or schools facing economic challenges that hinder the upgrade of evaluation machines, our solution provides a lifeline. By standardizing execution results across all devices, participants can compete on an equal footing with their peers, regardless of their local resources.

Our technology allows for the fair assessment of old classic problems without the need to constantly adjust time limits based on the evolving performance of judging machines now and the past to avoid brute force solutions pass.

We need YOU!

We are currently conducting a large-scale sample collection and experimental evaluation activity. We invite you to participate by submitting your code for various problems. Your contributions will help us refine and enhance this groundbreaking project.

  • Test Online Judge: Experience our evaluation system firsthand at https://waj.11316396.xyz/. Follow the guidelines on the homepage to ensure your submissions are considered for our dataset.
  • Feedback and Community: Join our user feedback QQ group at 1023748105 to share your thoughts, ask questions, and connect with fellow participants.

Together, we can revolutionize competitive programming and create a more equitable environment for all competitors. Thank you for your support and enthusiasm!

Let us embark on this exciting journey towards a brighter future for competitive programming!

  • Проголосовать: нравится
  • +92
  • Проголосовать: не нравится

»
13 часов назад, # |
  Проголосовать: нравится +1 Проголосовать: не нравится

First, I'd like to express my support to this project. It does provide us a stable and portable method for program evalution, which is something valuable in competitive programming.

However, I think it's not perfect. One of the main weakness is that nothing can perfectly illustrate the asymptotic complexity of a program. Running on a real machine can't. Neither can WASM. If a $$$O(n \log n)$$$ solution has a huge constant factor, it still can't be filtered out from $$$O(n \log^2 n)$$$ soultions: because it does invoke more instructions.

There is something more to consider: how to set a responsible cost for each WASM instruction? Can libstdc++ (which is more familiar to contestants) be ported to this system?... But I believe although such creative approach may not be used in real contests in the near future, it will be a remarkable technologic improvement.

»
13 часов назад, # |
  Проголосовать: нравится 0 Проголосовать: не нравится

In your Chinese blog you declare that:

基于上述方案,我们提出了一种改进的在线比赛赛制.它类似于 pre test — system test 赛制和 OI 赛制的结合.具体而言,数据集被划分为预评测集和系统评测集,两者均采用相同的数据生成器,但使用不同的随机数种子进行生成.(Translate: We present a improved competition format. It resembles the combination of the pre test — system test format and OI format. Specifically, the test data is divided into pretest data and system test data. Both test datasets share the same data generator, but are generated from different random seeds.)

In practice, we know that tests are not all generated randomly. For example, the test 2 of recent CodeForces problems enumerates all small possible cases. Furthermore, some tests that hack certain solutions are generated by stress tests, which aren't generated by a generator, but put into tests manually. How will you handle these tests?

  • »
    »
    12 часов назад, # ^ |
      Проголосовать: нравится 0 Проголосовать: не нравится

    Generator is not necessarily based on randomization, it can also be some constructed data. Just like you said, enumerate all small data.

    As for generating hack data through stress test, I think since you can find a seed to generate one set of hack data, you can definitely find a second one.

    In fact, this new competition system is not much different from the pre test — system test system currently used by Codeforces. It just changes the form of pre test from black box to white box, that is, the pre test is sent to the contestants directly without the need to judge on the evaluation server.

    Because WASM has the ability to eliminate platform differences, it is easy to prove that the results of contestants testing these white box data on their own computers are the same as those on the evaluation server.

»
5 часов назад, # |
Rev. 2   Проголосовать: нравится +3 Проголосовать: не нравится

I am unfamiliar with wasm, but it looks div is compiled as div, in contrast to cheaper instructions when the divisor is constant as x86-64 compilers are doing, which may result in a ~2.7x difference. Not sure how much impact will it has when the div operations are heavy in a solution, like when mod math is involved, will someone doing this kind of constant optimization gain some advantanges?

I was playing with

This source code

where f2 is rewritten based on asm generated from f, f uses 5.2e9 ticks while f2 uses 1.9e9 ticks.

Also, for the proposed competition format, while it improves user experience, will it make gen AI cheating be much easier? Nowadays if a cheater has a generated code, it needs to be submitted to be evaluated, leaving trace. But with the "white box pretest", a cheater can probably go many iterations locally without being noticed,

Despite the concerns, I like the idea, and thought something similar: using some kind of VM to count native instructions.