Блог пользователя Geothermal

Автор Geothermal, история, 15 месяцев назад, По-английски

While I very much enjoy Codeforces rounds overall, FSTs are by far my least favorite part of the Codeforces contest format. There are two reasons I dislike contests with weak pretests. The first is stylistic: I don't care very much about penalizing people who don't rigorously prove their solutions before submitting (which is the main justification for having weak pretests) and I think it's essentially impossible to FST these solutions without more frequently FSTing contestants who make minor mistakes in their code. (The vast majority of my FSTs are due to this sort of minor glitch rather than an error in the underlying idea).

The second is that FSTs introduce an element of hidden information to Codeforces rounds. During a contest, you have no way of knowing whether the authors have written strong pretests. If you act as though the pretests are weak, you will be at a disadvantage if pretests are actually strong (because if you knew pretests are strong, you could have e.g. submitted ideas that are difficult to prove but easy to code, submitted code without spending time checking every detail to make sure you haven't missed any silly details like int vs long long, etc). If you act as though the pretests are strong, you will be at a disadvantage if pretests turn out to be weak (because you didn't thoroughly check your solutions before submitting them). When pretests are weak, there is also ambiguity over which errors pretests will catch, and it's not obvious to me why some mistakes deserve to be penalized more than others by FSTing. Thus, the current system (FSTs are usually strong, but sometimes pretests are unintentionally weak and occasionally an author will write weak pretests on purpose) introduces a degree of variance that I believe both unnecessarily decreases the competitive value of Codeforces rounds.

In this post, I offer four proposals to improve the state of pretests on Codeforces. The first proposal serves to eliminate the hidden information problem described above. The remaining three focus on minimizing FSTs for authors who seek to write strong pretests. (These three proposals are generally commonly implemented already, and my purpose in posting this blog is to try to make them universal norms.) All of these proposals are supported by concrete examples of problems that I believe they would have improved.


Proposal One

If an author intentionally writes weak pretests, they should announce this decision in the contest announcement or in the statements of the affected problems.

Case study: Round 869, Div. 1 E. In this problem, the author chose to include only six pretests in order to discourage people from incorrectly guessing the solution idea and attempting proofs by AC. The issue is that (a) many minor implementation bugs seem to have been caught in the crossfire and (b) contestants had no way of knowing that pretests would be weak (only one large test was included, so unlike most problems around this difficulty level, typical implementation errors common in large test cases did not fail pretests).

I disagree with the stylistic choice to use weak pretests here, but I think it's a matter of judgment and I respect the author's right to decide to make pretests weak. However, to eliminate the hidden information problem, contestants should be told that pretests are weak in advance, especially because most contestants have come to assume that pretests for late problems in Div. 1 are strong.

I can think of a couple of ways to implement this, but my favorite is to add a note at the end of the problem statement reading "Note that the pretests for this problem are intentionally made weak. There are six pretests, including the sample test case." This lets contestants know that they are responsible for carefully confirming the validity of their solutions before submitting their code. This isn't a full solution to the hidden information problem, as there's still ambiguity over which specific errors the pretests are designed to catch, so I would oppose weak pretests even if this change was implemented, but I think this would significantly reduce the damage weak pretests do to the contest experience.

Sidenote: This proposal, including the language of the note above, was inspired by maroonrk's comment at this link.

The remaining proposals are specific to authors who intend to write strong pretests. (I think this applies to the strong majority of CF setters.)


Proposal Two

If a problem is expected to receive under 100 solutions, the pretests should be the same as the system tests. No exceptions.

Case study: Round 896, Div. 1 D. In this problem, there were 41 pretests, which I assume were meant to be strong (for example, I accidentally declared my variable storing the sum of the input degrees as an int and not a long long and this error was caught by pretests, even though the only way it should matter is due to undefined behavior or if the sum of the input degrees is not equal to $$$2N$$$ but is evaluated as $$$2N$$$ due to overflow). However, several people FSTed when it turned out that there were 37 additional system tests not included in pretests. Given that this problem received only about 25 solutions in-contest, running the full test suite on every submission would not have created a significant load for the contest servers, so these FSTs could easily be avoided by setting pretests equal to systests.

More generally, running a large number of test cases is not especially expensive when a problem receives very few solutions throughout the contest, so unless pretests are intentionally made weak, there is no reason for them to not coincide with the system tests.


Proposal Three

If pretests do not include all system tests, then multitesting should be used. If possible, pretests should include all possible small cases. Exceptions can be made for e.g. Div. 2 As if with only 2-3 pretests allowed, it is impossible to include all small cases, any larger edge cases, and larger maximal cases (though this is fairly rare because most modern D2As have small inputs and large numbers of test cases).

I'm ambivalent on whether exceptions should be allowed for e.g. problems where multitesting makes the complexity analysis more complex. However, any exceptions should require justifications approved by the contest coordinator (and if possible, these problems should just have pretests = systests).

Case study: CodeTON Round 5 D. In this problem, I FSTed due to a minor bug in my logic that would have been caught on a fair number of small cases (if I recall correctly, my solution fails most cases where there exists a zero-length edge incident to $$$N$$$). There wasn't any meaningful reason the problem didn't use multitesting, and incorporating multitests would have prevented this and similar FSTs.

Multitesting is a very powerful tool for preventing FSTs, and authors should use it (and take full advantage by including all small cases) unless they have a good reason not to.


Proposal Four

Assuming multiple meaningfully distinct test case generators are used, at least one test from each generator should be included in pretests. Authors should prepare solutions meant to fail on each category of test case and should ensure that each solution actually does fail pretests.

Case study: Global Round 19 E. In this problem, the answer is immediately NO whenever $$$n$$$ is not a Fibonacci number. No such cases are included in pretests, however, and my solution FSTed by not immediately returning after outputting NO. There was a generator in place to output non-Fibonacci numbers, but no such test was included in pretests (I think this was the result of last-minute communication issues during the preparation of the round).

I'm pretty sure that following this proposal is standard practice already. Aside from rounds where pretests were intentionally weak, I can't recall any authors intentionally not following this practice. The point of this proposal is not to substantially change the process of test case generation but to make sure that this condition is checked prior to the start of each round.


Thanks for reading! If you have feedback, please feel free to write a comment; I'd be happy to hear input from people with more problem preparation experience than me.

  • Проголосовать: нравится
  • +502
  • Проголосовать: не нравится

»
15 месяцев назад, # |
  Проголосовать: нравится +92 Проголосовать: не нравится

I think you should pay for better computers for MikeMirzayanov so they can just run the system tests on every submission.

  • »
    »
    15 месяцев назад, # ^ |
      Проголосовать: нравится +46 Проголосовать: не нравится

    We are not all as rich as you sir :(

  • »
    »
    14 месяцев назад, # ^ |
      Проголосовать: нравится 0 Проголосовать: не нравится

    How to send money to Russia now? (Context: It was a nightmare for me to pay for the registration fee of Petrozavodsk camps...

»
15 месяцев назад, # |
Rev. 2   Проголосовать: нравится +69 Проголосовать: не нравится

in fact, for Global Round 19 E, the authors didn't thought of creating tests with "n is not a Fibonacci number", so there were no such cases in the system test before the contest, but someone submit a hack to this case during contest.

  • »
    »
    15 месяцев назад, # ^ |
      Проголосовать: нравится +31 Проголосовать: не нравится

    Interesting, thanks for sharing! Originally, my sense from this comment was that the test data was updated at the last minute due to a change in the generator, but that such a test case did exist pre-contest. I wonder if tester solutions that did not account for this error were marked as AC before this test was generated and then weren't checked again later, so that the hack gave an unexpected verdict when it broke those solutions?

»
15 месяцев назад, # |
  Проголосовать: нравится +2 Проголосовать: не нравится

Another version of proposal one. Apart from sample input and output, authors provide a python script to generate ALL pretests inputs.

The onus would be on the participants to judge if the pretests are weak. This can be used to discourage proof by AC.

  • »
    »
    15 месяцев назад, # ^ |
    Rev. 2   Проголосовать: нравится +33 Проголосовать: не нравится

    This isn't an intrinsically unreasonable proposal (in fact, it's not too dissimilar to the TopCoder format where all pretests are public), but I think in practice, analyzing the tests would require contestants to spend their limited contest time on a relatively uninteresting task (assuming the number of pretests is not tiny). Given the current rate at which authors write blatantly weak pretests (e.g. no max test), I wouldn't find it worthwhile to analyze the pretests on the off chance I catch a point of weakness, decide to analyze my solution in more depth, and find an error.

    The other potential problem is that if people e.g. hardcode answers to pretests, their solutions can easily be hacked, which would make hacking a disproportionately significant component of one's overall score. There could even be a strategy of hardcoding answers to a problem you can't solve, submitting and locking, and then hacking other solutions that do the same thing in order to gain points. As a result, I think this idea is really only feasible in ICPC-format contests where hacking doesn't contribute to scoring.

    • »
      »
      »
      15 месяцев назад, # ^ |
        Проголосовать: нравится -8 Проголосовать: не нравится

      If you FST, you also should get 0 for the hacks you proposed.

      Does it help?

      • »
        »
        »
        »
        15 месяцев назад, # ^ |
          Проголосовать: нравится +10 Проголосовать: не нравится

        That would at least eliminate the strategy described in the second paragraph, though I'd still be worried about people submitting to problems they can't solve just to appear at the top of the leaderboard temporarily (the same reason people cheat in virtual contests...), leading to free hacks for whoever is the first to actually solve the problem.

        (There are always fixes like "provide a generator for a randomly selected subset of the pretests" or something, but past a certain point this approach starts to feel a little hacky.)

        I also realized I didn't finish my first paragraph--I'll go back to edit it now, but the general idea is that I'm not sure this feature would see much genuine use in-contest now, so I doubt if it would actually lead to much improvement over the status quo.

      • »
        »
        »
        »
        6 месяцев назад, # ^ |
          Проголосовать: нравится 0 Проголосовать: не нравится

        what??

»
15 месяцев назад, # |
Rev. 3   Проголосовать: нравится 0 Проголосовать: не нравится

»
15 месяцев назад, # |
  Проголосовать: нравится 0 Проголосовать: не нравится

One question regarding the first proposal — How strong is strong enough? Say, a task, likely Div1E~Div1F, yields a heuristic solution. And the solution of course has a countertest, but it is not trivial to find out that countertest's existence. If they do not include this countertest in the pretests (because they may want the participants to find out themselves about that countertest, maybe allowing some chance for hacks), is the pretest set considered weak?

  • »
    »
    14 месяцев назад, # ^ |
      Проголосовать: нравится 0 Проголосовать: не нравится

    The first proposal applies to any situation where the author is aware of a solution class that participants might plausibly submit that will pass pretests but fail system tests. There's some ambiguity here regarding what it means for a solution to be plausible: this restriction is necessary to indicate that e.g. a solution that includes assert(N != 58234) can FST even if pretests are strong, but FSTing solutions for e.g. int vs long long issues indicates that pretests are weak. The case you described is exactly the sort of situation where this proposal is meant to apply.

    (In fact, I think my proposal makes the problemsetting decision you described much more likely to be successful. The warning that pretests are weak will indicate to contestants that it may be a good idea to think deeply about possible cases that might not be covered by pretests, making it much more likely that someone will actually discover the countercase and use it to hack solutions.)

»
15 месяцев назад, # |
Rev. 4   Проголосовать: нравится +6 Проголосовать: не нравится

In "Proposal One" there's a slight inaccuracy. "most contestants have come to assume that pretests for late problems in Div. 1 are strong." should be "most contestants have come to assume that pretests for late problems in Div. 1 are as strong as the systest." because often it's in later problems that people are trying last resort strategies as "guess that the tests are actually weak and spam some constant optimizations + heuristics". As you've cited maroonrk as a source, I'll do it as well with this comment.

As it is I think that the state of codeforces should be fine for you. Usually people guess solutions and don't get FST due to corner cases. When something like that happens, I feel that it's more often due to a hack.

The only thing that I agree systest should ALWAYS account for is for testing TLE, max tests and min tests instead of WA because it's sometimes hard to judge whether a solution would fit in time or not and people getting FST due to pretests not having a test with max N is a bit weird.

»
15 месяцев назад, # |
  Проголосовать: нравится +114 Проголосовать: не нравится

How about not having pretests at all, except only examples? It would be fun to see 50%+ submission FSTs after the end of the contest :D

»
15 месяцев назад, # |
  Проголосовать: нравится +48 Проголосовать: не нравится

The proposals make sense. What I don't like is that they assume the hacking system is dead (for example, there's no "hack" in the whole post).

While that's the current sad reality, I believe Codeforces system change proposals must not take it for granted, and then build on it.

Come on people, there is quite a number of online judges without hacks already. No need to stomp it more into the ground.

  • »
    »
    15 месяцев назад, # ^ |
    Rev. 2   Проголосовать: нравится +17 Проголосовать: не нравится

    In-contest hacking is pointless and adds unnecessary multidimensionality in my opinion. Maybe it's because I wasn't around during the hacking heydays on codeforces.

    Mainly because it's just a matter of dumb luck depending on how your room is, and whether there are hackable people in it, or even how many people from the room end up participating.

    If everything else in CF if majorly a function of your skill, awarding 100 points subject to luck seems unfair to me.

    • »
      »
      »
      15 месяцев назад, # ^ |
      Rev. 2   Проголосовать: нравится +5 Проголосовать: не нравится

      In-contest hacking is pointless

      Well then why make it more so? The flaws you suggested about hacks are, of course, flaws, but they need their own discussion (which if I remember correctly, had a few posts on the topic previously). I still consider the hack system important, and if the contest organizers decide so too, I don't see why they would not foster the use of it (in ways like I suggested in the above comment)

  • »
    »
    15 месяцев назад, # ^ |
      Проголосовать: нравится 0 Проголосовать: не нравится

    I really hope people not to waste the hacking system.

  • »
    »
    14 месяцев назад, # ^ |
      Проголосовать: нравится +1 Проголосовать: не нравится

    Actually, Proposal One is, in my opinion, the most direct path to revitalizing the hacking system. In the current problemsetting environment, the vast majority of problem authors try to write strong pretests. Because it's usually impossible to know in advance that pretests will be weak, looking for hacks isn't worthwhile unless you know that you won't be able to solve any more problems. If Proposal One was implemented, however, contestants would know which problems are meant to support hacking, creating a strategic incentive to consider making hacks rather than attempting the next problem. The only other way I can think of to make hacking a significant focus of CF rounds again would be to make weak pretests very commonplace, which I think is unlikely because the opinion of the community has shifted strongly in favor of strong pretests.

    Personally, I'm not a fan of hacking-centric contests because I find solving problems more interesting than looking for bugs in other people's code, but if you prefer contests that involve some element of hacking, I think you should support Proposal One.

    Likewise, for those who prefer contests with weak pretests, Proposal One is a way to give authors the opportunity to write weak pretests without getting flamed by the community for violating what seems to be an unspoken norm of strong pretests. I think a lot of frustration with FSTs comes from the fact that contestants now expect strong pretests (because the vast majority of authors try to write comprehensive pretests), so they don't think it's necessary to spend time checking the details of their code (e.g. for int vs long long issues) because they assume pretests will check for them. Proposal One solves the hidden information problem by putting contestants on notice that they need to check their code carefully, meaning that FSTing represents a failure to check your code for errors or to rigorously justify your solution, rather than not knowing that you were expected to do so.

»
15 месяцев назад, # |
Rev. 2   Проголосовать: нравится +13 Проголосовать: не нравится

As a frequent tourist of the FSTland, almost all of my FSTs too are just some bad initialization or some small implementation bug.
This is due to the fact that pretests are so strong now-a-days, so if all logic related bugs get caught in pretests, then implementation related bugs, which in my opinion deserve even lesser for you to lose the whole problem should also be caught at pretests.

»
15 месяцев назад, # |
  Проголосовать: нравится -18 Проголосовать: не нравится

I completely agree with all four proposals.

»
15 месяцев назад, # |
  Проголосовать: нравится +94 Проголосовать: не нравится

Most of your justifications are personal opinions.

I care very much about penalizing people who don't rigorously prove their solutions (it is barely a solution if you didn't prove it) before submitting. I think that is a far more significant issue than FSTs.

Proposals 2-4 are good, thanks to being "specific to authors who intend to write strong pretests".

Proposal 1 is way too reliant on your belief that everyone shares your personal opinions. As a person who wants to penalize people who submit unproven code, if this proposal is implemented, I would just write "Note that the pretests for this problem are intentionally made weak." in every problem, because I want those believers to live in fear. As an author, I will use the statement to convey the point I want to make, even by (ab)using "technical messages", even if they were created in the first place to achieve the opposite effect.

I see this blog as misleading. It merges two purposes: Giving advice on how to make strong pretests and convincing people that strong pretests are the right way. Giving good advice is much less controversial, and gets you positive feedback and upvotes. Using that you create a positive context for a debatable point. You also make me (a person with opposing views) agree with 3 of your 4 proposals, which might seem like nitpicking and like you are already winning the argument, while in reality those 3 proposals are kind of orthogonal to the one I disagree with.

I'm not saying you did that on purpose, I expect you just didn't think that someone would disagree with any of the points. Also, we all use these tricks subconsciously because we want people to agree with us.

»
15 месяцев назад, # |
  Проголосовать: нравится +37 Проголосовать: не нравится

Here's a controversial take — either pretests should be systests, or pretests should be only samples.

This is largely rooted in the belief that uncertainty in a contest is bad, and the two cases presented here removes a lot of meta-uncertainty about how likely your solution is to be correct based on the strength of the pretests.

It's not even that unprecedented to do this, for example you can argue GCJ does something similar since its small/visible data sets cannot test things such as maximal test cases. There are plenty of full feedback contests, it would be nice to see CF lean more into its volatile rating system and spice things up with more volatile contests.

  • »
    »
    14 месяцев назад, # ^ |
      Проголосовать: нравится +21 Проголосовать: не нравится

    This is largely rooted in the belief that uncertainty in a contest is bad, and the two cases presented here removes a lot of meta-uncertainty about how likely your solution is to be correct based on the strength of the pretests.

    I agree that pretests = systests and pretests = samples are the only two formats that completely resolve the hidden information problem described in the blog. I think pretests = samples is a boring format (I want to focus on thinking about ideas to solve problems rather than careful implementation), but from a competitive standpoint I agree that it has more integrity than e.g. weak pretests. (That said, I can accept some deviation from these formats due to logistical constraints, e.g. setting strong pretests that don't include all systests because running all systests in-contest would overwhelm the servers).

    There are plenty of full feedback contests, it would be nice to see CF lean more into its volatile rating system and spice things up with more volatile contests.

    I don't agree with this logic. The volatility of a rating system should be inversely proportional to the volatility of the underlying contests. As a toy example, suppose that each contest was a perfect indication of every participant's underlying skill (with skill changing over time, but no error term reflecting performance on the day of the contest). In this case, the best rating system would be as volatile as possible, fully updating each contestant's rating to their current performance. In contrast, if contest performance is highly volatile, each individual contest gives you limited information (relative to past history) on a contestant's skill, meaning that the best prediction of an individual's ability places a smaller weight on the most recent contest (and thus, rating updates should generally be small).

  • »
    »
    14 месяцев назад, # ^ |
      Проголосовать: нравится -22 Проголосовать: не нравится

    GCJ is not doing anything*

»
15 месяцев назад, # |
  Проголосовать: нравится +7 Проголосовать: не нравится

Strong agree.

While we are on this subject I think hacks in contests should also be reevaluated. This is vestigial from topcoder and no OI's have it. Most successful hacks nowadays are exploiting hashtables (since pre/sys tests are generally strong) and I think it's lame to WA an otherwise AC submission because someone didn't include specific hashtable code. Such code is also harder in non-cpp languages and IMO overall this is just an unfair/bad experience for noobs (especially < 1600s losing rating for getting hacked during 12h hack period of Div3/4 contest). IOI, ICPC, Codejam in the day, and other premiere contests have feedback, so I think this choice is already on firm ground.

A format of "full feedback" kills two birds with one stone: problems with pre vs sys tests, and weird hacking related judgments. No need to overcomplicate things.

  • »
    »
    15 месяцев назад, # ^ |
    Rev. 4   Проголосовать: нравится +16 Проголосовать: не нравится

    Don't you think a hashtable hack can help people think more about hashtable?
    Not so related to CP, but hash flooding is also a critical issue in real world. (e.g. Hash flooding bug in V8, A $30,000 hash flooding bug)

  • »
    »
    15 месяцев назад, # ^ |
      Проголосовать: нравится +12 Проголосовать: не нравится

    A format of "full feedback" kills two birds with one stone: problems with pre vs sys tests, and weird hacking related judgments. No need to overcomplicate things.

    Yes, except that on AtC sometimes THE ENTIRE TESTSET is weak. It's already more complicated than things are supposed to be. While I would argue that hash-related tests belong to the pretests, the hack system really serves a purpose regardlessly.

  • »
    »
    14 месяцев назад, # ^ |
      Проголосовать: нравится 0 Проголосовать: не нравится

    Agree that in my ideal world, we would have full feedback (and barring that, pretests would always be as strong as possible conditional on server constraints). I'm not an expert on breaking hashes, but I think anti-hash tests should be in pretests when possible. However, I'm not sure how much generating such tests depends on the individual solution at hand, so this may not be feasible. (If anti-hash tests are unique to the specific solution, I think this is a very strong argument against the existence of hacks because some anti-hash solutions--i.e., those with active hackers in the room--get penalized more than others).

»
14 месяцев назад, # |
  Проголосовать: нравится -9 Проголосовать: не нравится

I am a fan of gojo satoru, he is my sensei

»
9 месяцев назад, # |
  Проголосовать: нравится 0 Проголосовать: не нравится

We have no pretests in Chinese OI contests.

»
5 месяцев назад, # |
  Проголосовать: нравится 0 Проголосовать: не нравится

Great