Codeforces contest results and Data Analysis

Правка en2, от ilyakor, 2017-08-01 15:39:56

Introduction

In the recent tourist's blogpost about a new strategy for some types of contests, there was an interesting comment from ftiasch. Apparently, some participants try to artificially boost their rating with the following strategy: solve problem C, if it goes fast and nice — submit and continue with the contest, otherwise don't submit anything. The assumption is, other problems in the contest are created by the same author, so if one performs good on problem C, they are more likely to perform good on other problems.

In the comment thread, some people expressed doubts about this strategy. Reading their comments, I wondered: is there any point in discussing this with groundless arguments, when there is large volume of historical data of contest results, and we can just analyse them to understand if the strategy really works. Luckily, Codeforces even provides API, so there is no need to implement custom scrapers.

Collecting and cleaning up data

For the dataset, I've used all the contests with CF rules (it doesn't make sense to evaluate the strategy on other rulesets). Even among CF-rule contests, there is a big variety, in particular, sometimes there are more than 5 problems. For the sake of the analysis, I've truncated all the problems to the standard "A-B-C-D-E" format (discarding FGH when they are present). Also, I've removed from the standings all the users with 0 points, to reduce noise.

Next question is, over which users to compute statistics. It is very likely that things work differently for Div2 users than for Div1 ones. Also, orange users are likely to have more noisy and inconsistent results than red and nutella ones. That's why I restricted stats computation only to users with rating >= 2400 at the time of scraping (note though that all users are considered when computing contest-level stats).

Verifying if the strategy works

First look at the data

Let's look at the scraped dataset. We have 695 contests (many of those are Div2 though, so they don't affect our stats). We have 107975 total users, but only 383 of them are red.

Let's see how red coders perform on different problems. First, what are the success rates for each problem? Here is the data (95%-conf intervals are printed):

A B C D E
0.91 ± 0.00 0.80 ± 0.01 0.59 ± 0.01 0.35 ± 0.01 0.15 ± 0.00

As we can see, the rates are even monotonic (so the popular claim "E is much easier than D!11" isn't true on average). What other stats are there besides success rates? An interesting one is how often a user's solution places among the top solutions in the contest (e.g. by score). The data looks as follows (with "top" defined as top-10%):

A B C D E
0.38 ± 0.01 0.38 ± 0.01 0.37 ± 0.01 0.29 ± 0.01 0.14 ± 0.00

An interesting observation is that for problem E, top-10% rate is almost the same as accept rate, while for problem A, these rates are very different.

A very naive attempt

First idea to evaluate the strategy is to estimate the following: given that someone solves problem C, how much more likely is that they also solve some other problem? We can think first and discard this idea right away, but for the sake of the analysis, let's compute the stats first. Here is the delta-rates (i.e. P[one solves problem X | problem C is solved] — P[one solves problem X]):

A B C D E
+0.01 ± 0.00 +0.05 ± 0.01 +0.41 ± 0.00 +0.09 ± 0.01 +0.06 ± 0.01

As we can see, all the lifts are above the confidence intervals (lift for C is huge, which is expected — the rate is lifted to 1.0). But does it mean that we've determined that the strategy indeed works. One can say that obviously NO: there is a strong bias towards contest difficulty. I.e., for easier problemsets, we're more likely to solve C as well as other problem. That is, this data just shows that "solving problem C" might be a proxy for "problemset is easy" property, which indeed means we solve more problems, but doesn't mean we perform better.

Slightly less naive attempt

As we have seen in the previous section, estimating our performance on problems without considering other participants isn't what we want. But what about comparing us with everyone else for each problem separately? We've already computed the data for "being in top 10% scores for a given problem", let's compute deltas conditioned to "we're in top-10% scores for problem C":

A B C D E
+0.15 ± 0.01 +0.19 ± 0.01 +0.63 ± 0.00 +0.18 ± 0.01 +0.10 ± 0.01

So, the strategy works after all? Not necessarily: one can argue that we're hitting a different bias. If a contestant solves problems sequentially, then quickly solving A and B (therefore having good scores on them) automatically increases score on C (as well as D, E, F). So these huge lifts are expected, and we again estimated not what we wanted.

Proper attempt

Finally, let's estimate what we want. From the very beginning, we needed to understand what happens to "performance" on problem X given good "performance" on problem C. Here, "performance" should be independent of other problems (as we have seen in the previous section, score doesn't work for this). A natural measure is time spent on this particular problem, but we don't know it. An easy way to estimate it is to sort all submission times and take differences. We assume that participants solve problems one-by-one in some order. Of course sometimes it's not true (I've seen successful submits with distance 5 seconds in the dataset), but should be mostly true.

Ok, now we define "performance on the problem" as "time to solve this particular problem", estimate it from submission log, define "good performance" as "being in best-10% times for this problem". Without conditioning on problem C, "good performance" rates look as follows:

A B C D E
0.34 ± 0.01 0.37 ± 0.01 0.36 ± 0.01 0.28 ± 0.01 0.14 ± 0.00

And deltas look as follows:

A B C D E
+0.14 ± 0.01 +0.19 ± 0.01 +0.64 ± 0.00 +0.19 ± 0.01 +0.11 ± 0.01

So, one can finally see that the strategy indeed works! The effect looks rather significant, especially for harder problems (almost doubling the probability to perform good on E). How ethical to use such a strategy is a separate question (one just gets some constant increase to their rating, at cost of not having fun solving more contests, and also likely feeling guilty), but it is outside of the score of this post.

Conclusion

Using some basic data analysis, we determined that the described strategy indeed works quite well. There are many open questions though:

  • Is the explanation "it is easier to solve problems from certain authors" correct? Other potential explanation is that someone is just having a good day (slept well etc.), and "quickly solving C" is a proxy to it;
  • Does this generalize to div-2 participants?
  • Is it true that top participants perform worse on "purple" rounds?
  • Is there such thing as "affinity" of participant and problem author? Is it correlated with rating difference?
  • ... and many other questions.

It would be very interesting to see more data analysis of contest results — it might help to understand problem-solving process better and develop more interesting strategies.

Теги strategy, data analysis, api

История

 
 
 
 
Правки
 
 
  Rev. Язык Кто Когда Δ Комментарий
en3 Английский ilyakor 2017-08-02 23:12:15 5049
en2 Английский ilyakor 2017-08-01 15:39:56 6337 (published)
en1 Английский ilyakor 2017-08-01 14:51:04 3434 Initial revision (saved to drafts)