nikgaevoy's blog

By nikgaevoy, history, 5 years ago, In English

Finally, I finished the implementation of an improved version of TrueSkill rating system that EbTech named "TrueSkill from St.Petersburg".

TL;DR

Results are here. Full repository is here.

Important notes on the results

Note that those results were obtained by running on the EbTech's testset (rounds only before Codeforces Round #645 and team contests excluded) and thus do not represent the current standings on Codeforces. Also, the file contains only users with at least 10 contests and at least one contest in 2020 (otherwise, the file would be very large). However, you may obtain full rating by running the algorithm on your computer, it should take about a minute. Also note that the results may seem not very impressive since I chose the first parameters out of the blue and they should be changed of course.

What is this rating system?

It is like original TrueSkill, but without its problems, like multiway ties. What is TrueSkill? It is a generalization of Elo/Glicko to the case when many teams compete simultaneously. However, the original TrueSkill was made to deal with many players, large teams and not so many ties, and therefore should not work correctly in our case. The improved version of TrueSkill was designed for the sport version of "What? Where? When?", thus it solves the original TrueSkill problems and perfectly match Codeforces-like competitions.

Pros

  • It is mathematically perfect. It does exactly Bayesian inference on the model with many simultaneous players (just like Codeforces), no approximate inference or any other tweaking of the results. Therefore, no other rating system could be more precise. However, there is a room for customisation to fit your personal preferences (the stability of the rating, etc.).

  • It is very fast. As I have already mentioned, it processes almost all Codeforces history in less than a minute. The complexity of processing contest with $$$n$$$ players is $$$\mathcal{O}\left(n \log \frac{1}{\varepsilon} \right)$$$ where $$$\varepsilon$$$ is the designated precision of users' ratings.

  • Teams out-of-the-box. Despite the fact that results above were obtained on tests with team contests excluded, it is not the rating system issue. The only reason why I excluded team contests from the testset is that CF-loader I took from EbTech's repository doesn't process team contests and I don't want to spend time for making a loader by myself. However, now team performance is computed as sum of player performances, but it could be easily changed to any other formula.

  • Reduced rating inflation, explicit performances, visible rating feature and more.

  • tourist on the top of the rating!

Cons

  • Some formulas lack of numerical stability. That happened because formulas from the original article contain some mistakes (at least they fail on some obvious tests like $$$(-m) \cdot \chi_{[-\varepsilon, \varepsilon]} = -\left(m \cdot \chi_{[-\varepsilon, \varepsilon]}\right)$$$) and I failed to retrace the author's steps in the computations. However, they seem to be somewhere near the truth and they are numerically stable, so I believe that it could be fixed.

Further discussion

Rating system parameters

As I already mentioned, I haven't spent lots of time choosing the best parameters for Codeforces case, so there is still some work with data ahead.

Normal vs logistic distribution

Which distribution expresses player rating and performance better? Any choice is very debatable. Currently my implementation uses normal distribution, but it could be easily replaced with logistic or any other distribution, the only thing you need to do is to implement operators like multiplication, division, addition and two multiplication on indicator operators (last two are more tricky than the others, but still not very hard, see article appendix for explanation what to do and this file as an example for normal distributions).

Large multiway tie on the last place

Codeforces standings have an unusual artifact of a large multiway tie of a players with zero score which does not perfectly match the theory and possibly affect those players' ratings too much. There are many ways to deal with this problem, e.g. exclude them from rating at all, make separate tie-$$$\varepsilon$$$ for them, etc.

Acknowledgements

I'd like to thank EbTech for starting this discussion that led us here, S. Nikolenko for pointing the solution out to me and MikeMirzayanov for great Codeforces system that crashed during my experiments only twice.

  • Vote: I like it
  • +317
  • Vote: I do not like it

| Write comment?
»
5 years ago, # |
  Vote: I like it 0 Vote: I do not like it

Auto comment: topic has been updated by nikgaevoy (previous revision, new revision, compare).

»
5 years ago, # |
  Vote: I like it +22 Vote: I do not like it

I could not find Semenar in your list.. If your rating system is as good as you say, then why Semenar is not in the first place?

»
5 years ago, # |
Rev. 2   Vote: I like it +16 Vote: I do not like it

The most interesting question (to me) is: how different are the standings, have you tried to compare that? I don't really know what to make of the numbers because they're on a different "scale" (almost everyone has rating in $$$[1480, 1500]$$$ :P), but how much do ranks differ on average in your rating system vs the current (obviously we can't just look at the average because near the median, even small changes can make a big difference)? How many players have radically different rank in your system?

  • »
    »
    5 years ago, # ^ |
      Vote: I like it +10 Vote: I do not like it

    In short: I haven't.

    The underlying truth is that we should find the best parameters first. The problem with this rating system is that it is extremely precise. So rating sigmas converge too fast which cause this kind of Elo-hell in the middle. It should be fixed by choosing the appropriate parameters or any other way (e.g. maybe there is a bug in my code, but I doubt it) before comparing to other rating systems. Also, as far as I know, the original implementation that is mentioned in the section 4 of the article also contained some tweak designed to solve this problem. However, the ratings of those users who made their way to the top are seem to be pretty reasonable.

»
5 years ago, # |
  Vote: I like it 0 Vote: I do not like it

Auto comment: topic has been updated by nikgaevoy (previous revision, new revision, compare).

»
5 years ago, # |
  Vote: I like it +54 Vote: I do not like it

I hope you are not telling the truth cuz I am 31240 out of 32889. Guess I should go back to selling watermelons :(

  • »
    »
    5 years ago, # ^ |
    Rev. 2   Vote: I like it +8 Vote: I do not like it

    Well, I grepped the history of your ratings and it says that your first performances made the rating system be confident that your rating is pretty low and not so much time passed after to make the rating system think that your initial rating has changed. It is not the rating system issue, it happened because of the parameters I set (the rating is too stable).

    Try to run it with larger SIGMA_GROWTH, it should help.

    • »
      »
      »
      5 years ago, # ^ |
        Vote: I like it -8 Vote: I do not like it

      Nice initiation! CF needs a solution to rating inflation in my not so important opinion.

    • »
      »
      »
      5 years ago, # ^ |
      Rev. 2   Vote: I like it 0 Vote: I do not like it

      Can you please elaborate? My rank is around 19,000 while someone who I 100% beat him all the time who is: GayLord1.1 (This person is chosen at randomly. Sorry for tag) is around from top 3,000. First contest, I performed better than him by a little bit if you are talking about first contest. If you are talking about first few, then I dropped more than him in the first 3 contests as for him, he dropped more than my minimum in the 4th contest. Does that make a Huge rank difference? Just because of 1 contest in the first few?

      • »
        »
        »
        »
        5 years ago, # ^ |
          Vote: I like it +3 Vote: I do not like it

        Current parameters were chosen almost randomly and thus rating system do not expect that your rating could change with time. At this point your results seem more or less reasonable.

        Again, you can run the algorithm with larger SIGMA_GROWTH, and it should fix this artefact.

»
5 years ago, # |
  Vote: I like it +10 Vote: I do not like it

I can't find myself on the list. Is there some sort of error?

  • »
    »
    5 years ago, # ^ |
    Rev. 2   Vote: I like it +10 Vote: I do not like it

    The dataset that was used includes only rounds up to $$$645$$$. You participated in less than $$$10$$$ contests before round 645, so you were omitted from the list because of space restrictions (the list only includes people who participated in at least $$$10$$$ contests before round $$$645$$$ and in at least one contest in $$$2020$$$).

»
5 years ago, # |
  Vote: I like it +8 Vote: I do not like it

Can you elaborate on "mathematically perfect"? I guess explaining the underlying model would help me understand.

  • »
    »
    5 years ago, # ^ |
    Rev. 3   Vote: I like it +13 Vote: I do not like it

    Sure! The full explanation isn't very short (see this and it's most important references for more), so I'll explain only the general idea.

    Firstly, assume that everything is Gaussian (or any other class of distributions you like, but normal distributions have some cool properties that facilitate calculation). It is the first and only assumption we have to make though. Then our problem is given a contest standings and prior team performances to compute posterior team performances. Note that I replaced players with teams, it isn't the same even when all teams have size 1, but it is pretty straightforward. The second assumption is that teams that tied teams have their performances almost equal (absolute difference is less than $$$\varepsilon$$$) while performances of the not tied teams on neighbouring places are differ on at least $$$\varepsilon$$$. And here comes the difference. The original TrueSkill(tm) is just uses these restrictions on neighbouring teams and hopes that multiway tie would not be a problem, while the rating system I implemented places tied teams into groups. Each group has a group-performance $$$l_{i}$$$ such that absolute difference between $$$l_{i}$$$ and the group teams' performances is at most $$$\varepsilon$$$ and $$$l_{i} - l_{i + 1} > 2 \varepsilon$$$. Now it is easy to see that all our restrictions form a tree (see figures on page 3 of this article) and thus we can run the message passing algorithm to make a Bayesian inference on a tree. However, we have a problem. The normal distribution after processing a node of the form $$$\cdot \chi_{[-\varepsilon; \varepsilon]}$$$ or $$$\cdot \chi_{> \varepsilon}$$$ ceases to be normal. However, we don't need to find exact posterior distribution and only have to find closest Gaussian (namely, we are lowering KL-divergence), thus we can make an approximate inference for nodes of this kind and repeat the algorithm until convergence (number of steps adds $$$\log \frac{1}{\varepsilon}$$$ to complexity, note that it is not the same $$$\varepsilon$$$ as above). Also note that despite the fact we approximate message when multiplying to an indicator function, we still can obtain posterior performances (closest Gaussians TBH) with arbitrary precision.

    Well, I don't expect that the explanation above would be very clear, but two things you should note are

    1. We don't make any additional assumptions except for assumptions that Elo/Glicko does and slightly different tie-model.

    2. We compute posterior distribution (inside the class of the normal distributions) with arbitrary precision.

    And it is the only rating system I know with those two properties! Moreover, as far as I understand, most "Elo-generalisations" like current CF rating system, AtCoder rating system, etc. assume that all player pairs are independent or something similar, which is far from truth. So the difference is that other rating systems promise to converge to the true rating after some number of contests while both TrueSkill-like rating systems provide the most accurate rating possible at any moment.

    • »
      »
      »
      4 years ago, # ^ |
        Vote: I like it +5 Vote: I do not like it

      Hey, sorry for the delay! To continue our discussion, most of the Elo-R system's approximations exist to deal with logistic distributions, whose heavier tails I preferred. If we instead use Gaussians, then Elo-R also comes close to being "mathematically perfect" in the Bayesian sense.

      I believe our systems can compared by their respective modelling assumption: TrueSkill+ has a specific $$$\epsilon$$$-parametrized model for ties, whereas Elo-R assumes that individual rounds have a very large number of players. It remains to be seen just how much damage is done when either of these assumptions fails. We also found that Elo-R has simpler human-interpretable formulas, while on the other hand TrueSkill+ computations are an order of magnitude faster.

      • »
        »
        »
        »
        4 years ago, # ^ |
          Vote: I like it 0 Vote: I do not like it

        But $$$\varepsilon$$$-parameterized ties is exactly the way CF compares player performances, right? The only assumption I made (which is not really necessary, only for simplicity) is that $$$\varepsilon$$$ is constant.

        On the other hand, I believe that there is a huge difference between the best possible approximation of the real rating and the approximation that converges to the real rating. Moreover, I believe that second approach leads to rating inflation and similar things because it is more biased for one-contest users.

        However, I tried to architect my code for easy replacement of distributions, formula for team performance, etc., so, if you have necessary formulas for logistic distributions, you could implement it and compare everything explicitly (be aware of current parameters, they are very stupid and should be changed).

        Anyway, I disagree that TrueSkill is less human-interpretable than [any other Elo-like system] since Glicko (and therefore, Elo) is corner case of TrueSkill itself and thus you can just interpret TrueSkill rating as equivalent Glicko rating when you want to compare two players.

        The only real problem I found is that top TrueSkill players are usually players who make a two year breaks between contests and win (or almost win) all the contests they participate. It is definitely reasonable, but in convergence-based systems they have pretty average rating (approx. CM or master) because they just don't have enough contests to converge their rating to their roughly infinite true posterior rating (note that it is different from just true rating). And I have reasons to strongly believe that it could be fixed with some more clever handling of $$$\sigma$$$ growth (or maybe I should set the lower bound for $$$\sigma$$$? Sounds promising). Anyway, there still remains a lot of work, mostly with data and CF-reader, and I don't want to do all this stuff, especially if noone is interested in practical use. But I encourage everyone who interested in underlying science (or, less probable, practical use) to try to tune this rating system and send me your observations (better in pull request :)).

        • »
          »
          »
          »
          »
          4 years ago, # ^ |
            Vote: I like it 0 Vote: I do not like it

          Yeah, $$$\epsilon$$$ being constant is exactly the assumption I'm talking about. On Codeforces, I might argue this is violated more severely than the assumption of highly populated rounds, though I don't know how much it matters in practice.

          What's your concern about convergence? Both algorithms are Bayesian, so if the priors and assumptions were correct, they should produce the Bayesian "best possible ratings" at all times.

          As for human-interpretability, I think the formulas on page 8 are pretty meaningful even without the Bayesian theory that was used to derive them. I provided my intuitions earlier in the paper. They may be a useful fallback if you don't fully believe the model.

          A few questions about SPb TrueSkill:

          • Are you saying the formulas are all exact? Page 4 contains the phrase "approximating the result with a normal distribution", suggesting that some intermediate results don't have a nice form.

          • How exactly are the unobserved latents $$$l_i$$$ modelled probabilistically? It seems to me that the ratings $$$t_i$$$ don't fully determine which players are tied, so I didn't quite catch how the paper makes a Bayesian model of that.

          • »
            »
            »
            »
            »
            »
            4 years ago, # ^ |
              Vote: I like it 0 Vote: I do not like it

            Well, if you really want, you can choose $$$\varepsilon$$$ based on native players' points in any way you want. There is no need for $$$\varepsilon$$$ to be constant. You could even infer it automatically if you add it into the graph! So, I don't see any specific problem with $$$\varepsilon$$$.

            About convergence. Suppose that we have a game with no ties and no teams. Also forget about distributions, it is not that important at the moment. Then I claim that any "generalisation of Elo/Glicko" should simulate TrueSkill with $$$\varepsilon = 0$$$ (or at least it's some corner case). Why? Because when we say that we have Glicko model with many players, we basically say that we have prior players' ratings, performances are just ratings with added noise, and players are compared based on their performances. So we have already set the Bayesian model. The only thing remains to choose is the class of distributions used. Now, two observations:

            1. This model is exactly the graph used in TrueSkill.

            2. True posterior distributions can be ugly, so, as always when we talk about Bayesian inference, we minimize KL-divergence.

            After that, we have two paths:

            1. To run something equivalent to TrueSkill.

            2. To find some formulas under stronger assumptions and to hope (or to prove) that it converges to the truth.

            The first is TrueSkill, the second is better when you need some tricky additional properties.

            About questions:

            • I am saying that posterior ratings are exact (to be precise, we can obtain them with any designated precision). It is easy to see that inferred distributions ceases to be normal. However, it is allowed to find closest distribution and loop until convergence (it is not easy to prove, here we use some properties of Gaussians, see references).

            • We want teams' performances $$$t_i$$$ to have the property $$$\left\lvert t_i - t_j \right\rvert \leq 2 \varepsilon$$$ for tied teams and correct monotonicity for untied. $$$l_i$$$ and $$$d_i$$$ are used to split $$$t_i$$$ in groups such that each group corresponds to a segment of length $$$2 \varepsilon$$$ and no two segments intersect.

            • »
              »
              »
              »
              »
              »
              »
              4 years ago, # ^ |
                Vote: I like it 0 Vote: I do not like it

              About the same questions again:

              • Ok so although the paper chooses not to get such high precision, you're saying that it's possible using methods in the references. Would you somehow still get a Gaussian end result, or does it become necessary to store more complicated distributions?

              • I understood how teams (or individuals) are split into segments of length $$$2\epsilon$$$, but there is often more than one way to do so. You can choose to make AB tied and CD tied, or you can make BC tied but put A and D into their own bands. And the band itself has some room to shift. If we view the round standings as an observation that the bands ended up a certain way, what's the likelihood?

              In summary, it sounds to me like yes indeed arbitrarily precise computations are possible, and doing so would involve factor graphs, but that method is a bit more involved than what TrueSkill does. You'd have to endogenize the tie model to better match Codeforces (a nontrivial extension), and represent complicated non-Gaussian distributions.

              Essentially, I'm arguing that both systems take some small shortcuts to make the math tractable. Elo-R (with Gaussians and no ties) can be thought of as an exact inference algorithm in the case where the number of contestants is practically infinite. I think this is actually pretty close to the truth on Codeforces. You can still draw the same factor graphs, but the result simplifies in the limit.

              • »
                »
                »
                »
                »
                »
                »
                »
                4 years ago, # ^ |
                Rev. 2   Vote: I like it 0 Vote: I do not like it
                • The paper chooses to get such high precision! The precision is hidden inside "loop until convergence". You don't need any additional methods to make it work. If you want to find closest Gaussian, you don't need any additional distributions, all the magic is hidden inside $$$\cdot \chi_{[-\varepsilon, \varepsilon]}$$$ and $$$\cdot \chi_{\geq \varepsilon}$$$ operators.

                • Still don't see any problems. We have an explicit graph, so we can easily compute likelihood up to some multiplicative constant. You don't need this constant, but you really want, it still seems to be solvable.

                In summary, everything works out-of-the-box, but it is hard to prove. Eventually, the proof has two parts. Firstly, you have to prove that matching first two moments minimizes KLD, and then you should explain why everything converges. See references for details. Of course, if you want to change Gaussians to Logistic distributions, you have to find the way to minimize KLD for them, but I guess that moment matching works.

                TBH, I don't see any advantages of an algorithm that needs additional assumptions and works a way slower ($$$\mathcal{O}(N^2 \cdot f(\varepsilon))$$$ vs $$$\mathcal{O}(N \cdot f(\varepsilon))$$$), sorry.

                P.S. One of my rated contests is the round with only 20 teams, is it still practically infinite?

                • »
                  »
                  »
                  »
                  »
                  »
                  »
                  »
                  »
                  4 years ago, # ^ |
                  Rev. 4   Vote: I like it 0 Vote: I do not like it

                  No, I don't consider 20 close to infinite. It's an exceptional case where the algorithm still seems to behave reasonably, but I think it overweights the evidence a bit. For similar reasons, the evidence is overweighted for the top and bottom players. As I said, we'll eventually have to study how much damage this assumption makes, and whether some bias-correction is needed. Same with any exogenous or endogenous model for ties (mine isn't perfect either).

                  Ok you're being a bit unclear (or is it the paper?), since you said earlier that true posteriors are ugly non-Gaussians. Arguably, minimizing KL after each round is a more rigorous approximation, but we wouldn't call it exact inference. Still, I'm very interested if it's possible, in the logistic case and with reasonable complexity in terms of $$$\epsilon$$$, to compute the posterior that minimizes KL to the truth!

                  P.S. my dataset is updated if you'd like to run your code again. It also auto-fetches new data if you add the contests to contest_ids.json. Since both systems are based on similar Bayesian models, you can probably copy my parameters.

                • »
                  »
                  »
                  »
                  »
                  »
                  »
                  »
                  »
                  4 years ago, # ^ |
                  Rev. 2   Vote: I like it 0 Vote: I do not like it

                  First of all, the rating system that requires some bias correction is bad rating system (or at least not better than current). The idea of a rating system is to choose some reasonable parameters and leave everything as is.

                  Ok, I already mentioned it above, but I think I should elaborate. What do I mean when I say that our inference is exact? Let's start from scratch.

                  We have some Bayesian model and a family of distributions. It is easy to see that posterior distribution ceases to be in this family (I guess it could be proved that there is no finite-dimensional family of distributions that is closed under inference in this model except for something stupid). So we formulate our problem as a problem of finding closest distribution to the family. More formally, if $$$Q$$$ is a true posterior distribution, our task is to find the following

                  $$$ \operatorname{argmin}_{P \text{ in family}} D_{KL}(P || Q) $$$

                  or

                  $$$ \operatorname{argmin}_{P \text{ in family}} D_{KL}(Q || P) $$$

                  Note that these two things are different in general case (and moreover, have different meanings). But for family of Gaussians they are the same. And when I say that we solve the problem exactly, I mean exactly this problem (any of them). Sorry if it wasn't very clear.

»
4 years ago, # |
Rev. 2   Vote: I like it -10 Vote: I do not like it

if i don't know a easy thing(we are human) and asked for that..that a great mistake ?? i asked cause of downvote..

»
4 years ago, # |
Rev. 2   Vote: I like it +44 Vote: I do not like it

Jan 2021 update!

There's more I'd like to say about the Elo-R (now named Elo-MMR) system. I teamed up with inutard to write this publication. It contains experiments showing better accuracy than competing systems, and an optimization which makes it faster too (at a small cost to precision).

Based on the results in the TrueSkill-SPb paper you mention, it's likely that TrueSkill-SPb can be made to perform very well if implemented and tuned optimally. I don't have access to such an implementation, but if anyone would like to help, the codebase is open-source. I'd love to get a fair comparison! In the meantime, I'd like to point out two advantages of Elo-R:

  • Due to its simplicity, I can rigorously prove properties such as monotonicity (so the player's incentive is always to win), robustness (bounds on maximum rating change), and runtime (I haven't seen any proofs on how many iterations TrueSkill requires).

  • Elo-R can handle a variety of log-concave performance distributions, and has been tested for Gaussians and logistics. In principle, TrueSkill should be able to do so as well, but we'll have to wait for a logistic implementation to see how its speed and accuracy compares. One issue is that, when performances are non-Gaussian, the posterior can be rather complex. Elo-R tracks a more complex posterior in order to retain more information.

Elo-R lacks a realistic tie model, though I would argue the same is true of the alternatives. With Elo-R, we have two approaches: 1) take the limit as $$$\epsilon\rightarrow 0$$$, or 2) treat it as half-win half-loss, as Codeforces does. I haven't thought much about team models yet, though it would be a useful extension if done right!