Finally, I finished the implementation of an improved version of TrueSkill rating system that EbTech named "TrueSkill from St.Petersburg".
TL;DR
Results are here. Full repository is here.
Important notes on the results
Note that those results were obtained by running on the EbTech's testset (rounds only before Codeforces Round #645 and team contests excluded) and thus do not represent the current standings on Codeforces. Also, the file contains only users with at least 10 contests and at least one contest in 2020 (otherwise, the file would be very large). However, you may obtain full rating by running the algorithm on your computer, it should take about a minute. Also note that the results may seem not very impressive since I chose the first parameters out of the blue and they should be changed of course.
What is this rating system?
It is like original TrueSkill, but without its problems, like multiway ties. What is TrueSkill? It is a generalization of Elo/Glicko to the case when many teams compete simultaneously. However, the original TrueSkill was made to deal with many players, large teams and not so many ties, and therefore should not work correctly in our case. The improved version of TrueSkill was designed for the sport version of "What? Where? When?", thus it solves the original TrueSkill problems and perfectly match Codeforces-like competitions.
Pros
It is mathematically perfect. It does exactly Bayesian inference on the model with many simultaneous players (just like Codeforces), no approximate inference or any other tweaking of the results. Therefore, no other rating system could be more precise. However, there is a room for customisation to fit your personal preferences (the stability of the rating, etc.).
It is very fast. As I have already mentioned, it processes almost all Codeforces history in less than a minute. The complexity of processing contest with $$$n$$$ players is $$$\mathcal{O}\left(n \log \frac{1}{\varepsilon} \right)$$$ where $$$\varepsilon$$$ is the designated precision of users' ratings.
Teams out-of-the-box. Despite the fact that results above were obtained on tests with team contests excluded, it is not the rating system issue. The only reason why I excluded team contests from the testset is that CF-loader I took from EbTech's repository doesn't process team contests and I don't want to spend time for making a loader by myself. However, now team performance is computed as sum of player performances, but it could be easily changed to any other formula.
Reduced rating inflation, explicit performances, visible rating feature and more.
tourist on the top of the rating!
Cons
- Some formulas lack of numerical stability. That happened because formulas from the original article contain some mistakes (at least they fail on some obvious tests like $$$(-m) \cdot \chi_{[-\varepsilon, \varepsilon]} = -\left(m \cdot \chi_{[-\varepsilon, \varepsilon]}\right)$$$) and I failed to retrace the author's steps in the computations. However, they seem to be somewhere near the truth and they are numerically stable, so I believe that it could be fixed.
Further discussion
Rating system parameters
As I already mentioned, I haven't spent lots of time choosing the best parameters for Codeforces case, so there is still some work with data ahead.
Normal vs logistic distribution
Which distribution expresses player rating and performance better? Any choice is very debatable. Currently my implementation uses normal distribution, but it could be easily replaced with logistic or any other distribution, the only thing you need to do is to implement operators like multiplication, division, addition and two multiplication on indicator operators (last two are more tricky than the others, but still not very hard, see article appendix for explanation what to do and this file as an example for normal distributions).
Large multiway tie on the last place
Codeforces standings have an unusual artifact of a large multiway tie of a players with zero score which does not perfectly match the theory and possibly affect those players' ratings too much. There are many ways to deal with this problem, e.g. exclude them from rating at all, make separate tie-$$$\varepsilon$$$ for them, etc.