As we all know, a coder with A rating has a 1/(1+10^((B-A)/400)) probability of accepting a B difficulty problem.
At present, the following situations may occur: Two equally difficulty problems have a difference of 14.3% in acceptance rate; Two problems with a difficulty difference of 100 have the same acceptance rate. Accurating difficulty to units can reduce errors to 0.1%.
Codeforces Beta Round #23 has only 765 participants, Codeforces Global Round 9 has 21150 participants, so we can definitely make difficulties more accurate by using participants that have increased by more than 25 times.
By the way, would not it be better to have the same accuracy of rating between the problems and coders?
One reason is for visual purposes. Who wants to see every digit? Rounding to the hundreds is usually enough to get the difficulty level. People say, "oh, thats a 1500 problem". They don't say, "Ah, the classic difficulty of 1548!"
Think about why they display the rating as 1500, rather than 15.
However, I know some do prefer to know the exact rating, so maybe cf could allow some sort of extension to track it.
In fact, people can capture information according to their own needs. It's like when people say a coder's rating, “That is a 2300 coder” not “That is a 2288 coder”.
I don't think anyone really says "2288 rating" over just something like 2200 or 2300.
Besides, problem ratings are still different from cf ratings because they don't have to be terribly precise. Actually, many problems are probably not rated correctly — in each contest, there is luck, queue, pretests, etc, that affect problem rating quite drastically. So it is hard to place a too-specific rating on a problem.
But again, I do realize that some would like to see the exact rating, whether it is accurate or not, so I think it would be a good idea if cf released exact ratings and maybe allowed users to have the option to "show exact rating" or "show approximate rating".
In my opinion, the rating of coders is more inaccurate. Each problem has been solved by tens of thousands of coders, while each coder has only solved hundreds of problems.
I think the difficulty ratings of problems are strongly skewed by their placement and the other problems in the set. Two examples:
If GR9 problem G was problem C or D instead, I think it would have twice as many solves.
Often in implementation-heavy educational rounds, the problems at the end of the set aren't nearly as hard as their rating would make you think they are. Lots of people just don't solve them because they run out of time implementing them. If that were the Div1B or Div1C on its own set, many would likely have a much lower difficulty rating.
This will at least make the first four problems more accurate, benefiting most of the coder. You know, only 10% of GR9 participants solved Problem D. Moreover, I feel that we can measure the deviation of the order of problems through comparative experiments.
Sorry, you are right that ratings of coders is not accurate also.
However, this doesn't mean ratings of problems are accurate. One of the reasons it is so hard to determine rating of a problem is that everyone is different — person A could think a person is very hard, and person B of the same skill level could think the same problem is very easy. This variation makes it hard to put a tens and a ones place on problem ratings — I feel it makes it too specific.
There are also other factors that help determine problem rating, some of which I am not aware of (I just made a blog post about this). Some problems are overrated, some are underrated. Mike actually addressed this issue by changing problem rating formula a couple weeks ago, which goes to show the inaccuracy of the ratings.. Still, it is far from perfect.
That being said, I'll repeat again that I am not inherently against this — I would support an option to show exact ratings of problems. The bottom line is, because of variation, I don't really think changing a problem's rating by say, 40, matters all that much.
It does not need to an extra option if it accurate to units. If you want to find the original 1500 difficulty problems, just use the filter and keep [1450,1549].
I guess the point I'm trying to make is,
Say you were given two problems, one of rating 1480, and the other of rating 1520. Could you tell which one is more difficult? I don't think anyone on this site can be sure on which one is the higher rated problem.
That means that there really isn't any need to differentiate between scales so small.
I don't really have an opinion here, but I think it is important to note that it doesn't make sense to me to be want a 15% variation on problem difficulty with the same rating, but still want user's ratings to be to the units digit. Especially because a problem goes through thousands of tests before getting a rating (thousands of contestants attempt it), when a user only has a couple dozen contests.
So, if anything, the problem's rating is likely more accurate than the user's rating. If we should show the users rating to the units, why not the problem's as well?
I know what you are saying is that we cannot accurately measure the problem difficulty. In fact, my point is the rating of both coders and problems are inaccurate. Why are only coders' rating accurate to units?
The rating of coders is not inaccurate. That value beeing inaccurate would mean there is some kind of "real rating" or something. But there is none.
The rating of the participants is kind of self defining. It is what it is, determined by the results of the participant in contests. There is no wrong or right with it.
The error of the ELO rating is independent of the error of calculating the ELO rating, and what I want to reduce is the second error.
No one knows more about the inaccuracy of Codeforces problem rating system than I do. Two problem with a rating difference of 200 can be actually of the same difficulty to me. These ratings can be really subjective because we are good at different stuff. Its aim is just to provide an approximate sense of the difficulty level.
To specify these ratings to units is like the old joke about the 20,000,006 years old dinosaur specimen. Using seemingly accurate numbers for something already too inaccurate is meaningless because it does not change anything. Two problems with a difficulty rating of, say, 2147, can still have "a difference of 14.3% in acceptance rate". My suggestion is: if adopting the new rating system sounds like a joke, do not do it.
BTW, the rating system for the coders is the way it is because it can rise or fall. It cannot be like you either get +100 or -100 or +0 rating for each contest. However, the rating for problems does not change. A dynamic rating system for problems is also impossible because thousands of 1500 rating guy just like me do 2300 problems for practice which will cause even more inaccuracy.
Codeforces probably has more important things to do than doing some of these meaningless and misleading changes.
There are two variances, the first is the error of the formula, the second is the error of the rounding of 100, and the 14.3% is the second variance. Since there is almost no covariance, this change is equivalent to removing the second variance. The variance of the acceptance rate of 2147 difficulty is not still the same.
Well, I don't think you get what I've been meaning to say.
You seem to believe that we can create some sort of accurate rating system and help the coders by specifying the ratings for problems to units. However, is there anyone who only do problems with rating in [2555,2565]? I believe for most of us, we look for problems ranging from like [2400,2600] and maybe pick from them. This is our way to eliminate the variance you are talking about. Specifying them to units does not change anything at all. Since not many coders care about this, Codeforces team must have much more important thing to do.
For example, 12000 people trying to solve a problem with true difficulty X. The problem is now given a difficulty of X+-42.1 (standard deviation). When accurate to units, the difficulty become X+-13.2 (standard deviation), and everyone can find more suitable problems.
By the way, the change only takes 5 minutes.
I don't think problemsetters have been taking this seriously. Maybe it has been a approximate value all along.