Open Dataset of Codeforces Submissions (End of 2024)

#	User	Rating
1	jiangly	3976
2	tourist	3815
3	jqdai0815	3682
4	ksun48	3614
5	orzdevinwang	3526
6	ecnerwala	3514
7	Benq	3482
8	hos.lyric	3382
9	gamegame	3374
10	heuristica	3357

#	User	Contrib.
1	cry	169
2	-is-this-fft-	165
3	Um_nik	161
3	atcoder_official	161
5	djm03178	157
6	Dominater069	156
7	adamant	154
8	luogu_official	152
9	awoo	151
10	TheScrasse	147

Hello, Codeforces community!

I am currently working on a project called "Codeforces User Analysis System for Generating Individual Training Recommendations". The goal of this project is to create a tool that recommends tasks to users, helping them improve their skills through solving targeted problems.

As the first step, I decided to collect data using the open Codeforces API. After spending about 6–7 hours gathering and processing the data, I thought it would be a good idea to share the dataset with the community. This way, anyone working on similar projects can save some time.

What is this dataset about?

This dataset includes submissions from ≈15,000 active Codeforces users over the entire history of the platform, up to the end of November 2024. The dataset consists of 17.6 million records, with the following details for each submission:

handle: An anonymized and shuffled user nickname (e.g., user{i}).
rating_at_submission: User's rating at the time of submission.
problem_rating: Problem difficulty rating.
id_of_submission_task: Unique problem identifier on Codeforces.
verdict: Result of the submission (e.g., OK, WRONG_ANSWER).
time: Time of submission (in seconds since the Unix epoch).

Where to download the dataset?

I have uploaded the dataset to Hugging Face:
UsersCodeforcesSubmissionsEnd2024

How can this dataset help?

Save time: No need to spend hours collecting data. It’s already processed and available in a ready-to-use format.
Support AI projects: This dataset can be used to develop training systems, analyze user behavior, study problem difficulties, and more.
Inspire new ideas: Perhaps this dataset will inspire you to start your own projects)

Wishing you productive learning and good luck with your projects! :)

Comments (11)

Write comment?

denk

7 weeks ago, # |

Auto comment: topic has been updated by denk (previous revision, new revision, compare).

→ Reply

naivedyam

Don't you think it can be full of errors if you specifically want use it to decide what problems to recommend to people of a particular rating range? Because many times people use multiple platforms and some even quit codeforces, practice somewhere else and return as Masters or Candidate Masters.

7 weeks ago, # ^ |

Yes, I've thought about that, but I have some assumptions on how to mitigate the impact of such errors. However, this blog isn't about my project but about the database itself. Of course, this data can be used not only for machine learning but also for gathering other types of statistics

Agarwal

oh, so thats why we all are facing queue..

SomethingNew

6 weeks ago, # |

interesting stuff

jagan028

5 weeks ago, # |

Do you mind if I try the same as a hobby project?

5 weeks ago, # ^ |

Of course, I don't mind; on the contrary, I'll only be glad!

Thank you so much :D, will share if I get good results!

AkiLotus

This dataset includes submissions from ≈15,000 active Codeforces users, over the entire history of the platform, up to the end of November 2024

Tfw I found myself in the dataset... so I am active, in a way?

For the sake of things, kudos for actually censoring the handles in the dataset. Not sure revealing would harm anyone, but I think we all do prefer a bit of privacy.

How did I figure myself out, you ask?

Apologies for not clarifying what is meant by "active" in this context. This dataset considers submissions from users who were included in the leaderboard rankings at the time of data collection (i.e., those who participated in at least one contest in the last six months).

Woah, I see. I thought "active" at first meant those with top activity count of all time, turns out it was just current active ones with data spanning through "all time". Punctuation matters, I guess. It's good to know now though. ;)

denk's blog

Hello, Codeforces community!

What is this dataset about?

Where to download the dataset?

How can this dataset help?