Блог пользователя Um_nik

Автор Um_nik, история, 3 года назад, По-английски

I can sometimes see people doing some CP-related projects for their courses/portfolio/community, and I want to suggest an idea of such a project that would benefit me as a CodeChef Admin and all the problemsetters, I believe.

Issue: There are a lot of problems on different platforms, and it is really hard to check if some problem was already used somewhere or not. Sometimes googling works, but most of the time I have to rely on my own experience and memory, which are far from perfect. It's hard to search for problems, because they usually have some kind of story, and the same problem can be formulated in a lot of different ways.

Proposed solution: Create some kind of smart database, that will collect the problems from all known sources and will have some very smart search engine that would focus on the internal meaning of the problem, instead of how it is formulated.

I have no idea how to do any of that :) I would assume that it would require some kind of very smart ML and a lot of community effort. But I thought that it might be an interesting project for someone and I really hope that it will be done by some amazing person.

Just a database of problems from different sources with any kind of full-text search would also be nice.

I think I'm ready to pay some reasonable money ($1000? I understand that it's small money for a high-level ML person, but I'm not rich) for a well-done job, and maybe we can crowdfund it even more.

  • Проголосовать: нравится
  • +470
  • Проголосовать: не нравится

»
3 года назад, # |
Rev. 3   Проголосовать: нравится -98 Проголосовать: не нравится

Isn't that against Codeforces policies? Maybe not.

UPD: It seems I'm wrong.

  • »
    »
    3 года назад, # ^ |
      Проголосовать: нравится +82 Проголосовать: не нравится

    Um... what here can be interpreted as against Codeforces policies? Proposing paid job?

»
3 года назад, # |
  Проголосовать: нравится +133 Проголосовать: не нравится

I would say it's very hard. Even the subproblem to parse two statements and say if they are similar already requires shooting some state-of-the-art cannon at it. Not to mention the software engineering effort required to create such a database of problems...

The only way in which I see it remotely doable, is to manually categorize thousands of problems based on difficulty, algorithms / operations used and give the admin option to search through some problems which roughly fall in the same bucket.

Also, if this tool became publicly available it would have some serious implications for the entire competitive programming...

  • »
    »
    3 года назад, # ^ |
    Rev. 2   Проголосовать: нравится +35 Проголосовать: не нравится

    Yes, but I thought "shooting some state-of-the-art cannon" is not that hard in ML world, but I know nothing about ML and very much might be wrong. There already were some attempts (actually, let's ping AlexSkidanov) in SOLVING cp problems using ML, categorizing sounds much easier and much more ML-like for me, but again, I know nothing about this field.

    to manually categorize thousands of problems based on difficulty, algorithms / operations used

    We already have huge amounts of data — problemset of CodeForces and other platfroms with manually set tags and editorials.

    • »
      »
      »
      3 года назад, # ^ |
        Проголосовать: нравится +87 Проголосовать: не нравится

      NEAR failed with that and pivoted. They are now yet another crypto startup.

      • »
        »
        »
        »
        3 года назад, # ^ |
          Проголосовать: нравится +23 Проголосовать: не нравится

        I know that they are doing blockchain now, my point is that 3 years ago people with some real knowledge in ML tried to do something harder, so maybe it's not that impossible as one might think.

    • »
      »
      »
      3 года назад, # ^ |
        Проголосовать: нравится +31 Проголосовать: не нравится

      It's a common misconception. Either you can use something fully out-of-the-box, or you're going to suffer some serious pain.

      In the world of transfer learning, you'd start with a huge pretrained model that's capable of doing something (like recognizing objects), and your goal as an ML engineer is to fine-tune it to your application. But this isn't easy in our case, as the ability of reading is insufficient to solve competitive programming problems as I demonstrate by not being red.

      Starting with Codeforces database and our tags, one could create a fairly simple system which searches in some range of problem difficulties for tasks which have the largest number of similar tags. But the sad truth is that 'DP + 2200' is probably not going to be useful to you as a contest admin, because you'd have to manually search through a ton of problems (so the problems would have to be more carefully categorized for it to be useful). For anything beyond that, the difficulty of creating such a system rises exponentially, I'm afraid :(

    • »
      »
      »
      3 года назад, # ^ |
        Проголосовать: нравится -94 Проголосовать: не нравится

      if you dont know anything about ML then you might need to think twice before suggesting projects related to it

      • »
        »
        »
        »
        3 года назад, # ^ |
          Проголосовать: нравится 0 Проголосовать: не нравится

        and you should think twice before commenting something

        • »
          »
          »
          »
          »
          3 года назад, # ^ |
            Проголосовать: нравится -47 Проголосовать: не нравится

          oh really. my downvotes are from umnik idol worshippers, not from level headed people

          • »
            »
            »
            »
            »
            »
            3 года назад, # ^ |
              Проголосовать: нравится +42 Проголосовать: не нравится

            Fun fact: almost every time I have seen someone saying "I got downvoted because X" it's fairly clearly at least partially wrong.

            • »
              »
              »
              »
              »
              »
              »
              3 года назад, # ^ |
                Проголосовать: нравится -32 Проголосовать: не нравится

              cant resist the need to seek contribution at any opportunity can u

              • »
                »
                »
                »
                »
                »
                »
                »
                3 года назад, # ^ |
                Rev. 2   Проголосовать: нравится +21 Проголосовать: не нравится

                case in point (more or less).

                A comment that gets about +15 points has basically zero effect on my contribution. So no, I write such comments only because I want to say something.

    • »
      »
      »
      3 года назад, # ^ |
        Проголосовать: нравится +66 Проголосовать: не нравится

      One thing we've done back then was pay people to rewrite statements into short mathematical form. I might be able to excavate that dataset.

      On the full descriptions with "Anna was given a string for her birthday" what you want is probably not possible. With short mathematical descriptions something like finding 10 most similar problems could be within reach.

      But to your point, back then we might have been somewhat dilusional about what's possible, so our attempts from 2017 are no indication of what is possible today.

  • »
    »
    3 года назад, # ^ |
    Rev. 3   Проголосовать: нравится +69 Проголосовать: не нравится

    Maybe it should parse not statements, but solutions. This way no experts are required to manually label problems.

    When statements are similar, problems can still be quite different. However, similar solutions mean similar problems.

»
3 года назад, # |
Rev. 2   Проголосовать: нравится +39 Проголосовать: не нравится

Something like OEIS for CP-Problems would be really cool!

The defining structure of a problem is some function $$$P$$$ with $$$P(input)=output$$$.

I imagine a database of problems and solutions. We feed it some $$$input$$$ and some $$$output$$$ and it checks and returns all problems $$$P$$$ for which $$$P(input)=output$$$.

Spoiler
  • »
    »
    3 года назад, # ^ |
    Rev. 2   Проголосовать: нравится +19 Проголосовать: не нравится

    This has a fundamental problem of assuming that the equivalent tasks have the same input and output. I'd say only in few cases it will be true. But predominantly, only the idea to solve it will be the same.

    • »
      »
      »
      3 года назад, # ^ |
      Rev. 3   Проголосовать: нравится +8 Проголосовать: не нравится

      How do you mean? Do you mean the general formatting of the input and output or do you mean the explicit content of some input and output? The former I agree, that's what I commented in the spoiler. For the latter, I don't propose matching the content of inputs and outputs in the database. I propose a database of functions $$$P$$$ and we check for each $$$P$$$ in the database whether it yields $$$P(input)=output$$$. (Which is of course computation-intensive. But after all, I'm just daydreaming.)

      Or maybe I just misunderstand your concern?

      • »
        »
        »
        »
        3 года назад, # ^ |
          Проголосовать: нравится +5 Проголосовать: не нравится

        Ah, sorry, I didn't see your spoiler :)

      • »
        »
        »
        »
        3 года назад, # ^ |
          Проголосовать: нравится +5 Проголосовать: не нравится

        Maybe this is where categorisation can help. You can reduce the computation by tag filtering. Also if someone is trying to steal a problem, they can always shuffle up the input variables. That will be hard to resolve.

  • »
    »
    3 года назад, # ^ |
    Rev. 2   Проголосовать: нравится +34 Проголосовать: не нравится

    I believe this would be a good way to create the database, and not too difficult to implement well. Indeed, OEIS would be a good model for this project.

    To make the search more efficient, we could also have some canonical inputs, such as some fixed list of numbers for all problems whose input is a list of numbers.

»
3 года назад, # |
Rev. 3   Проголосовать: нравится -96 Проголосовать: не нравится

are you joking? 1000$?

first of all the projects in nlp are pretty hard and that is why there are lots of companies which pay way way more to help them with nlp problems , and secondly if some one can create such thing why in the world he/she should waste it on CP? you can use such tool for way way more things and become rich. at this point and with the available methods doing this job with 1k$ is just a joke. also just considering required hardware make it even more like a dream.

  • »
    »
    3 года назад, # ^ |
      Проголосовать: нравится +48 Проголосовать: не нравится

    he is not rich

  • »
    »
    3 года назад, # ^ |
      Проголосовать: нравится +51 Проголосовать: не нравится

    I'm saying that I'm ready to pay $1000 with my own money. Maybe platforms with more funds will also be willing to pay for such a project, I can't say for them.

  • »
    »
    3 года назад, # ^ |
      Проголосовать: нравится +33 Проголосовать: не нравится

    You can start from simpler things. And no, no one will be doing this for 1k$, you can look at um_nik's offer as some indicator that people need it.

  • »
    »
    3 года назад, # ^ |
      Проголосовать: нравится +45 Проголосовать: не нравится

    Imaging registering a new account just to critize somebody.

»
3 года назад, # |
Rev. 2   Проголосовать: нравится +17 Проголосовать: не нравится

If the solutions/submissions are publicly available, I'd bet it's much easier to detect isomorphic AC solutions than the texts. Do we have formal verification experts here?

  • »
    »
    3 года назад, # ^ |
      Проголосовать: нравится 0 Проголосовать: не нравится

    I would prefer not to implement model solution to a problem before approving it, and then I would prefer not to send it to open project for everyone to see.

    • »
      »
      »
      3 года назад, # ^ |
        Проголосовать: нравится 0 Проголосовать: не нравится

      The first issue is a tough one :)

      The second issue can be solved by running the queries locally. Probably even without downloading the whole database, and probably even without having interpretable codes in the database in the first place.

    • »
      »
      »
      3 года назад, # ^ |
        Проголосовать: нравится +8 Проголосовать: не нравится

      Ask the problem proposer to write the model solution

»
3 года назад, # |
Rev. 2   Проголосовать: нравится +8 Проголосовать: не нравится

If some machine can interpret the intrinsic meaning of the problem, then it almost equals to that machine will be able to solve the problem.

Comparing the solutions (source code of different problem statements), I think it will be more feasible.

  • »
    »
    3 года назад, # ^ |
      Проголосовать: нравится +11 Проголосовать: не нравится

    tourist is the name of that machine

  • »
    »
    3 года назад, # ^ |
      Проголосовать: нравится 0 Проголосовать: не нравится

    If some machine can interpret the intrinsic meaning of the problem, then it almost equals to that machine will be able to solve the problem.

    Not really, second is way much harder than a first, but I definitely agree that even first is already far beyond what we can expect from AI in a near future.

»
3 года назад, # |
Rev. 2   Проголосовать: нравится +148 Проголосовать: не нравится

I teach machine learning to undergrad students and I've actually worked in the ranking team of a search engine. I've had a lot of students with CP background, and naturally I tried to nudge them towards picking a CP-related ML course project like this one. This has never worked a single time in 3 years, but I don't lose hope :)

So I've thought about this problem for a while. For a search engine, there are three types of data that can be used:

  • Problem statements are freely available, but I'm fairly certain that using just them won't work, at least not with reasonable accuracy (maybe with a $10M budget for model development and data labeling it will).
  • Short atCoder-style formal statements and editorials may actually work, but they are not always available and in most cases need to be processed by hand.
  • Problem solutions are probably the best approach here as it is fairly easy to check if two programs solve the same or similar problems, and so the search can be done very accurately with the source code instead of the problem statement. I think it is possible to negotiate with main OJ admins to get them (on the terms that they become available to only a limited group of people developing the search engine and not become public).

Creating a model for searching by a solution source should be relatively easy, and what we can do to generalize it to search by a problem statement is to train a separate model that (inaccurately, because the data is scarce) turns a short problem statement into the same vector representation as the average source code of its solution.

In terms of funding, I think it's a bad idea to offer a small amount. In my opinion, the project should be either crowdfunded with a reasonable market-salary budget (in the 10-20k range) or fully crowdsourced (that is, nobody gets paid, but a lot of people need to help with data labeling and other maintenance stuff).

  • »
    »
    3 года назад, # ^ |
      Проголосовать: нравится +3 Проголосовать: не нравится

    it is fairly easy to check if two programs solve the same or similar problems

    Can you point me to some resources or what keywords to google? I only found results that states that the problem is undecidable.

    • »
      »
      »
      3 года назад, # ^ |
        Проголосовать: нравится 0 Проголосовать: не нравится

      I believe sslotin meant that it's easy to check in a similar way to the plag checker. We're not checking that they do the exact same thing, just that the solutions are similar enough. For example, taking in input $$$K\ N$$$ instead of $$$N\ K$$$, the overall solution could be exactly the same, but with

      std::cin >> N >> K;
      

      vs

      std::cin >> K >> N;
      

      Also, there can be very different-looking problems that have very similar solutions. I remember this happened in a round a few months ago, but I can't recall which one. Other than the input/output format, one problem could be directly translated into the other. However, if you just look at the problem statement / input / output, you won't see any correlation.

»
3 года назад, # |
  Проголосовать: нравится +19 Проголосовать: не нравится

You want to automate part of your work as cp admin. Understandable, but if you think, there are not a lot of other cp admins who would also benefit from this solution, for other people like coaches or teachers it might help, but wouldn't recude their work dramatically.
If you're determined with this idea, I'd suggest to look into more generic solution (for papers, patents, news?), with a custom application for cp problem search. In this case it would be possible to get founding and most likely such tech would worth millions.

»
3 года назад, # |
  Проголосовать: нравится +13 Проголосовать: не нравится

It would most likely be a multi hop reasoning model, where each solution (algorithm) will be represented by a vector. Look into OpenAI GPT-3 model and its capabilities Link to GPT-3. Also It would be a good idea to check Neural Network Theorem solver Link to the article about theorem solver. Maybe a combination of the two would be suffice to represent each solution as a vector. Similar solutions would be closer in the vector space.

»
3 года назад, # |
  Проголосовать: нравится +66 Проголосовать: не нравится

Reading some of the comments, I think any approach that involves making an AI that can read and understand problems is right out. That's simply not feasible.

The best that can be done is a respectable search engine with some domain-specific optimizations (modern search engines contain ML components of course but it is not really what I mean above). We have some advantages here. There is a lot of standard terminology and the search space is small, we can probably get hands on about 20k problems. If there is any more-or-less good open-source search toolkit (we don't need anything close to the state of the art), then this might actually be doable, but still require a dedicated team and a lot of time and money. But at this rate, I'm not sure if you'd actually get better results than googling and heavy filtering.

I think editorials are the best thing to search. Editorials often contain relatively formal and simple problem statements (i.e. no Vasya finding queries in the left pocket).

Here's the thing though. Very often we have problems that haven't exactly been proposed before, but you read it and you kind of have seen everything before. Here's an example of such a problem (and I felt the same way solving this problem). You'd want to avoid creating this feeling, but now you need to have some level of intelligence to be able to distill this problem to classical subproblems and somehow understand that the transition is simple enough for it to be considered a repeat of tha problem.

»
3 года назад, # |
Rev. 2   Проголосовать: нравится +89 Проголосовать: не нравится

Creating a CP problem aggregator was one of my biggest personal projects, and one idea of mine was to do that as well (amongst others like problem recommender, contest creator, custom interesting contest formats like lockout, etc.)

Contrary to what others believe, I actually think it’s doable; however, my idea would be to use submissions instead of statements to infer similarity between problems. I have done some work in the past years in that sense, but nowadays I just don’t have the time to integrate it into the platform. I might do that once I free myself some time. In the mean time, if someone is willing to put effort in the project and wants to discuss ideas and implementation, feel free to PM me.

The platform at its current state is at https://binarylift.com (it has some bugs, so you might prefer to check out the old slower version at https://competitive.herokuapp.com)

  • »
    »
    3 года назад, # ^ |
      Проголосовать: нравится 0 Проголосовать: не нравится

    where is the option to add CF handle in binarylift.com ?

    • »
      »
      »
      3 года назад, # ^ |
        Проголосовать: нравится +8 Проголосовать: не нравится

      Once you’ve created an account, you can go to your profile and link your OJ accounts there. The submissions should be updated in some time (it might take a while nowadays as there are quite a few accounts on the platform and it doesn’t scale extraordinarily well).

»
3 года назад, # |
  Проголосовать: нравится +30 Проголосовать: не нравится

I'm interested in supporting such a project as well (financially too!).

»
3 года назад, # |
  Проголосовать: нравится +5 Проголосовать: не нравится

I guess this might be possible to implement such a model only in a world, where authors don’t write problem legends… Of course, such a world doesn’t exist, as problem legends are great and essential.

»
3 года назад, # |
  Проголосовать: нравится +18 Проголосовать: не нравится

Isn't that the type of problem that the tags are proposed to deal with? It is much easier to make better tags in, say, CF, than to make a full search engine of this kind.

Also, if you want just a full-text search engine, without the internal meaning thing, then in that case, you can use any search engine, like Google, really. Recall how search in cplusplus.com works, for example. If you want a full-text search over problems with some specific tag, which seems to be the main use case, it is also possible with proper database design and some search line magic.

And, tbh, I doubt that it is possible to do much more than that.

I also feel that we really need something like new e-maxx, with good modern algorithms, good reliable code in modern C++, where usability, extensibility and readability are preferred over brevity (not the case of e-maxx at all), and example problems on each topic (just like in e-maxx).

I think this is the thing that may be worth paying for, because it is clear for me that crowdsourcing is simply impossible here (well, except for some basic topics, but they are not the point), and that no one would really do it all by himself.

  • »
    »
    3 года назад, # ^ |
      Проголосовать: нравится +22 Проголосовать: не нравится

    I think tags (even if significantly improved) still cover way too many problems to reasonably allow you to search if a problem has already been used. (On a side note, Codeforces tags are pretty bad and I'd love to see a revamp, but I'm out of ideas how to actually improve them).

    The latter thing could probably be reasonably achieved. After all, cp-algorithms is open for contributors as far as I can tell, and with enough support, you could start a campaign to rewrite code examples to have higher quality.

    • »
      »
      »
      3 года назад, # ^ |
        Проголосовать: нравится +29 Проголосовать: не нравится

      Codeforces tags are pretty bad and I'd love to see a revamp

      It would be cool if there were user-defined tags. Tag editing is already gated to div 1, so spam shouldn't be an issue. The bigger problem might be getting people to agree on canonical names for various techniques or when certain tags are even appropriate. This will naturally force the community to make and maintain a "Ultimate Topic List"-like wiki for the tags, which would be an amazing side effect in itself.

      I would say image board sites like danbooru.donmai.us or e-hentai.org are good examples of these kinds of tagging system.

    • »
      »
      »
      3 года назад, # ^ |
        Проголосовать: нравится 0 Проголосовать: не нравится

      I think, tags + regular full text search on editorials should do the trick. However, I definitely agree that current CF tags are not really capable of dealing with anything.

      As for cp-algorithms, as far as I see, it is just a straightforward English translation of e-maxx (even not full), even with the same bugs (e.g. the well-known comparator that is not a comparator) and mostly the same problems with the code style. It also doesn't cover any more or less advanced topics, and even some not really complicated algorithms like halfplane intersection are out of scope.

      And even with all that, rewriting all code examples is simply too much work for one person. In fact, the moderators are needed to maintain the quality of articles and code. Another solution is to find someone (like, a big tech company?) who would sponsor such kind of campaign and new articles.

  • »
    »
    3 года назад, # ^ |
    Rev. 3   Проголосовать: нравится +16 Проголосовать: не нравится

    I know it's offtopic, but I'd be pleased to learn more about your (and others') ideas about improving the source code at e-maxx.ru. You're right that the original goal behind e-maxx.ru's snippets was brevity and, to the extent possible, the clarity of the original idea (i.e., not hiding the algorithm's core idea under 10 layers of abstraction — both because pre-written code was very uncommon 10-15 years ago, and also with the teaching aspect in mind).

    Do you mean to create a prewritten library that would provide, say, a generic "graph" primitive with all algorithms applicable to it? Is the main point the ease of reuse as a pre-written code or something else?

    P.S. I'm asking because I feel like there's a whole bunch of different directions in which one can go, and the aspects you mention seem contradicting at first glance. E.g., there's a good C++ algorithm library with great extensibility and flexibility — the Boost Graph Library — but the price is low readability (especially of the internal implementation) and complex usage (one really needs to sit down and learn it before being able to use it). I'm aware of several huge commercial projects that switched away from Boost because of these and other complexities caused by its flexibility, and I would definitely prefer e-maxx.ru/algo not going into that direction (I mean being a weak half-complete clone of Boost :D) But I probably misunderstood you.

    • »
      »
      »
      3 года назад, # ^ |
        Проголосовать: нравится +4 Проголосовать: не нравится

      I am not very familiar with the Boost Graph Library, I never used it before, but I spent a couple of hours reading code and my first impression was that its main goal is to make sure that any algorithm can be efficiently programmed without extending the library. It is a good approach and I genuinely like it, but it is not what I meant.

      My issues with e-maxx.ru and its successors are connected not with global architecture of the library, but rather with its code style. As the one who mainly focuses on ICPC style contests, I don't normally use any prewritten code (except for short 40-line template), so I don't really mind if every time I write some data structure it is different a bit. But only a bit! I don't use or write code that relies on how I would use it. For example, many containers (like, segment tree) currently use global variables to store its values. It immediately disallows me to use more than one segment tree at once, as well as leads to other bugs specific to usage of global variables. And when I say that the code is not flexible or extensible, I mostly mean these types of problems that are needed to be fixed.

      The way I see how it could be done is to choose and then follow some simple code style, that would also show some proper things and be nice for teaching (from the programmer's point of view) purposes. Basically, I see such a code style as a set of rules like this (of course, it is not complete, I will include only things that annoy me the most).

      • No global variables, except for constants and no magic numbers. goto is also forbidden.

      • Every container should be a class that encapsulates what it contains and provide an interface to work with. It is useful every time you need to reuse a data structure from one problem into another.

      • Since C++ is effectively duck typed, it is also a wise idea to make the type of the elements a template. It is useful every time you realise that you need long long instead of int. Just for the record, #define int long long is not a real solution.

      • Similarly to the containers, every algorithm should be a separate function. If it can take iterators instead of explicit vector or string, it is also good to do. In general, doing things STL-way is preferable. The same applies to the types.

      Example of Lomuto partition
      • If you need a mergeable container, it is not an excuse to give up writing a class, because you can use move semantics.

      • Every time you write a data structure for tree, it is better to write a class for tree and an inner structure for node, that would not be accessible from outside. Why we need two classes while the outer usually has only one field? Two reasons. The first is proper interface, and the second is that it is not always true. The good example is skew heap, where you can add field size only to the outer class.

      Example of skew heap
      • etc.

      So, in general, I think that it is better not to make Boost, but a kind of library that can be used in scenarios like this.

      1. I see a problem and I decide that I need RMQ.

      2. I go to the site, copy-paste the segment tree example, and then replace each + with min.

      3. I use it similarly to any other container from STL and get AC.

      I guess, on one hand, it is good for people who wants just to learn stuff and need simple examples (so, the operation in the segment tree is simple sum and not something template), and on the other hand, it should also be quite good for hardcore competitors who want to make their own little Boost (or, maybe, AtCoder Library) to use in the contests. Maybe it is also good to have everything in two versions: more simple version for the article and more template-heavy ACL-like version for the practical usage.

      P.S. Sorry for late reply, weekend, you know.

»
3 года назад, # |
  Проголосовать: нравится +9 Проголосовать: не нравится

Maybe for each specific problem, the relationship between input and desired output can be described in some formal expression (for example), which is STRUCTRUED / HIERARCICAL / DETERMINATE by any chance. Then it can be stored and queried.

By the way since the expression is structrued, subtask (solved by common well-known algorithm template) may be identified, clustered, and queried as well.

»
3 года назад, # |
  Проголосовать: нравится +5 Проголосовать: не нравится

Um, some thinking problems without algorithms seems hard to check because it's hard to express the main idea with format expression that can match. For example, *1500- is hard to check.

However, a perfect tag system can provide a nice search engine for algorithmic problems, I think it will be helpful.

By the way, CF tags do need revamp.

»
3 года назад, # |
Rev. 2   Проголосовать: нравится +17 Проголосовать: не нравится

I suggest creating a competition in Kaggle (the codeforces of data scientists). It is worth emphasizing that what you propose is extremely difficult. There would be very noisy data with the statements of the problems, since problems with similar stories and different solutions can be confused as similar, perhaps a good idea would also be to use the solutions of the problems (and the code) to look for similarities between them.