Machine learning everywhere, why not in Competitive programming?

After reading the title if your first thought was like “ML model that is capable of solving competitive programming tasks?”, then this is going to be disappointing – we are not yet there 🙂

In this post I will write (and sort of prove) some ideas which uses ML and can help us build better competitive programming platforms, help us learn better. I will end the post with a bunch of ideas that are worth studying in deep. Let’s talk about some of the issues in CP platforms currently:

  1. There is no recommendation system – Would it not be amazing if you get recommendations for similar problems based on what you have already solved?
  2. Tags are mostly broken – Currently problems are tagged manually and there are multiple issues with this. Tags are high level. They tell if a problem has math involved or binary search involved but does not give us any information about what in math or how complex it is. There are a lot of problems with out tags.
  3. It is very common for beginners to get struck on problems. Currently they search for an editorial or short explanations. Many problems do not have editorials or explanations (spoj, lots of regional contests, other coding camp problems, etc).
  4. Also, in the case where an editorial exists, the programmer reads it and gets an idea of the whole solution. So that is a jump from 0 to 1. It would be better to give progressive hints on how to solve the problem, this will make them think more about the problem.

I believe 3 and 4 can bring a big change. A lot of programmers give up when they are struck on problems or when they are not able to learn. Likewise, 1 can help them solve more similar problems and understand the concepts better. In this post I will propose possible solutions for them.

Auto Tagging CP Questions

Humans are amazing. We put in a lot of logical thinking + intuition to understand a CP question. As of today, computers are not advanced enough to understand the problem statement (in what ever language) and come up with a solution, or even categorize the problem. However, machine learning models are very good at recognizing patterns. The idea comes from this – Give a binary search solution written by a programmer X to another programmer Y. Y without knowing the problem statement, or even what it is trying to do, he will be able to realize that there is a binary search in that code. That’s because we saw it so many times that our brain learned the pattern. Following are 3 examples of how it might look:

while (Left <= Right)
{
  int Mid = (Left + Right) / 2;
  if (A[Mid] < X) Left = Mid + 1;
  else Right = Mid - 1;
}
 index = lower_bound(a, a+n, required) - a 
while (r - l < 1)
{
  long long c = (l + r) / 2;
  if (count(n, m, c) < k)
    l = c;
  else
    r = c;
}

We can recognize many such patterns. Most people use “dfs” as function name when they are doing Depth first search. They have an array named “dp” and/or “mem” in Dynamic Programming problems. As you read this, you should be able to remember some patterns for: exponentiation by squaring, dijikstras, segment trees, binary indexed trees, convex hull trick, square root decomposition, and so on. My idea is based on the fact that patterns exists in solutions to a given problem. Idea is to take the submitted solutions for a given problem, for each accepted solution figure out what common patterns exists and based on prior knowledge predict the categories for that problem. We can use these categories as tags (and much more as you will see in a bit). We should be able to efficiently predict tags for any problem which has at least 30 accepted solutions.

I implemented a basic model to prove the above claims, due to time constraints I kept this as simple as possible. The model understands if the solution uses binary search or not. We shall now see in detail about the implementation and how it performed. Skip to “Recommendation system” if you want to skip the technical details. Note that this simple model is implemented to prove that the patterns exists and the model can recognize them, for that reason I ignored a lot of things, kept the model simple and did not spend a lot of time trying to improve it further.

Why binary search?

Binary search is one of the most common techniques and it can be implemented in various ways. It does not have strong patterns like segment trees, BIT, etc. Amount of code that has binary search patterns is very less (4-5 lines). Also binary search solutions can possible be implemented in other ways: sort and 2 pointers, square root decomposition search etc, this means the data is going to be noisy. This makes it one of the toughest patterns to understand (from ML point of view), hence i choose this in our toy model.

Data and preprocessing

I selected 14 problems from codeforces, some of them are tagged with binary search. I wrote a script to download all accepted solutions (in all languages) and labeled them True or False depending on weather binary search tag is present or not. I did some processing to reduce the size of source code and fix some other irregularities. I fixed the source size as 800 chars. I ignored solutions with larger length and padded ones with smaller length. I mapped the chars to range [0-96). At this point I have 20000 vectors, each of length 800 and each value is in the range [0-96). Each of the 20000 vectors represent one program source code, and it is tagged 0 or 1. Note that many problems tagged with binary search might not be using it, so the data we are using is in fact noisy.

Model

Ignore this section if you want to avoid technical details. You should know a bit about Convolutional neural networks for this para to make sense. Model is a CNN with 2 convolution layers followed by 2 pooling layers. First kernel was of size 50 by 1, moving with 1 by 1 stride and has 16 output channels. Second kernel was of size 25 by 1, moving with 1 by 1 stride and has 32 output channels. Both the pooling layers are max-pool layers with window size 4 by 1 and moving with 4 by 1 stride. Starting with 800 by 1 vector, after first pooling layer we have 16 200 by 1 vectors, after second pooling layer we have 32 50 by 1 vectors. Then we have a fully connected layer with 256 neurons followed by a softmax layer with 2 outputs. I trained the model for 20 epochs.

Results

I made sure that the model was not over fitting training data. The model was able to predict accurately 86% of the times if the source has binary search or not. This is the case when i randomly split the data into test set and validation set. If i instead split the set in a way that test set does not have any solutions from 4 problems (2 positive, 2 negative) and then use those solutions as validation set, then the accuracy was 78%. So, in the latter case we are showing it solutions of a problem it never saw and it was able to predict them correctly. 78% accuracy is pretty good and we can actually deploy this model to predict if a problem should be tagged with binary search. It will almost never get it wrong. That is because each problem will have at least 30 solutions (assumption i made above) and our model would correctly predict 78% of those solutions. Then we can aggregate the results and decide if we want to take the problem with binary search or not.

There are many ways to do better and tune the model. For example each char is mapped to a value between 0-96, instead I should have mapped them to 96 dimension vectors, this will remove the similarity between variable “i” and “j” for example. This might or might not help us, but worth trying. Also I should have fed in language of the solution into model.
Also note that I trained it on solutions of just 14 problems, more the training, better the model will be. I am confident we can achieve much better accuracy with more effort.

Conclusion

The model was able to understand the binary search patterns. We can easily extend the model to predict a wide range of categories and hopefully, we need not tag problems manually.

Recommendation system

Idea 1: We again use similar concept as above. Lets remove the concept of tags and bring in classes. A class can be category (DP, Segment Tree, etc) or it can also be low level trick (convex hull trick, sorting input) or a implementation detail (lot of math, big integer). Instead of assigning a problem with tags, let us just predict probabilities for each of the classes (which may not sum up to 1). Now we have a lot of information about how the solution looks. We understood that the solution uses dynamic programming with convex hull trick. Given this data, we can predict similar problems more accurately. Many a times using tags to predict problems will fail because tags are high level and 2 problems using segment trees can be totally different. Instead now we are trying to capture lot more details which helps us recommend better.

Idea 2: If we represent each solution in some high dimensional space (100-D?) such that similar solutions are close to each other in the space. Here each dimension some how represents the problem, you can think of each dimension as a feature related to the solution. Like if the X axis represents math, and Y axis represents DP, then a point close to origin in this system represents a solution that does not use DP and does not have a lot of math. Like wise a point with high X value and low Y value represents a solution that has a lot of math but not DP. Representing something in 100-D or higher just means that we are trying to capture a lot of details about the solution. Now that similar solutions are close to each other, we can very effectively suggest similar problems. But how do we get that representation? It turns out we can actually do this with machine learning. This is very similar to word2vec [BTW, do read about word2vec if you are interested in ML stuff, its neat].

Hints Generation

By now you should have guessed how this works. It just extends the concepts we spoke above in Recommendation system idea 1. Now that we have knowledge (low level implementation, and high level category knowledge) we can try to generate useful hints. Let me explain, for a given problem, looking at the solutions the model recognized that it does following: “sorts input array”, “uses DP”, “uses long long data type a lot”. We can generate the following first hint “Try to sort the input and solve it”, and then the 2nd hint “Try to use DP”. Later, they submit a solution and are getting WA and have no clue what might be wrong, but our model realized that they are using int data type the most. We can give them a hint saying “May be you are running into overflow issues”. This particular solution looks very natural to me. This is how we help fellow programmers once we solve a problem. Also, in case of editorials or reading up explanation the person is getting a jump from 0 to 1, that is till then he does not know the solution and all of a sudden he knows it all. Instead giving progressive hints will let him think and solve the problem as it gets easy with each hint.

Actually, this model has great scope beyond CP. Imagine if the model is good enough to recognize that “for(int i=0; i<n; j++)” or “if(x=0)” is a common typo mistake and alert the programmer, it would be great. It can be a smart code editor.

And more..

In this section I will talk about random thoughts related to what we spoke so far.

  1. As I spoke above, hint generation system can be a great tool beyond CP.
  2. I was using CNN to do the task but I believe Recurrent neural networks will out perform CNN’s in this task.
  3. A lot more can be done in preprocessing state. I did not care to stripe comments, or unused code (Like how Topcoder does). We can also have a language specific parser to better work on the processing removing common headers, standardize spacing, rename all variables to same name, or renaming all variables to standard format (a-z or a01-a99 or so).
  4. We can use different input rather than source code. May be compiled code? or low level instructions?
  5. How about patterns in the way memory is read, written. This comes from how a bottom up DP fills the arrays – There is a pattern. Also same with how Segment tree accesses memory when it answers queries.

Conclusion

In this post I am not really solving the issue. I wanted to put the ideas out, give a simple proof of concept. It’s nice how all the issues are wired together, we can really build just one good model that works for all of the above. Hopefully the big programming platforms take a step to allocate resources to solve these issues. Competitive programming is amazing to shape one’s future, and making it better in any way is surely worth it. Often people start CP but give up quickly due to lack of resources or lack of learning, I believe making smart tools that can help them in initially stages is time well utilized. I will try to push some programming platforms to take a deeper look at this and lets hope we will soon have an intelligent competitive programming platform 🙂

I recently started learning ML and it sure is an interesting space. As my knowledge is very limited, I might have missed a lot of obvious things in my post. Do comment what your think about this post, ideas listed here and of course about ML. Also, if you are a student, see if any of these can be an interesting project.

  • Akshay khanna

    I think rnn (lstms) would do a better job as it remembers the previous code to a greater extent and hence could be more useful or accurate in recommendations/suggestions.

  • Mudit Jain

    This is a way better explanation of conv networks http://cs231n.github.io/convolutional-networks/

  • Very nice Idea. It would be also benificial to include an opposite of recommendation system, so that user encounters new problems and then if he/she wishes to, can solve similar problems in that category..

  • Aniket Bhatnagar

    Hello sir
    Can you please tell how did you come up with the CNN model specifications like how did you analyze that 16 output channels would be required for the first filter and 32 for the 2nd. I am also a Machine learning enthusiast and have a basic idea of CNNs and how to apply them on images but could’nt gather how to apply the same on code since here we can not use word2vec or similar word vectorisers.

  • Prashant Yadav

    From the view point of a learner, it looks awesome.

  • Hello Anudeep,
    Just to bring something similar into notice. I have made a platform(StopStalk) that accumulates the submissions from various competitive platforms into one and a lot more features. Now future goal is to extend it to what you told **exactly**.
    Note: I am not commenting here to promote my platform(In a way I am but… Here the point is totally different)

    Here is the link to the issue – https://github.com/stopstalk/stopstalk-deployment/issues/30

    I cannot agree with you more that a lot of interesting features can be added using ML for having recommendations.
    For exploring what actually StopStalk is, have a look – https://www.stopstalk.com (I just ask for 5 mins to find out what it provides to make you more and more active for competitve programming)
    We can have a discussion somewhere if you are interested 🙂
    Since it is Open Source I would really appreciate if someone is interested in sending a PR !
    Great post!
    Thanks,
    Raj

    • Is this site for stopping Stalk… or particularly for stalking ?

      • Stop stalking and start StopStalking 😉
        PS: Please don’t SPAM anymore about StopStalk on this thread since I think that the post was for a genuine reason which is getting diverted by my comment (Which also was not for promoting it – But anyway..)

  • DS

    Interesting approach. Have you considered using collaborative filtering to solve (1)?