The world’s leading publication for data science, AI, and ML professionals.

How to generate code dataset for machine learning applications

Pitfalls to avoid

Source Code Datasets for AI – Why and How

Photo by daniel wildman from FreeImages
Photo by daniel wildman from FreeImages

Why code datasets

Code AI recently became a common topic; whether it is to classify a code topic, find vulnerabilities within it, summarise it or even guess the next token given the previous ones. What enables it is the recent NLP advances and the availability boost of rich and varied source code datasets. Having services like Github, Gitlab and Bitbucket, generating source code datasets may be seen like a trivial task – ‘crawl Github and start training the model’. The truth is this simplicity covers quite a complex domain with many pitfalls to avoid. A detailed explanation below.

Case study – topic prediction

Let’s assume our task is to build a model that predicts if a source code is server related. There are many ways to approach it, from manually collecting features (like the file type, is it test code, what are the imports within it, etc..) to Deep Neural Networks that will auto generate a set of features to look at. No matter what path we’ll take, the first step is to gather a relevant dataset. A common (naive!!) suggested plan will be to randomly sample repositories from a public code hosting service (like Github) using some relevant search criteria (like Node.js repositories) and to label it base on the existence of relevant terms (‘server related’ VS ‘not’). Following that path we’re likely to find our model overfit to the dataset’s shallow characteristics. What could go wrong?

Why random sampling code is not trivial

The main issue with our initial plan is the source code dataset inherent implicit biases which can affect the model generalisation across different languages and code types. Github, which is one of the most common code hosting services, reportedly suffers from a languages’ long tail; few are super common (like Javascript and Java) while most (like C) are quite rare. Revisiting our plan, randomly Sampling Github will end up with a proxy of their long tail distribution. It can become even worse due to the fact that some topics are language related; Java for example is more relevant to server applications than Clojure. Given that Java is also more common in our dataset can lead the model to conclude that Clojure is not related to servers at all (always negative on our model predictions, which is not correct). The solution should be to make sure our dataset is more uniformly distributed across languages and classes.

Why stratify sampling code is not trivial

The common way to achieve uniform distribution is stratified sampling. On code datasets it is common to apply the sampling on the repo level in order to make sure the generated train-test-validation splits are mutually exclusive (to avoid the risk of same repo files overfitting). Trying to sample languages on the repo level can be difficult as some languages like Json and Yaml are common in general; no matter what sampling policy will be enforced, they are likely to end up (too) common in our dataset. Another question is how to target relevant files; to make sure the dataset will have enough server related examples we suggested to search Github using server related terms (like ‘tomcat’ on Java, ‘flask’ on Python and ‘express’ on Javascript) but the issue is such search is too technology specific (may lead to overfit) and in general it can be tricky to make sure it will result with the same amount of examples per language. Relying on normalised search terms (like the word ‘server’ itself) will ignore many of the potential relevant examples to target. How to make sure our dataset is evenly distributed while keeping it rich and varied?

Sampling in phases to tune the distribution

Following the mentioned code dataset characteristics, our dataset is likely to suffer from biases. Stratified sampling can only partially solve it due to the implicit class to language relations. The solution should be to sample in phases; On the positive class (which is commonly more scarce), under sampling can waste expensive examples while oversampling can enable memorising of the less common examples. On the negative class (which is commonly more available), stratified sampling may hide the class variety. Our solution will be to start with under sampling the positive class’s too common languages and continue to over sample the long tail ones. Finally, we will under sample the negative class to match the just generated positive languages distribution. It will result with a data sample in which we keep as many positive examples as possible while making sure the overall population is almost uniformly distributed. Important side note is the fact that no matter what sampling strategy we choose to apply, we should always monitor; make sure to watch both the general and per language performance. Gaining visibility to potential hidden biases.

Active learning to fill the gap

On our suggested solution some examples will be ignored (under sampling the too common positives and in general the negative ones) and some will be duplicated (over sampling the positive long tail). To make the sampling smarter than a simple random, we can use a weak model to identify which negative examples to prioritise higher (ones which are correctly identified by the weaker model are less important) and to replace the over sampling duplications with actively targeting the strata of interest. Concentrating on the dataset ‘weak’ areas will enable us to better spend our work force towards improving the model.

Build with caution

Even though it looks simple, generating a reliable code dataset can be a difficult task, sometimes even harder than training the model itself. Paying attention to the training population characteristics in the first place can prevent many later headaches.


Related Articles