An easy guide to “The hardest data science tournament on the planet”

Data science for fun? for crypto? Why not both? 😀

Suraj Parmar
Towards Data Science

--

Source: Numerai blog

Update — DEC 01, 2020: The notebook has been updated according to the new target “Nomi”. TARGET_NAME is now “target” instead of “target_kazutsugi”

Update — SEPT 2021: The data set used in this notebook is now legacy. A new super massive data set is now live. The code would still work with that with a few modifications.

Just Give Me The Code:

Make sure you have signed up on numer.ai as you’ll need to set up your API keys to make submissions directly from colab.

💡 The Numerai tournament problem

The Numerai data science problem is like a typical supervised machine learning problem, where the data has several input features and corresponding labels (or targets). And our goal is to learn a mapping from input to targets using various techniques. We usually split data into training and validation parts. and most of the time is spent on cleaning the data.

Left: Sample of training data. Right: Sample submission

However, Numerai data is different. It is a problem of predicting the stock market but what makes it unique is that the data is obfuscated and is already cleaned! We don’t know which row corresponds to which stock. Moreover, each row is grouped into eras that represent different points in time but as long as it has a structure, we can certainly try to learn and map patterns from it.

Numerai gives this cleaned data to data scientists and asks them to provide better estimates for the data. These crowd-sourced predictions are used to build a meta-model and to invest in real stock markets around the world. The incentives are based on the quality of your predictions and the amount of your NMR staked. You earn a percentage of your stake if your predictions help to make a profit, otherwise, your stake gets burned. This earn/burn system keeps motivating for better and unique predictions. So, the more accurate and/or unique the predictions, the higher the returns. This is what makes it interesting and complex(hardest data science problem).

Let’s address this problem on Google Colab. An end-to-end walk-through using a simple yet very good technique— CatBoost. I’ll be explaining the colab snippets here. It would be really helpful if you open the notebook link in a new tab parallel to this.

Pipeline ➿

  1. Load data set(and some operations that you’ll need)
  2. Define a model
  3. Train a model
  4. Validate
    4.1 Tweak something(back to step 1)
  5. Predict and submit
    5.1 Observe the performance over 4 weeks

Setting up Colab

We’ll need to switch the runtime to use GPU by going to

Runtime -> Change runtime type -> GPU -> Save

Colab comes preinstalled with so many data science libraries. we’ll need to install CatBoost and numerapi.

We’ll go through setting up your pipeline in colab and making it flexible enough to perform experiments there and submit the predictions using API keys. Thus, all you need to do is to press Run all on colab once you set up the keys and finalize a model.

Again, make sure you have opened the notebook alongside this article.

Loading data 📊

Downloading data using numerapi and loading into memory

The tournament data already contains validation sets (val1 and val2). We usually evaluate our model’s predictions on this subset with the goal of performing well on unseen data.

Defining and training a model 🤖⚙️

Defining and training a model

This is probably the part where most of your observations and tuning will happen. You should experiment with other types of modeling algorithms.

Making and evaluating predictions 📐

Don’t get overwhelmed by so much code here. This is mostly a boilerplate code that helps in evaluating predictions. You probably won’t need to change much. However, you might want to add more metrics for better evaluation once you feel comfortable with the tournament.

Predict and validate
Evaluation results on the training and validation set

Once you think your predictions are satisfying your goals, you can save and upload them with the help of numerapi using your secret keys.

Submitting the predictions📤

Settings menu on top right

Although you can manually upload predictions.csv , we’ll use API for hassle-free and easy submissions. Numerai lets you create keys for different purposes but we’ll create key for uploading predictions only.

Options under Automation for creating keys
Key options for different purposes

To create new secret keys, go to

Settings -> Create API key -> select "Upload Predictions" -> Save

You’ll be prompted with your keys to save it somewhere safe.

Below is a sample key for submitting predictions.

A sample key for submitting your predictions

You can have 10 models in one numerai account. So, feel free to experiment with new techniques while keeping your well performing models same. You can use numerapi to submit predictions for different models. You can see a list of your models in options above settings. You just need to copy model_id and paste here.

Getting the model key
Submitting predictions
Stats for one of my models, not from this one.

After uploading the predictions, you’ll see some metrics and information about your submission.

From my experience, it takes a couple of submissions to get up and running in the tournament. Once you have set up your workflow, all you need to do is to press Run all in Google colab.

Your predictions will be tested on live data and given scores,CORR: Correlation between your predictions and live data
Meta Model Contribution(MMC): An advanced staking option which incentivizes models that are unique in addition to high performing
You can stake your NMR on either CORR or CORR+MMC.

What’s next? 💭

There are a couple of things you can do to improve your performance. You get paid for the uniqueness of your predictions too.

  1. Play with data
  2. Tune model parameters
  3. Change model architecture
  4. Ask on RocketChat or Forum
  5. Join the weekly Office Hours — details at RocketChat

--

--