Numerai Tournament: Blending Traditional Quantitative Approach & Modern Machine Learning

How to boost intelligence on financial machine learning

uki

Published in

Towards Data Science

11 min readJun 23, 2020

Introduction

Numerai Tournament

Numerai is a crowdsourced fund, a hedge fund that operates based on the results of stock price predictions made by an unspecified number of people. Numerai holds tournaments in which participants compete for forecasting performance. Tournament participants will build a predictive model based on an encrypted dataset provided by Numerai, and then use it to create a submission. Participants will be ranked and paid (and sometimes burned) based on their predictive performance. Numerai’s backers include Howard Morgan — co-founder of Renaissance Technology, Paul Tudor-Jones, Union Square Ventures and other prominent VCs and persons with significant hedge fund experience. The Numerai dataset is supervised by an advisor specializing in financial machine learning. The total prize money paid out to participants to date is in excess of $34 million, and the project is presumably making good progress.

About the author

The author invests in the Japanese stock market using market neutral methods. Market neutral aims for an absolute return that is independent of the market’s price movements by combining buying and selling (long and short), predicting the relative rise and fall of stock prices in the universe (a group of stocks to invest in). Based on traditional quantitative methods and statistics, the author built this predicting model by machine learning. The results have been good, with a yield of around 40%.

Purpose of this article

In this article, I will share the insights gained in the process of building the author’s model. I first explain the concept of traditional quantitative approach and discuss how to blend it with machine learning to build a modern predictive model.

Notes

Numerai’s dataset is encrypted and the author has no inside knowledge of it. This article is solely the viewpoint from the author’s investing and modeling experience.

Traditional quantitative approach

The study of predicting stock returns has been around for a long time. Let’s start with an explanation of what the traditional quantitative method is and its origins.

BARRA’s risk model

The prototype of the current quants is probably the risk model proposed by Barr Rosenberg [1]. There are many theories about this, but for a history of Wall Street in this area, you should definitely read Peter Bernstein’s book “Capital Ideas” [2].

In the 1960s, based on Markowitz’s covariance model, Rosenberg devised a method to explain the risk of individual companies using a variety of factors. And also he found that these risk factors were related to the excess return on stock prices (risk premium). In 1975, Rosenberg founded a consulting firm, Barr Rosenberg Associates, Inc. This firm became known to management firms around the world as BARRA.

Currently, the BARRA model is the most well-known risk model and MSCI offers it as a vendor. Other risk models include Axioma. Although there are various types of BARRA models, the BARRA Global Equity Model (GEM) is a risk model for stocks in major equity markets around the world [3]. This model decomposes equity returns into country factors, industry factors, risk factors, and individual factors as follows.

This can be described in a multiple regression model as follows. Rn is the excess return (relative to the risk-free interest rate) of stock n, x is the factor exposure of stock n to each factor (k, j, and i), f is the factor return, and en is the specific return. The key here is the concept of factor returns.

Factor returns

For simplicity, I will use a single-factor model rather than a multi-factor model to explain. I will also proceed with the Numerai dataset structure as a concrete example. Factor returns are the regression coefficients f in the following cross-sectional regressions. Here r is the target vector in eraX and x is the vector of featureA in eraX.

Factor return is a measure of how much return is expected by betting on that risk factor in the universe. Factor exposure is how much the stock is exposed to that risk factor, and the greater the exposure, the greater the benefit from the factor returns. As can be seen from the equation above, the regression model is a cross-sectional model over a specific period of time (eraX), and in the actual testing process we cumulate it over time (e.g., monthly) and observe its characteristics.

Below is an example of factor returns from the BARRA GEM document. If a factor return is noticeably rightward, it means that as long as you bet on that factor, you can get a stable return. If it is going to be noticeably downward, then you can bet against that factor (switching long and short). In the current year 2020, few factor returns are noticeable in one direction. Therefore, one should construct a portfolio diversified across a variety of factors considering factor exposure of each stock.

Factor returns of various risk factors. Source: [3]

Factor returns and correlations

Since the factor returns are regression coefficients, they can be converted to correlation using the volatility of the objective and explanatory variables. In the equation below, b is the regression coefficient of the explanatory variable x on the objective variable y, σxy is the covariance of x and y, and σx and σy are the standard deviations of x and y, respectively. Correlation, so to speak, is a factor return standardized between -1 and 1, corrected by volatility.

Correlation is a very important indicator in risk models, and thus in active portfolio management theory. In active portfolio management theory, correlation is called Information Coefficient and it is an indicator of a fund manager’s skill. I will not go into a detailed explanation of this area. If you are interested, you can refer to the most famous book on active management theory [4].

Here I have described the factor returns (calculated by correlation) for each Numerai feature. The calculation is simply done by a single factor model. From this figure, we can see at a glance which features have which characteristics and how much explanatory power they have by themselves.

Factor returns of Numerai features. Source: by author

It should be noted that these factor returns include variation due to randomness. The following is Monte-Carlo simulation in the case of Correlation=0.0 and Correlation=0.005 (100 trials). We should always keep in mind that this degree of variability can occur due to randomness. It is a very difficult problem to determine statistical significance in a sample period of 120 or so. We can see that the most significant factor returns are gained by dexterity 4 and 7.

Randomized factor returns. Source: by author

Evaluation by correlation

When we think about it this way, we can see why Numerai uses correlation to evaluate. The prediction submitted by each tournament participant is itself a rich super-factor — a signal containing much more information than a typical factor. Numerai is then seeking excellent returns from these super-factors generated independently by the participants. If the factor returns are excellent, Numerai can operate simply by combining them or, in some cases, can implement further learning from the individual factors gathered to improve their performance.

Risk factor as a feature

In this chapter, we discuss how traditional risk factors can be incorporated as features for machine learning. Firstly, country feature and industry feature are important.

Country feature

Numerai is considered to have a universe of stocks in all major markets around the world. In the dataset of Numerai tournaments, the IDs of individual stocks are encrypted and we have no way to know this. However, since Numerai Signals published a list of target stocks, I compiled it. From the total number of stocks, I suspect that it is the same as the current Numerai tournament. There are 41 countries in the Numerai Signals list, with the US having the most stocks, followed by Japan, Korea and the UK.

These are possibly not simply incorporated by Country, but incorporated by Region (North America, South America, Pacific, etc.).

Number of stocks in each market. Source: by author

In the usual risk model, country feature is introduced as a 0/1 categorical variable. However, the Numerai dataset is basically a 5 quantile, and the number of each quantile is the same in most features. Therefore, if I were to create a feature in this way, I would run a multiple regression on the index of each country (or each region), and then use the beta as a quantile of the feature.

If we do this, for example Japanese stocks would have a higher beta relative to the Tokyo market index and would be clustered in the larger (or smaller, depending on the classification sign) quantile of the feature. Then, if there is a country feature in Numerai, the largest quantile is only informative and the rest has no information. Numerai’s analysis_and_tips reported that some features have remarkable characteristics when the feature value is 0 or 1, which I think is the case maybe.

For reference, I show the trend of relative returns for each country since 2010.

Cumulative relative return of each country. Source: by author

Industry feature

Secondly, industry feature is important. In Stock Market Wizards, Steve Cohen states that 40% of stock price movements are shaped by the market, 30% by industry and the remaining 30% by individual reason. There is no reason why this industry feature should not be incorporated. The definition of industry varies, but the BARRA GEM defines 38 sectors. In addition, GICS defines 60 sectors, and FactSet’s RBICS defines 12 Economies, 31 Sectors and 89 Subsectors. For reference, the number of stocks by economy in the US market is shown below.

Number of stocks in each industry of US market. Source: by author

Industry feature may also be quantiled with multiple regression betas on the industry index, as is the case of country feature. Also in this case, only the largest quantile is informative, and the rest has no information.

For reference, I show the trend of relative returns for each industry in the US market since 2010.

Cumulative relative return of each industry. Source: by author

Risk Index feature

The Risk Index is likely to incorporate those used in BARRA. These are size, value, success (momentum) and volatility. These can be simply incorporated, but are often normalized by the category, considering the bias from the country and the industry.

For size index, factors such as sales, total assets and number of employees as well as market capitalization could be considered. For value index, price-to-book, price-to-earnings, price-to-cashflow could be considered. Other Risk Indices include liquidity, growth, dividends, and financial leverage. In addition to these traditional Risk Indices, alternative variables such as analyst’s revision and sentiment indices extracted from news can also be incorporated.

For reference, I show the trend of relative returns for each risk index in the US market since 2010.

Cumulative relative return of each risk factor. Source: by author

Blending traditional quantitative approach and modern machine learning

In this chapter, we discuss the methodology of how machine learning can be used to improve performance for traditional quants.

Tree based model

The BARRA model is simply a weighted combination of the individual risk factors. There is a simple and easy way to make this improve. That is to consider interactions between the individual risk factors.

To give a simple example, some industries are more likely to be value-prone and others are not. If we look at the size of stocks, there is a factor that works best for large stocks and a factor that works best for small stocks. Furthermore, different industries outperform in different countries.

In order to consider such interactions, the linear model is inadequate. In a linear model, the interaction variables must be specified by a human and set as a feature. In the case of tree based models, the model can learn interactions on its own without any particular intention. On the other hand, tree based models are not good at understanding the risk premium of the original BARRA model because they are not good at linear classification due to the grid-like division.

The solution to this is the ensemble or stacking of linear and tree models. In fact, in the Two Sigma financial competition at Kaggle, the ensemble of Ridge regression and ExtraTrees got good prizes [5].

5th Place Winners’ model in two sigma financial challenge in Kaggle. Source: [5]

Deep factor model

On the other hand, there are cases where deep learning is used in the model. This is a technique called deep factor model [6]. In conventional quantitative management, the fund manager creates and selects factors based on his or her experience, but the deep factor model aims to capture the nonlinearity of individual factors by eliminating human judgment by replacing it with deep learning.

The method uses 80 factors to predict monthly returns and has been confirmed to be able to outperform predictions made by linear models and other machine learning methods (SVR and Random Forest).

I believe that it is relatively easy to outperform traditional quantitative models by blending machine learning in this way. On the other hand, however, the complexity of the model may reduce its explainability, and there may be traps such as over-learning and snooping bias, so the construction of the machine learning model requires knowledge and intuition specific to the finance field. For more technical techniques in this area, we should refer to the book Finance Machine Learning by Prado, Chief Scientific advisor to Numerai [7].

Conclusion

In this article, I explained the concept of traditional quantitative approach, then described a method of incorporating traditional risk factors as a feature, and showed how traditional quantitative approach and modern machine learning can be blended together. In this way, there is feasibility to improve predictive performance significantly.

I also hope that readers will be more interested in the actual market by getting to know how to observe the market based on traditional quantitative method, which will make the analysis on Numerai even more enjoyable.

I hope this article will inspire readers to be curious and inspire readers’ predicting models. Thank you for reading to the end.

UKI

References

[1]Barr Rosenberg, Marathe Vinay, “The prediction of investment risk: Systematic and residual risk”, 1975
[2]Peter Bernstein, “Capital ideas: The improbable origins of modern Wall Street”, 1992
[3]Barra global equity model handbook
[4]Richard Grinold, Ronald Kahn, “Active portfolio management”, 1995
[5]Team Best Fitting, “Two Sigma Financial Modeling Code Competition, 5th Place Winners’ Interview”, 2017
[6]Kei Nakagawa, Takumi Uchida, “Deep Factor Model: Explaining deep learning decisions for forecasting stock returns with LRP”, 2018
[7]Marcos Lopez de Prado, “Advances in financial machine learning”, 2018