Deep Learning and Momentum Investing

Discover how to apply deep learning models to financial data in a disciplined and interpretable way

Published in

Towards Data Science

23 min readAug 20, 2019

In this post I provide an overview of my new working paper on deep learning and momentum in U.S. equities. I begin with a short summary of the paper highlighting the research question and main results, after that I deliberately put on my practitioner’s hat (buy-side quant / PM, currently in transition between jobs and open to new opportunities, hint hint) and focus on the following practical aspects of disciplined quantitative research:

Motivated choice of features and features engineering
Systematic approach to selecting optimal network architectures and building ensembles
Interpretation of model’s predictions

The paper is available on SSRN and a condensed summary of the paper can be found here. Comments, suggestions and feedback are welcome and eagerly anticipated.

The post is quite lengthy so here are the key takeaways:

I investigate predictive power of a broad set of momentum-related variables in a deep learning framework and document rich, nonlinear, time-varying structure in their impact on expected returns. Investment strategies built on predictions of the deep learning model actively exploit the non-linearities and interaction effects, generating high and statistically significant returns with robust risk profile and their performance virtually uncorrelated with the established risk factors.
Thoughtful approach to model’s inputs is paramount: the usual problems with financial data, e.g. scarcity, non-stationarity, and state-dependence, can be alleviated by motivated choice of features and features engineering.
Automated hyperparameter optimization is not only important because it leads to better model architectures but also because it adds an additional layer of modelling discipline enhancing reproducibility of the results.
Interpretability of predictions of a machine learning model is crucial. Ability to link the predictions to established facts about assets’ behavior serves as a sanity check of the results.

I. Introduction and Summary

In finance momentum refers to the phenomenon of cross-sectional predictability of returns by past price data. A standard example would be the well documented tendency of stocks that have had high returns over the past one to twelve months to continue outperform stocks that have performed poorly over the same period. Positive returns from buying past winners and selling past losers is a long-standing market anomaly in financial research documented for basically every asset class and literally for hundreds of years. Note that since the stocks are compared to their peers we talk about cross-sectional predictability, in contrast to the time-series momentum, or trend following, where decision to buy or sell a stock depends on its own past performance only. Over the past quarter of a century the finance literature has proposed numerous ways to measure the momentum, e.g. in terms of lookback horizon, and identified a host of confounding variables, like market volatility, predicting its performance as an investment strategy. The emerging field of financial machine learning further finds past price data to be among the strongest predictors of future returns, dominating fundamental variables like book-to-market ratio.

In the paper I investigate predictive power of a broad set of price-based features, over various time horizons in a deep learning framework. My results and contributions are as follows:

Empirical: I document rich non-linear structure in impact of these features on expected returns in the U.S equity market. The magnitude and sign of the impact exhibit substantial time variation and are modulated by interaction effects among the features. The degree of non-linearity in expected returns also varies substantially over time and is highest in distressed markets.
Methodological: I leverage differentiability of neural networks’ outputs with respect to inputs to study directional effects of features on models’ predictions, their evolution over time and interactions with other variables. This analysis allows to explicitly relate the predictions to stylized facts about momentum, thus increasing transparency of the results and showcasing interpretability of the infamously black box algorithm. I further demonstrate how to utilize hyperparameter optimization and ensemble construction methods to chose the best performing models in a systematic way.
Practical: investment strategies built on the out-of-sample predictions of the deep learning model actively exploit the non-linearities and interaction effects, generating high and statistically significant returns with a robust risk profile and their performance virtually uncorrelated with the established risk factors, including momentum, and machine learning portfolios from the current literature.

The rest of the post is organized as follows:

II. Importance of Features Engineering
III. Model and Data
IV. Hyperparameter Optimization and Ensemble Construction
V. Test Set Results and Interpretability of Predictions
VI. Concluding remarks

II. Why Choice of Features and Features Engineering are important?

Financial data are special and requires a thoughtful approach. First, in comparison with datasets standard to machine learning applications it is rather limited in scope and availability — indeed for the vast majority of the markets we do not have high quality data prior to the 1990s. Second, this problem is further aggravated by the fact that financial data are not stationary. Broadly speaking, the rules governing the data generating process can change over time, for example, thirty years ago trading costs were an order of magnitude higher than now especially for smaller stocks, thus keeping everything else equal commanding higher expected return relative to their larger and more liquid counterparts. Furthermore, many important variables driving returns can be omitted from a model or not experience full range of their values over the training set. A concrete example of this problem would be vastly different behavior of asset prices during high and low market volatility periods, or regimes — training the model on a subset of the data that covers a time period corresponding to only one of the regimes would impair the ability of the model to generalize on the test set. Of course, not only market volatility but also other variables like macroeconomic statistics, monetary policy announcement sentiment, etc. can summarize the overall state of the market. Third, financial data exhibits very low signal-to-noise ratio. Given that the most powerful models like neural networks are low bias and high-variance learners, this means that models will overfit the noise in the data. To sum up, in finance we do not have several billion cat pictures to train models on; the finance cats also mostly look like noise and can transform into a tapeworm or an owl with unknown probabilities on the third Thursday of each leap year if the ambient temperature is below zero.

All of the issues above mean that we cannot rely on the meat grinder approach and simply plug in all the available raw data hoping that the algorithm will pick up important features without severely overfitting the inherent noise. Fortunately, we have several decades of research in return predictability that has essentially been doing features engineering. More importantly, these engineered features have some theoretical or empirical justification behind their ability to predict returns.

For example, the figure at the top of the post shows the crash in returns on the long-short standard momentum strategy (i.e. buying past winners and selling short past losers) in early 2009. The standard momentum buys or sells stock based on their performance over the past 12 month skipping the most recent month, it is often denoted as 12–1 momentum. The mechanic of the crash is well understood: the lion share of variation in returns on individual stocks can be attributed to a single factor — the market. Stocks that exhibit higher degree of co-movement with the market outperform stocks which are less correlated with the market in times when the market goes up. The reverse holds during market downturns. We can measure the normalized degree of co-movement by estimating the slope coefficient, or beta, in the following simple linear regression of stock return on the market return also referred to as the market model:

where the intercept of the regression, or alpha, measures the component of stock i’s return which is orthogonal to the market. Toward the end of 2007 the U.S. equity market entered a downturn loosing more than half of its value by the beginning of 2009, so the 12–1 momentum strategy was investing in low-beta stocks that had arisen relatively unscathed from the market collapse of 2008 and was shorting high-beta stocks which had suffered the largest losses. Note that the strategy, or portfolio, is a linear combination of its constituents: the low beta of the long positions minus the high beta of the short positions resulted in an overall negative beta of the portfolio. In other words, in the beginning of 2009 the 12–1 momentum was betting against the market. When the market rapidly rebounded the momentum suffered the second worst crash in history after the Great Depression during which the 12–1 momentum stepped on the same rake. Figure 1 provides a visual illustration by plotting the values of S&P 500 (in the top panel) and difference in average one-year market betas between past winners and past losers in individual stocks. The average beta of the momentum strategy became positive again only in June 2009 after the market had already regained about 25% comparing to the lowest point.

Figure 1: S&P 500 and market beta of momentum strategy

See Daniel and Moskowitz (2016) for more information on momentum crashes.
See Section 2.1 of this post by Boris B for an example of reasoning about why certain features make sense as input variables of an ML algorithm

Indeed, the existing research can provide powerful insights into what a good model should be capable to account for. A couple of paragraphs of reasoning immediately give us an intuition that the market return, beta and interactions thereof can be crucial features. As a matter of fact, the deceptively simple univariate regression above provides two additional features with empirical and theoretical support: the estimate of the intercept, or alpha, and the standard deviation of the residual which measures idiosyncratic volatility. The key point here is that instead of feeding the whole time-series of returns into the model hoping that it will figure out the estimates on its own and risking to run into the problems outlined in the beginning of the section, we can exploit our ex-ante knowledge of the features and their transformations that predict returns and design a more parsimonious and interpretable set of variables.

I have not even touched the data quality issue which is an important topic on its own, but of a lesser issue for the momentum research because price data are rather clean in comparison with, for instance, fundamental accounting ratios which can be reported with a lag and are subject to revisions by data providers, meaning that what we have for training a model can differ significantly from what we would have had in a live application.

I refer the readers who are interested in the caveats of financial data in machine learning applications to Arnott, Harvey and Markowitz (2019).

III. Model and Data

A. Model

I specify the return prediction task as a classification problem and estmate the probability of the next month return of a stock being above and below the median return of the entire cross-section. The prediction targets, or labels, for stock i are defined as follows:

Given stock i’s vector of features at month t , X(i, t), the predicted probability is a function of features and weights:

I choose the simplest architecture — multilayer perceptron — and train the model by minimizing binary cross-entropy using Adam optimizer. I further enforce equal representation of classes within each minibatch finding that it greatly improves stability of the training. For regularization I employ early stopping and dropout.

Classification offers several advantages over regression which is staple in empirical asset pricing. First, by construction, the labels have the same distribution over time and same magnitude, and thus simplify training by alleviating the problem of time variation in cross-sectional distribution of returns. Second, since the binary classification covers the whole sample space, the estimated probability of return being above some constant, in our case cross-sectional median, is directly proportional to the expected return, assuming the measurement error is random. Third, in practice we often care much more about expected performance of an asset relative to its peers than about point estimate of its future return.

B. Data

I use the standard CRSP dataset of U.S. equities for the period from January 1965 to December 2018. Which leaves me with about 20,000 unique stocks in the sample after data availability filters are applied. I further deliberately focus the main part of my analysis on a subsample of the largest equities, selecting top 500 stocks by market capitalization each month and leave the rest of the stocks as a robustness check. The resulting subsample covers on average three quarters of the total U.S. equity market capitalization and is statistically indistinguishable from S&P 500 in terms of daily returns. The main reason to focus on the large caps is the compelling evidence from recent replication studies that the bulk of forecasting power for the vast majority of variables predicting returns is concentrated in small and micro caps, which is extremely relevant in practice where transaction costs and price impact from trades is a reality.

I set the forecasting horizon at one month and split the sample as follows: the training set covers the period from January 1965 to December 1982 and includes 105,177 stock-month examples; the validation set is from January 1983 to December 1989 (41,408 examples); and the test set is from January 1990 to December 2018 (170,385). The additional test set which includes all stocks contains over 1,200,000 examples.

Following the discussion in the previous section I construct a set of features that are motivated by findings in the previous research and possess some rationale for why or how they are associated with expected returns. For example, I include market return and volatility measured over horizons 10 days, and 1, 2, 3, 6, 12, 18, 24 months, or alpha and beta from the market model regression in the previous section estimated over horizons from 10 days to 12 months. The complete list of features, rationale behind them, and references to the corresponding studies can be found in Section II of the paper.

To facilitate training I normalize the stock-specific variables, i.e. past return of a stock, beta, alpha, etc., with respect to the cross-section by computing z-scores every time period. For the time-series of the market volatility and return I compute z-scores relative to their own history up to the estimation date to avoid look-ahead bias.

IV. Selecting optimal neural network architectures and building ensembles

A. Hyperparameter Optimization

Deep learning models are very sensitive to choice of hyperparameters that determine model architecture and guide estimation process. Performance of a model often depends more on hyperparameters than, for instance on how sophisticated a particular model is. Bergstra et al. (2013) argue that hyperparameter tuning should be formal, quantified and reproducible part of model evaluation. Bergstra et al. (2011) introduce the TPE algorithm — a sequential Bayesian optimization technique allowing to formalize the hyperparameter tuning task as an outer optimization problem. They demonstrate that TPE outperforms both the manual and random search. The idea of the algorithm is to start from prior distributions of hyperparameters θ and model the loss as a stochastic function of θ, then sample hyperparameters from a ‘good’ distribution corresponding to loss values below a certain threshold, and pick the hyperparameter values that maximize the expected improvement in the loss for the next optimization step. As optimization progresses the sampled hyperparameters converge to their true values. A formal mathematical description of the algorithm can be found in the appendix of my paper. Here are a couple of links with examples of how to use the TPE and other hyperparameter optimization algorithms in Python:

A tutorial by Vooban
Application of a closely related approach — the Gaussian process by Boris B
Bayesian optimization algorithms by Yurii Shevchuk at NeuPy
A tutorial by Dawid Kopczyk

I define the hyperparameter optimization objective function as follows: for a set of hyperparameter values I first estimate the model five times and pick the five best values of the validation loss achieved by each model, the value of the objective is the average validation loss over these 25 values. Since the training is largely stochastic, I am explicitly looking for architectures that can consistently achieve lower loss both within each estimation run and across different runs. Table 1 reports the prior distributions of the hyperparameters:

I initialize the algorithm with 25 evaluations of the objective using random search and then perform 700 TPE iterations. Figure 2 plots the average losses over the course of the TPE optimization: the blue dots are the losses of the best 50% of evaluations; the solid red line is the expanding first decile of losses, and the dashed black line tracks the minimum loss achieved for each TPE step.

Over time the algorithm consistently proposes better hyperparameter configurations. The next figure shows how TPE adjusts the distributions of hyperparameters over iterations plotting the prior (dashed black line) and empirical densities of decimal logarithm of the learning rate (left plot) and dropout rate (right plot) for the first (in blue) and second (in red) halves of the TPE optimization: the distributions converges toward lower values of the learning rate and higher dropout probability.

Figure 3: Distributions of hyperparameters during TPE optimization

B. Optimal Ensemble

Neural networks are low-bias and high-variance algorithms, therefore variance reduction techniques like ensembles of models offer tremendous advantage while being computationally inexpensive. More importantly, instead of simply averaging predictions of several models we can assign optimal an optimal weight to predictions of each constituent of an ensemble in straightforward fashion. I pick the 20 model specifications that achieved the lowest validation loss during the hyperparameter optimization as initial candidates for the ensemble and then follow the Caruana et al. (2004) algorithm: starting from an ensemble of size one with the best model, each iteration I add a new model from the model pool (with replacement) such that the average prediction of the ensemble yields the lowest validation loss. Figure 4 plots the ensemble’s validation loss over the algorithm iterations: the black dashed line corresponds to the loss of the best model and the blue solid line plots the ensemble loss as optimization progresses.

Figure 4: Validation loss during ensemble optimization

After approximately twelve iterations the algorithm stops considering new models and continues to adjust the weights of the existing constituents instead. Since the ensemble optimization is computationally cheap I re-optimize the ensemble before every prediction on the test set, using the newly available information.

V. Test Set Results and Interpretability of Predictions

A. Out-of-Sample Results

First, to gauge the model’s ability to generalize on unseen data, let’s have a look at the test set loss. Figure 5 plots the ensemble loss relative to its validation loss (dashed black line normalized to 1). The red line draws the average loss on the test set, the gray line plots average loss over all stocks for each month, and the blue line shows a 12-month rolling mean of this average. In terms of the loss, the test set performance of the model deteriorates, on average, by about third of a percent in comparison with the validation set. This is hardly surprising given that I estimated hundreds of specifications using the same validation data: the more specifications are tried on the validation set, the higher is the probability that the best models will overfit the validation set by chance — something to always keep in mind. The discrepancy is nevertheless small. More importantly, the test loss is stable over time fluctuating around its long-run mean.

Recall that the model outputs estimated probabilities of next month’s return being above the cross-sectional median return. Thus we can directly translate predictions into investment strategies that buy stocks within a given predicted probability range. Table 2 reports descriptive statistics of returns (in excess of the risk-free rate) on equally-weighted portfolios for the largest 500 stocks sample. For example the first triplet of columns reports statistics for the median sorts: portfolios investing in stocks with predicted probabilities below the median predicted probability (first column); above this probability (second column); and long-short portfolio selling the stocks in the first portfolio and buying stocks in the second (third column). Similarly, the second and third triplets report statistics for lowest and highest predicted probability portfolios with the probabilities split into quintiles and deciles respectively. Mean, median returns and their standard deviations are in percent p.a.; Sharpe ratios measuring return per one unit of risk taken are annualized; maximum drawdown (the worst streak of returns an investment has ever experienced), maximum single month loss and average monthly turnover are in percent. The numbers in brackets are HAC t-statistics for the null hypothesis of mean return being equal to zero.

Table 2: Descriptive statistics of ensemble portfolios: 500 largest stocks

The portfolio effectively shorting half of the S&P 500 constituents and investing in the other half earns on average 7.2% p.a., which is statistically significant at any conventional level. The spread between high and low portfolios increases as sorts become more aggressive to 17.7% p.a. for the difference in the extreme decile portfolios. This increase comes simultaneously from higher (lower) returns on the ‘high’ and ‘low’ portfolios, providing evidence of the model capturing the cross-sectional distribution of expected returns. The annualized Sharpe ratio rises from 1.11 to 1.34 for the median and decile sorts respectively. Returns on the portfolios also become more positively skewed for the more concentrated sorts. For comparison, the excess return on the U.S. stock market over the same period is 7.1% p.a. with Sharpe ratio of 0.48 and maximum drawdown of over 50%.

The following figures plot values of $1 invested in the ensemble portfolios (in natural logs). The top and bottom subplots show the results for the largest 500 stock and all stocks respectively. The left panels depict the returns on the high and low portfolios sorted by the median predicted probability (solid and dashed blue lines) along with excess return on the entire equity market (in black). The right panels display the returns on the long-short portfolios: median, quintile, and decile sorts are drawn in blue, red, and gray respectively. Regarding the ‘all stocks’ sample, the performance improves even further: for example the average spread in returns between decile portfolios (the gray line in the bottom right panel) increases to 22.3% p.a. (t-statistic of 10.5) and Sharpe ratio rises above 2. However, these raw numbers are to be taken with a g̵r̵a̵i̵n truckload of salt without careful analysis of transaction costs and tradability issues.

Figure 6.a: Out-of-sample performance of ensemble portfolios: 500 largest stocks

Figure 6.b: Out-of-sample performance of ensemble portfolios: all stocks

B. Can Existing Risk Factors Explain Returns on the Ensemble Portfolios?

No, they can not. Table 3 reports results of time-series spanning tests, i.e. regressions of the excess returns on the ensemble portfolios on the five factors of Fama and French (2015) plus momentum portfolio (a brief discussion of the Fama-French factors can be found here). The goal of these regressions is to determine whether the return on the test assets (neural network portfolios in our case) can be represented as a linear combination of factors (usually previously established investment strategies with non-zero return, like market risk premium). Under the null hypothesis the test assets are spanned by the factors and the intercepts in the regressions, or alphas, are zero; the slope coefficient estimates gauge how correlated are the returns on the test asset with the return on a given factor.

For each test asset (across rows) in Table 3 the first column reports the estimated intercept, α in percent p.a., the next six columns report coefficients on the factors, and the last column shows the adjusted R² of the regression. The key takeaway from this exercise is that returns on the ensemble portfolios cannot be captured by the other risk factors, delivering large and statistically significant alphas. The returns on the long-short portfolio P2-P1 is essentially uncorrelated with any of the explanatory variables. In the paper I further show that this result holds for other portfolio sorts, and is robust when the set of factors includes hedge fund returns and returns on other portfolios from the current financial machine learning literature.

C. Which Features Drive the Performance?

Let’s recap what we have so far:

1. The model generalizes well out-of-sample and captures the cross-sectional distribution of expected returns.
2. Portfolios based on the ensemble’s predictions generate high and statistically significant returns.
3. Performance of the ensemble portfolios is uncorrelated with other investment strategies and established risk factors.

Leaping over the important topics of position sizing and transaction costs let’s focus on which features drive the predictions, how they do it and whether it makes sense.

Figure 7 shows partial derivatives of the predicted probabilities of stock return being above the cross-sectional median return in the next month with respect to model’s inputs. The top and bottom 10 input variables ranked by their average gradient are across the vertical axis. For a given feature the colored bars and whiskers represent respectively the interquartile and 5–95% range of all gradient evaluations on the test set. The solid black lines and dots inside each bar show median and mean value of the gradients. Since the variables are normalized to have mean of zero and standard deviation of one the interpretation is as follows: keeping other things equal an increase in one-year alpha of a stock by a small Δ relative to the cross-section increases the predicted probability of stock return being above the cross-sectional median in the next month by approximately 100×Δ%.

Figure 7: Gradients of predicted probabilities w.r.t. inputs

Essentially the figure above reports unconditional predictors of expected returns. The most salient cross-sectional features predicting positive return are the market model alpha over horizons from nine months to one year along with the six months and one year price momentum. In fact the one-year alpha is extremely robust: out of more than 170,000 observations only two have negative gradients with respect to this variable. The only market state variable among the unconditional predictors is the 2-year market return. Apart from the short horizon price features consistent with the short-term reversal and information discreteness, fip (a feature aiming to gauge if return accumulates in a few large jumps or in many small increments), the price momentum and market model alpha at horizons of seven to eight months are the major predictors of low expected returns. In fact the largest positive contributions of the price momentum to the predicted probability are at six and twelve months lookback horizons which is illustrated by the next figure (Figure 8) that plots average partial derivatives of the predicted probabilities with respect to the price momentum (left plot) and market model alpha (right plot) against their lookback horizons. The horizons shorter than one month are aggregated into one-month bins. The contribution of the alpha on the other hand reverts to negative values at horizons shorter than nine months. A peculiar feature is that at the six month horizon the gradients of both variables are much higher than those in the surrounding lookback periods. Overall, the effects outlined above are consistent with empirical evidence from the finance literature (I provide a detailed discussion with references to relevant studies in the paper).

Figure 8: Gradients of predicted probabilities w.r.t. to inputs by lookback horizon

If the main features predicting higher expected returns are alpha and momentum how come that the long-short ensemble portfolios are uncorrelated with investment strategies using these variables, e.g momentum that uses past return, and how the portfolios managed to demonstrate extraordinary performance in the 2008–2009 period when the momentum got literally steamrolled? The answer is: features interactions with the market state variables.

Figure 9 shows the ten largest and smallest average gradients of the long-short decile portfolio sorted on predicted probabilities. Interpretation of the gradients becomes a bit more cumbersome, for example: keeping everything else equal, a small Δ change in the 10-days market return on average increases the predicted probability of a stock in the long leg of the portfolio (or decreases the probability of a stock in the short leg) by 0.2×100×Δ%. Although many variables are rather strong unconditional predictors of returns, the long-short neural network portfolio does not simply buy one year alpha and sell short-term momentum. In fact the portfolio does not exhibit any systematic exposures to the cross-sectional features with magnitudes similar to those in the long-only case. Of course, on average the portfolio is tilted toward betting against the beta, information discreteness and short term idiosyncratic volatility, but there are around 25% of stock-months where these bets are reversed. The portfolio is also on average long market volatility and market return, but once again variation in these gradients is significant. The dominance of the market state variables together with dispersed bets in terms of the gradients makes it hardly surprising that the classical static asset pricing factors in the time-series tests in Table 3 possess virtually no explanatory power in capturing the variation in the returns on the neural network portfolios.

Figure 9: Gradients of predicted probabilities of long-short portfolio

To demonstrate how the market state features modulate the importance of the cross-sectional characteristics I plot partitions of the Hessian of the long-short decile portfolio for alpha (in Figure 10), beta (in Figure 11) and market state features (across horizontal axes). As market volatility rises the importance of alpha (and price momentum since the two are strongly correlated) measured over longer horizons goes down and the gradients of shorter term alphas go up. The reverse applies to market returns at horizons of up to three months. In other words, in distressed markets, when returns are low and volatility is high, the model dynamically allocates more importance to recent performance.

Figure 10: Second order effects, impact of market state features on alphas

Figure 11: Second order effects, impact of market state features on betas

For the betas, the impact of increasing market volatility is generally positive except for the shortest horizons. The beta gets higher gradients for longer horizons if the market return on similar horizons is above its long-term average. The short-term market return modulates the beta quite aggressively allowing, for example, to directly exploit the pathological momentum behavior discussed in Section II: when the market trend reverts upwards, i.e. the short-horizon market returns increase, the gradients of the predictions with respect to betas increase as well, which is especially prominent for the 10-day market return.

VI. Concluding Remarks

Sure, it is great to have models which beat traditional financial forecasting approaches to a pulp, but the message of the post is more subtle:

First and foremost, a proper handling of data is of utmost importance in financial machine learning: apart from ensuring quality of the dataset, it is useful to come up with a hypothesis or mental model about which features should work and why. This serves at least two purposes: (i) engineered features make it easier for the algorithm to learn the association between inputs and predictions by eliminating the noise which dominates the raw data; (ii) alleviates the problem of HARKing — hypothesizing after the results are known — humans are extremely good at fooling themselves.

Second, automated hyperparameter optimization not only allows to search for the best performing architectures in a systematic way, but more importantly it also contributes to reproducibility of the results.

Third, interpretability of the predictions is crucial to understanding how the model works and whether the results pass sanity check. It is furthermore critical to grasp under which conditions the model can fail.

Comments and feedback are welcome. Thank you for reading.

References

Arnott, R., Harvey, C. R., & Markowitz, H. (2019). A backtesting protocol in the era of machine learning. The Journal of Financial Data Science, 1(1), 64–74. Available at https://ssrn.com/abstract=3275654

Bergstra, J. S., Bardenet, R., Bengio, Y., & Kégl, B. (2011). Algorithms for hyper-parameter optimization. In Advances in neural information processing systems (pp. 2546–2554).

Bergstra, J., Yamins, D., & Cox, D. D. (2013). Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures.

Caruana, R., Niculescu-Mizil, A., Crew, G., & Ksikes, A. (2004, July). Ensemble selection from libraries of models. In Proceedings of the twenty-first international conference on Machine learning (p. 18). ACM.

Daniel, K., & Moskowitz, T. J. (2016). Momentum crashes. Journal of Financial Economics, 122(2), 221–247. Available at https://ssrn.com/abstract=2371227

Fama, E. F., & French, K. R. (2015). A five-factor asset pricing model. Journal of financial economics, 116(1), 1–22.