(Mis)adventures in using machine learning for equity portfolio selection

The pains of preventing data leakage in multi-period forward forecasts

Published in

Towards Data Science

6 min readDec 18, 2019

Preface

As a fledgling data scientist, I recently built a machine learning model to predict and select a portfolio of outperforming equities within the CSI 300 Index (top 300 equities by market capitalisation for the Shanghai and Shenzhen stock exchange).

The results were promising, but not good enough. My model managed to select a portfolio of equities that consistently outperformed the benchmark index (accumulated excess returns of at least 5% over a 200 day period), but if brokerage costs and slippage were factored in… well, let’s just say I’m not putting my money on my model’s predictions for now!

So why am I writing this article? Well, because I’ve tried in vain to search for articles/advice regarding some of the challenges I faced in the course of building this model, hence the (mis)adventures, and I sincerely hope that somewhere, someday, someone in the same situation can get some use out of this article. Or I find out that I’m really bad at online searches. Either way, learning happens!

Problem Statement

To use a machine learning model fed with fundamental and technical indicators to predict and select a portfolio comprising equities from the CSI 300 Index that will outperform the benchmark index.

Data

The primary source of the data was from a trial subscription to Morningstar.

Technical features used were:

Percentage change in Close prices between two periods
Percentage change in exponentially-weighted moving Averages of Close prices between two periods
Variance of day-on-day Close prices
Average Turnover
Percentage change in exponentially-weighted moving Averages of Daily Volume Traded between two periods
Variance of daily Volume Traded

These were all calculated for different period or permutation of periods.

Fundamental features used were:

Return on Equity
Profit Margin
Change in Profit
Change in Revenue
Price-to-Earnings ratio
Price-to-Book ratio
Price-to-Sales ratio

These were all derived from quarterly financial reports and normalised to their sector designations (using Morningstar’s designation of sectors)
To prevent data leakage, quarterly financial statement numbers would only be used one month after the end of that quarter, as that would be the latest that a listed company would have to publish their quarterly financial statement.

Additionally, key economic indicators used were:

China Caixin Manufacturing PMI
China Caixin Services PMI
China NBS Manufacturing PMI
China NBS Non-Manufacturing PMI
China Consumer Price Index
China GDP year-on-year

To prevent data leakage, economic indicator numbers were only used on the next trading day after their publication.

With only 10 quarters of financial statements to work with, the timeframe of my data that I could use for modelling was only from October 2017 to Nov 2019.

Moreover, with missing fields in the financial statements for some companies, I only had 225 equities to work with, out of the 300 originally intended.

Still, it was a decent chunk of data to work with. No issues! Yet…

The (mis)adventure begins…

Challenge: Preventing data leakage in multi-period forward forecasts

Most, if not all, of what I managed to find in regards to equity prediction tended towards one-period forward predictions. Those that looked multiple periods ahead were generally of the auto-regression variety.

Neither was what I had in mind for this model.

I really wasn’t looking to create a high/mid-frequency algorithmic strategy for this model, as I have generally taken a fundamental approach to trading. The primary intent of the model is to identify significant divergences between the actual traded price and supposed intrinsic value of an equity, and these corrections would take multiple periods (days, at least) to materialise, if they do at all.

So assuming a holding period of 5 periods, it would make sense to base the target variable on the Close price of 5 periods ahead, right…?

All done!

Alas, it is not so.

For example, if the training split ends on 28/10/2019, the 5 period ahead Close prices for 22/10/2019 to 28/10/2019 would actually still be unknown.

The target Close prices boxed in red cannot be used!

Although I was using 240 periods for training and my X variables did not actually include the dates, this seemingly small leakage actually resulted in a massive (and wholly unwarranted) outperformance in my prediction results.

Why not just drop the last 5 periods of X variables then? Because it resulted in a massive (and this time, wholly warranted) underperformance in my prediction results.

Solution: Slice the data at each possible cut-off date for a training period, and use the last Close price of that period as a fill value when shifting.

Slicing the data for each training period ensures that Close prices that ought to be in the ‘future’ will be left out. We can then proceed to shift the Close prices forward, and use the last Close price of that period as the fill value.

If it’s stupid but it works… it might still be stupid.

Over here, observations are labelled as the positive class 1 if the equity outperform the benchmark index by more than 5% at the end of the holding period of 5 days.

These sliced sets of target variables are then added to a dictionary, which can be called by the model for each specific training period.

I don’t think this is an elegant solution by any measure, as there is definitely some concern that the target variable might be mislabelled when using these fill values. But weighing that consideration against the information gained from the last few days of the training period, I think it was worth it, as it definitely improved my model’s results.

Of course, if anyone can think of a more elegant solution, please do share!

Conclusion

This seems like a good place to break, wrap up this article for now, and gauge if this has been helpful for or of interest to anyone. If so, I would be glad to regale the subsequent misadventures I had in the course of this project, such as trying to use GridSearch to tune hyperparameter for a classification model where the accuracy of the model’s (discrete) predictions is not truly indicative of the actual (continuous) price of the equity portfolio. And maybe even some of the methodologies that worked.

And because ultimately every equity selection model is evaluated on it’s performance, here’s mine thus far. The model’s predictions (blue line) are yielding 10% accumulated excess returns above the benchmark index (orange line) over a 200 trading day period, but this is assuming no brokerage costs and trading slippage… so yeah, still a long way to go.

The model’s portfolio generates 10+% of accumulated excess returns over the benchmark index for a 200 trading day period

Once more, I’m just a fledgling data scientist and I’m always keen to hear and learn from all of you here. This is still a work-in-progress, and I’d definitely welcome any input or suggestions that the community has!

Thanks for reading, and feel free to reach out to me on Linkedin anytime!