Beating Stock Market | Towards Data Science

Personal Proyects

Note from Towards Data Science’s editors: While we allow independent authors to publish articles in accordance with our rules and guidelines, we do not endorse each author’s contribution. You should not rely on an author’s works without seeking professional advice. See our Reader Terms for details.

Summary

This is a personal project in which I have tried to develop a trading application using Machine Learning tools. Starting with data modelling along with a categorisation based on distribution and machine learning techniques, I have developed a trading strategy for beginner investors to generate low-risk profit with the help of this application.

Introduction

The market analysis is both interesting and complex as it can be seen in the following link [1]. Nevertheless, there are several works carried out with machine-learning which try to shed light on this field.

In this piece of work, I have created an application consisting of two main points:

A screen where Stock Market index may be analysed over different temporal horizons. Here it can be found a candlestick chart; a chart to analyse technical indicators [2]; a line chart which shows the percentage of price change between days, as well as a box-plot representing this last chart in order to understand that distribution.
A screen where the analysis of the trading strategy which I have developed (Strategyone) can be done. This strategy is divided into two different parts: the first one consisting of the prediction of stock market index movements by means of machine learning, whereas the second one involves the comparison between the current data vectors prediction and what happened in the past. The chosen temporal horizons range from 7, 14, 21 to 28 days.

This last section is explained thoroughly in "How to beat the market" and "Trading strategy"

Data has been obtained through the Alpha Vantage API [3], while a list of the stock market index from the Finnhub API [4].

Context

As a physicist I have been always fascinated by the complex systems world: how certain formulae can be applied to and have interesting results either to biological systems or financial ones, as well as to a group of several electrons.

Likewise, how the individual study of an element of the system might result into a different behaviour when it is studied within the system.

Consequently, this project emerges from the curiosity about the stock market in addition to the software and intellectual challenge that implies to understand such a complex system as the market is.

The project has gone through three stages: the first version of this work was developed as the final thesis of the Master’s Degree in Data Science which I attended in [5], and whose aim was only the creation of classification model which could predict the future of an stock in the market using machine learning. The second version was designed externally to the Master’s and it tried to improve the first one. Finally, the third version is the one here discussed, and it offers a significant improvement, the development of a trading strategy.

How to beat the market

In order to use a classification model to predict market movements, I needed to categorise the data. These prediction categories have been called "Strong bull", predictions in which the price increase is significant; "Bull", when there is a price increase; "Keep", the price remains the same; "Bear", a decrease on the price, and "Strong bear", the price decrease is significant [6].

How are the stock market index categories chosen?

This have done through the distribution of percentage variation in the stock price. As our aim is predicting the future, in the registers, the percentage variation column needs the daily information about how the price varies in relation to the temporal horizon that we want to predict.

Therefore, the variation percentage to be categorised is compared to the last 4-month distribution, and one of the categories abovementioned will be selected based on the range of the percentiles in relation to that distribution.

In this way, we could categorise all the data given a temporal horizon, and this will always be about the future.

Once the categorisation is done, the next step was getting to know which the best way to apply an algorithm of classification with more precision is. After a number of trials and different ideas, the selected process was scaling the data by means of the robust scaler technique and Random Forest as classification algorithm. These were the chosen ones since they provide an average higher precision upon all the categories.

Only following these steps, we can obtain a model which is able to predict "Strong bull" with a 40 % level of accuracy.

Trading Strategy

The trading strategy will be based on what happened in the past and on the idea that we guess correctly provided that we win, omitting that in order to win we must also guess the right predicted category.

That is, if the prediction is "Bull", we carry out a long position operation and the resulting outcome is actually "Strong bull", our prediction will be considered as accurate. Likewise, if we predict "Strong bull" and the result is "Bull" or when the prediction is "Strong bear", we carry out a short position movement and the outcome achieved is "Bear" and the other way round.

If none of the abovementioned cases take place, the operation will be considered as a fail.

Having this in mind, the strategy will only consist of long position operation and when the model predicts "Strong bull" given that it is the category with higher accuracy from the classification model.

How does the strategy work?

Once the robust scaler is applied to all the registers, the category is predicted and the actual categorisation, a PCA is applied to reduce the number of dimensions to 4 maintaining the 95 % of data variability. Therefore, we have other 4 variables together with the prediction linked to the register and its actual category. How the variables are can be known when something is predicted in relation to the real category, so we arrange the prediction and the category, and we calculate the median associated to each profile curve to understand how to describe each one.

As a result, we will have described the variables in which "Strong bull" is predicted" and the actual outcome was "Strong bull" or any other category.

All of this will be limited to the last 6-month-data in relation to the prediction day in order to avoid the influence of an old market state on the strategy. The results obtained are summarised below:

Description of the variables for each prediction-category after the PCA.

The interpretation of this table is that in the last 6 months before the prediction of "Strong bull" and the category was guess correctly, the variables of the main components had as the median.

Consequently, in order to carry out a operation, we must apply the data of the day in which we are doing the prediction a robust scaler and a PCA,

If the prediction obtained is "Strong bull", we will have reached the first step to carry out the operation. The second step is checking which profile of the previous curves is more similar to the data that is being predicted. This will be done using the cosine similarity which will allow us to observe the more similar vector to the data. If it corresponds to "Strong bull-strong bull", we will have the key to perform a safer operation.

Following this trading strategy, we will obtain almost a 50 % level of accuracy, but, as it was mentioned at the beginning, guessing correctly does not imply guessing the category too.

Guessing correctly does not imply guessing the category too

Under our circumstances, a correct guessing will be also the prediction of "Strong bull" and obtaining "Bull" as a final result. The strategy level of accuracy reaches 58 % when this is taken into account.

Conclusion

The aim of this piece of work was the development of a strategy which allows a beginner investor to carry out to generate low-risk profit without suffering a total loss. As I have mentioned, the strategy ensures a 58 % level of accuracy under the described conditions, but, on a personal note, it is not a strategy to be implemented automatically because the error level assumed raises up to 40%.

However, it is interesting to see how a level of accuracy over 50 % is obtained in the performed operations, following a strategy based only on data and with a limited and minimal knowledge about the stock market.

All the project code can be read on: GitHub/esan94/bsm03