Implementation of Technical Indicators into a Machine Learning framework for Quantitative Trading

Building an ML forecasting tool to predict stock price movement using Technical Indicators on S&P100 companies

Modishubham
Towards Data Science

--

Note from Towards Data Science’s editors: While we allow independent authors to publish articles in accordance with our rules and guidelines, we do not endorse each author’s contribution. You should not rely on an author’s works without seeking professional advice. See our Reader Terms for details.

This article is a continuation from my previous article (https://towardsdatascience.com/building-a-comprehensive-set-of-technical-indicators-in-python-for-quantitative-trading-8d98751b5fb), where I developed technical indicators that will now be used in a Machine Learning framework to form directional predictions for the stocks. However, to make the model more robust, this current model will be developed on the entire S&P100 index, rather than just two stocks as in the last article. The previous article was flexible enough to consider any number of stocks.

The article will exploit quite a few data analytics and extraction techniques commonly used while developing a quantitative trading strategy including web scraping, winsorizing, clustering, feature engineering, hyperparameter tuning, validation curves, saving pickle files, and final prediction.

The trading framework will follow the below steps:

1. Use web scraping to obtain the S&P100 companies at the date of running the model

2. Download data for the S&P100 companies from Yahoo Finance

3. Feature Engineer the data to create Technical Indicators (as done in the last article)

4. Cluster the S&P100 companies to create unique random forest models for each cluster

5. Use the random forest for the corresponding cluster to predict 7-day movement for each stock

6. Obtain the prediction for all stocks and select the stocks with the biggest probability of movement.

Current constituents of S&P 100 companies

Python offers a convenient way of scraping web data using Beautiful Soup package along with requests package that allows extraction of html data from websites. Writing a code to extract the constituents rather than manually creating a list of 100 companies not only saves a lot of time but also creates a dynamic method of updating the list if the index constituents change with just one run. This is far more efficient than manually creating a list of all the companies.

Firstly, we find the list of S&P100 on the internet. The most obvious source is the Wikipedia page for S&P100 companies (https://en.wikipedia.org/wiki/S%26P_100). Moreover, Wikipedia has a systematic way of creating their html pages which allows easy scraping of data.

Hence, we then obtain the list of all tickers in S&P100 using the below code:

Creating Technical Indicators

Using all_data Data Frame from above, we can use the code as is from my previous article to create the Technical Indicators for all the stocks (https://towardsdatascience.com/building-a-comprehensive-set-of-technical-indicators-in-python-for-quantitative-trading-8d98751b5fb).

Creating Prediction variable

Now that we have formed all the variables that will be used in predicting the stock movement, we need to define the prediction variable.

Technical Indicators generally work well in short interval predictions and since our indicators have been based on 5-day and 15-day periods, I use a 7 (trading) days prediction interval. Thus, the idea is to observe the technical indicators for today and use it to predict the direction of movement of the stocks 7 days later. If the stock went up in 7 days, we denote it by 1 and if it went down/did not change, we denote it by 0 (“Target_Direction”). In order to make this even more comprehensive, let us look at it from a real-world point of view. If we look at the indicators today, we will buy the next day’s opening price and hold for 7 trading days, including the day we bought, and sell at the Closing 7 days later. Thus, in our code, we will have to shift the price by just 6 days. Our profit will be the difference between our buy (Opening price of day 1) and sell (Closing price of day 7)

In the below code, we define the Target variable as the percentage profit defined above. This is transformed into a Target Direction (1 or 0) variable as described above, which forms our prediction variable.

Winsorizing the Indicators

Before we begin to develop our prediction model, it is important to deal with the outliers that exist in our explanatory variables, i.e. our Technical Indicators. There are multiple ways in which outliers can be treated, one of which is winsorizing the data. The idea of winsorizing is to bring extreme outliers to the closest value that is not considered an outlier. For example, here we winsorize 10% lowest and 10% highest values, thus the extreme values will now change to the 10th percentile and the 90th percentile values respectively. One of the key advantages of winsorizing is that the information contained in the extreme outliers is not lost; only the absolute values of those are sensitized.

Clustering companies

This is where things start to get interesting. A simple way of predicting would be to assume that all the companies would follow the same ML model and create this one global model to predict returns for all companies. However, it is possible that different companies/industries react differently to a set of Technical Indicator. One way to solve this problem is to create different ML model for each cluster of companies that are expected to behave similarly perhaps belonging to the same industry, where the “behavior” is captured in their returns.

Thus, to make our model even more sophisticated, we will create different ML models for each cluster.

Identify number of clusters

The idea is to use returns of these companies and create an elbow curve to determine the number of clusters that would create a balance between low sum of squares within a cluster vs the total number of clusters. For 100 companies, if we create 100 clusters, we would get 0 sum of squares within the cluster however this clustering would not make sense, i.e. would not be parsimonious. On the other hand, having 1 cluster would be parsimonious however will lead to very high sum of squares within cluster. An elbow curve helps to determine the approximate point at which the marginal decrease in sum of squares is small.

We use K-means clustering to create an elbow curve. K-means aims at minimizing the inertia or the within cluster sum of squares while clustering. By providing a range of clusters from 1 to 50, we create the required elbow curve.

Elbow Curve for different number of clusters — Image by author
Change in Sum of Squares between clusters — Image by author

Thus, we can see from the above curves and the table that 17 clusters best serve our purpose. The decrease in sum of squares after 17 is not large enough. Hence, for our analysis, we select 17 clusters.

Creating clusters

We now use the Guassian Mixture clustering algorithm to assign the companies to 17 clusters based on their returns. Gaussian Mixture is an uses a probabilistic method of determining the appropriate cluster for a series of observation, assuming the universe is formed out of different Gaussian distributions.

The code below provides us a dataframe with different clusters and the companies that fall in each cluster.

Let us look at a few clusters to understand if the clustering makes sense.

Sample companies in clusters — Image by author

The clustering resulted in, to a large extent, an industry wide classification of stocks which is in line with our initial thought.

Random Forest model

Random Forest is a commonly used Machine Learning model for Regression and Classification problems. However, given the complexity of the model, it is important to carefully understand the parameters that go into the model to prevent in-sample overfitting or underfitting, a standard bias-variance tradeoff.

There are quite a few things to consider while forming a Machine Learning model. Let us go through one by one.

Training period

We will first separate our data into a train and a test sample. While scipy offers a TrainTestSplit function, we will not use that here since our data is a time series data and we want to split the Train-Test as a timeline rather than randomly selecting observations as train or test. We first convert our index into a date time index and split the data to before and after 31st December 2018.

Train period: 15th October 1990 – 31st December 2018

Test period: 1st January 2019 — 16th November 2020

Type of Random Forest model

Random Forest has primarily two types — Regressor and Classifier. Regressor is when we want prediction in a non-linear regression form whereas Classifier is used as a non-linear logistic model. We will use Classifier as we want to assess the probability of an up move (probability of 1) for every stock. Similar to logistic regression, Random Forest Classifier will give us exactly that.

Hyperparameter tuning

One of the key steps of a random forest model is fine tuning the parameters of the random forest. Random Forest has 5 to 6 key parameters that define how the forest will be built. Those are: max_depth, max_features, min_samples_leaf, n_estimators, min_samples_split. Since these parameters can take wide range of values, we will use a GridSearchCV to obtain the best combination of these parameter for each cluster.

GridSearchCV works with the possible combinations of these parameter values that we provide and gives the best combination that would have lowest error in the out-of-sample cross-validation. These parameters will then be used for prediction.

Validation Curves

To determine the initial values that will be given to GridSearchCV, upon which it will work to find the best combination, we can use Validation Curves for each of the parameters. Validation curves also look at cross validation and provides a score of prediction for in sample and out of sample. This provides us a good idea of the initial value around which we can provide a range to the GridSearchCV.

Below is an example of validation curve score for n_estimators. At the lowest n_estimators, we see underfitting while at higher n_estimators we see over-fitting. Underfitting leads to high bias-low variance where overfitting leads to low bias-high variance. We can see that at n=7, we obtain a good tradeoff between bias and variance. Hence in our GridSearchCV, we use n_estimators to be around 7.

Validation curve for n_estimators — Image by author

Pickle files

Python offers a very convenient way of saving function files using the pickle package. The idea would be to fit the model on the data and save this fitted model into pickle files for each cluster. This way we would have a different model saved for each cluster. Hence, when predicting for a particular company, we will use the model in the corresponding cluster’s pickle file and make our prediction.

Building Random Forest Model

Final prediction

Now that we have a saved model for each of the clusters, we can use these models to get predictions for the stocks. The models saved contain daily data from 15th October 1990 to 31st December 2018, which is a significant number of data points for a good model. One can create this model up to the most recent date as well. The reason I used 31st December 2018 is to back-test the model from 1st January 2019 to 11th June 2020.

Trading : Back-test

In order to trade using this model, we would obtain a probability of an upward movement for each stock. We aim to go long on those stocks which the highest probability of up move. Let us look at an example.

Let us assume that we are currently on 31st December 2018 and have created the model files. The next trading day is 2nd January 2019. At the end of 2nd January, we now have values for all the indicators using which we can predict each stocks movement. Hence, we will put these values in our models and get the probability of 1 (up movement) in next 7 trading days for each stock .

We now have a dataframe with the probability of up move for each stock. This can be used for trading on 3rd January.

Probability of up move for each company — Image by author

In order to find the stocks with highest probability of up move, we sort the prediction column in a descending order and pick and top 10 stocks.

Hence, once we have the 10 stocks, we will wait for 3rd January 2019 and buy at the Open price, hold for 7 days, and sell on the 7th trading day end Closing price. The main assumption here is that we can trade at the Open price and sell at the Close price. This is not too unrealistic given we know the timings of the market and can code to execute 1 minute after Opening and 1 minute before Closing of the day.

Thus, below is the table with the probability of prediction, the actual movement after 7 days and the percentage change that took place in next 7 days.

Top 10 companies in our prediction — Image by author

We can see that for 9 out of top 10 firms, the prediction was actually an up move in the next 7 trading days thus one could potentially trade on the prediction information on beginning of 3rd January 2019 and if all 10 stocks were bought, there would be a profit realized on 9 of these stocks assuming a holding period of 7 days.

Example of Final Trade on 3rd January

To make things clear, let me show an example of how we can trade our top prediction, BIIB, in real life.

Step 1: Run the model for all the stocks using the indicators we calculate on 2nd January 2019 after day end.

Step 2: Here we know BIIB has the highest probability of up movement. We wait until 3rd January 2019 and buy BIIB at the Open price.

Step 3: We hold the stock for 7 trading days (until 11th January 2019) and sell it at the Close price on 11th January.

BIIB movement for 7 days — Image by author

Thus, we would have purchased the stock at $306.76 and sold it at $333.21 thus making 8.62% profit in 7 days.

The cumulative returns for all the 10 stocks over the next 7 days is:

Cumulative returns of all top 10 firms over 7 day holding period — Image by author

We see that 9 out of 10 stocks gave a positive and decent returns over the 7 trading day period.

Trading in out of sample period

Once we liquidate all these holdings on 11th January 2019, we once again run the model on the data obtained until 11th January to get prediction probabilities and buy the stocks with highest probability on Open of 12th January 2019 and repeat the process.

Using the process described above, below chart lays out the cumulative returns offered by following the strategy from January 2019 to November 2020.

Cumulative returns from strategy back-test — Image by author

Back-testing

Before implementing any strategy, it is important to thoroughly back-test it under different market conditions, at different time periods and to do it all in a very systematic way by taking into account transaction costs, the realistic nature of the trade among other things. The above is a simplistic back-test assuming no transaction costs, and perfect execution of trades.

To understand whether the strategy actually produces good results, we would of course look at the cumulative return it generates over the next year and half to so, along with the Sharpe Ratio, the maximum drawdown, the Value of Risk (VaR) among other statistics. Moreover, along with that we can regress the returns of our strategy with different factors to understand the kind of factor loadings our portfolio would have. Depending on the type of factor, this portfolio could potentially be used for factor hedging or increasing factor exposure.

--

--