Application of Gradient Boosting in Order Book Modeling

Sergey Malchevskiy
Towards Data Science
8 min readJun 18, 2019

--

Subscribe to our Telegram channel for more insights into the world of trading.”

Today we are going to create an ML model that forecasts the price movement in the order book. This article contains a full-cycle of research: getting data, visualization, feature engineering, modeling, fine-tuning of the algorithm, quality estimation, and so on.

What is an Order Book?

An order book is an electronic list of buy and sell orders for a specific security or financial instrument organized by price level. An order book lists the number of shares being bid or offered at each price point, or market depth. Market depth data helps traders determine where the price of a particular security could be heading. For example, a trader may use market depth data to understand the bid-ask spread for a security, along with the volume accumulating above both figures. Securities with strong market depth will usually have strong volume and be quite liquid, allowing traders to place large orders without significantly affecting market price. More information is here.

Pricing scheme

Market depth looks like this, visualization could be different and it depends on software

Related image
BTC market depth on GDAX

Another way to visualize order books is a list with bids and offers

Order book list

Mid-price is the price between the best price of the sellers of the stock or commodity offer price or ask price and the best price of the buyers of the stock or commodity bid price. It can simply be defined as the average of the current bid and ask prices being quoted.

Our goal is to create a model that forecasts the mid-price.

Getting the Data

Let’s download the data samples from LOBSTER. This service provides Google, Apple, Amazon, Intel, Microsoft assets as an examples with 3 levels as market depth (1, 5, 10 levels).

First of all, I suggest to visualize mid-price and difference of ask-bid volumes for all available assets. We need to import necessary libraries

The next code loads the data of a given asset and level from a file

After that, we can visualize each asset

Mid-price and ask-bid volume difference

MSFT and INTC have a slightly strange and different distributions. The mid-price graph doesn’t has a single bell curve, it looks like a mixture of two distributions. Also, the volume difference is too symmetric and differs from other assets.

Feature Engineering

This part is very important, because of the quality of the model directly depending on it. We should reflect a wide range of relationships between bids, asks, volumes, and also between different depths of data in these new features.

The next formulas allow to create these features

Features

These features are the first part of feature engineering. The second part is adding the lag components. It means that we shift given features with some lags in time and add as columns. This example shows how it works on the raw dataset (not new features).

Lag components example

The next code provides these two parts of feature engineering, and add the target column log_return_mid_price.

Usually, the features look like this

Features example

Modeling via Gradient Boosting and Fine-Tuning

Our goal is to show that training a GBM is performing gradient-descent minimization on some loss function between our true target, y, and our approximation,

That means showing that adding weak models,

to our GBM additive model:

is performing gradient descent in some way. It makes sense that nudging our approximation, closer and closer to the true target y would be performing gradient descent. For example, at each step, the residual gets smaller. We must be minimizing some function related to the distance between the true target and our approximation. Let’s revisit our golfer analogy and visualize the squared error between the approximation and the true value

More information you can find here.

We will use the Yandex’s implementation of Gradient Boosting that calls CatBoost. This library is better than others by speed and quality in most cases

Libraries performance

This algorithm has a few parameters that have a huge impact on the quality:

  • n_estimators — the maximum number of trees that can be built when solving machine learning problems;
  • depth — the maximum depth of the trees;
  • learning_rate — this setting is used for reducing the gradient step. It affects the overall time of training: the smaller the value, the more iterations are required for training;
  • l2_leaf_reg — coefficient at the L2 regularization term of the cost function. Any positive value is allowed.

Also, we have parameters of the features:

  • level — market depth;
  • number of time-steps — how many lags to build.

Theoretically, each our asset could have the unique set of the parameters. For this task, we should define the objective function that estimates the quality of the model

One of the best ways to define the optimal parameters is Bayesian optimization. I described this approach in the previous article.

The loss function is RMSE that looks like this

Train set consists of 50% of data from the beginning. Validation data is used for fine-tuning of the model. The last 25% of the data needed to test the final result and this is hold-out data.

After the fine-tuning step, we train the final model on the both parts (train and validation sets) and test the model using the last part. Let’s code this

The do_experiment function is a main one of this research. This function additionally build the feature importance of the best model, and estimates the quality of the model.

Generally, importance provides a score that indicates how useful or valuable each feature was in the construction of the boosted decision trees within the model. The more an attribute is used to make key decisions with decision trees, the higher its relative importance.

This importance is calculated explicitly for each attribute in the dataset, allowing attributes to be ranked and compared to each other.

Importance is calculated for a single decision tree by the amount that each attribute split point improves the performance measure, weighted by the number of observations the node is responsible for. The performance measure may be the purity (Gini index) used to select the split points or another more specific error function.

The feature importances are then averaged across all of the the decision trees within the model. Source here.

Analysis of the Results

The basic metric of success is to get the error less than the baseline. It means that the final model has good quality.

The first question is how to measure quality. It could be squared errors. After that, we can estimate the interval by bootstrapping method. The bootstrap sampling, calculation statistics, and interval estimation are implemented in bs_interval function above.

Bootstrapping

The second question is what values should be used as a baseline forecast. A lot of research claim that the markets are unpredictable. Often, the forecasted next price is the same as the last price plus some noise, and it looks like this

Bad stock prediction result

It means that if we want to forecast the return it will be around 0 plus noise. You can find this result in this article by Rafael Schultze-Kraft.

Our baseline is similar. This approach is implemented in do_experiment function. Let’s run this experiment do_experiment(asset_name), where asset_name from the list (AAPL, AMZN, GOOG, INTC, MSFT).

Collect of important parameters and metrics into this table

Final result table

AMZN and GOOG have the same optimal parameters. Often, level and depth have the maximum or close to the maximum value.

As we remember, in the exploratory step at the beginning, the first three assets (AAPL, AMZN, GOOG) had good distributions of ask-bid prices and volumes. The last two assets (INTC, MSFT) had strange distributions.

This table shows that we got a statistically significant difference in error for AAPL, AMZN, GOOG, and the baseline has been beaten (green color). The upper bound of the interval for modeling is lower than the lower bound for the baseline.

For INTC we don’t have a significant result, the intervals are intersected (grey color). In MSFT case, the given result is worse than the baseline (red color). Probably, the causation of that is detected pattern in distributions (maybe some activities by market-makers or other things).

Let’s look at the most important features of the models

Top features for AAPL
Top features for AMZN
Top features for GOOG
Top features for INTC
Top features for MSFT

As we see, for the successful models the most important features correlated with recent values of log_return_ask, log_return_bid, log_ask_div_bid, and so on.

Conclusions

  1. Suggested the approach for order book modeling via gradient boosting. The code you can find on GitHub.
  2. Feature engineering method described and formalized. The feature importances are shown.
  3. Quality estimation is demonstrated. For some assets the good result was obtained.

How to improve the result:

  1. Change the number of max_evals in optimization.
  2. Change the max_depth, n_estimators in fitting.
  3. Add the new features that better than current ones, or combinations of given features.
  4. Carry out the experiments using more data to get a better model.
  5. Find the history with more number of levels in the order book.
  6. Use a model that specifically developed for time-series (e.g. LSTM, GRU, and so on).

Best regards,

Sergey

--

--