Research Digest

Using AI to Trade? Here is JP Morgan’s Approach

Using Reinforcement Learning to Get Optimal Execution Strategy

Alex Chen

Published in

Towards Data Science

8 min readMay 2, 2020

In this article, we mainly focus on the paper, “Risk-Sensitive Compact Decision Trees for Autonomous Execution in Presence of Simulated Market Response”, to discuss how Machine Learning helps us get a better execution strategy. During the process, we add more information from other papers, which are basic theories for this paper and additional sources to help us to understand the problem.

Problem: How to execute orders?

Imagine a trader have a bunch of money, and the trader wants to buy a bunch of stocks. Now the traders can use two orders to buy the stock:

Market Order Book: Buy and sell at current market price. The traders submit the number of shares that he/she want to buy, then exchange automatically set the price based on the current market.
Limit Order Book: The buy and sell at a specific price that traders set in advance. When someone else wants to sell or buy at this price, the exchange will match these two and finish the trade.

As we can see the below image. If the trader wants to execute a limit order, he will put the order on the queue; if the trader wants to execute a market order, he will get execution immediately at the current market price.

Limit Book Order Example (Credit to Vyetrenko and Xu, 2019)

Limit Order V.S. Market Order

When you want to execute the market order, you want to make sure the execution is the priority over the price.

Advantage:you can liquid the asset immediately because the exchange automatically matches the price and execute the order.
Disadvantage: you may have lots of costs when you buy the asset. You may buy a stock at $101 immediately, but you can buy it at $100 if you just wait 5 minutes.

When you want to execute the limit order, you put the priority on price over the execution.

Advantage: you execute the order at the price which you set, regardless of buying or selling. If you want to buy 100 stocks at $105, you set a price first. If the system matches sellers who want to sell 100 stocks at $105, then the system matches you two and executes the process. So you get the certainty for the cost that you want to buy.
Disadvantage: you have the risk that the limit order will not be executed for a long time. For example, you can set a buying price at $50 when the current market price is $80. Until the price drop to $50 (unlikely), you can not get the execution.

Market Impact

Another problem is the market impact. Stock trading always has a supply-side and a demand-side, which has to keep the balance. The problem is that big institutions tend to buy many stocks in a short period ( for example, in one day). If the market price is $100, and they want to use $1,000,000 to buy 10,000 shares. If they just show an order in one lump. All the people in the market will know someone wants to buy a massive amount to shares, and then the seller will increase the price in light of human nature, which called market impact. To solve this problem, they can split the large order into smaller child orders to reduce the cost.

Execution Strategy: How to achieve optimum?

The goal is to execute and get a certain amount of shares by the end of today. But traders want to implement an optimal strategy to deal with it. Three questions remain:

1. What’s the proportion for each child order?

2. When do they want to execute the order?

3. For each child order. Do they want to execute the market order or limit order?

For each child order, we call the limit order passive order because buyers will set the price and wait it get matched passively. We call market order aggressive order because buyers want to get the order aggressively, regardless of the price.

Optimal execution problem V.S. Optimal placement problem?

Everyone asks this question: How can I reduce the cost when I do the trading? From the previous paragraph, when a trader wants to buy S shares of stocks given by a time window [0,T]. How does the trader want to minimize the cost of acquisition? We call this optimal execution problem.

Bertsimas and Lo (1998) addressed an optimal execution strategy based on Dynamic Programming Principle (DPP), they assumed the linear price impact:

It means that the current price (Pt) is a function of previous price (Pt-1), proportion of shares of stock you bought (St), and random noise. They concluded that the optimal execution strategy is just evenly divided into the whole stock shares based on the equal time interval.

Guo et.al (2013) gave a theory for optimal placement problems. Compared to optimal execution, the optimal placement problem focused on the trading happened in a smaller time window (10–100 seconds), especially in HFT (High-frequency trading) field.

Reinforcement Learning Approach

There are so many ways to consider this problem. This paper wants to use Markov decision processes (MDP) and reinforcement learning (RL) to address this problem. The authors considered the optimal execution strategy price impact, and multi-agent interaction(simulation to the market). In this paper, 3 components work together :

Limit order book data simulator
Reinforcement Learning Algorithm: Risk-Sensitive Q-Learning
Feature Selections

Market Simulation

Simulator interaction

There are 2 components to run the strategy:

Market Simulator: It is a playground to let the RL algorithm to have enough data and interaction to train itself. The simulator will simulate the market impact if we executed an order.
Reinforcement Learning Agent: The reinforcement learning agent will interact with the market simulator. It places orders to make the market impact and then get a response as a reward from this simulator.

Training Process (Credit to Vyetrenko and Xu, 2019)

As you can see in this image, they trained reinforcement learning agent in a simulated market environment, which has a similar scheme with real-time execution agent.

How to define the market impact?

Recall that some papers (Bertsimas and Lo, 1998) used a linear function to simulate the market impact.

But in this paper, they used market response as a function of the LOB (Limit Order Book) Data and aggressive order size. It means that they can get insights from LOB data instead of using the assumption in formula.

Reinforcement Learning

For each reinforcement learning, there are 3 components:

1. States (s): There are 2 kinds of states. For the agent, the states indicate the position, which means how much money left and how many shares the trader wants to buy. For the market, a market has the market price, limit order book size at the moment. Then they separated the state variables into bins. Means If there are 5 variables, and each has 2,3,4,5,6 bins, respectively. It will get 2*3*4*5*6 = 720 states.

2. Actions (a): At each moment, there is an action option: put size o on limit order book or market order book, which represent passively and aggressively, respectively. If o = 0, they don’t execute any order at that moment.

3. Rewards (R): The reward is the cost saved for each time. Rewards define as:

In this formula, Rewards is a function of execution price (Pt) at time t, price if the trader just execute the whole parent order (Pσ), and order size (ft).

In Lo’s paper, they also use the MDP to process modeling the optimal execution strategy. But they made some fixed assumptions and didn’t use dynamic data to support them.

Therefore, in this paper, they used Q-learning and MDP to simulate the reaction in a real simulated dynamic environment. Q-values formula in Q-Learning defined as following:

In this formula, Q-values is a function of rewards (R) summation, state(s), and action(a). After defining the Q-values. The goal is to get the optimal policy which maximize the total discounted rewards expectation:

Since there are so many state variables as features, they also use a least-squares policy iteration method to select a subset of features from the total features pool.

Result represented by Decision Tree

In the reinforcement learning algorithm, they provide a changeable parameter beta to indicate different risk appetite.

After the Q-Learning Training process, they get a learned tabular execution policy, and then they use the decision tree to represent this result. The advantage of the decision tree is that the decision tree can explain the features explicitly. Based on that, traders can use different strategies, aggressively(market order) or passively (limit order) in terms of the situation of the state variable.

Decision tree based policy (Credit to Vyetrenko and Xu, 2019)

They have the better result compared to the benchmark. They show great result based on this reinforcement learning tree-based agent. This agent can save 32% of the total execution cost. But for standard deviation, this strategy increased 27% of the cost standard deviation, which is the trade-off between cost-saving and risk, which is the Limit Order Book’ nature.

Result for Reinforcement Learning Model (Credit to Vyetrenko and Xu, 2019)

Conclusion

In this paper, authors showed the reinforcement learning algorithm can be used on the optimal execution problem. It will consider a dynamic environment instead of making a fixed assumption. It performed well on the Limit Order Book dataset. And authored showed that they can use a decision tree to represent a policy to guide traders to make decisions.

Note from Towards Data Science’s editors: While we allow independent authors to publish articles in accordance with our rules and guidelines, we do not endorse each author’s contribution. You should not rely on an author’s works without seeking professional advice. See our Reader Terms for details.

Reference

[1] Bertsimas, Dimitris, and Andrew W. Lo. “Optimal Control of Execution Costs.” Journal of Financial Markets 1, no. 1 (1998): 1–50. https://doi.org/10.1016/s1386-4181(97)00012-8.

[2] Gatheral, Jim, and Alexander Schied. “Dynamical Models of Market Impact and Algorithms for Order Execution.” SSRN Electronic Journal, 2012. https://doi.org/10.2139/ssrn.2034178.

[3] Guo, Xin, Adrien De Larrard, and Zhao Ruan. “Optimal Placement in a Limit Order Book.” SSRN Electronic Journal, 2013. https://doi.org/10.2139/ssrn.2318220.

[4] Guo, Xin, Adrien De Larrard, and Zhao Ruan. “Optimal Placement in a Limit Order Book: an Analytical Approach.” Mathematics and Financial Economics 11, no. 2 (2016): 189–213. https://doi.org/10.1007/s11579-016-0177-5.

[5] Vyetrenko, Svitlana, and Shaojie Xu. “Risk-Sensitive Compact Decision Trees for Autonomous Execution in Presence of Simulated Market Response.” arXiv preprint arXiv:1906.02312 (2019).