A review of recent reinforcement learning applications to healthcare

Taking machine learning beyond diagnosis to find optimal treatments

Published in

Towards Data Science

10 min readDec 10, 2018

The application of machine learning to healthcare has yielded many great results. However, the vast margin of these focus on diagnosing conditions or forecasting outcomes, and not explicitly on treatment. Although these can indirectly help at treating people (for example diagnosis is the first step to finding treatment), in many cases, particularly where there are many available treatment options, figuring out the best treatment policy to use for a particular patient is challenging for human decision makers. While reinforcement learning has grown quite popular, the majority of papers focus on applying it to board or video games. RL has performed well at learning the optimal policies in these (video/board games) contexts, but has been relatively untested in real world environments like healthcare. RL is a good candidate for this purpose, however there are many barriers for it to work in practice.

In this article, I will outline some more recent approaches as well as the largest barriers that exist with the application of RL to healthcare. If this topic interests you, I will detail a few models at the PyData Orono Meetup on Reinforcement Learning in the Real World which will be broadcast on Zoom this Wednesday 7–9:30 EST. This article assumes that you have a basic knowledge of reinforcement learning. If you don’t, I suggest reading one of the many articles already on Towards Data Science on the subject.

Fundamental challenges:

Learning and evaluating on purely (or primarily) observational data

Unlike with AlphaGo, Starcraft, or other board/video games, we cannot play out a large number of scenarios where the agent makes interventions to learn the optimal policy. Most importantly, it would be unethical to utilize patients for the purposes of training the RL algorithm. In addition, it would be costly and would likely take years to complete. Therefore, it is necessary to learn from observational historical data. In RL literature this is referred to as “Off-Policy Evaluation”. Many RL algorithms such as Q-learning can, “in theory,” learn the optimal policy effectively in the off-policy context. However, as Gottesman et al. point out in their recent article, “Evaluating Reinforcement Learning Algorithms in Observational Health Settings”, accurately evaluating these learned policies is tricky.

In a normal RL context to evaluate a policy we would simply have the agent make decisions then compute the average reward based on the outcome. However, as mentioned above, this is not possible due to ethical and logistical reasons. How do we evaluate these algorithms then? As of now there is no one satisfactory answer. Gottesman et al. describe this in great detail in their report and state possible scenarios, but they don’t come to a concrete conclusion about which metric to use. Here is a brief breakdown of commonly used metrics. I will go into more detail when discussing individual papers as well.

Importance sampling

Gottesman et al. *Note don’t confuse “weight” here with the weight of neural network. Here you can think of weight almost as a measure of similarity between the histories of the patient under the clinical policy versus what it would be under the learned policy.

To simplify the math jargon this approach essentially involves finding treatment scenarios made by the physician that match or nearly match the learned policy. We then calculate the reward based on the results of these actual treatments. A problem with this approach is that, in many cases, the actual number of “non-zero importance weights is very small.” What this means is essentially, if the learned policy suggests treatments (or lack thereof) that physicians would never do then you will have problems evaluating the policy because there will be no similiar histories where we actually have the outcomes with which to compare.

Gottesman et. al. don’t offer a solution to this problem. Instead they state that one should “always examine the distribution of the importance weights” because the majority will often sit around zero. For this reason Gottesman notes “we should therefore make sure the effective sample sizes used to evaluate policies is large enough for our evaluations to be statistically significant. Limiting ourselves to policies that are similar to physicians’, as discussed above, will also be beneficial for increasing the effective sample size.”

U-Curve

Examples of the U-Curve from Gottesman et al.

This method focuses on comparing the difference between the learned policy and the physician policy with respect to an outcome. This evaluation method contains problems of its own because it can easily cause bad policies to look as if they outperform clinician policies. At the core of the method is the underlying assumption that if when the policy recommended dosage and the physician dosage match the mortality is low then the policy must be good. However, Gottesman et al. found that a random or a no treatment policy would look like it would outperform a physician policy due to the variances found in the data.

Selecting good evaluation metrics is important because, in certain scenarios, an agent may learn to associate treatments with negative consequences due to the majority of treated patients having adverse outcomes. This is due to the lack of data on non-treated patients. As Gottesman et al. state with respect to Sepsis:

“We observed a tendency of learned policies to recommend minimal treatment for patients with very high acuity (SOFA score). This recommendation makes little sense from a clinical perspective, but can be understood from the perspective of the RL algorithm. Most patients who have high SOFA score receive aggressive treatment, and because of the severity of their condition, the mortality rate for this subpopulation is also high. In the absence of data on patients with high acuity that received no treatment, the RL algorithm concludes that trying something rarely or never performed may be better than treatments known to have poor outcomes.” page 4

This goes back to the classic problem of correlation not being equal to causation. While, here it is rather obvious that the model has problems in other settings it may be much more subtle and not easily detectable without proper evaluation.

2. Partial observability

Unlike in many games in medicine we are almost never able to observe everything going on in the body. We can take blood pressure, temperature, SO2, and simple measures at almost every interval but these are all signals and not the ground truth about the patient. Additionally, there might be times when we have data at certain time steps but not others. For example, Chest X-Rays to treat a pneumonia patient a doctor might only give before and after. The model, therefore, has to estimate the state of the condition without all the data present. This is a difficult problem in healthcare where there is a lot unknown about the patient at every time step.

Reward function

Finding a good reward function is challenging in many real world problems. Healthcare is no exception to this as it is often hard to find a reward function that balances short-term improvement with overall long-term success. For instance, periodic improvements in blood pressure may not cause improvements in outcome in the case of sepsis. In contrast, having just one reward given at the end (survival or death) means a very long sequence without any intermediate feedback for the agent. It is often hard to determine which actions (or lack thereof) resulted in the reward or penalty.

RL is data hungry

Almost all the major breakthroughs in deep RL have been trained on years worth of simulated data. Obviously this is less of a problem when you can generate data easily through simulators. But as I have described in many of my previous articles data for specific treatments is often scarce to begin with, the data that is there takes tremendous effort to annotate, and due to HIPPA compliance and PHI hospitals/clinics are very wary of sharing their data at all. This creates problems for applying deep RL to healthcare.

Non-stationary data

Healthcare data by nature is non-stationary and dynamic. For instance, patients will likely have symptoms recorded at inconsistent intervals and some patients will have their vitals recorded more than others. Treatment objectives may change overtime as well. While, most papers focus on lowering overall mortality for instance, if a patient’s condition improved the focus could shift to reducing length of stay or another objective. Additionally, viruses and infections themselves could likely rapidly change and evolve in a dynamic way not observed in the training data.

Interesting recent studies

Now that we have addressed a few of the biggest challenges regarding reinforcement learning in healthcare lets look at some exciting papers and how they (attempt) to overcome these challenges.

Deep reinforcement for Sepsis Treatment

This article was one of the first ones to directly discuss the application of deep reinforcement learning to healthcare problems. In the article the authors use the Sepsis subset of the MIMIC-III dataset. They choose to define the action space as consisting of Vasopressors and IV fluid. They group drugs doses into four bins consisting of varying amounts of each drug. The core network is a Double-Deep Q Network which has separate value and advantage streams. The reward function is clinically motivated based on the SOFA score which measures organ failure. For evaluation they use what Gottesman termed as the U-Curve. Specifically, they look at the mortality rate as a function of the difference in dosage of the prescribed policy versus the actual policy.

Reinforcement Learning with Action-Derived Rewards for Chemotherapy and Clinical Trial Dosing Regimen Selection

This paper describes a method to find the optimal policy for treating patients with chemo via reinforcement learning. This paper also uses Q-Learning as the underlying model. For an action space they formulate a quantity of doses for a given duration that an agent is able to choose from. Dose cycles are only initiated with a frequency determined by experts. Transitions states are computed at the end of each cycle. The reward function is defined as the mean reduction in tumor diameter. The evaluation is done using simulated clinical trials. It is unclear exactly how these simulations were setup but appearantly simulations of this type generally incorporate both pathological and statistical data.

Supervised Reinforcement Learning with Recurrent Neural Network for Dynamic Treatment Recommendation

This paper uses supervised RL (via the actor critic methods) with RNNs in connection with learning an overall treatment plan. This paper is notable in its attempt to utilize the full MIMIC-III dataset and provide treatments to all patients rather than just a subset. This setup essentially contains three main components: the actor which recommends medications based on patient state, the critic network which estimates the value of these medications in order to encourage or discourage them, and the LSTM which is used to help overcome the problem of partial observability by summarizing the historical observations. The action space is 1,00 exact medications or 180 drug categories. The authors evaluate their method based on estimated in-hospital mortality.

Deep Reinforcement Learning for Dynamic Treatment Regimes on Medical Registry Data

Image from article detailing using RL to prevent GVHD (Graft Versus Host Disease).

This is an interesting paper that aims to provide a framework for a variety of dynamic treatment regimes without being tied to a specific individual type like the previous papers. The authors state “The proposed deep reinforcement learning framework contains a supervised learning step to predict the most possible expert actions; and a deep reinforcement learning step to estimate the long-term value function of Dynamic Treatment Regimes.” The first step is supervised learning to predict a set of possible expert treatments for the given patient to prevent Graft Versus Host Disease (GVHD), which is a common complication after Bone Marrow Donation where donor’s immune cells attack the host’s cells. The second step is the RL step where the agent seeks to minimize the possibility of complications.

There are a few other papers that for reasons of length I will not go into long descriptions of but are still interesting.

A Reinforcement Learning System to Encourage Physical Activity in Diabetes Patients

This is a sighly different paper that covers using RL to encourage healthy habits instead of direct treatment.

Representation and Reinforcement Learning for Personalized Glycemic Control in Septic Patients

This is another paper focused on Sepsis and RL. However, it takes a slightly different approach and looks at only glycemic control.

Learning Optimal Policies from Observational Data

This actually isn’t a reinforcement learning paper, however good nonetheless. It focuses on counterfactual inference and using domain adversarial NN.

Conclusion

There remain many challenges in the application of reinforcement to healthcare. The hardest and most prominent is the problem of evaluating RL algorithms effectively in healthcare scenarios. Other challenges relate to the amount of data needed for training, the non-stationary nature of the data, and the fact that it is only partially observable. That said, there is a lot of recent emerging literature that could help solve some of the issues. As stated above I intend to dive deep into a couple of these approaches at the upcoming meetup. Also, I’m hoping to revive the Slack channel on machine learning for healthcare so please join if you are interested.

Additional Annotated References

Digital Doctor Symposium Reinforcement Learning in Healthcare

This is a very good talk by one of the of the organizers of MLHC on applying RL to HIV treatment. Here she discusses many of the issues in a very clear and non-confusing manner.

Continuous Adaptation Via Meta-Learning In Non-Stationary And Competitive Environments

This paper is not related to healthcare but I think that it provides a good framework with dealing with non-stationary issues (that could come up) with MAML. Moreover, (just hunch) I think that it could be useful for viruses that evolve very quickly and other cases where a RL will have to adapt fast and based on small amounts of data.

Evaluating Reinforcement Learning Algorithms in Observational Health Settings

I already sampled this heavily for this article but I think if you are serious about understanding evaluation you should read through this article fully.