metamorworks/Shutterstock.com

Postmarket Responsibilities for Medical AI

Part II — What Industry Should Do in Commercializing Medical AI

Bradley Merrill Thompson
Towards Data Science
20 min readJun 25, 2020

--

In Part I of this series of articles, I described the tension that exists between FDA and industry with regard to the appropriate level of FDA oversight of medical AI. I explained that it presently seems likely that FDA will ask Congress for more postmarket authority so that it can oversee the commercialization of medical AI in a much deeper way, despite the fact that FDA already has substantial authority.

In this, Part II, I propose an alternative. I propose that industry adopt best practices to assure that patients are getting the best possible care, self-regulating as it were. In Part III, I will make the case for limiting the FDA’s oversight to its existing statutory authority.

There are four areas where industry should actively take measures during the commercialization phase to assure the safety and effectiveness of its AI-based medical devices.

A. Pursue Enhanced Postmarket Vigilance

Deploying adaptive and especially autonomous AI algorithms into the marketplace requires a substantially heightened postmarket vigilance by the companies that deploy them. Adaptive algorithms, by definition, adapt while in use. As a matter of quality assurance and risk mitigation, but also as a matter of research on ways to continuously improve products, companies should closely monitor the performance of their algorithms in the marketplace.

How should it be done?

Before answering that question, it’s important to understand why monitoring is important. Identifying things that might go wrong tells us what kind of monitoring is necessary.

Here I’m borrowing heavily from an excellent article by Om Deshmukh on “Deployed your Machine Learning Model? Here’s What you Need to Know About Post-Production Monitoring” on October 7, 2019.

The following are a few different scenarios that can significantly alter the performance of a trained algorithm.

· In Domain But Unseen Data. We are beginning to appreciate just how different every human being is. Personalized medicine is based on our deeper understanding of the human genome, and how factors like race, ethnicity, sex and so forth have a big impact on both diagnostic and therapeutic decision-making. This makes it especially important that training sets have data from all of the populations on which the algorithm will ultimately be used. It’s quite possible, for example, that a well-trained algorithm might not include Native American populations, and that when the algorithm is used in such populations its performance may decline. In a sense, the training in such a situation would be biased against Native Americans.

· The Changing Nature of Input Data. An adaptive algorithm necessarily needs new data. And it is fairly easy to predict that new data may not always be created in the same way as the data on which the algorithm was originally trained. If the adaptive AI analyzes a radiological image, if there are changes in the technology that produces the image, those changes may result in data that the adaptive algorithm interprets differently. It’s easy to imagine both technology-induced changes, as well as social changes in data production. The way medical professionals or even the way patients themselves or their caregivers collect data may change in ways that affect how algorithm interprets the data. Those who spent a lot of time cleaning data understand how little things like changing from strings to integers could have a profound impact on how well the algorithm works.

· Changes in data interpretation. The meaning of existing data changes over time. Even though the data points themselves don’t change, their meaning changes due to social or environmental factors. In the medical context, one can imagine that the meaning of words in electronic health records may gradually change over time with changing practices by physicians and other health caregivers. Over time physicians might use the same words but with a new intended meaning. This is different from the issue above where the data change.

· Machine learning systems deployed in unexpected contexts. This is very common in the medical technology field as doctors and other healthcare professionals experiment with different uses for existing technology. You can imagine that software perhaps originally intended for use with tracking flu symptoms might, in the hands of certain physicians, be used in an attempt to track COVID 19 symptoms. As health professionals decide to explore utility in other fields, it means that the data on which the algorithm functions may change, having an impact on the overall performance of the product.

· Missing data. For algorithms that use a wide variety of data, it’s possible that certain data streams would be cut off or interrupted.

· Model drift. Without focusing on the cause, it’s simply possible that a model doesn’t perform as well over time for any number of reasons.

· Malicious Behavior. Once an AI device is ‘in the wild’, a malicious actor may intentionally introduce data sets purposely to skew the device’s behavior, that would potentially be less detectable than traditional ‘hacking’ (e.g. ‘breaking into’ a system and changing parameters).

The point is we need to understand that when an algorithm is used in the real world, things can happen that will affect performance. It’s not that hard to imagine.

So what exactly should companies do in the way of postmarket monitoring? Probably the one thing we can all agree on is that one size does not fit all. However, there are some general principles that these software developers can follow to make sure that they are adequately monitoring their products.

Broadly speaking, we can divide monitoring into two different types: active and passive. Let’s talk about active first.

1. Proactive Model Monitoring

Mr. Deshmukh suggests that we continuously track the performance of the machine-learning model against a key set of indicators and generate specific event-based alerts. While Mr. Deshmukh addresses that topic in general, I want to focus on the medical context.

The goal of this technique is “to identify which input samples deviate significantly from the patterns seen in the training data and then have those samples closely examined by a human expert.” The problem is that patterns of interest largely depend on the domain of the data, the nature of the problem, and the machine-learning model used.

Let’s say we have an algorithm that is watching 10 different vital signs for a patient in a hospital, with an eye to identifying when those vital signs reach a point that suggests the patient is becoming septic. One way to accomplish this monitoring is to look statistically at how those vital signs normally flow and set statistical distribution or other parameters that, if exceeded, will generate an alert to the algorithm developer. Note these are different from alerts set for purposes of determining when a particular patient might need medical care. These are statistically based alerts that indicate that there has been some shift in the type of data the algorithm is analyzing. Such an alert would then give a human working for the developer the opportunity to investigate to see if there is something that needs to be updated or corrected.

Beyond checking for statistical anomalies, we might do other forms of live data checks, for example, to ensure that the model is still receiving all the different data it requires. This is true for both categorical variables as well as numerical ones. The company can examine the data to determine among other things if any feature dependencies exist in the live model.

The company could also perform model prediction and performance testing with the cooperation of its customers. In this instance, the company will need to gain access from the end user to data on whether the predicted events come to pass. If the software developer can negotiate access to such data, the company can continuously evaluate the performance of its model. This is often difficult in medicine where the consequences might be months if not years later. However, in some cases, the expected event is likely to occur (or not) in a short period of time, and so performance can be continuously evaluated. If the company cannot obtain access, it would be important to simply track the overall distribution of the predictions to see if that distribution changes over time.

2. Reactive Model Monitoring

Reactive model monitoring is very much akin to what medical device manufacturers already do. As required by law, these manufacturers have systems in place to collect and then analyze problems that users report to the company. However, in the AI context, reactive model monitoring can be conducted on an even more systematic basis.

There are likely to be reports from customers of inaccurate results. A certain number of these have to be expected. Likely, after training and validation of the model, the company learned that the model performs at a certain statistical level of accuracy, less-than-perfect. Consequently, the company knows that there will be times when the model produces what turns out to be an incorrect prediction.

While these customer complaints are to be expected, they also create an opportunity for continuous improvement. By collecting and analyzing these reports, the company can look for systematic root causes and ways to improve the accuracy in the future.

Having said that, it’s also quite possible, and companies need to be mindful of the fact, that they shouldn’t chase performance by what amounts to overfitting. It’s quite easy to imagine that a company might, through well-intentioned actions, think that it is making improvements based on feedback on a certain set of inaccurate predictions, but inadvertently it was compromising the future overall performance of the algorithm, reducing its accuracy in areas where previously it performed well.

Companies can unquestionably learn much from failures, and an ongoing analysis of failures for trends as well as the overall level of failures is a productive exercise. It is simply important, and more easily said than done, to dig for the root cause.

If substantial changes to the algorithm are required, the company would need to bring FDA back into the discussion because changes to products that pass a certain threshold are not permitted without FDA prior review and blessing. This may include, for example, significant retraining of the algorithm.

B. Share information with users

Many terms are thrown around that involve the general topic of AI understandability. People talk about “transparency,” a “black box,” “explainability,” “interpretability” and so forth and my own perception is that those words are not used consistently. Indeed, over the last several years of my career, I have not been consistent in how I use these words. So I now want to be clearer in my own language.

Of relevance to FDA law, I think there are two closely related ideas that are also clearly distinguishable. The words below that I am choosing to describe those two different ideas are a little bit arbitrary because frankly there are no precise English-language labels available.[1] I’m going to base those distinctions in principles that are important in FDA law.

1. Framework

a. Transparency

There is in FDA law a focus on information provided to users at the point when they are deciding whether a product is suitable for a given use. FDA places high value on this decision because using FDA regulated products appropriately is a key determinant in their safety and effectiveness. Indeed, one of the most often brought FDA enforcement actions is a claim of off label promotion, a legal charge that a seller is giving information that encourages a broader use of a product than FDA has authorized.

To be clear, healthcare professionals can use products however they wish. FDA’s primary focus is on ensuring that healthcare professionals receive from manufacturers information that is truthful and not misleading. In FDA’s view of the world, however, this also means limiting the information provided by the manufacturer to that information that FDA has legally authorized.

I’m going to call the obligation of an AI developer to share information the company has relevant to this use decision: “transparency”. I use the word “transparency” here because transparency in this context means sharing what you know. As I’ll explain in a little bit, this contrasts with an obligation to develop new information to address a knowledge gap. Transparency is largely passive, not requiring information that is difficult to gather. It literally means not keeping information to yourself.

For AI based medical products as with any medical device, information on the safety and effectiveness of a product is important information for a potential user to know when deciding whether the product is suitable for a given, planned use. But unlike other medical devices, this information will constantly change as the algorithm adapts based on new data and its performance evolves. So in contrast to traditional medical devices, AI developers need to keep this public information up-to-date.

b. Explainability

Apart from the pre-use suitability decision, there is in FDA law a relatively new emphasis on whether a healthcare professional can understand the basis for software generated advice with regard to a particular patient post-use. This new focus comes courtesy of a 2016 law, the 21st Century Cures Act. Section 3060 of that act specifically calls out as unregulated any software that empowers a health professional to independently review the basis for diagnostic or treatment recommendations, so long as the professional does not have to rely primarily on the software recommendation to make a clinical diagnosis or treatment decision regarding an individual patient. In a world where FDA regulation may mean years of delay and perhaps hundreds of thousands of dollars in added cost, this provision creates quite an incentive to ensure that the user can review the basis for a recommendation instead of relying upon it.

I’m going to call the opportunity to avoid FDA regulation by ensuring that the user can adequately understand the basis for a particular recommendation: “explainability.” As I’m using the term, explainability is different from transparency because where transparency is passive and focused on the pre-use suitability decision (indeed sharing information that already exists), explainability in the case of AI typically calls for the active creation of new information to be used after the algorithm produces a result. I worded that carefully to focus on results, recognizing that the groundwork for explainability has to be created obviously well before the software is used.

The statute is not limited to software that makes use of AI, and explainability is easier to achieve for expert systems that, for example, merely apply standard knowledge in the form of clinical guidelines to make recommendations. Explainability in non-AI typically means simply structuring the software so that the user can easily work their way back to the original source of the knowledge to understand the basis for recommendation. Unfortunately, in the world of AI, explainability can be extremely difficult.

Put another way, explainability is the information after the fact that I should consider to assess whether I believe the algorithm got the answer right. If the algorithm says that I, Brad, likely have COVID 19, should I believe it?

The specific questions that a patient and clinician might have concerning the basis for an AI-generated recommendation include such things as:

  • How did the algorithm arrive at that conclusion?
  • What information did the algorithm consider in my particular case?
  • What symptoms and physiological measurements did the algorithm rely on most as a basis for its conclusions?
  • Why does the algorithm place such weight on those symptoms in arriving at its conclusion with regard to me?
  • Are there symptoms that I have that do not fit the diagnosis that were for some reason discounted?
  • What possibilities in the form of an ultimate recommendation did the algorithm consider and reject?

It’s important to understand that transparency and explainability are cousins. Good transparency lays the foundation for more effective explainability. But while they are related, they are clearly different. They serve different purposes, have different regulatory consequences and they come about in distinctly different ways.

Certainly, over the last five years in particular there’s been tremendous amounts of research into explainability in medical AI. The core challenge faced in that research is that an AI algorithm doesn’t think like a human does. A human looks for causation, where an algorithm looks for association. In this sense, it is not adequate for explainability purposes to merely recite that the algorithm found an association between X and Y. Humans want a deeper explanation that ideally addresses causation or at least the questions recited above.

Importantly, the statute does not require an algorithm to be explainable. I’ve noticed in the literature some confusion around that point. Some commentators are complaining that FDA as a part of its regulatory process is insisting on explainability. They are not. Rather, as explained above, one incentive to accomplish explainability is found in the desire of AI developers to avoid FDA regulation and all the attendant costs and delays. However, if a software developer wants to make a black box that is not explainable, it may. The company just has to go through the FDA process with sufficient evidence to prove the safety and effectiveness of the algorithm, an algorithm where the user won’t be able to understand the basis for a recommendation. So even for regulated products, there is an incentive to provide at least some explanation to the user in labeling or elsewhere as a risk mitigation tool that would reduce the evidence of safety and effectiveness required. It should be noted, though, that even though FDA may not insist on explainability to the ultimate user, FDA as a scientific agency will have a hard time authorizing the marketing of AI without itself some understanding of causation.

2. Factors to Be Balanced in Selecting the Best Policy Option

As explained more in Part I of this series, figuring out what’s best for patients requires a balancing of numerous factors that in some instances conflict. We want to make sure that users — typically healthcare professionals (and increasingly patients) — can understand the strengths, weaknesses, and appropriate use of an AI algorithm prior to using it, so that they can employ it on the right patients in the right circumstances. To ensure informed consent, we also need to make sure that patients understand the nature of the algorithm, and if there is necessary reliance on a black box algorithm, that they understand the risks. At the same time, in a conflicting priority, in the information we share in both transparency and explainability, we need to respect the privacy interests of those on whose data the algorithm is trained.

There are other areas where a conflict arises. For example, we need to balance the burdens caused by this postmarket vigilance against the need for innovation and access to potentially life-saving AI-based technologies. That means not demanding more monitoring and more corrective action than is necessary, but it also means not requiring developers to share more information with the public than is necessary, trade secret information that might compromise their competitive ability to commercialize the product. Healthcare is an industry, and it behaves by the same economic rules observed by every other sector. This means if we force AI developers to share their trade secrets, they may be unable to survive competition and therefore unwilling to invest in future development. Finally, all development needs to proceed in a way that produces a product that will be cost-effective, such that it will be available within our collective economic resources.

3. Proposal: Living Labeling

Based on these competing factors, I propose we pursue a concept of “living labeling.” Living labeling, as I’m using that term, would mean labeling that is constantly being updated with information as to both transparency and explainability. Ordinarily, for medical devices, labeling is a fixed set of words that FDA agrees to during its premarket review. There is some flexibility to change them over time, but not a lot. With this concept of living labeling, there would be a pre-clearance agreement with FDA that the developer would update the labeling to reflect the most current X, Y and Z on ABC interval without requiring a new regulatory submission.

AI-based medical devices require more effort to be put into continuously updating labeling information. Overall, for an adaptive algorithm that is continuously evolving, the metrics tracked under transparency need to be continuously updated. And while explainability may be static in the sense that the software has to be programmed to explain itself each time it is used, if explainability includes dynamic elements, those dynamic elements need to be kept up to date.

In the FDA reports I quoted in Part I, you saw the FDA proposing essentially the same thing, where developers are constantly sharing updated information. So this concept isn’t new or unique to me. However, I do have a different implementation in mind than FDA. There seem to be two differences in our viewpoints. First, FDA seems to have in mind sharing at least some of this information with the agency on an ongoing basis. Second, FDA likely has in mind a much deeper and broader sharing of information, rather than summaries of key statistics.

I am proposing that software developers share information in summary form without (1) breaching patient confidentiality, (2) sharing information that would damage their trade secret positions or (3) requiring overly costly information gathering and processing. The exact approach is going to vary widely depending on the type of AI and the intended use of the software product. Here are some general principles the companies might consider.

a. Transparency

Developers with medical AI-based solutions on the market should keep existing and prospective customers informed as to current performance. That means sharing an appropriate summary of key metrics tracked as a consequence of the monitoring described above. This is not just an occasional updating when the company deems that something interesting happened. This means as a matter of course at some appropriate interval updating key information through some mechanism such as the company’s website or more directly through the product itself.

This is in addition to the information relevant to understanding the appropriate use of the AI model that likely doesn’t change but nonetheless needs to be shared. Information that doesn’t change includes historically how the algorithm was trained, which includes details regarding how the data were acquired and used. This is key because each user of the AI based product needs to make a judgment about whether there is potentially any bias in the original training set that would impact a given use. If I recognize clearly that the algorithm was not, for example, trained on adolescents, I know that if I want to use the product with an adolescent there is some unquantified risk that the results will be inaccurate.

Likely a user might also want to know how were the data cleaned? What are the features or dimensions upon which the model was trained? Were the data augmented? While these data are obviously a bit more technical, an informed user might very well draw conclusions about the suitability of the algorithm for her particular use.

It’s also important that the company be transparent about what, qualitatively, it is doing to monitor and manage the quality of the product postmarket. Users would be highly interested in how the company is permitting the algorithm to adapt, what constraints are on that adaptation, and what quality checks and testing the company is performing. These all provide insight into the risks associated with using the product as it evolves.

For example, if I know that the product is labeled for use with adults, but I see that procedurally the company is not limiting new data to only adults, I know there is some risk over time that there will be adolescent data added by other users and that will shift the performance of the product. As another example, how often is the algorithm allowed to adapt? At the onset of a new communicable disease like COVID 19, knowing how quickly the data set may be influenced by COVID patients helps me understand how, or at least how quickly, the performance of the algorithm may change during a pandemic.

As suggested by one commentator, Ron Schmelzer who writes for Forbes Magazine, it may also be important to include model versioning. He suggests that “those producing models should also make it clear and consistent how models will be versioned, model versioning frequency, and the ability to use older model versions if new models start to perform poorly.”

To support decisions regarding appropriate use of the algorithm, FDA has already specified a lot of information to be included in labeling and I won’t repeat all of that here. But it includes such things the developer’s intended use as well as the results of clinical trials used to validate the technology prior to FDA approval.

b. Explainability

The exact mechanisms used to achieve explainability, or XAI, vary widely depending on the data science domain. For example, explainability may be comparatively easy to achieve with image analysis, where the software can be programmed to highlight a region of interest even if it can’t precisely articulate why that region is interesting.

In contrast, with data science models that make use of electronic health record data, XAI approaches may be a bit harder and include strategies such as:

  • Feature interaction and importance, which includes an analysis of feature importance and pairwise feature interaction strengths.
  • Attention mechanism, which includes the capability of finding a set of positions in a sequence with the most relevant information to a prediction task.
  • Data dimensionality reduction, which as you might guess focuses on most important features.
  • Knowledge distillation and rule extraction, which transfers knowledge from a complex and accurate model to a smaller and less complex one which is faster but still accurate.
  • Intrinsically interpretable models, which relies on preserving interpretability of less complex machine learning methods while enhancing their performance by boosting and optimizing techniques.[2]

Pointing out the key statistically significant features is similar to a radiologist who points to a place on an image and says, “I’ve seen that before, and it’s been malignant.” Much of what we “know” in medicine is just associations without a deeper understanding of a causal relationship. Innovative data scientists are also developing completely new approaches such as visual models to explain more accurately data science results.[3]

To help users interpret the output of AI, in summary, the developers can:

· Explain what can be explained. Don’t make the problem bigger than it has to be. If the software is actually a blend of expert systems and machine learning, and if a particular recommendation is based on expert systems, such as simply looking up the drug allergy in the patient’s EHR, following a simple computational model or recommending a treatment because it is cheaper, the recommendation ought to reveal that reason.

· For that software output which truly relies on AI, state the particular association that underlies a particular recommendation as precisely as possible. As already explained, based on new research, in some cases it’s possible to identify what features statistically affected the recommendation the most.

· Convey the confidence level for the particular recommendation. Uncertainty is always important to communicate to allow the best decisions and legally to allow informed consent.

C. Report to FDA Problems That Meet the Legal Reportability Test

Companies selling FDA regulated medical devices must have in place a quality system. That quality system includes a requirement that such companies have complaint handling procedures. User complaints may lead to identification of problems. If those complaints lead to identification of circumstances that trigger a reporting obligation under the so-called medical device Medical Device Reporting (MDR) regulation (21 CFR Part 803), the company needs to make those reports to FDA.

This reporting obligation exists because at a certain level of seriousness, FDA needs to be kept informed in case it thinks more significant remediation is required than the company may itself plan and to monitor for the prevalence of similar issues in other products regulated in the same regulation or product code. The FDA wrote the MDR regulations prior to AI being deployed in healthcare, so there may be a need for updated guidance to clarify exactly when the reporting obligations are triggered in the case of AI. However, as I will discuss more in Part III, FDA should not expand the MDR obligations.

The MDR reporting thresholds reflect a careful balance between signal and noise. The software developer has the burden to distill the signal from the noise, and report serious enough signals to the agency. This helps to ensure that the agency receives enough information, but not too much information that it will overburden the agency’s systems and frankly cause FDA to miss the signal.

As described above, there are also reports to FDA for changes to commercialized products that are material but don’t rise to the level of requiring premarket FDA review. Companies obviously should make those reports as well.

D. Permit FDA to Inspect the Company’s Data and Records

Apart from reporting, the quality system requires that manufacturers document essentially every change to a device and document such things as investigations into complaints and other experience data. Indeed, it appears that virtually all of the information that interests FDA can be found in records that exist under the quality system.

Section 704 of the Federal Food, Drug & Cosmetic Act authorizes FDA to conduct inspections of organizations that produce medical devices at reasonable times, within reasonable limits, and in a reasonable manner. The scope of these on-site visits permits the agency to examine quality system information as well as physical manufacturing and quality systems to ensure compliance. Thus, FDA can physically inspect the quality system records that contain the relevant information. Under the law, medical AI developers need to permit these on-site FDA inspections. However, as I will explain more in Part III, FDA should not try to expand what the statute allows.

In Part III, I will lay out the case for the damage that would be done by allowing FDA to become more involved in the oversight of the commercialization of AI. FDA is certainly an important regulatory body, but as a regulator with a naturally conservative bent, FDA is not designed to be involved in the day-to-day decisions necessary for the commercialization of AI.

[1] I would note that while I approach this as an attorney primarily, and only secondarily as a data scientist, data scientists see and appreciate this difference independently. See page 4 of Causeability And Explainability Of Artificial Intelligence In Medicine by Hlozinger, et al, textbook entitled WIREs Data Mining Knowledge Discovery 2019

[2] Payrovnaziri, at L, Review, Explainable Artificial Intelligence Models Using Real-World Electronic Health Record Data: A Systematic Scoping Review, Journal of the American medical informatics Association, 2020

[3] Lamy, et al., Explainable Artificial Intelligence For Breast Cancer: A Visual Case-Based Reasoning Approach, Artificial Intelligence In Medicine, 2019.

--

--