Vendors everywhere, please don’t tell me your Machine Learning (or better yet, ‘AI’) model is 99% accurate. Don’t tell me that all we need to do is ‘plug in your data,’ and some shiny magic happens. I am not the only one. Recently 31 scientists challenged a study by Google Health because there wasn’t sufficient evidence made public to determine the credibility of the proposed AI models. We all need to do our part to build trust-worthy models that power our new advancing application. Being informed is the first step in mitigating model risk.
Provide me context
When examining internally created models or canned vendor products, there are some necessary steps I follow. There are many other great questions. Please feel free to share them in the comments section.
First of all, we need to understand if what you are doing behind the scenes is machine learning. If you can’t tell me what types of models generate the results, get your engineer on the call.
Without context, accuracy metrics are worthless. Nine-nine percent accuracy is terrific if you are labeling fish species for a children’s zoology app, no so much if you are identifying incoming missiles.
Before any metric is even whispered, the following three questions need to be examined.
What is the goal?
- Are you promoting something or preventing something?
How frequent or rare is the event?
- Give me stats. ‘Really rare’ isn’t a good response.
What is the worst-case scenario?
- What happens if the model is wrong, both False Positive and False Negative.
- If you don’t know, you don’t understand the use case. You have some work to do.
Provide me proof and details
Once I understand those basics, then we can get to the details.
Where did you get the training data?
- If purchased, who was the vendor?
- Are you building models on top of modeled data?
- How was the data collected and prepared?
- Transactional, phone survey, web app, phone app
Types of models trained:
- What kinds of models?
- How did you arrive at those particular models?
- Were you reliant on a specific platform or architecture when making that decision?
Metrics:
- Accuracy is good, but there is so much more than accuracy. Please show me the confusion matrix. Where did the model do well, and where did it not?
- Could you show me the top predictors? I need to verify that they ‘make sense.’ What was the association value? Is there possible data leakage?
Bias:
- What bias (demographics, region, collection method (iPhone vs. Android), ethnicity, health, and wellness) do you understand to be in your training data? If bias is unavoidable, what have you done to mitigate it?
- Have you assessed the possible bias impact by analyzing the results of the model implementation? A model can have absolutely ‘clean’ data and still have unintended consequences. Gender, race, ethnicity, and socio-economic status are easily leaked into databases even when those columns are not in use. For a detailed example, please read "Dissecting racial bias in an algorithm used to manage the health of populations" by Obermeyer et al. _SCIENCE_25 OCT 2019: 447–453. It nicely details how the choice of a seemingly reasonable optimization target (lower health care costs for chronic patients) resulted in the underserving of worthy wellness program candidates in the black community.
Conclusion
As quickly as the work of machine learning and AI evolves, just as quickly should our expectations of model design and verification transparency. Without the ability to thoroughly understand the models and outcomes they produce, distrust in ALL models may arise, stifling growth and creativity. If you aren’t asking questions already, please start doing so. All of our success depends on it.
References
AI is wrestling with a replication crisis
"Dissecting racial bias in an algorithm used to manage the health of populations" by Obermeyer, et al. _SCIENCE_25 OCT 2019: 447–453