Explainability Matters — Here’s Proof

Averting Algorithm Aversion through Explainability

Published in

Towards Data Science

11 min readJun 23, 2020

Imagine you’re applying to grad school and the admissions committee at your dream university decides that admission decisions this year will be made by Machine Learning (ML) algorithms instead of human reviewers. Would you be comfortable with ML algorithms evaluating you and making a decision? Some of us probably wouldn’t want that. But why?

Research shows that evidence-based algorithms (ML algorithms) more accurately predict the future than human forecasters. Algorithmic forecasts are seen to be superior to human forecasts in an array of applications, be it stock market forecasting or gameplay forecasting (in AlphaGo). Admission decisions can also be seen as a forecasting task as they are nothing but a prediction of how good a fit the candidate is to a particular program or a forecast of how successful the candidate will be. Yet, why do some of us want a human to evaluate us?

If algorithms are better forecasters than humans, then people should choose algorithmic forecasts over human forecasts. However, they often don’t. This phenomenon, which we call algorithm aversion, is costly, and it is important to understand its causes (Dietvorst, Simmons, and Massey, 2014).

We know very little about when and why people exhibit algorithm aversion. There is no agreed-upon mechanism on when people use human forecasters instead of superior algorithms, or why people fail to use algorithms for forecasting. As the amount of data we produce daily has now reached a point where almost all forecasting tasks need some kind of an algorithm involved, it is important to tackle the problem of algorithm aversion so that most of us can rely on better-performing algorithms to forecast the future for us.

Dietvorst, Simmons, and Massey (2014 and 2016) carried out several studies to find the causes of algorithm aversion. They found that:

People tend to more quickly lose confidence in algorithmic forecasters than human forecasters after seeing them make the same mistake.
People will use imperfect algorithms if they can (even slightly) modify the results. Hence, giving control can be a way of overcoming algorithm aversion.

However, we know that providing control may not be possible in many cases. Thereby, we need to look at other options for overcoming or averting algorithm aversion.

Modern forecasting algorithms are mostly seen as black boxes by a majority of the population as it involves a complex machine learning model, which very few understand. Add to this the fact that algorithm complexity and performance are seen to be inversely proportional to its explainability. For example, a linear regression model might be easy to interpret but might have poor performance. On the other hand, a neural network could have great performance but could be difficult to interpret at the same time. So, can explaining the model’s predictions or understanding what the model has learned help overcome algorithm aversion? Let’s find out!

I conducted an online experiment to combine these two areas of model explainability and algorithm aversion to answer the broader question of the possible mechanism of algorithm aversion. In particular, I wanted to explore the question: What role does model explainability play in algorithm aversion and can explanations help in overcoming aversion towards algorithms? I operationalized the question by observing if people choose the same algorithm more frequently (or rate it higher) over human forecasters (themselves) if it comes along with explanations.

Dataset

Before beginning my experiment, I needed to choose a machine learning algorithm that would act as my forecaster/predictor. For training any machine learning algorithm, we require data, and in our case labeled data. For this purpose, I used an open dataset from Kaggle which is similar to the admissions scenario discussed above.

To make sure that the participants aren’t overwhelmed by numbers, I gave special importance to keeping the number of features/predictors to under ten during the process of dataset selection. The Graduate Admissions dataset has a ‘chance of admit’ measure ranging from 0 to 1 for each student profile along with the following 7 parameters:

GRE Score ( out of 340 )
TOEFL Score ( out of 120 )
University Rating ( out of 5 ) from where the applicant completed undergrad. 1 being the least and 5 being the highest rating.
Statement of Purpose Strength ( out of 5 ): 1 being the least and 5 being the highest.
Letter of Recommendation Strength ( out of 5 ): 1 being the least and 5 being the highest.
Undergraduate GPA ( out of 10 )
Research Experience ( either 0 or 1 ) : 0 indicating no previous research experience and 1 indicating at least some research experience.

I converted the ’chance of admit’ measure into ’Admission Score’ by multiplying it by 100 so that it is easier for participants to play around with, i.e., they can enter whole numbers as predictions. ‘Admission Score’ can be thought of as the prediction of the success of the student or profile strength. The score ranges from 0–100 and a higher score indicates higher chances of admit/profile strength. The dataset had 500 entries in total. The dataset was clean and didn’t require any other major data preprocessing or data wrangling steps.

Model and Explainer

I trained several models on the dataset and XGBoost was one of the best performing models. I decided to stick with XGBoost as the graduate admission predictor as I got good enough results even with minimum parameter tuning and preprocessing. After having the machine learning model ready, I had to choose a library to generate explanations for the algorithm’s predictions. Thankfully, the machine learning community has been receptive to the problem of model explainability and has developed several new methodologies to explain machine learning models.

One such explainer is SHAP (SHapley Additive exPlanations), a game-theoretic approach to explaining the output of any machine learning model. SHAP can produce explanations for individual rows showing features each contributing to pushing the model output from the base value. Summary plots and contribution dependence plots indicating overall model explanations can also be produced using the SHAP library. The following resources have been very helpful in understanding model explainability using SHAP and I recommend you check them out:

Learn Machine Learning Explainability Tutorials

Extract human-understandable insights from any machine learning model.

www.kaggle.com

One Feature Attribution Method to (Supposedly) Rule Them All: Shapley Values

Among the papers that caught my eye at NIPS was this one, that made the lofty claim to have discovered a framework…

towardsdatascience.com

Experiment Design

The experiment was a simple randomized controlled experiment to check if adding explanations had any effect on the choices people made or how they perceived/rated the algorithm. It was built on the Qualtrics survey platform and participants were recruited on Amazon MTurk and outside of crowdsourcing platforms through friends and connections. The total number of participants was 445, with 350 participants from MTurk and the rest 95 from various sources. The participants had an average age of 30.76 years with around 43% of them being female and 57% male.

The survey (Please take a look to get a better idea of the flow) started by asking the participants about their age, gender, and familiarity with probability, computer science, and machine learning on a scale of 1–4. This was for exploring the heterogeneity of treatment effects once the experiment is completed. Then, all participants were familiarized with the imaginary scenario of a Canadian University utilizing a machine learning algorithm in their admissions committee. Following that, participants were presented with the value of features for 10 applicants and asked to predict the Admission Score for them. After the participant scored each applicant, the participant’s prediction was displayed alongside the algorithm’s prediction if the participant belonged to the control group. For the treatment group, however, in addition to both the participant and the algorithm’s prediction, an explanation of what features were affecting the model was displayed.

The question (on the left) is asked to the participant with the values of applicant profile features presented. Control Group (Top Right): After predicting a score, the participant in the control group is shown his/her prediction along with the actual score and the algorithm’s predicted score. The treatment group (Bottom Right) participant is shown an explanation for the applicant’s algorithm predicted score along with everything seen by the control group.

Also, a participant in the treatment group had a page explaining about Shapley Values displayed before the 10 questions and a page with feature importance and the contribution dependence plots, i.e., overall model explanation displayed after the 10 questions.

The control group (without explanation) and the treatment group (with explanation) had 227 and 218 participants respectively, each of who were randomized at the start of the survey. Once these questions were completed, the participants were asked a set of follow-up questions regarding what they saw. Three questions were particularly focused on in this section:

You are applying to the same MSc/MBA program this year and the admissions committee is experimenting with the algorithm you saw in the previous 10 questions. Which methodology of review would you be in favour of?
Now imagine that you have been appointed as an admissions committee lead for the MBA program. Would you use the algorithm to make decisions on your behalf? Note that you have to deal with a lot of applicants and a limited workforce, but you’re also accountable for the decisions you make!
How well do you think the Machine Learning algorithm did? Score the algorithm on a scale of 0–100.

Questions asked to all participants at the end of the survey

These questions represented scenarios where the outcome of the algorithm (i)had a direct impact on the participants, (ii)the participants were accountable for the algorithm’s outcomes, and (iii)a general rating without any connection to them respectively.

Results

The following results were observed for the three follow up questions mentioned in the experiment design section:

1. When applying for admissions, i.e., when the algorithm’s forecast had an impact on them, the percentage of participants choosing ML Algorithm in the control group, without explanations was 48.13% and in the treatment group, with explanations was 49.78%.

2. When present in the admissions committee, i.e., when the algorithm eased their job but held them accountable, the percentage of participants choosing ML Algorithm in the control group, without explanations was 47.95% and in the treatment group, with explanations was 49.50%.

3. When asked to rate the algorithm in general, participants in the control group rated it 70.81 out of 100 on average, while participants in the treatment group gave it an average score of 71.07. The Cohen’s d value for the two groups was 0.016.

From the above results, we can see that there is a pretty distinct difference between the two groups. In questions 1 and 2, a higher percentage of participants chose to use the algorithm when it came along with the SHAP explanations. Also, in question 3, Cohen’s d value of 0.016 shows that the difference between the means is positive and in favour of the group with explainability, but the effect size is very small. Overall, a larger sample size would make conclusions and effect size a lot clearer.

Heterogeneity of Treatment Effects: Like for any other experiment, I explored the heterogeneity of treatment effects amongst the participants on three grounds: age, gender, and familiarity with probability, computer science, and machine learning. The treatment effect was the same across all age groups. Gender didn’t play a role in determining the algorithm score as well.

When we look at the algorithm score or the way the participants perceived the algorithm, the overall difference wasn’t much. However, when we look at the subgroups based on their familiarity with probability, computer science, and machine learning, the results we as follows.

An important observation was that the algorithm score given by the participants without a technical background (i.e., who had never heard of Probability, Computer Science and Machine Learning) is always higher in the treatment group with explanations, compared to the control group without explanations. The Cohen’s d value also increases significantly for this subgroup, suggesting that there is a large positive effect on how people rate algorithms with explanations compared to a sheer number presented to them.

Through this experiment, I tried to find empirical proof that algorithm aversion can be overcome by explainability. It holds good in cases where the participants were the ones being affected by the algorithm’s predictions and also in cases where the participants were the ones accountable for its accuracy. In further experiments, the heterogeneity of treatment effects should be the primary focus. We can focus specifically on non-STEM (Science, Technology, Engineering, and Math) participants with additional questions to understand their reason for choosing the way they did. If we can prove the external validity of this experiment on a much larger dataset, simple methodologies like SHAP can be incorporated in our daily machine learning forecasters to help people avert algorithm aversion!

Conclusion

The world is moving towards ‘Explainable AI’ and many see it as a necessity for deep learning algorithms now. This experiment only proves that such steps are necessary for the process of building trust towards algorithms.

Even though algorithm aversion isn’t a big problem, for now, it is a problem we need to tackle with all seriousness. Although we might think of today’s forecasting tasks as simply a grad school admission problem or a house price estimation task; it will be the task of forecasting the speed of the truck in front of you on a self-driving car tomorrow. It is inevitable for algorithms to take over such tasks where human lives are at stake, be it healthcare or space exploration. No matter what the scenario, we need to find a way through which we can build trust in these algorithms. Therefore, it is very encouraging to see the machine learning community working towards this problem by taking explainability as seriously as complexity in our path of innovation.

References

[1] Dietvorst, Berkeley and Simmons, Joseph P. and Massey, Cade, Algorithm Aversion: People Erroneously Avoid Algorithms after Seeing Them Err (July 6, 2014). Forthcoming in Journal of Experimental Psychology: General. Available at SSRN: http://dx.doi.org/10.2139/ssrn.2466040

[2] Dietvorst, Berkeley and Simmons, Joseph P. and Massey, Cade, Overcoming Algorithm Aversion: People Will Use Imperfect Algorithms If They Can (Even Slightly) Modify Them (April 5, 2016). Available at SSRN: http://dx.doi.org/10.2139/ssrn.2616787

[3] Lundberg, S.M., Erion, G., Chen, H. et al. From local explanations to global understanding with explainable AI for trees. Nat Mach Intell 2, 56–67 (2020). https://doi.org/10.1038/s42256-019-0138-9.

Please feel free to comment with your feedback and suggestions on the post or connect with me on LinkedIn.