[A comic-style storyboard with scenes of technical ML content and human-machine interactions]

Ethical Storyboarding for Machine Learning

Picturing the systems we build within the systems we live

Published in

Towards Data Science

9 min readMay 10, 2019

Machine learning is gradient-descending its way into more and more places, and with its arrival comes both increasing demand for skilled ML practitioners and increasingly disruptive challenges to the basic assumptions we hold about our society. The truth is, ML is fast becoming too deeply integrated into our world to keep engineering and ethical concerns separate from each other, and to survive the wave it’s creating we’re going to need a workforce that is acutely aware of how its work ripples through the surrounding world. Where better to start increasing this awareness than the classroom?

This spring I co-taught the pilot for Google’s new Applied Machine Learning Intensive bootcamp at Mills College, a program which brought together recent college grads of all different majors and backgrounds to introduce them to a wide breadth of foundational ML techniques. Google was gracious enough to give me the flexibility to try out some integrative approaches to teaching engineering and ethics, and like the output of a well-tuned neural network, the results were electrifying.

The current state of the art

Back when I was an engineering undergrad at Stanford, we were required to take two “engineering citizenship” requirements to round out our technical knowledge with a little bit of self-reflection.

One of these requirements was the Writing-in-Major seminar, a rushed and unfortunate one-unit afterthought of a course centered on critiquing homework assignments from our digital design lab. Class time mostly focused on the importance of decorum when presenting our opinions, and the primary criteria for our writing was enough self-restraint to keep from cursing out our professor (the average grade was surprisingly low). The other requirement, Technology in Society, had us enroll in any class from Stanford’s Science and Technology in Society department. These classes were highly abstract and intellectually formal in ways that made students used to applied engineering content feel like they were trapped in the depths of an Onion article about academia. We were arrogant, hungry engineering undergrads missing the gravity of the subject in a hurricane of academic posturing. Fast forward a decade — the pedagogy and technology of our STEM classrooms has improved immensely, and yet...

At the outset of our ML bootcamp I was relieved to see Google had blocked out an early unit dedicated to the subject of “ML Fairness and Bias”. Cool! That’s a great start, but something about it was still nagging me: it was just another Writing-In-Major seminar. The truth is, a single unit in an introductory ML class about bias is like thoughts and prayers after a tragedy — it sounds nice but abdicates our responsibility to systemically heal the problem.

Let’s talk integrative methodology

The main principle behind our modifications to the ethical unit of this course was that it should be a recurring, constantly present aspect of the engineering process, the same way we treat debugging and testing. Research has shown the value of integrating “test-first” exercises into foundational CS assignments, and our hope was to have similar results with a “bias-first” approach to an ML course: as students build their mental muscle memory for exploring and wielding a dataset within an engineering context, they should also be developing a reflex to ethically analyze their solutions.

What we arrived at for this semester was an “Ethical Storyboarding” worksheet: a writing exercise asking students to concretely picture both positive and negative consequences of an ML algorithm trained on a given source of data, and then brainstorm different ways to alter their designs given their analysis. The questions were structured to pattern a few specific behaviors that we felt were important to see in principled and competent engineers:

Keeping possible negative repercussions of your project in mind from the very start, rather than sweeping them under the rug or paying lip service to them after all the technical work is done
Brainstorming fixes to deal with problems across multiple dimensions of the engineering process — and not just technical ones
Feeling safe to discuss negative, possibly project-stopping concerns, while also entering into the conversation with avenues to move the team forward.

Whenever students encountered a new dataset, they were required to fill this worksheet out. Early on, while students were all working on the same data, we also followed this written work with an in-class discussion to debrief students on the exercise. One of the primary roles of these early discussions was to emphasize the importance of ongoing reflection and iteration, rather than to “engineer-ize” the ethical questions at hand (something that broad surveys of engineering curricula have found to be a recurring problem). We really don’t want students to learn to apply technical band-aids to “solve” bias. That’s possibly the most dangerous takeaway one could get from this exercise, and extra care should be taken by the instructor to address it.

The Ethical Storyboarding worksheet

Presented here is the worksheet as given to our students, with a toy example filled in to demonstrate the form and level of content expected from students in an introductory Machine Learning course.

Describe your dataset

We chose the MNIST database of handwritten digits (http://yann.lecun.com/exdb/mnist)/, a collection of 60000 images of hand-written arabic numerals.

Write a one-paragraph story describing a fictional person who was positively affected by a model trained with this data

Riley is a doctor who spends most of their time with patients figuring out how to enter data into whatever current electronic medical record software their practice has contracted to use. After the latest update, they decide to switch to free-form note taking software that lets them write medical records on their tablet in whatever form they like, and uses OCR with numeral recognition trained on MNIST to make these records searchable and copyable. Now, they can focus more on attending to their clients’ needs and never have to understand a new hairbrained UI again!

Describe at least two sources of bias the particular model in your story could have

Upon research we found that the MNIST dataset was compiled from the handwriting of both Census Bureau employees and high school students (we could not find further demographic information about either group). While there are hundreds of different writers represented in the dataset, it’s entirely possible that there are substantive differences in handwriting between on-the-job medical practitioners and the cross-section of people represented by MNIST . Additionally, because most MNIST-based ML benchmarks use the dataset in isolation, it’s unclear whether the introduction of alphabetical characters or markup could make MNIST-trained networks less accurate

Describe at least one way we could modify the model to mitigate this bias
Eg, what can we do when designing our model to account for inherent bias in its input data?

We could use a classifier on top of the normal recognition network to determine whether a region is likely to be a number or a word, and use this information to weight the final character class probabilities outputted from the OCR model. It is unclear to us how to address the problem of poor handwriting recognition by solely altering our software alone — that would seem to require deeper changes to the training data itself.

Describe at least one way we could modify the dataset to mitigate this bias
Eg, what could we do differently if we collected this data again?

We could make sure to build the OCR dataset from the handwriting of doctors, and also make sure that handwriting includes examples of the full range of letters, numerals, and symbols we might see in an average medical record.

Describe at least one way we could modify the context surrounding the model to mitigate this bias
Eg, what human practices or policies could we put in place to protect people within the social system where this model is used?

We could institute a strict record release policy such that whenever data is transferred from the doctor’s hand-written notes to a pharmacy or other specialist, a human being reviews the data to make sure all relevant records are included in the release and that all figures copy-and-pasted out of them are accurate.

Write a one-paragraph story describing a fictional person who was negatively affected by a model trained with this data

Taylor is one of Riley’s patients, and begins experiencing focal seizures after their 20th birthday. Riley prescribes them the anti-seizure generic Levetiracetam at the maximum dose of 3000 mg/day, but an OCR failure interprets the 3 as an 8 and goes uncaught by the pharmacist. Taylor develops acute medication poisoning, spends a night in the ER, and now has staggering medical debt to pay back on top of their general insurance premiums and outstanding student loans.

The worksheet, in practice

We introduced this worksheet early in the semester, shortly after playing with toy linear regression examples but before they worked with any real-life data. They split off into groups of three, browsed what was available on Kaggle, and filled out a storyboard for the the most intriguing dataset they could find. Their work ranged from subtle and chilling (like an ethnic restaurant disproportionately saddled with financial burden due to racism-motivated food safety fears) to hysterically funny (like a fledgling cannabis user being dissuaded from further drug use due to an improper suggestion for a powerful strain called “Strawberry Cheesecake”).

We found that, understandably, students had incredibly varied comfort levels with regards to this exercise: some students responded with panic at the request for a “story”, while others had trouble figuring out exactly how to connect a technically esoteric dataset to a real-life narrative. Almost all students needed clarification for the prompt asking students modify the social context their technologies were used in. In retrospect, possibly the most important improvement to note for future iterations of this assignment would be to write an example storyboard in front of the class, styled much like a live-coding session one might do for a new programming technique. Asking students to fill in details as the instructor types, Mad Libs-style, could add a fun interactive dimension to this instructional component.

We also saw that students’ early answers to bias mitigation questions tended to range from highly general hand-waving to “quick-fix” type solutions, although their work improved throughout the semester as their work became more hands-on. We feel that dedicated one-on-one support and guidance to “workshop” these storyboards, beyond the class-wide discussions, is of great value here.

One beneficial use we found for this worksheet beyond just its ethical dimension was as a diagnostic tool to identify students having trouble connecting the technical dots into a bigger, more applied picture. Asking them to build a concrete map between a dataset, the technical tools they were learning, and a real-life use case provided some ample opportunities for us to identify and fill in the gaps. This exercise in design thinking proved important later in the semester, when students were beginning their self-directed final projects: we found that, as this ethical framework was their first experience brainstorming anything to do with ML, many of them were instinctively using it as a means to start ideating their project’s content.

Our future depends on reflection

In Canada, most students graduating with an engineering degree attend a Calling of the Engineer ceremony, in which they recite a passage emphasizing the social responsibility of their profession and are presented with an iron ring to carry throughout their lives. These rings, a reminder of their societal obligation, are of course no better than a stand-alone unit on ML bias, or a single semester of engineering ethics, or even a survey course full of ethical worksheets. All the same, there is something about them that I find beautiful: they’re an enduring and on-going presence, at once a piece of jewelry worn with pride over our achievements, and also a weighted burden reminding us of how our work ricochets throughout our world. This sentiment is the heart of what we hope to convey through our curriculum.

We don’t pose this worksheet and its associated pedagogical methodology as a prescriptive solution for an ethical engineering education — it might not even resemble a proper solution. Rather, we are sharing it with the intent to spark your interest in adopting more ethical content in your own technical coursework; to convince you to closely integrate doing with questioning. The authors of that study espousing test-first CS assignments found that while it didn’t substantively affect students’ project grades, it did measurably improve their ability to test code. It wasn’t of value for teaching first-semester programming, but rather for being a well-rounded software engineer. As they put it:

If curricula can get students ‘test-infected’ from the beginning, we believe they are likely to realize that testing is an integral part of programming, benefitting [sic] them throughout their academic and professional careers.

What a wonderful sentiment! Let’s try to get students ‘ethics-infected’ as well. For those of you inspired to do something like this in your course, and for those of you already miles ahead of us doing similar work, my heart sings — let’s start a conversation about this, and let’s make it loud.

Naomi Alterman is a freelance CS educator and software engineer who loves having conversations with people about ways to think about things. If you do too, you should drop her a line, she’ll be jazzed about it. Her contact info can be found at www.nlalterman.com, and her twitter handle is @uhohnaomi