Robots Are Wrong Too—Confusion Mapping for the Worst Case

Here be dragons

Photo by Seth Doyle on Unsplash

When was the last time a calculator didn’t do what you wanted it to? When was the last time that a person did? Algorithms like machine learning are in between these two from a deterministic standpoint.

Today we are building machines that are more complex than we can understand and we need to deal with them differently than we would previous technologies. It is no longer about specifying what we want something to do and then debugging it.

At Philosophie we have been building new tools and exercises like Empathy Mapping for the Machine to bring human purpose to AI and machine learning projects. The latest is what we call Confusion Mapping and it helps you better prepare for the all the possible ways non-deterministic systems could fail.

Crash course on confusion matrix

As we have stolen Empathy Mapping from Design Thinking we are stealing the confusion matrix from data science for Confusion Mapping. Confusion matrixes are a performance visualizations for classification algorithms (including machine learning). When digging into how a system performs you need to know more than error and accuracy rates. You need to know how it will fail.

Here is an example:

It is always about cats, isn’t it?

Generally, the horizontal axis is what actually happens and the vertical is what was expected to happen in a perfect world. There are four quadrants that are considered:

  • True positive — the system thinks it should be true and it is true
  • True negative — the system thinks it should be false and it is false
  • False positive (aka ‘false alarm’ or a Type I error for those stats inclined) — the system thinks it should be true and it is false
  • False negative (aka ‘miss’ or a Type II error) — the system thinks it should be false and it is true

The most traditional way the confusion matrix is used is for classification problems. For example, if you wanted to categorize a picture as either a dog or a cat you would be considering two different classes in the confusion matrix. Thinking that a picture of a cat is a dog would be a false result. In this case there are 18 cases where a dog was mistakenly classified as a cat:

Example of confusion matrix plot for a dogs and cats classifier from the course.

For each of the quadrants we are considering the number of times that the correct (or incorrect) classification takes place. It can be very helpful when considering how impactful particular outcomes are in comparison to their frequency.

Usually when it is represented in plots you will see a diagonal line from the top left to the bottom right where you have the true positives and true negatives for each class or prediction.

There are cases where you can have many classifications. In the case of ImageNet there are many, many possible classifications. With regards to dogs there are classifications for 120 dog breeds alone.

They can be huge!

Confusion matrixes are about the outcomes…

The first step towards building services that solve people’s problems is being focused on outcomes rather than solutions. Confusion matrixes do this by counting the number of cases that the classifier does what it should do.

You need to know what the desired outcome when the machine does what it should and when it correctly doesn’t do anything.

When confusion matrixes are created they are most interested in the raw classification. We need to take it a step further to think about what are the outcomes or possible repercussions of our machine working the way we expect it to.

Let’s take the example of a front facing object detection system for a autonomous vehicle. It will need to identify when there is someone or something in front of the vehicle so the right action can be taken, such as braking hard or swerving to avoid a collision. When nothing is detected it shouldn’t interfere with the other planning and execution of the vehicle.

For a project we worked on for field service operations (FSO) there was an interface that would recommend whether or not to assign a field tech to a job based on a predictive algorithm. We either wanted the system to assign someone so they could go to the work or ask someone to get more information before it is assigned.

A field service operations job detail for a dispatcher with automatically recommended technicians, parts, schedule, etc.

The intervention to be taken is the most important part of this example. Depending on the type of thing identified it may want to take different actions.

…and how the applications will be wrong

When we build these types of intelligent algorithms they will get things wrong. The reason is that we don’t try out every possible combination of circumstances just the ones that we think will happen. If we didn’t it would take forever to build and train these types of systems.

Watch out for that bicyclist!

For the autonomous vehicle obstacle detection system we have been talking about would have a false positive when it thinks there is an obstacle in front of the car when there isn’t. The false negative is not thinking there is something in front of the car when there actually is. There is a clear difference between these cases because one results in slamming on the breaks inappropriately (false positive) and the other is hitting an obstacle like a person (false negative).

False negatives are not always worse than false positives. Take the example of an algorithm that diagnoses cancer: a false positive could trigger a series of unnecessary surgical interventions that could cause further problems. It is important to understand as many of the impacts as possible from both a false negative and false positive point of view.

A false positive for FSO tech recommendations would be assigning a field tech before a job was ready to be assigned. This could require return trips (which were a ‘feel bad’ situation for the predictive algorithm) if the field tech wasn’t qualified, didn’t have the right parts, or just didn’t know what to do. The false negative in this case was to have jobs sit in the queue and require more clarification before assigning. This could cause utilization of the field techs go down and reduce the amount of billable time (both ‘feel bad’ situations for the algorithm).

Premortem as inspiration

Confusion matrixes shows how a particular classification algorithm fails particularly. Premortems are a technique that help us understand how our projects and teams will fail. They are especially helpful in complex team structures.

What works so well about premortems is that it allows everyone to put on their black hat and be really creative about the worst case. The wisdom of the crowd is great at determining what are the most likely self-inflicted problems for a project.

Premortems as a method and confusion matrixes as a concept together are the inspiration for the Confusion Mapping exercise we have created at Philosophie.

The Confusion Mapping method

This exercise is most appropriate after you have focused in on a problem and a possible solution. In a workshop we ran we did Challenge Mapping (with a good ‘how might we…’), Crazy Eights ideation, and Empathy Mapping for the Machine before doing this exercise.

What is key during this exercise is to not worry about the frequency of the cases quite yet. You are focused on collecting as many problems possible.

You should plan for about an hour with a group of cross disciplinary people (e.g. data scientists, engineers, designers, product managers, execs, etc.) to finish these types:

  1. Draw the grid with ‘positive’ being the functionality you want to happen and ‘negative’ being the functionality not happening.
  2. For each of the four quadrants privately ideate on post it notes what the impact would be on the people or other systems if it happens. Take 3 minutes while ideating for each quadrant and immediately share after each quadrant’s generation step…
  3. Together you put each individually ideated item up on the board, get clarification if they aren’t understood, and deduplicate any common items. Try to avoid too much discussion at this point. Take about 2 minutes per quadrant for this.
  4. Now dot vote the most important items in the true positive and true negative quadrants. Stack rank them to one side in order of more to less votes.
  5. Next dot vote the worst items in the false positive and false negative quadrants. Stack rank them to the side in order of really bad to less bad votes.
Photo from our NYC Media Lab Design Thinking for AI workshop.

After this process you should have a stack ranked list of outcomes that are most important when things are working and when they aren’t. The most important aspect of the exersize is the alignment and shared understanding you have within the team about what could go wrong.

Be less confused

At Philosophie, we use these types of exercises when going from problem to plan and I think you will find they are beneficial.

Now that you have a shared understanding in the team there are a few things to consider :

  • Take a systems approach with more deterministic modules to adjust or guide the outputs to avoid those scenarios.
  • Break out the model into multiple models with different, single error metrics, as an ensemble to reduce the likelihood of those errors happening.
  • Start to build with an eye for estimating frequency of the different outcomes.

Understanding how these systems can be wrong is the first step towards building trust with the people that will use them. Untrusted solutions aren’t used.