The world’s leading publication for data science, AI, and ML professionals.

Good AI starts with Good Data: Running a Data Design Sprint with your Team.

The tech giants - Amazon, Microsoft, Google, Uber, Facebook - all spend millions of dollars on AI solutions. Yet, many end in failure:

Data for Change

The tech giants – Amazon, Microsoft, Google, Uber, Facebook – all spend millions of dollars on AI solutions. Yet, many end in failure:

  • IBM Watson was pulled out of medical institutions when it began to frequently give bad advice to cancer patients.
  • Amazon’s hiring AI showed bias towards hiring men because it learned that good hires tended to use verbs that were commonly found on male engineers’ resumes.
  • Uber’s autonomous driving AI struck and killed a woman because it failed to classify her as a pedestrian

Although each of these systems were designed and built to accomplish a task in good faith, they each failed because of unintended predictions.

How could this possibly happen?

Data.

Each of these models made poor decisions not based on the data they had, but based on the data they didn’t have.

When training a machine learning model, the three most important aspects of the data can be broken down to:

  1. Optimal Inputs If all the data in the world were available to you, which inputs would you use? How do these inputs affect the outcomes of the model if they are not present?

  2. Training Data What data do you need to train your predictive algorithm? Where do we procure such data?

  3. Feedback Data Once our model is trained, what data do we need to continuously improve the model?

As an AI PM, I created the Data Sprint to shortcut the endless debate and compress months of work into a single hour. Instead of launching into your AI product development lifecycle and later realizing that your model predicts poorly, you finish the hour with a tactical roadmap.

The process is similar to running a Google Design Sprint. The requirements are simple:

  1. You’ll need to invite the right people who understand the business goal, technology requirements and user needs.
  2. Every member of the team is asked to work alone, together. During the brainstorming time, each member is capturing their thoughts using a sticky note.
  3. The team converges to discuss each sticky note in order to prioritize next steps

Using the Data Design Sprint, within a 1-hour meeting, you’ll walk out with the most important outcome for building a machine learning model: commitment & alignment. Because, who doesn’t love the their own idea?

Here’s how I conduct them with my team and the time I spend on each section. Being strict on the timing is critical to this process. Team members may fret about having so little time to think, but a little encouragement is all you’ll need to push them to see results.

  1. Define the prediction problem (20m)
  2. Brainstorm the inputs and theme (10m)
  3. Brainstorm possible bias outcomes of each input feature (10m)
  4. Summarize, Capture & Share (10m)

As you explain this process to your team, you need to stress the fundamental philosophy of product development: experimentation.

At the end of each sprint, you will have one hypothesis.

You will have focused on one prediction goal and you will be armed with the recipe to test your hypothesis.

Let’s get started!

Because of the exceptional circumstances of the year 2020 (COVID-19), I find that the easiest way to conduct these sprints is on Trello. You will notice the process runs 50 mins, the 10 min is buffer for chatting/technical difficulties. Trust me, it’ll take an hour.

  1. Defining the Prediction Problem (20m)

Goal: To create alignment on the problem the team will be tackling for the next hour

Use the first 15 minutes to have the stakeholder(s) explain what their business problem and goals are. Depending on your organization, the stakeholder can take the form of a Product Manager, a Director, a VP, or CEO. Keep the discussion hyper focused on the problem and goals. Allow for questions and discussions but keep it bound to 15min.

While the stakeholders speak, the team is busy creating stickies on what they think the prediction problem might be. For instance, during the discussion the stakeholders may mention that one of their goals is to increase the average volume of items purchased for the next quarter.

A data scientist on your team may write: "Predict an item user may want" or your engineer: "Predict average user budget for specified month", and your Product Manager: "Predict item availability".

In the last 5 minutes, using Trello stickers, ask your team to vote for the best prediction problem. This is what your team will focus on for the remainder of the sprint.

2. Brainstorm the inputs/features of the model (10m)

Goal: Develop a broad set of inputs for your model

Start by telling your team that the goal of this exercise is to brainstorm possible input features – regardless of whether you can obtain them or not. This concept needs to be stressed. The goal of this exercise is to have all team members contribute regardless of whether they think their ideas are bad.

Once 5 minutes has passed, use stickers to vote for the most important inputs. *Depending on how you want to do this, you can modify this process here using priority frameworks like impact/effort.

Assign a member of the group to help you theme the inputs. Organize inputs under a theme if possible (e.g., gender, age, city, would all fall under a demographic theme)

Organize the inputs from ascending order, top voted inputs are at the top of the list.

As the AI PM of my team, I now look at this list as a shopping list. It’ll be up to you to use your ingenuity & creativity to determine where you can obtain these inputs. Half the work is done for you since you’ve already prioritized which inputs are necessary / nice to have.

Now you understand your prediction problems and the inputs required to train a model.

3. Brainstorm possible bias outcomes of each input feature (10m)

Goal: To understand the biases that your inputs can introduce to your prediction *Depending on your goal and prediction, this section can be skipped.

Using each input theme, start a new column and ask your team to think about the consequences that including or excluding these inputs could possibly result in.

The answers should be in the form of a single card, with one thought per card. For example, if demographics was a theme for a hiring algorithm, an example answer might be: possible hiring bias based on gender/ethnicity etc

4. Summarize, Capture & Share (AI PM)

In 1 hour, you’ve just walked your team through a data brainstorm and now you have a tactical roadmap to move forward.

You’ve determined a number of prediction goals, the inputs you need, and most importantly, you’ve spent time thinking about the possible outcomes of of those inputs reducing the chances of unexpected predictions.

Use the remaining time to summarize the results with your team, capture the information and share.

I always like to issue a survey to the team to learn whether they enjoyed the process and learn how it can be improved. If you enjoyed this article and/or have suggestions, connect with me and let’s chat. I’m always looking to learn from other AI PMs.


Related Articles