The world’s leading publication for data science, AI, and ML professionals.

Data Science Project Management

A 5-step Project Management Framework for Data Science

This article is part of a larger series on Full Stack Data Science. In the previous post, I introduced the idea of a full-stack data scientist and the 4 hats it entails. In this article, I will discuss the first of these four hats – the project manager (PM). While there are countless ways to approach data science project management, here I propose one possible framework and the PM’s role in executing it.

Photo by Scott Blake on Unsplash
Photo by Scott Blake on Unsplash

Data science projects often involve developing machine learning (ML) models to solve business problems. While this may seem commonplace in business today, it still comes with several risks.

Namely, developing ML models is inherently uncertain, technically demanding, expensive, and time-consuming. These risks motivate project management frameworks specifically designed for data science projects in mind.

Here, I will describe one such approach and break down the key contributions of a project manager in this context.

A 5-Step Project Management Framework

The approach I like to use for Data Science projects is outlined by the 5-step framework illustrated below.

My 5-step data science project management framework. Image by author.
My 5-step data science project management framework. Image by author.

Digging deeper, here are a few key activities for each phase.

  • Phase 0: Problem Definition & Scoping – Formulate the business problem. Design the data science solution. Define project milestones, tasks, and success metrics. Key role: Project Manager
  • Phase 1: Data Acquisition, Exploration, & Preparation – Evaluate available data. Acquire and explore data. Develop data pipelines. Key roles: Data Engineer, Data Scientist
  • Phase 2: Solution Development – Develop ML solution. Evaluate solution validity and value. Iterate with stakeholders and revisit past phases as needed. Key role: Data scientist
  • Phase 3: Solution Deployment – Integrate solution into real-world business context. Develop solution monitoring pipeline. Key roles: ML Engineer, Data Scientist
  • Phase 4: Evaluation & Documentation – Evaluate project outcomes. Deliver technical documentation and user guides. Reflect on lessons learned and future work. Key role: Project Manager

An important point here is that data science projects often do not progress linearly through each of these phases. Rather, some amount of iteration is required through key feedback loops. Here are a few examples of what this might look like.

  • Phase 1 → Phase 0: When exploring the available data, it becomes clear that key information is not available, and the project plan must be revisited.
  • Phase 2 → Phase 1: After training a handful of models, it is discovered that an exception was not properly handled in data preparation.
  • Phase 2 → Phase 0: Preliminary models do not demonstrate strong predictive performance, which requires reevaluating the project’s value.
  • Phase 4 → Phase 0: Every project has its opportunities for improvement. Upon completion, teams can evaluate these opportunities and kick off another project, starting with Phase 0.

Role of the Project Manager

The project manager (PM) is ultimately responsible for a project’s success. If the project is late, it’s on the PM. If costs exceed estimates, it’s on the PM. If the value doesn’t meet expectations, it’s on the PM.

While this responsibility involves a diverse range of tasks from multiple contributors, one key determinant of a project’s success is the PM’s execution of Phase 0 (as described above).

Phase 0 sets the foundation of a data science project. Just as a poorly constructed foundation will result in a difficult construction project, a poorly executed Phase 0 will result in a difficult data science project.

The 3 key elements of Phase 0 include Problem Diagnosis, Solution Design, and Implementation Plan [1].

1) Problem Diagnosis

Of the 3 elements, this is the most critical because if you get this wrong, you can spend a lot of time and money solving the wrong problem (i.e., little value is generated). Despite its importance, many tend to gloss over (if not skip entirely), taking the time to stop and think about the business problem.

Just as a doctor interviews a patient to produce a diagnosis, a PM interviews stakeholders to better understand the business problem and identify the root cause. Although there are many ways to do this, I like to keep things simple and focus on asking two key questions.

  1. What problem are you trying to solve? – this is always the best starting point for these conversations [1]
  2. Why is that important to the business? – this can kick off a series of 5 why-based questions to get to the problem’s root cause (see Toyota’s 5 Why’s approach) [2]

One of the PM’s most important skills is effectively collaborating with stakeholders to understand their problems. I discuss this further in a past article.

5 Questions Every Data Scientist Should Hardcode into Their Brain

2) Solution Design

Once the business problem is clearly understood, the next step is to define how to solve it. Various solutions at various levels of complexity can address any given problem.

For instance, if customer churn is high due to a slow onboarding process, some potential solutions could be removing unnecessary onboarding steps, analyzing where drop-off occurs and reworking that step, customizing onboarding based on customer information, etc. Notice that these solutions may not require Machine Learning (and that’s okay).

Suppose, after extensive back-and-forth, the stakeholder wants to move forward with developing a personalized onboarding experience based on customer profiles. While this narrows things down, this solution can still be implemented in many ways. Therefore, the PM must use their judgment to propose a solution based on stakeholder conversations, similar industry projects, and available resources.

3) Implementation Plan

The final element of Phase 0 is translating the proposed solution into a concrete project implementation plan. This plan consists of two key pieces: a project roadmap and the project requirements.

A project roadmap consists of key project milestones. I like to base these milestones on Phases 1–4, as described above. Each phase consists of tasks assigned to a particular role (e.g., data engineer, data scientist, or ML engineer) and a due date [1].

Project requirements specify all the necessary resources for implementation, including data requirements, key roles, software tools, and compute infrastructure.

Case Study: Semantic Search over YouTube Videos

I will walk through Phase 0 for an example case study to solidify these ideas. While this is meant to be instructive, it is a real project I will implement (and document) in future articles of this series.

🔗 Series Reading List | YouTube Playlist


Background

I share content about data science and entrepreneurship on platforms like Medium and YouTube. This serves as a way for me to structure my learning and document my journey as an entrepreneur.

A natural consequence is that my content spans various topics, attracting a diverse audience of learners, entrepreneurs, and business leaders. While this diversity enriches the learning experience, it can also present challenges for audience members navigating different topics across multiple platforms.

Problem

Given the diverse range of topics I cover, audience members may face difficulties in efficiently locating content that aligns with their specific interests or current educational needs. This can lead to lower engagement and potentially inhibit audience growth as users may give up or miss out on relevant content.

Solution

To enhance content discoverability and user engagement, I propose developing a centralized repository where all my content from Medium and YouTube is accessible. This platform will feature a search function that allows users to easily find specific topics, answer data science questions, and explore entrepreneurship insights. The search functionality will be designed to understand natural language queries, making it easier for users to find what they need without navigating through different platforms.

As a first step towards creating this centralized repository, I will develop a proof-of-concept web page that will index all my YouTube videos and incorporate a search function that enables viewers to locate videos by specific topics or questions. This initial phase will help assess the feasibility of the search technology and lay the groundwork for an MVP version of the centralized repository.

Implementation Plan

Projet requirements. Image by author.
Projet requirements. Image by author.
Project milestones and tasks. Image by author.
Project milestones and tasks. Image by author.

What’s Next?

Given the unique challenges of building data science solutions, managing these projects requires special consideration. Here, I described one possible framework for doing this and break down the project manager’s key contributions.

In the next article of this series, we will move on to Phase 1 and walk through a typical data engineering workflow using the above case study as a guide.

More in this series 👇

Full Stack Data Science


Resources

Connect: My website | Book a call

Socials: YouTube 🎥 | LinkedIn | Twitter

Support: Buy me a coffee ☕️

Get FREE access to every new story I write


[1] What Problem Are You Trying To Solve: An Introduction to Structured Problem Solving by Astor et al.

[2] Toyota’s 5 Why’s Approach


Related Articles