Engineering Meets Data Science- How to Balance the Tension Between Data Science and Agile

A few months ago, I started managing a research group on machine learning and deep learning.

Anat Rapoport
Towards Data Science

--

Photo by Austin Neill on Unsplash

I have a lot of experience managing dev and engineering teams, but this was my first time being exposed to these topics, algorithms and work methodologies.

On one hand, I had a research team that said it needed time to investigate the issues and read more material, unable to commit to dates and accuracy levels.

On the other hand, I had the management of my startup asking for timelines, results, and commitments.

​This was my first draft of the process I thought was the right one, to meet ​​the needs of the research team and management’s demands somewhere in the middle:

  1. Product defines the problem and creates a user story that describes the need.
  2. The research team reads relevant articles, and looks for similar problems which other data science teams have faced.
  3. The data team collects relevant data, and someone (internal or external) needs to label it.
  4. The team tries some algorithms and approaches, and returns with input as to what were the results — for example, in our case we show a confusion matrix that demonstrates where errors occurred which allows us to take actions for the confusing areas.
  5. Here we can make a few iterations with the product, sometimes it would be to change the definition of the problem a little bit, when it doesn’t harm the feature, and improves the success rate.
  6. Once we have a good enough result, we develop an API and a feature over it. We call this productionisation.

​This process defined the steps but still there remained questions about committing to timelines and committing to accuracy.

I presented my process and the open issues I was still facing to a few Facebook groups and also organized a round table to listen to engineering-managers-managing-a-research-team and ​a few research-teams-managers.

Here are some of the points that came up, as they were raised by the participants:

  • Where I work, there’s an additional stage of a POC with clients for whom the issue we are trying to resolve relevant. We would run the algorithm on their data and then improve the model according to the results.
  • In my case, improving the model required working on the data itself, adding features and improving existing ones. When it came to the clients, we needed their input, and again, we would add features we hadn’t thought of before, or we would modify the model.
  • Regarding timelines. Until the end of the POC, timelines were defined on a weekly basis: one week to test models, one week for a prototype, one week to run it on two different clients. In my field (security), sometimes we needed to change the whole problem definition and start over. Once the POC was a success, meaning clients approved it, the model went to another team for development.
  • We add a phase of retrospective, the researcher has to commit to goals and if they don’t meet their own deadlines, they do a retrospective to learn about what worked and what didn’t.
  • We learned that the most important thing is to set very accurate and specific goals, and to set checkpoints in the middle.
  • Deploying to production is not the end of the process. We need matrices to track the performance of the algorithm over time. We need alerts. Additional iterations to improve the results. Add more data. Sometimes there is a dead end and we start over.
  • I think one of the hardest things in managing a data science project is the uncertainty about timelines and the potential of the success of the task. Trust and open lines of communication are essential to overcoming the possible obstacles. Some might disagree with me, but I am not a great believer in time estimations for research projects. I’ve done it (while adding all the needed disclaimers and reflecting all the possible fail points) when I felt this is something that my managers needed. In my current workplace, I don’t do that at all. I have my own expectation of the timelines but when it takes longer, I can always explain the reason: we found out things we didn’t know, the baseline was worse than we thought, the task was harder. This goes back to trust. I can divide the work into sprints but it’s artificial and has no benefit for me. What’s important for me now is to communicate all the project’s dependencies, the steps, where are the uncertainties. If the manager is someone that is stressed by uncertainties, the data scientist can communicate less and work harder hoping it will be good enough and this is exactly where things go wrong. Matrices and KPIs need to be set together with the research team after checking what’s possible and meeting the business needs. It sounds trivial but I can’t tell you how many times I heard a colleague saying — I need to get to 90% accuracy because my manager (not a data scientist) set it as a goal, without asking some important questions. What would happen if you only got to 85%? Are there non algorithmic ways to compensate for the missing 5%? What is the baseline? What is the algorithm? What is the current accuracy? Is 90% a reasonable goal? Is it worth the effort if it’s significantly harder than getting to 85%? Matrices should also be set iteratively. We can have an end goal, but it should be flexible to the things that come up while researching.
  • Some of our work uses Agile methodology. We made a lot of adaptations and are learning how to do it right for us. We hold an all-team “Standup” twice a week, and have a Trello board for the team. We plan together and run retrospectives. This allows us to be more transparent and improves cooperation and mutual assistance in the team. Working in a team where everyone knows what the other members are working on is valuable for everyone on the team. In data science, the progress is not linear. But still, it’s important to set short term goals, especially for juniors in the field so that we have a sense of direction as to what we’re doing and where we’re going. It’s also important to experiment with clients so we know how a feature meets a business need and whether it solves the right problems for clients.
  • It’s important to manage data scientists in a way that’s similar to managing programmers. Data scientists have to understand the business needs, including timetables. They should be able to explain to their managers all of the necessary technical material concerning their work. There’s nothing there you, as managers should not be able to understand. And data scientists should be willing to code (at least a little bit) to be less dependent on developers, dev/ops etc.
    Although research tasks have lots of unknowns and it’s hard to commit to a due date and to specific results, it’s essential to set clear milestones and to break them into short-term tasks with a clear DoD that can be managed in sprints.
    Sample tasks can include importing data, cleaning a data source, conducting a single experiment, etc. Instead of setting a KPI goal for 3 months from now and trying to achieve it, I recommended monitoring the KPI in short intervals as part of the sprint, so that you can make managerial decisions when information is completely transparent.
    If there are several directions to take in the research, I recommend to setting a time frame for researching each option, and then reviewing the results achieved during this time. You, as managers, should be able to decide to stop a research direction and move to another. Sometimes you’ll have to make a tough decision and cut a non-productive research path against the will of the researcher, for the good of the overall project.

    And some tips for data scientists:
    The best thing you can do to help your manager succeed (and make yourself appreciated) are:
    * Learn to explain your work clearly, so that your managers can make good business decisions.
    * Learn how to work with deadlines and short milestones. It will make you much more efficient and you’ll actually achieve more.
    * If you feel a certain research path is not working, tell this to your manager as soon as possible. It doesn’t mean you failed — it simply means a research path came to a dead end, that’s how research works.
  • Working with milestones and short term tasks, and monitoring KPIs along the way, are all very important.
    The only thing that’s really hard for researchers and data scientists, is planning too much ahead and giving very accurate time estimates.
    We still should be able to give timelines, but the manager should be flexible and understand that plans might frequently change throughout the lifetime of a project because, for example, sometimes the whole modeling approach needs to change or the data is just not what was expected.
    Also, it’s important to make sure that data scientists work in short cycles, and don’t disappear into research for months, but actually have some interactions with the customer/product along the way.
    The best way to do this is to deliver the simplest possible POC as soon as possible, test it with the customer, and then make adjustments or build something more complex according to the customer’s feedback
  • My two cents as a data scientist: Yes, data science teams and data scientists should work with time tables. A structure-less work environment is bad for everyone. However, remember that data scientists’ timetables are “softer” than dev ones. Solutions to classification/prediction problems can be achieved at different levels. Simple solutions can be achieved within a tighter schedule, and with less data. Complex ones require more data, and come with greater ambiguity. Plan the research road map accordingly, and as a rule of thumb, start with the low hanging fruit. As a manager, make sure you understand the weight of the ambiguity in the tasks that are critical to your product, and what you can do to help the research.
  • Data scientists should know how to code, since ignorance may cripple them. However, make sure as a manager that you are using your tools wisely. Sometimes it’s more beneficial to give the technical work to a proficient programmer, and let the data scientist work on what they’re good at.

My own takeaways from all of this is that although the traditional Agile is not 100% suitable for the data science field, it’s still important to set goals and timelines. If the relationship between the data science team and management is good, and there is trust between the two sides, timelines and accuracy commitments can play a role but should be seen as soft commitments, assuming that everything can be retrospected, explained and learned from.

I learned a lot from the different perspectives shared and from meeting so many different managers and data scientists, listening to them and hearing what they have to say. I hope after reading this, you did as well.

I want to thank Miri Curiel, Shir Meir, Lena Capon, Ayelet Sachto, Michal Gutman, Dana Averbuch, Tamar Amiti, Rachel Ludmer, Inbal Rosenshtock and Inbal Naor for their great insights and contribution to this discussion.

--

--