The world’s leading publication for data science, AI, and ML professionals.

Team Topology for Machine Learning

Organizing teams in modern organizations in their journey to machine learning to achieve fast flow.

Photo by Pixabay from Pexels

Nowadays, Machine Learning (ML) is all in rage worldwide. A lot of companies are adopting ML (or AI or Advanced Analytics or Data-Driven Decision Making) in their current business processes. In this organization, a lot of effort is going towards recruiting ML talents, forming teams, identifying the feature scope of the team. Like many tech organizations, these organizations are also producing monoliths applications, e.g., one platform that includes workflow orchestration, model management, feature management, ML application code, etc. When such an organization realizes that they have ten different teams with seven different architectures, they realize that it is neither scalable nor reasonable to be in such a situation. It is more expensive to maintain the platform, which makes it difficult to replicate the design elsewhere. It is harder to update, which in turn affects delivery speed. It is very difficult to discontinue, which lowers team morale. It also enables heavy reliance on tribal knowledge, which makes the team sensitive to mobility.

Clearly breaking monoliths is the way to go. However, if it is not done carefully, teams may have pushed those problems from their near sight, but eventually, the problems will emerge at some point in the delivery cycle.

Since a machine learning application (in production in particular) is a special type of software system, I have looked into principles adopted by modern tech companies on how they handle microservice design. I found out that many are adopting a framework advocated by the book Team Topology. Most discussions that I found in the book and other complementary materials, relevant to web services, cloud technologies, etc. In this article, I am adapting their proposal to suit more organizations managing ML applications.

Team Topology

Microservices is the dominating software architecture principle in today’s tech world. Organizations are moving more towards loosely coupled microservices from monolithic software applications to make it easier, faster, and safer to change. However, building a team around software rather than software around a team has the risk of adopting a "distributed monolith". If splitting up microservices from a monolith application pushes the release of an application further down the deployment pipeline, the changes are not necessarily faster or safer.

Team Topology pushes the idea of team-sized software. It stems from Conway’s Law that states an organization will produce a system design following the organization’s communication structure. If the system is designed without organizational communication structure in mind, teams will produce difficult-to-manage systems. The book advocates a reverse "Reverse Conway’s Law" – adopting a team structure that matches the expected design of a system.

The tricky part is to figure out how to split a monolith safely without creating a distributed monolith. The book recommends focusing on the cognitive load on the teams. It can be defined as the collective amount of mental load in the total active memory capacity of the team, which varies on team composition. A team comprising senior members has a higher cognitive load capacity than that containing junior members. However, even for a senior team, the load capacity is likely to be smaller than many monoliths. Obviously, this means that a monolithic system needs to be broken down into subsystems. The size of the subsystem should be based on the cognitive load capacity of the team and vice versa.

Based on this idea, the book defines an alternate approach to identify team boundaries based on specific types of four teams, namely, stream-aligned, complicated subsystem, platform, and enabling teams. A stream-aligned team delivers one critical value stream of the organization towards its key stakeholder, customer, or end-user through developing, operating, and supporting a product or solution. A complicated subsystems team manages a specific subsystem that requires several specialists. A platform team provides development and support of the common platforms needed by the stream-aligned teams so that the teams do not need to focus much on platform development and management. Finally, an enabling team assists the stream-aligned teams to adopt new technologies provided or not provided by the other teams.

Teams for Machine Learning

As illustrated by the four colored boxes in Figure 2, inspired by Team Topology, we stick with four types of teams: Stream-aligned ML, ML enabling, Data/Infrastructure Subsystem, and ML platform teams. Check the image label text to understand the details of the topology diagram. We describe the team types in more detail as followed:

Stream-aligned ML Teams

These are teams that develop and/or manage ML solutions for end-users, i.e., domain experts, or customers in an organization. For example, in a retail company, such a team can be a markdown/discount-pricing team that delivers prices during seasons throughout the year. The scope of the team can vary but should be determined by the cognitive load of the team. For example, if the data sources and regression mechanism of the solution does not vary too much for the in- or sale-seasons then the cognitive load to support both do not double and, hence, a slightly bigger team can develop, operate, and manage the solutions for its stakeholders. On the other hand, for the same industry how online and store channel operates can vary a lot. Therefore, the markdown solution for the online channel may be operated by one team, whereas the same type of solution for the store channel may be operated by a different team.

Should such a team develop its own platform or data/infrastructure subsystems? Ideally not, if the team does not have to, since it is likely to increase the cognitive load of the team by a big margin and will make the team slow at delivering features maximizing its target value streams.

Data/Infrastructure Subsystem Teams

Even though data and infrastructure are totally different components, from an ML team perspective these are specialist teams and can easily be observed as subsystem teams. In terms of data-focused teams, there can be a data lake/warehouse/mesh team, data catalog team, data governance team, etc., and may contain specialists such as domain data experts and data platform engineers among other roles. In terms of infrastructure-focused teams, there can be cloud administration policy teams (in case of a multi-cloud organization), VPN solution team, identity management team, etc., and may contain specialists, such as cloud solution architects, network engineers, security experts, etc. These teams may support some or all team types. Having such teams ensures that effort to handle data or infrastructure management policy does not have to be developed/explored in ML teams and the number of specialists to be recruited in the organization can be kept lower.

ML Platform Teams

These teams develop or support end-to-end machine learning platforms for ML solution development so that stream-aligned ML teams do not have to develop these platforms themselves. The question is what is the extent of a platform. We are inspired by the definition provided by Evan Bottcher in martinfowler.com in the article What I Talk About When I Talk About Platforms that roughly states that a platform is more than just software and APIs – it is documentation, consulting, support, evangelism, templates, and guidelines.​​

The platform developed by the teams should be optional. The teams rather try to make the platform compelling the same way a global fashion brand makes its product compelling:

  • it is backed by its loyal user community that drives adoption
  • it can be studied, tested, and used primarily in a self-service manner
  • it is designed to be composable discrete services that can be used independently​ or in conjunction
  • quick and cheap to start using with an easy on-ramp​
  • secure, compliant by default, easy to update​

Most importantly, as defined in the book, the platform should be its thinnest viable version based on small, curated, complementary components.​ This means the team delivering the platform must consider the cost/benefit tradeoffs of built versus bought platform/components.

However, platform teams should stop at the tech solutions that are not directly related to ML solution development, unless there is no other option. Even then the teams should only proceed to incubate and then hand over when a more appropriate team can take over. A good example of that would be standardizing log management for teams, which is appropriate for any software development not just ML application development. When such a solution is missing in the organization, a platform can work on that but should not try to optimize, but only build enough to support the interested stream-aligned teams. Optimization of such solution rather is done by a more generalized platform team supporting all kinds of stream-aligned teams whether they deliver ML solutions or not.

There are several other important questions:

  1. How many end-to-end platforms should be delivered?
  2. Should one team manage one platform, multiple platforms, or subsystem of a platform?
  3. Out of the tens or hundreds of tech stack combinations, how does a team choose a winning combination?

We will explore this important, but more deep-dive questions in another article.

ML Enabling Teams

This type of team helps stream-aligned ML teams adopt missing capabilities in the ML solution development area. It can range from choosing a more robust algorithm to onboard to a new platform. Collectively one such team possesses quite a broad skill to make them applicable to a broad range of stream-aligned ML teams. However, they should not be treated as internal consultants, rather internal coaches whose main purpose is to train and enable a team.

An important question is how to form such teams? Such teams should be closer to the ML platform teams and, hence, they can be formed by expert users from the platform community. However, the teams can also include engineering, data-science, data-analysis, and ML product managers who typically possess broad experience in ML solution delivery and mentoring others.

How many such teams should there be? We believe there should not be many and most of them should be formed temporarily by leads in the organization when a specific situation arises.

Remarks

The suggested framework is currently at a theoretical level. I have no data to show that it works. However, I have confidence that it is likely to have a positive impact on fast ML solution delivery since some learnings/conclusions can be drawn from organizations delivering other state-of-the-art software solutions as presented by the book. However, I am going to test it in the coming days, assess the outcomes, adjust the framework, and share. If you have reached similar designs already and learned a few things that worked or did not worked please share.


Related Articles