The world’s leading publication for data science, AI, and ML professionals.

Rethinking the Roles of Data Scientists, Engineers and Architects

What the wording tells about the roles – and why some companies should rethink their approach and expectations from data projects.

Photo by ThisisEngineering RAEng on Unsplash
Photo by ThisisEngineering RAEng on Unsplash

The role of a "data scientist" now exists for about 10 years, and soon after it was understood that an additional role of a "data engineer" was needed to support steady progress. And finally "data architects" were required to choreograph the interactions between multiple teams and systems. But what are these roles all about, what is their difference between them? Actually there is no standardized definition of these roles and the interpretation almost always depends on the needs, technology and eventually on the culture of each company. But when you start thinking about the terms, the words themselves, a solid interpretation of these roles becomes almost immediately visible.

In this article we will dig deeper into analyzing these roles and also set clear expectations for the deliverables of each role – some of them might surprise some companies (or more specifically their managers). This will help to strengthen the understanding and expectations, both for managers and for candidates.


What’s in a Word

For a moment, let’s forget that we are talking about roles in the data business (i.e. "data scientist") and let’s focus on the second word of each role: scientist, engineer and architect. Those roles already exist for some centuries, and it is really worth to go back to their basic definitions and apply them to software and data products.

1. The Scientist

A scientist works where almost every technological progress starts. Essentially, a scientists job is to use his creativity to spawn and explore new ideas. He spends his workday inside a laboratory (be it a real lab of things like in chemistry or biology or virtual lab of thoughts like in mathematics or in social sciences).

Scientists Tasks

Basically a scientist has two important missions: First he systematically gathers information and knowledge from all kinds of sources and second he combines the information and extends the knowledge to generate new ideas.

Typically a scientist uses his creativity to explore new concepts and ideas within a given context or to reach a specific goal, sometimes only with a rough direction to follow. This depends on the goal of the research: Should it be used in new products desperately needed for pushing the company forward within a short period of time or is it part of a more strategic program for the far future?

For example in the domain of mobility, one scientist might work on improving details of the electrical motors to be used in the next years car models. In this case the context was set by management in order to achieve some short term improvement. A different scientist might think about slowly changing habits (for example due to an increase in home office) and how these could affect future requirements for mobility in general.

Scientists Methodology

The special responsibility of a scientist to gather and extend knowledge and to research new ideas requires an adequate methodology. In its definition of a scientist, the Science Council already gives strong hints how to tackle this undertaking: "A scientist is someone who systematically gathers and uses research and evidence, to make hypotheses and test them, to gain and share understanding and knowledge."

This definition essentially summarizes the scientific method, which is built around observations, deductions of hypotheses and testing these hypothesis using sound methods like statistical tests. This method is dominated by a "trial-and-error" approach, but in a very positive sense as opposed to software development. Research always is exploration and exploration (by definition) always means to try out new things, without knowing the outcome in advance. As an important consequence of this approach, a scientist is always prepared that his theory built on a set of hypotheses turns out to be wrong. From a business point of view, this means that "doing science" always includes the non-negligible risk of failure.

In addition to spawning new ideas and theories, a vital activity of every scientist is the exchange of ideas with other researchers. This can be accomplished on conferences, but even reading a paper already accounts for this activity. This type of communication (either bi-directional on a conference or uni-directional by reading) also helps to gain and foster the theoretical knowledge required for formulating and testing new hypotheses.

Scientists Responsibilities

The main responsibility of a scientist is to explore and to discover. The primary objective of a scientist is to pave the ground for the future, for example by analyzing of the impact of social trends or by exploring new technologies to meet new requirements.

At this point, it is of particular importance to understand that a scientist is not responsible for turning an idea into a product, instead he is conducting his research as a theoretical foundation for the next step. Of course he should be involved in productizing his ideas, but only in the role of an expert and advisor.

Scientists Education and Personality

Both the objective and the methodology of a scientist requires an appropriate education but also a certain personality. A scientist should have a strong academic background (i.e. he should hold a master degree or a PhD) to ensure that he is experienced with systematic research and with applying the scientific method. That implies that he probably (that might depend on the domain) also needs a solid education in statistics as the main tool for hypothesis testing.

A scientist needs to have strong perseverance combined with almost endless curiosity and a big portion of creativity. These traits help him to get unstuck again whenever he experiences one of the many failures that is inherent to research.

Finally every scientist also needs to be a strong communicator: On the one hand he needs to convince other people to believe in and to support his ideas to ensure the financial founding of his ongoing work. On the other hand a scientist also needs to know how to simplify his new possibly complicated findings in order to explain them and their impact to a broad non-expert audience.

2. The Engineer

After the scientist has discovered new insights, it’s then the engineer who will turn the idea into an actual product. Generally, an engineer may use results from scientists in the same company, or he may even do some literature research on the given task and pick up ideas from other scientists.

Engineers Tasks

The primary responsibility of an engineer is to create a solution design for a specific problem. In the software world, engineers are also commonly responsible for the implementation, while in other domains this could be the responsibility of additional craftsmen. The solution design should offer an appropriate level of abstraction and generalization, such that it can be reused for similar cases and to form a reliable foundation and tooling for future work performed by the scientists.

For creating the solution design, the engineer will either build on the results of the scientist or (when there is no scientist) he performs some research for solutions to similar problems. These results probably constitute only a small, yet important part of the solution design and the final product. Many important pieces will be missing from the scientists result, because all these details are already well understood and did non require any additional research by the scientist. But the engineer will have to include all these parts.

For example, the scientists might have implemented a new algorithm (be it for machine learning or for something completely else), but it might not be as efficient as possible (in terms of required energy or time) and it might not scale at all to multiple machines in a cluster. These are important aspects, which the engineer will have to take care of. Eventually the engineer will have to build an application or product, which includes a user interface or API, possibly monitoring or logging and so on. All these additional pieces are well understood with readily available building blocks in the form of libraries, so no scientific exploration is required.

Engineers Methodology

There is some undeniable overlap between the tasks of a scientist and and engineer (for example research in literature), but their methodology differ. While a scientist follows the "scientific method" for exploration and discovery, an engineer follows the "engineering design process" for a solution design. The wording already implies that engineers follow a process as opposed to scientists, who mainly use a single method.

The engineering design process starts with a basic research for solutions to similar problems -this probably has to include many more aspects than what the scientist is looking at. In the software world, this could include researching an appropriate programming language, libraries, design patterns and more. A crucial part for finding appropriate design pattern is the act of abstraction and generalization of the given problem in order to detect its fundamental structure. At the same time, an engineer also needs to collect all requirements from product experts: What precisely should the solution include, who will be its users and how should it be used?

When these two steps are finished, the engineer will validate the feasibility to ensure that the given problem can be solved with existing technology. If there is a gap at this point, a scientist might help to close it.

Finally a solution design is prepared by combining existing building blocks with custom implementations. Depending on the domain, the implementation is also within the responsibility of the engineer, but in most cases skilled craftsmen and artisans will take over (that is often not so much the case in software development, although there is the role of a "software engineer" and a "programmer").

Engineers Responsibilities

The engineers main responsibilities are two fold: First he needs to build tools and infrastructure to support the work of scientists. These tools need to be generic enough such that scientists can combine different tools in novel ways to innovate.

The engineer is also responsible to pick up the results from the scientist and to design a working product, which is usable by the targeted audience. The product should not only incorporate and implement the ideas from a scientist, but it should also fulfill many non-functional requirements. For example, it should not break immediately by wrong usage, it should be a pleasure to use by end users.

For the software world this means that the product should have an user friendly UI and that it shouldn’t crash and instead present meaningful error messages when something is not okay.

Engineers Education and Personality

To be able to assume his responsibility, an engineer also needs an adequate education and the appropriate knowledge. Specifically he should be curious to learn new things and get well versed in applying state of the art technology of his domain. An engineer needs to understand and handle all requirements – both functional ones ("What should the product do?") and non-functional ones ("How should the user interact with the product?", " In which environmental will the product be used?", and the like). Therefore an engineer needs to work very thoroughly and he should be (at least a little bit) obsessed with detail to build a stable and pleasant product.

Comparison to Scientists

Apparently there is an overlap in methodology between science and engineering. But there are some crucial differences: Engineering is formulating a problem that can be solved through design while science is formulating a question that can be solved through investigation. According to Wikipedia, "the key difference between the engineering process and the scientific process is that the engineering process focuses on design, creativity and innovation while the scientific process emphasizes discovery (observation)."

3. The Architect

Finally we have the role of an architect. The classical meaning according to Wikipedia is "a person who plans, designs and oversees the construction of buildings". Of course we are not only interested here in the construction of buildings but more in the general task of constructing things or products. And I would argue that any sufficiently complex task involving many different disciplines can benefit from the role of an architect as described by the article.

Architects Tasks

We can simply generalize the tasks in the definition of a classical architect as a generic concept: The architect plans, designs and oversees the construction of a product. Not all products require that role, and in many different domains these responsibilities are taken over by some (project) manager – and that’s completely fine. I would argue that even a classical architect for buildings is some sort of a manager, since he is the one who oversees the construction.

As the definition already implies, an architect has three important tasks: Designing, planning and controlling. In the first phase, the architect has to collect all requirements for the final product and create a technical design to meet these. This design is a rather high-level, it describes the overall structure as a solution design including many fundamental decisions on the technology and material to use. The architect has to decide on all the stuff that is hard to change in the middle of a project – that is probably one important aspect which sets him apart from an engineer (more on the differences later). While an engineer should be able to adopt his component design to changed requirements, the architecture design is more about which components are required and how fit together. These architectural decisions are often very hard to change after work started for a while without demolishing parts of what was already built.

In the second phase, the architect has to plan how the design is to be implemented. That doesn’t mean that he will now start to look into all the technical details, instead he should be concerned about the skill sets required to build the final product and in which order the components have to be built. The result of this phase is a rough plan of what needs to be done when and how much man power and material will be required.

Finally in the third and most critical phase, the architect has to oversee the construction. While the implementation and construction itself is performed by engineers and additional work forces, the architect has to verify that all artifacts are built according to his high level design and that all components eventually fit together. In many cases it turns out that some details of the original design don’t work out as expected, so minor design adjustments are necessary, but the overall structure should not change much.

Architects Methodology

The architect should also follow the "engineering design process" of the engineer, just at a higher level. Architects should not follow a trial-and-error approach like a scientist (remember, a change in the architecture is difficult and thus really expensive). Instead he should create a design that supports all requirements collected beforehand.

Architects Responsibilities

The architect carries a very huge responsibility, since his fundamental design decisions have the biggest impact if the product can be built or not and if this can be done within the planned scope and time. The structure and technology of the architecture design often has also an immediate impact on the formation of the construction teams including the engineers. Changing the design structure might require building new teams -a costly step every company will try to avoid.

After the initial design and planning phase is finished, the architect is also directly responsible that construction happens along his design. He is responsible that the build quality matches his expectations and is suitable to stand the given requirements. The architect might have to set technical rules for the teams how individual components need to be built.

Architects Education and Personality

On the one hand the architect is responsible for creating the overall design and on the other hand he needs to oversee the construction. These two obligations impose also a very solid education. The architect needs to understand a broad range of technologies to be able to select the appropriate ones. He needs to know the strengths and limitations of each alternative he considers for the design.

Since the number of available technology exponentially grows with time, no one can expect that a single person knows all the details about all of them. This is also true for the architect. So even if he keeps his eyes open on the market for new trends and technologies, he will require consultation with experts that really know all the important details, strengths and limitations.

To meet his obligation of overseeing the construction, an architect also needs to provide technical leadership and he often finds himself in the role of a mentor not only explaining the "what"’s but even more importantly the "why"s of his design decisions. Therefore in addition to an appropriate education, communication and learning are key to becoming a successful architect.

Comparison to the Engineer

While an engineer is concerned with an implementable solution design including all the nasty technical details, an architects role is more high level. The architect also needs to know about the capabilities and properties of many technologies and frameworks, but he doesn’t need to know how to build a product on top of these technologies. The architects responsibility is to envision the big picture with a strong focus on the interaction between multiple systems and technologies which are required as building blocks for the whole companies. He will take the most fundamental decisions about which technology stacks to use and how the data should flow between them. He conceives the grand plan together with engineers as his consultants for specific technologies.


Mapping Roles to the Data Domains

So far we explored the generic meaning of the roles "scientist", "engineer" and "architect" outside any specific domain. Now let’s try to apply the gained insights to the data domain, where we have several different roles made up from these basic personalities:

  • Data Scientist
  • Data Engineer / Machine Learning Engineer
  • Data Architect

Some of these roles have overlapping responsibilities and not all data companies have or even need to have dedicated persons for each of these roles.

Interestingly during the last couple of years we could see a stronger distinction between "Data Science" and "machine learning", which became necessary since the required skill sets overlap but differ. While the data scientist should have a very broad knowledge including probability theory and statistics, a machine learning scientist should have a less broad but much deeper knowledge especially in the area of deep learning, which is an increasingly important and very rapidly growing sub-domain of data science. So maybe there is even a "machine learning scientist" today, although I never saw that role explicitly.

Data Scientist

The tasks given to a data scientist really should require the scientific method to make progress. This implies that much of the data scientists work is about proposing hypotheses for specific properties of the given data and accepting or rejecting them by experimentation. Moreover the data scientist may also be involved into exploring new algorithms for distilling relevant information from data. This means that a data scientist may be busy with analyzing customer data to gain new insights on their behavior, or he might research the next deep learning architecture for identifying cats and dogs.

Since this work heavily relies on experimentation and exploration, the data scientist will prefer to work in interactive notebooks like Jupyter and R Studio, which support a trial-and-error approach much better than a classical programming environment. For the same reason, a scientist will prefer interpreted languages like Python and R over compiled languages in order to minimize turn-around times, which would be much larger if every modification required a compilation step followed by a restart of the program.

The data scientist will heavily rely on statistical packages for testing hypotheses about data distribution, but he will also employ libraries and applications for data visualization. Eventually he may also use machine learning and deep learning libraries for researching powerful models for predictive and inferential tasks.

Data Engineer

A data engineer is a software engineer specialized in data topics. The typical data engineer has two important jobs: First he has to build a reliable and pleasing environment for the data scientist to work in. This task includes to provide all required data by building appropriate data pipelines for ingestion and it may possibly also include to build an integrated model suitable for typical analytical tasks (the terms "_star schema"_ and "wide tables" should come to ones mind). Probably most people think of precisely this aspect when they think of a data engineer – but there is more to this role than building data pipelines and providing an environment for data scientists. The data engineer is also responsible for providing tools and other building blocks to support and simplify the data scientists work.

The other important job of a data engineer is (or should be) to transform any results of a data scientist into a working product. While the data scientist uses the scientific method to explore possibilities, the data engineer will pick up the scientists result and applies the engineering design process to come up with a solid application that essentially implements and embeds the idea of the scientist into a larger application. Probably not all people will agree on this responsibility, but since the engineer is an expert for technology and design it would be careless to assign a scientist to this job.

This difference in methodology between a data scientist and a data engineer is also reflected in the tools. While the scientist spent most of his time in interactive notebooks, the data engineer uses full blown IDEs and follows best practices of software development like creating a solid software design, using a compiled language for type safety, implementing unit tests and integration tests and so on.

Since the responsibility of a data engineer is to build an production-ready application, he is also fluent with the corresponding technology: Spark, Scala, Java are his best friends for implementing data pipelines, but he also knows much details about Hadoop, Docker, Kubernetes and possibly cloud services as the typical target platforms for deployment.

Machine Learning Engineer

The role of a machine learning engineer is rather new and its duties were part of a data engineers job before. But since large scale machine learning has its own challenges in data management and model deployment which require special tools and frameworks to address, it made sense to separate this role from a data engineer.

Similar to a data engineer, a machine learning engineer is a software engineer specialized in machine learning topics. The main duties are to provide an appropriate research environment to a machine learning scientist, which is a superset of what a simple data scientist requires. Topics like "version control for training data" and "model deployment" are more specific to machine learning than for generic data science. Therefore a machine learning engineer mainly needs to address these topics while the data engineer provides the input data.

To meet these obligations the machine learning engineer should knowledgeable about infrastructure topics for providing the data scientists environment, including data version control and simple model deployment. Specifically for this second part, a machine learning engineer should also be a software developer who can build a production ready application (probably a REST service) from the data scientists model, which requires knowledge in the languages and frameworks of the data scientist like Python, R, Keras, Tensorflow. He probably also needs to know about Kubernetes as the target environment for model deployment.

Data Architect

Finally the data architect has to envision the big picture of how data should flow through the company. He needs to decide on cut points to separate different data domains and which teams should be formed to build the applications and services within each domain.

To meet these obligations, the data architect needs to have a very detailed understanding of the business model and its requirements. His broad knowledge about available applications, platforms, services and libraries enables him to map these requirements to suitable technologies. The first output of his work is a rough design identifying the required data domains together with an appropriate technology stack. This fundamental decision will be very hard to change at a later point in time – even if one could replace some individual components, the separation into different data domains with individual teams will be very hard to overcome later on.

Once the data architect finished his design and teams start its implementation, he needs to oversee the progress and to perform small adaptions of the design in case something doesn’t work as expected. Specifically he needs to support all teams with interfacing with other teams to build a working and reliable data and service mesh. The architect will enforce rules for intra-domain communication including prescribing the type of data transport like file based via FTP or message based via Kafka and how interfaces are to be published and documented in a global registry.


Conclusion

The article’s view on science, engineering and architecture may diverge from the commonly found understanding of a "data scientist" and a "data engineer", but I highly recommend to approach these terms by looking at the more general definition of a scientist and an engineer. I feel that many companies looking for data scientists are actually looking for data engineers (they have problems that can be solved with technical designs instead of exploration). Nevertheless even with this view it remains true that ETL is within the responsibility of a data engineer, since it is a design problem. Similarly finding new insights of customer behavior is within the responsibility of data science, since it is a research problem.

Things get really interesting with machine learning, which can be both: Simple and well understood problems can be solved by a design process whereas more difficult problems with few literature may require a discovery process. But even in the later case, an engineer should be responsible for building a whole machine learning pipeline for production, which also needs to cover many non-functional requirements like scaling, monitoring, logging and more.

You should now also understand, that all roles are equally important but require different focus and depth in knowledge and methodology. Which role suites best to you solely depends on your skills and personality. Only the combination of science, engineering and architecture will create exciting and pleasant products containing novel ideas.

I hope this article helps you to decide whether you’d like to be a whatever-scientist, a whatever-engineer or a whatever-architect or something in between. Depending on the role expectations, methodologies and tools will be very different and you might feel more comfortable within one role than in another role. But the good thing is that all roles within a single domain have a big intersection of the required knowledge, so it should be easy to shift to a different role when you find out what best suits your skills and preferences.


Related Articles