Introduction

Data science departments are a new addition to the traditional corporate structure. Here is how to effectively create one that benefits your organization and not one that creates unnecessary internal friction.
Long before The Harvard Business review declared data science as the sexiest job of the 21st century (2012), mathematician John W. Tukey made equally sexy predictions about how the advent of electronic computing would affect statistics in his paper "The Future of Data Analysis" (1962). He pressed the indispensable generalization of the highly theoretical and academic field of mathematical statistics to the more practical field of "data analysis", characterized by the addition of procedures like "planning the gathering of data to make its analysis easier, more precise or more accurate, and all the machinery" of statistics to the practice of determining truth from data. That is, after all, the most fundamental precept of science: how can truth be distinguished from falsehood, clarified in its full context and applicability to understand the universe we both inhabit and are components of? No matter the field, data is obtained to guide the journey. Data can also be abused – as it too often is – to propel false narratives through both explicit intent, unintended bias, or simple carelessness. Tukey wrote at a time when the nascent utility of electronic computing for data analysis was just entering discussion, noting "it is easier to carry a slide rule than a desktop computer" and "the computer makes feasible what would have been wholly unfeasible". As the power of computing grew, so did the magnitude of the capacity for data to illuminate truth and to lie. The breadth of applicability grew as well, with data science now penetrating nearly every field of study or industrial practice there is. Thus, there has emerged a new field of science and industry in one: "data science", characterized by the intersection of several fields like mathematics, statistics, data analysis, computing, databases, and no small amount of artistic storytelling with subject matter expertise and scientific inquiry as the unifying core. Given the complexity, abundance, and importance of data that these organizations are now facing, data science satisfies a functional gap that has arisen within the traditional organizational structure of businesses. The cost of ignoring this gap is that born from the consequences of acting on narratives that are not in accord with reality.
"Data Science", first denoted as such only in 2008, has emerged as a new role in organizations that amalgamated from many others and steps on as many toes. Once intractable problems can now be solved by this new role provided that its function in an organization is well-defined to complement existing departmental roles, rather than inspire territorial backlash. A unique intersection of skillsets held covered by other traditional departments is nevertheless required to execute that new role adequately. Data science is operationally expressed in code, but software engineers do not generally have the mathematical or statistical background necessary. Further, the objective of the engineer is to build something that works for a given purpose, not to seek truth via empiricism. Classically trained statisticians have a strong foundation in the first principles of data analysis and empiricism, but are not usually trained in software and database design or management. The database administrator and data engineer can efficiently manage large volumes of data, but have not necessarily ever engaged in scientific inquiry. Artificial intelligence is a sub-field in its own right sitting at the very center of data science and requires its own expertise in addition to coding, database, and statistical modeling skills. Given that this field of study is roughly a mere decade old now in 2020 and in its modern incarnation, its leaders almost invariably have formal educations in some tangential field and migrated in as necessity dictated. The personal experiences of the early leaders in tandem with the needs of the organizations they work for inform the role that data science has come to play, with that role potentially varying greatly from one organization to the next as industry in general struggles to understand the utility of this new functional role.
After a career in academia as an experimental high energy physicist studying the deepest known principles of how the universe is structured with datasets generated by particle colliders and several years more experience working as a full stack software engineer on the data team of an inflight wifi provider, I have spent the past few years building the data science program for rMark Bio, where we have worked on data-driven approaches to increasing operational efficiency and actionable insight gathering in life sciences. As a young company unencumbered by legacy departments, products, or mentalities, we have had the opportunity to define the role of data science in a new organization that is subject to the technological and economic forces that exist in the second decade of the 21st century. I would like to share what we’ve learned.
The First Principle of a Data Science Department
I routinely stress to data scientists the importance of solving problems by understanding the first principles underlying the problem being studied, so let’s approach the problem of defining and building a data science department the same way: What is the first, most fundamental objective of a data science department? I propose:
"The objective of a data science program is to strategically identify facts about data, understand those facts in their full complexity, identify a resulting and relevant truth narrative, and then apply that knowledge for some practical purpose that is aligned with the mission of the organization the data science program is working within. Data science is expressed empiricism, specializing in the methodologies and technologies used to achieve this objective."
The most fundamental goal is not to build something that works, even though doing so is necessary to reliably work with data; is not to effectively market a narrative, even though doing so is the responsibility of data science once a truth-narrative of data has been found; and is not to sell a product, even though a data-driven insight may be integral to the design of that product. The first principle of data science is scientific inquiry through empiricism; everything else is a consequence of that pursuit. That search for truth via empiricism is the distinguishing role of the data science department from others with whom it must work in tandem. Just as in any scientific field, expressing the creative freedom to challenge conventional wisdom, regardless of whether the end result of the endeavor is to reinforce that existing wisdom, better understand it, or overturn it entirely, is crucial to that role and must be respected for data science to pursue its mission.
Data Science Includes Software Engineering, Data Architecture, and Statistics, but is Not Them
One role that data science most closely resembles is software because of the practical necessity to organize, clean, and manage data along with writing the code to manage and study it. The objective and output of that software operating on data, however, is not merely to transform, reorganize, or display the data, even if all of those may be part of the process. The objective of identifying truths and insights in data results in less predictable and more ambiguous outcomes of data science software. The correct output for a particular input to a machine learning model cannot be predicted a priori like a well-defined data transformation, so explicit test scripts cannot always be written in the way a software engineer is accustomed to. Models are never clearly "correct" or "incorrect", but rather occupy a nebulous realm of being good enough (or not) in certain applicable contexts and not in others, measured by accuracy scores and other metrics that can never be 100% (which would negate the need for modeling if they ever were). Nevertheless, the data scientist is perpetually tasked with improving the models as measured by those metrics. Rather than being characterized by the automated computation and transformation of data, data science is the automated study of data and the mentality of the data scientist must differ from the software engineer accordingly while still mastering similar technical skills.
If software engineering is the toolset with which data scientists achieve their objectives, the empiricism of statistics is the creative soul engendering the visions to create. Aside from some novel fringe-case examples, "machine learning" can be understood as the application of statistical modeling executed in code instead of the pen-to-paper (or slide-ruler) computations that statisticians were relegated to in the pre-electronic computing era. If you have some data in a 2-dimensional plane denoted by the standard "x" (horizontal) and "y" (vertical) axes and you want to find the best-fitting straight line running through that data, you’re doing a linear regression – one of the simplest forms of statistical modeling. If you perform that linear regression with code running on a computer, you are now doing "machine learning". Most traditional statistical regression modeling is the practice of fitting some mathematical function to existing data in a way that best represents that data. Regardless of which mathematical function is being fit to a dataset, the process of determining that best fit function follows a standard high-level process:
- Define the problem to be solved and design the study
- Collect, clean, and organize a dataset
- Most of the dataset will be used to train the model. Small, independent, and randomized subsamples of the dataset are reserved for "testing" and "validation" of the model being trained.
- Train the model on the training dataset, use the test dataset to monitor the quality of the model during training and the validation dataset to see how well the model performs on representative data that it was not explicitly trained on.
- Revise and be mindful of whether the model is being over-trained or under-trained (i.e. the Bias-Variance Tradeoff).
- Apply the model for whatever practical purpose that initiated its construction in the first place.
Whole books can and have been written about how to effectively execute these admittedly very high-level steps, but these outline how regression problems are executed.

Supposed we now have a dataset and we want to train a model on a shape in the data that is not representable by any single particular mathematical function. An adjustment can be made to the regression process just described by replacing the mathematical function that was being fit to the data with a network of mathematical functions. This is the juncture where classical statistical modeling dovetails into what we now call "Artificial Intelligence" (AI). Using a network of mathematical functions instead of a single one has shown to be very effective at modeling features in data that are more complicated than what could be approximated by a single mathematical function. While there are now a wide variety of machine learning and AI methodologies, the most standard forms largely follow the same basic process of described above (i.e. a "supervised training" process). If classical statistical modeling methodologies are "statistics 1.0", AI is "statistics 2.0". Understanding and executing artificial intelligence models requires mathematical and computing expertise that tends to fall just outside of what would be strictly considered in the realm of statistics, however. Further, in practice, modern AI modeling also requires some degree of expertise in data architecture and cloud computing to develop enterprise AI solutions.
One of the most critical insights one can have about data is how to represent its shape. Before even any kind of modeling of should be considered, we need to manage and understand the data itself. There are numerous ways to represent the shape of data and making informed decisions on how best to do so for a particular dataset has critical consequences for data warehousing architecture. For example, a banking model of customers, checking accounts, savings account, mortgages, etc. has traditionally used the "relational database management system" that represents the various forms of data as tables that relate to each other. One table would have all customer data. Another will have all checking accounts with a reference number back to the customer in the customer table who owns that checking account. Another way to shape data is the "graph", characterized by particular data points having a potentially complex web of relationships. The users of a social network and who each user is connected to (i.e. is "friends" with) is very appropriately representable as graph data. Of course, one could represent banking data in a graph database and one could represent social network data in a relational database management system, but whether or not it is wise, efficient, and cost effective to do so requires good judgment about data architecture. There are many areas of data science that require such judgments to be made and executed before further study of the data can begin. A data science department must understand how to architect data and often manage at least some of that architecture itself.
Lastly, Data Science departments’ findings are useless if we cannot communicate them to other stakeholders. Other leaders in an organization rely on the truth narratives to make informed decisions. Data science departments are responsible for understanding the consequences of how their work is used, which requires some degree of subject matter expertise in the field of the data being used. That expertise could be related to production, marketing, product, design, or any other function that needs to make business decisions based on an accurate understanding of data. Contextualizing the data and the results drawn is equally critical. Subject matter expertise and qualitative understanding of the business problems being solved is the task of the "data analyst" and should also be considered within the realm of data science.
Data science departments need to work closely with departments like software applications, data engineering, IT, product/design, business development and other departments, to ensure that the organization as a whole is making informed decisions based on a correct understanding of data available. While data science leaders often do come from software engineering, applied statistics, IT, or academic fields like physics, mathematics, and social sciences, expanding skillsets well beyond one’s original training is necessary to execute the vision and functionality of a data science department distinct from those traditional organizational roles.
Data Science Department Responsibilities
If the role of the data science department requires the confluence of a wide variety of existing areas of expertise – along with some new data science-specific ones – for a mission that is distinct from other departments, the next natural question is: What are responsibilities and expectations of a data science department?
Let’s begin with what not to do.
There’s a particular mistake in the problem-solving process that I’ve encountered far too frequently to not mention it at the outset of this conversation – one that is often committed by both novice data scientists and by seasoned business executives: starting with the conclusion that ‘we need to be using AI’ and then trying to construct a reason to do so. For the new data scientist just entering the field, there is both the excitement and the professional pressure to be able to say that some advanced machine learning or AI methodology was put into practice. Doing a brilliant job of experimental setup, of painstakingly tedious data cleaning, database design and management, selecting a simple and classic modeling methodology doesn’t deliver the ego and resume building street cred that a large Word2Vec or other AI model does. Similarly, in many legacy companies that have a lot of data, but don’t specialize in its management, there is pressure to be seen as ‘investing in an AI’ future because that’s been one of the biggest value-driven buzzwords of the past decade. When approached without care, large sums of capital can quickly be wasted investing in ineffective applications. Effective problem solving with data requires starting with the business problem to be solved, identifying relevant data available (or what data can be collected, if not available), and then working back to the most appropriate solution for that problem without biasing oneself to a particular methodology prematurely. It sounds obvious in abstraction, but the pressures to use an AI-based solution even when done out-of-context are as common and strong as they are short-sighted. A responsibly run data science department advises against this folly.
Supposed those concerns have been allayed.
The data science department is tasked with determining data-driven, decision-making strategies that drive the mission of the larger organization in which it resides. Identifying the business problem to be solved and acquiring enough subject matter expertise should come first. Data scientists cannot effectively execute their function by blindly collecting data without an intuitive understanding of context and turning the crank on some modeling or analysis methodology – another too-common pitfall. We need to be willing to sit with the stakeholders with subject matter expertise related to the problem and understand the pain points of their process. We are not here to replace them; we are here to help them execute their function more effectively and efficiently. That should be tactfully communicated as well. People-skills and emotional intelligence are an absolute necessity even in a hard science like this.
Relevant data then needs to be identified, both that which already exists and that which needs to be obtained. The nature, context, and shape of that data must be understood and documented. Datasets large and small need to be architected wisely. A few of the countless questions that should be addressed are:
- How will the data be collected and in how many steps will it need to be transformed? That is, design the ETL ("Extract, Transform, Load") process.
- How should the data be stored? Does it need to be in CSV files, a relational (SQL) database, JSON (NoSQL), graph database, time series database, etc.? Do various segments of the data need to be stored in a variety of these shapes?
- If the data is tabular in nature, does it need to be in an SQL database or do CSV files suffice? Cloud hosted SQL databases are vastly more expensive to maintain than CSV files in blob storage. In many circumstances, CSV files are vastly more efficient to process for modeling as well. Cost concerns must not be ignored.
- Does the data contain "personally identifiable information" (PII) that would be covered by the General Data Protection Regulation (GDPR) of the European Union or the California Consumer Privacy Act (CCPA)? In this case, legal responsibilities of the data science department managing this data must be addressed. Not doing so can result in fines large enough to sink small to medium sized companies.
- Are there data types that do not have proper unique identifiers associated when one is necessary? (This one has been common at rMark Bio)
- Are all the values cast to their proper types? For example, unexpected issues can arise if a value that is expected to be an integer is actually an alphanumeric character. Issues like "1 + 2 = 12" can arise, if the data scientist is not fastidious.
- Identify the common data cleaning issues such as: missing data, incomplete data, data arriving in formats that cannot be processed, and was a certain segment of data simply entered in the wrong column? (Also common)
- The data should be architected with a candidate methodology or modeling solution in mind.
The most sophisticated modeling or analysis methodology in the world will produce only junk results if junk data is entered. For all these reasons and more, the data science department needs to have the freedom to architect its data to its needs. This does not mean that a data science department cannot or should not be able to work in tandem with an external data engineering organization; this does mean that a data science department needs to have the skillset and resources to architect the data it needs to execute its mission.
With an architected dataset ready for use, the study of the data can begin. The available analysis toolset of the data scientist is vast and quickly growing. The chaos of the creative process spares none and finding truth is data is just that. The creative process needs to be nurtured and respected until a viable solution is found. Some solutions may come in the form of one-off, isolated studies to be delivered as a report. Other times, a solution needs to come in the form of a service to be maintained and the result of the creative endeavor needs to be cleaned, documented, and reorganized into maintainable software. Creative chaos needs discipline; technical mastery is the necessary bedfellow of artistic value creation. Python notebooks that execute a script in a browser window are very popular tools amongst data scientists and are excellent for pedagogy. Proper and maintainable software, they absolutely are not. Aside from from the isolated studies, it is my strong opinion that the product of data science efforts should be in the form of well-organized, maintainable, deployable software. Data science departments produce software for the topics that fall within our purview. Like all software, therefore, ours needs to be managed with a robust software development lifecycle.
Lastly, no solution is worth the time, effort, and cost of its labor if it cannot be communicated effectively. Data science that cannot be communicated cannot be utilized. If the central task of the data scientist is to find that truth narrative to tell about a dataset, then effective communication skills, both written and verbal, are critical and non-negotiable.
Of course, data science departments don’t only produce software and no individual member of a data science department has necessarily mastered every single skillset discussed above. Data science departments need the functionality of data engineers and architects, machine learning and AI experts that can produce their work as robust software, statisticians and mathematicians, data analysts, visualization experts, and effective communicators. Data science departments should have individual experts in any one of these areas.
What data science can achieve that other overlapping departments/roles cannot
Only in the 21st century, at the dawn of the Information Age, has data been so ubiquitous that a new functional role within organizations has become a necessity. Given that, however, the advent of this new role that needs to elbow its way into existing organizational structures is unavoidable. The volume, variety, and velocity (the "3 Vs of Big Data") of data needs to be efficiently managed and understood. Doing that requires not just a unique combination of technical skillsets, but also a scientific mentality that may not have existed in many organizations in the past. Intuition and instincts, manifested from personal experiences and subjective perceptions of them, are flawed; common sense isn’t that common and commonly wrong. Data science can implement and enforce an objective evaluation of the reality that an organization is working with to ensure decision-making is properly informed. That contextualized, data-based perception of reality will also be constantly changing as situations, and the data they produce, change over time. Well-trained scientific minds with the technical expertise to execute this role are becoming ever more indispensable in a business and economic landscape that is swamped with data.