The world’s leading publication for data science, AI, and ML professionals.

The Data Science Landscape

An attempt to provide structure and reference points in a complex field

Getting Started

Photo by Shahadat Rahman on Unsplash
Photo by Shahadat Rahman on Unsplash
  1. Introduction

Data is the new oil in the 21st century – the information age.

This expression encapsulates the fact that extracting insight from data has become essential for most businesses. This trend is the underlying driver for the rapid growth of data science.

Yet, there is still a lot of uncertainty regarding the individual disciplines and jargon applied in the field. Dealing with data science-related issues might be daunting, especially for non-technical executives. This brief article attempts to shed some light on the field of data science, its disciplines and provide some structure and reference points.

2. The Data Science Landscape

Data science is part of the computer sciences [1]. It comprises the disciplines of i) Analytics, ii) statistics and iii) machine learning.

The Data Science Landscape - Source: Own Illustration
The Data Science Landscape – Source: Own Illustration

2.1. Analytics

Analytics generates insights from data using simple presentation, manipulation, calculation or visualization of data. In the context of Data Science, it is also sometimes referred to as exploratory data analytics. It often serves the purpose to familiarize oneself with the subject matter and to obtain some initial hints for further analysis. To this end, analytics is often used to formulate appropriate questions for a data science project.

The limitation of analytics is that it does not necessarily provide any conclusive evidence for a cause-and-effect relationship. Also, the analytics process is typically a manual and time-consuming process conducted by a human with limited opportunity for automation. In today’s business world, many corporations do not go beyond descriptive analytics, even though more sophisticated analytical disciplines can offer much greater value, such as those laid out in the analytic value escalator.

2.2. Statistics

In many instances, analytics may be sufficient to address a given problem. In other instances, the issue is more complex and requires a more sophisticated approach to provide an answer, especially if there is a high-stakes decision to be made under uncertainty. This is when statistics comes into play. Statistics provides a methodological approach to answer questions raised by the analysts with a certain level of confidence.

Analysts help you get good questions, whereas statisticians bring you good answers. Statisticians bring rigor to the table.

Sometimes simple descriptive statistics are sufficient to provide the necessary insight. Yet, on other occasions, more sophisticated inferential statistics – such as regression analysis – are required to reveal relationships between cause and effect for a certain phenomenon [2]. The limitation of statistics is that it is traditionally conducted with software packages, such as SPSS and SAS, which require a distinct calculation for a specific problem by a statistician or trained professional. The degree of automation is rather limited.

2.3. Machine Learning

Artificial intelligence refers to the broad idea that machines can perform tasks normally requiring human intelligence, such as visual perception, speech recognition, decision-making and translation between languages. In the context of data science, machine learning can be considered as a sub-field of artifical intelligence that is concerned with decision making. In fact, in its most essential form, machine learning is decision making at scale. Machine learning is the field of study of computer algorithms that allow computer programs to identify and extract patterns from data. **** A common purpose of machine learning algorithms is therefore to generalize and learn from data in order to perform certain tasks [3].

In traditional programming, input data is applied to a model and a computer in order to achieve a desired output. In machine learning, an algorithm is applied to input and output data in order to identify the most suitable model. Machine Learning can thus be complementary to traditional programming as it can provide a useful model to explain a phenomenon.

Traditional Programming vs. Machine Learning - Source: Own illustration adapted from Prince Barpaga
Traditional Programming vs. Machine Learning – Source: Own illustration adapted from Prince Barpaga

2.4. Machine Learning vs. Data Mining

The terms machine learning and data mining are closely related and often used interchangeably. Data mining is a concept that pre-dates the current field of machine learning. The idea of data mining – also referred to as Knowledge Discovery in Databases (KDD) in the academic context – emerged in the late 1980s and early 1990s when the need for analyzing large datasets became apparent [3]. Essentially, data mining refers to a structured way of extracting insight from data which draws on machine learning algorithms. The main difference lies in the fact that data mining is a rather manual process that requires human intervention and decision making, while machine learning – apart from the initial setup and fine-tuning – runs largely independently [4].

2.5. Organizing the World of Machine Learning

The world of machine learning is very complex and difficult to grasp at first. The degree of supervision as well as the type of ML problem are considered particularly useful to provide some structure.

2.5.1. Supervised and Unsupervised Learning

The majority of machine learning algorithms can be categorized into supervised and unsupervised learning. The main distinction between these types of machine learning is that supervised learning is conducted on data which includes both the input and output data. It is also often referred to as "labeled data" where the label is the target attribute. The algorithm can therefore validate its model by checking against the correct output value. Typically, supervised machine learning algorithms are regression and classification analysis. Conversely, in unsupervised machine learning, the dataset does not include the target attribute. The data is thus unlabeled. The most common type of unsupervised learning is cluster analysis [3].

Other than the main streams of supervised and unsupervised ML algorithms, there are additional variations, such as semi-supervised and reinforcement learning algorithms. In semi-supervised learning a small amount of labeled data is used to bolster a larger set of unlabeled data. Reinforcement learning trains an algorithm with a reward system, providing feedback when an artificial intelligence agent performs the best action in a particular situation [5].

2.5.2. Types of ML Problems – Regression, Classification and Clustering

In order to structure the field of machine learning, the vast number of ML algorithms are often grouped by similarity in terms of their function (how they work), e.g. tree-based and neural network-inspired methods. Given the large number of different algorithms, this approach is rather complex. Instead, it is considered more useful to group ML algorithms by the type of problem they are supposed to solve. The most common types of ML problems are regression, classification and clustering. There are numerous specific ML algorithms, most of which come with a lot of different variations to address these problems. Some algorithms are capable of solving more than one problem.

2.5.2.1. Regression

Regression is a supervised ML approach. Regression is used to predict a continuous value. The outcome of a regression analysis is a formula (or model) that describes one or many independent variables a dependent target value. There are many different types of regression models, such as linear regression, logistics regression, ridge regression, lasso regression and polynomial regression. However, by far the most popular model for making predictions is the linear regression model. The basic formula for a univariate linear regression model is shown underneath:

Linear Regression Formula - Source: Own illustration adapted from RPubs
Linear Regression Formula – Source: Own illustration adapted from RPubs

Other regression models, although they share some resemblance to linear regression, are more suited for classifications, such as logistic regression [1]. Regression problems, i.e. forecasting or predicting a numerical value, can also be solved by artifical neural networks which are inspired by the structure and/or function of biological neural networks. They are an enormous subfield comprised of hundreds of algorithms and variations used commonly for regression and classification problems. A neural network is favored over regression models if there is a large number of variables. Like artifical neural networks, regression and classification tasks can also be achieved by the k-nearest neighbor algorithm.

2.5.2.2. Classification

Classification is the task of predicting a value for a target attribute of an instance based on the value of a set of input attributes where the target attribute is a nominal or ordinal data type. Therefore, while regression is usually used for numerical data, classification is used for making predictions on non-numerical data. Decision trees are among the most popular algorithms. Other algorithms are artificial neural networks, k-nearest neighbor and support vector machines. Neural networks, which consists of multiple layers, are referred to as deep learning models [3].

Deep Learning Model - Source: Researchgate
Deep Learning Model – Source: Researchgate

2.5.2.3. Clustering

Cluster analysis, or clustering, is an unsupervised machine learning task. It involves automatically discovering natural patterns in unlabeled data. Unlike supervised learning, clustering algorithms only analyses input data with the objective to identify data points that share similar attributes. K-means clustering is the most commonly used clustering algorithm. It is a centroid-based algorithm and the simplest unsupervised learning algorithm. This algorithm tries to minimize the variance of data points within a cluster.

Clustering Model - Source: Adapted from Luigi Fiori
Clustering Model – Source: Adapted from Luigi Fiori
  1. The Data Science Toolkit

Data Scientists use a wide variety of tools. In the business context, spreadsheets are still very dominant. For exploratory data analytics, visualization tools – such as Tableau and Microsoft Power BI— are useful in order to get an understanding and visual impression of the data. For statistics, there are a number of established statistical packages, such as SAS and SPSS. Machine learning is usually conducted using programming languages. The most popular languages for machine learning are Python, C/C++, Java, R and Java Script. Most of the above-mentioned tools can be used for a large variety of data science-related tasks. The R programming language, for example, was built primarily for statistical applications. Therefore, it is highly suitable for statistical tasks as well as visualization using the popular R package ggplot2.

  1. The Data Science Process

The Cross Industry Standard Process for Data Mining (CRISP-DM) is a process model with six phases that naturally describes the data science life cycle. It is a framework to plan, organize and implement a data science project.

It consists of the following steps:

  • Business understanding – What does the business need?
  • Data understanding – What data do we have / need? Is it clean?
  • Data preparation – How do we organize the data for modeling?
  • Modeling – What modeling techniques should we apply?
  • Evaluation – Which model best meets the business objectives?
  • Deployment – How do stakeholders access the results?
The CRISP-DM Process - Source: Own Illustration adapated from Datascience-PM
The CRISP-DM Process – Source: Own Illustration adapated from Datascience-PM

Conceived in 1996, it became a standard methodology across industries on how to best conduct data science projects. The CRISP-DM process is not a linear, but rather an iterative process. It evaluates all aspects of a data science project and thus significantly improves the chances of successful completion. Most project managers and data scientist therefore adopt this methodology [6].

  1. Principles of Success

In closing, there are several considerations which determine whether or not a data science project will be successful. First, at the initial stage, it is paramount that the underlying business problem is clear to all stakeholders involved. Second, sufficient time has to be allocated for the data preparation stage which typically accounts for the majority of time spent during most projects. Third, the right variables have to be selected by the data scientist. A model should ideally comprise only the fewest possible number of variables with relevant explanatory power. The process of feature selection is therefore important in order to maximize performance while reducing the noise in a model.

"Irrelevant or partially relevant features can negatively impact model performance".

Fourth, over- and underfitting of the model should be avoided as underfitting leads to generally poor performance and high prediction error while overfitting leads to poor generalization and high model complexity. Lastly, the result of the data science project must be communicated in a way that non-technical people can understand. A suitable way to communicate data is to use visualization techniques. In the business context, a good reference for presenting data is the International Business Communication Standards (IBCS).

  1. Summary

Data science is a complex and quickly developing field with a unique jargon. This contribution attempts to shed some light on the terminology, individual disciplines as well as the data science process. Guidance for further reading is provided by Prezemek Chojecki as well as Claire D. Costa.

Literature

[1] O. Theobald, Machine Learning For Absolute Beginners: A Plain English Introduction (2018), Independently Published

[2] D. Spiegelhalter, The Art of Statistics – Learning from Data (2019), Penguin

[3] J. Kelleher and B. Tierney, Data Science (2018), The MIT Press Essential Knowledge series

[4] Juhi Ramzai, Clearly Explained: How Machine learning is different from Data Mining (2020), Towards Data Science

[5] Isha Salian, SuperVize Me: What’s the Difference Between Supervised, Unsupervised, Semi-Supervised and Reinforcement Learning? (2018), Nvidia Blog

[6] Israel Rodriguez, CRISP-DM methodology leader in data mining and big data (2020), Towards Data Science


Related Articles