What is Data Science? What is Machine Learning? A caveman, Isaac Newton and a Data Scientist discuss the essence of learning from data: from learning to make fire in prehistoric times to 21st century data science

Joseph Blitzstein, the famous Harvard Professor stated the following as the difference between a novice and an expert, in his book on probability:
"the novice struggles to memorize a large number of seemingly disconnected facts and formulas, whereas the expert sees a unified structure in which a few principles and ideas connect these facts coherently."
In other words, the expert can effortlessly see the big picture, the bird’s eye view of the field. Professor Blitzstein uses the spirit embedded in the quote above to teach probability and statistics in Harvard and to the wider world. This article will, in the spirit of Blitzstein’s quote, give you a bird’s eye view of learning from data and how that relates to Data Science and machine learning. I will keep this article fairly high-level so anyone who is starting in this field, or just curious, will find it useful. Experienced data scientists will also find this article useful and get a fresh perspective on the essence of data science and machine learning.
To set the stage, let’s start with a story of a cross-generational bonfire meeting that probably did not take place in this universe. The bonfire meeting had 3 special guests discussing the field of data science. Those 3 people were:
A Caveman from prehistoric times, Sir Isaac Newton (the famous scientist from the 17th century), and a data scientist (from 21st century). At some point during the meeting, the data scientist started explaining data science to Newton and the caveman:
Data Scientist: Data Science is a vast field but, in a nutshell, it is all about Learning from data using models and communicating those results.
Sir Isaac Newton: This sounds a lot like what we, as scientists do. We observe the world by collecting data, and then we try to explain the underlying phenomenon responsible for the data with a hypothesis. This hypothesis is our model of explaining the phenomenon that we observed. The model then helps us build things further and provide solutions.
Caveman: _I thought I am going to be completely out of place but I’d like to add something. My grandfather told me that our ancestors used to live in dense forests where they observed wild fires but didn’t know how to make or control fire. We then observed that friction can create fire and that is how we invented a bow drill (a prehistoric hand-operated tool used to start a fire with friction) that helps us start a fire. I suppose we were doing data science then. We observed (fire-making), formed our model (friction makes fire), and communicated and refined that knowledge to solve a problem (how to make and control fire)._
Everybody seems to have learnt something and had takeaway messages to share:
Caveman: Humans have made great progress and while they have come up with cool names (science /data science), what they do is essentially the same stuff that we used to do (observe, build, communicate).
Newton: The science is well and truly mainstream now, although they have new names for it now.
Data Scientist: I never realized that data science is actually as old as humans themselves!
As the prologue attempted to assert, the idea of us humans learning from data is nothing new. We have been doing this since our existence on this planet. We observe (collect data), summarize those observations (with a model or a hypothesis), and then use that model to solve problems (e.g., prediction). Over the years, various disciplines have developed tools and techniques to do precisely just that: learn from data (some examples are computer vision, digital signal processing, statistical signal processing, adaptive signal processing, data mining, system identification, pattern recognition etc.). What all this means is that the body of techniques and tools developed have been around (and evolving) for a long time and the field itself (data science or Machine Learning) is not new. However, unlike other terms, machine learning (and now data science) has gained widespread popularity. Essentially, they are an umbrella term referring to a wide variety of tools and techniques that help us learn from data.
The body of techniques and tools developed have been around (and evolving) for a long time and the field itself (data science or machine learning) is not new. However, unlike other terms, machine learning (and now data science) has gained widespread popularity. Essentially, they are an umbrella term referring to a wide variety of tools and techniques that help us learn from data.
A good starting point to make sense of the field is by using a Venn diagram to illustrate the inter-disciplinary nature of the field (this was first proposed by Drew Conway, and several variants of this have been put forward over the years):

Two vital components required now to learn from data are computer science (called hacking skills in the original diagram by Drew Conway), and mathematics & statistics. These two fields together are the foundation of machine learning, which is concerned with tools and techniques to learn from data. When the knowledge of machine learning is then combined with knowledge of the field to help solve a problem (and then communicated that as part of the problem-solving pathway), that’s data science for you. To put it in plain English: Data Science is having enough computer science skills to be able to run software codes that can work on data, enough mathematics and statistics knowledge to understand what models to use on the data and enough domain expertise to ask the right questions and solve problems.
To put it in plain English: Data Science is having enough computer science skills to be able to run software codes that can work on data, enough mathematics and statistics knowledge to understand what models to use on the data and enough domain expertise to ask the right questions and solve problems.
Now that you have an idea of what data science is, a valid question to ask is: why is data science popular now if it has been around for as long as humans have existed?
Two key reasons for the popularity behind data science are computing power and amount of data we generate now.
We generate an unprecedented amount of data now. According to Forbes, we had gathered over 90% of the data in the world in only the two preceding years (and that article is now two years old!) and the amount of data we generate is only growing by the day. The numbers are mind-boggling. There have been an estimated 4.5 billion people online (as of June 2019) with Google, Facebook, Youtube generating vast amount of data. It is estimated that, by 2025, there would be 75 billion Internet-of-Things (IoT) devices. I don’t think I need to convince you that we are now generating massive amount of data, totally unprecedented. If nothing else, think of the smartphone you use every day (and the fact that you are reading this article via a digital medium which is generating some data about views and reading time as you read this sentence). Now scale that up to every interaction you have during the 24 hours, and then scale that up to all the humans that are around and you will soon see that the numbers are, understandably mind-boggling.
In addition to data, there has been a phenomenal increase in computational power (cost of storage, GPUs, cloud computing, and concomitant improvement in availability and accessibility of tools). This is a great resource to check to get a better feel of the exponential progress (and Moore’s law in action) through various figures and charts.
Let us dig a bit deeper on the concept of learning from data. One way (and the most popular way) to divide learning types is by the type of feedback available. Three types of learning using this approach are supervised learning, unsupervised learning and reinforcement learning.
Three types of learning are: supervised learning, unsupervised learning and reinforcement learning.
In supervised learning, we know both the input and output and we try to find a model that can best summarize this relationship between the input and output. As an example (see the figure below), suppose we want to build a model that can tell us if a baby is healthy (shown in green) or sick (shown in red) using a baby’s heart rate and respiratory rate (i.e. how fast the baby is breathing). We use all the existing data of all babies we have and provide that to the learning model including information on whether they are healthy or sick. The learning model then uses this data to find a decision boundary. This decision boundary then helps us find out if a baby (a new data instance) with known heart rate and respiratory rate (but unknown health status) is sick or healthy.

In unsupervised learning, we only have the input data to use for learning and no explicit feedback is provided during the learning stage. The most common application of unsupervised learning is clustering. As an example, suppose we have height and weight information from a large number of people and we want to divide them into distinct groups. An unsupervised learning approach is what we will use to find natural groupings in data.

In reinforcement learning, the feedback is provided in the form of rewards or punishments where an agent learns to find the most appropriate action to take. The agent is the learning system which exists in an environment with which it interacts by taking actions and either being rewarded for it or punished for it.

Before I conclude, a note about the learning approach that I have outlined: Learning a model (or a function) from specific data points is inductive learning (a bottom-up approach). A different approach to learning is deductive learning (going from models to data). Dividing learning types into supervised, unsupervised and reinforcement is the most common approach in the machine learning community when introducing the field. In practice though, an iterative approach is typical where you build a model, see how the model performs and then make changes again based on the feedback (see my Top Tips for Consultancy in Data Science article for a more practical side of things when implementing a data science project). Effectively, this is combining a bottom-up approach (inductive learning) and a top-down approach (deductive learning). Data science in practice is an iterative process and consequently uses both approaches when solving problems.
To conclude, this article asserts that data science is not a new field and (almost) as old as us humans, has given the two most important reasons why the field has become widely popular now (computing power and amount of data generated now, both unprecedented in our history) and given a bird’s eye view of types of learning from data (supervised, unsupervised and reinforcement learning). I hope you had fun reading this article :).
Read every story from Ahmar Shah, PhD (Oxford) (and thousands of other writers on Medium)