Is Data Science Really a “Science”?

Why data science might be a rigorous field different and distinct from its predecessors

Published in

Towards Data Science

12 min readNov 8, 2018

I continue to see discussions about whether or not data science is really a “science” and also whether or not the field is “new” at all. A while back, I racked my brain to think of good arguments in favor of data science being both a “science” and “new”. I can’t claim to be 100% conclusive on those points, but there are some definitively unique aspects of data science, which I describe below, excerpted from Think Like a Data Scientist.

Data as the object of study

In recent years, there has been a seemingly never-ending discussion about whether the field of data science is merely a reincarnation or an offshoot — in the Big Data Age — of any of a number of older fields that combine software engineering and data analysis: operations research, decision sciences, analytics, data mining, mathematical modeling, or applied statistics, for example. As with any trendy term or topic, the discussion over its definition and concept will cease only when the popularity of the term dies down. I don’t think I can define data science any better than many of those who have done so before me, so let a definition from Wikipedia, paraphrased, suffice:

Data science is the extraction of knowledge from data.

Simple enough, but that description doesn’t distinguish data science from the many other similar terms, except perhaps to claim that data science is an umbrella term for the whole lot. On the other hand, this era of data science has a property that no previous era had, and it is, to me, a fairly compelling reason to apply a new term to the types of things that data scientists do that previous applied statisticians and data-oriented software engineers did not.

The users of computers and the internet became data generators

Throughout recent history, computers have made incredible advances in computational power, storage, and general capacity to accomplish previously unheard-of tasks. Every generation since the invention of the modern computer nearly a century ago has seen ever-shrinking machines that are orders of magnitude more powerful than the most powerful supercomputers of the previous generation. The time period including the second half of the twentieth century through the beginning of the twenty-first, and including the present day, is often referred to as the Information Age. The Information Age, characterized by the rise to ubiquity of computers and then the internet, can be divided into several smaller shifts that relate to analysis of data.

First, early computers were used mainly to make calculations that previously took an unreasonable amount of time. Cracking military codes, navigating ships, and performing simulations in applied physics were among the computationally intensive tasks that were performed by early computers.

Second, people began using computers to communicate, and the internet developed in size and capacity. It became possible for data and results to be sent easily across a large distance. This enabled a data analyst to amass larger and more varied data sets in one place for study. Internet access for the average person in a developed country increased dramatically in the 1990s, giving hundreds of millions of people access to published information and data.

Third, whereas early use of the internet by the populace consisted mainly of consuming published content and communication with other people, soon the owners of many websites and applications realized that the aggregation of actions of their users provided valuable insight into the success of their own product and sometimes human behavior in general. These sites began to collect user data in the form of clicks, typed text, site visits, and any other actions a user might take. Users began to produce more data than they consumed.

Fourth, the advent of mobile devices and smartphones that are connected to the internet made possible an enormous advance in the amount and specificity of user data being collected. At any given moment, your mobile device is capable of recording and transmitting every bit of information that its sensors can collect (location, movement, camera image and video, sound, and so on) as well as every action that you take deliberately while using the device. This can potentially be a huge amount of information, if you enable or allow its collection.

Fifth — though this isn’t necessarily subsequent to the advent of personal mobile devices — is the inclusion of data collection and internet connectivity in almost everything electronic. Often referred to as the Internet of Things (IoT), these can include everything from your car to your wristwatch to the weather sensor on top of your office building. Certainly, collecting and transmitting information from devices began well before the twenty-first century, but its ubiquity is relatively new, as is the availability of the resultant data on the internet in various forms, processed or raw, for free or
for sale.

Through these stages of growth of computing devices and the internet, the online world became not merely a place for consuming information but a data-collection tool in and of itself. A friend of mine in high school in the late 1990s set up a website offering electronic greeting cards as a front for collecting email addresses. He sold the resulting list of millions of email addresses for a few hundred thousand dollars. This is a primitive example of the value of user data for purposes completely unrelated to the website itself and a perfect example of something I’m sorry to have missed out on in my youth. By the early 2000s, similar-sized collections of email addresses were no longer worth nearly this much money, but other sorts of user data became highly desirable and could likewise fetch high prices.

Data for its own sake

As people and businesses realized that user data could be sold for considerable sums of money, they began to collect it indiscriminately.

Very large quantities of data began to pile up in data stores everywhere. Online retailers began to store not only everything you bought but also every item you viewed and every link you clicked. Video games stored every step your avatar ever took and which opponents it vanquished.

Various social networks stored everything you and your friends ever did. The purpose of collecting all of this data wasn’t always to sell it, though that happens frequently. Because virtually every major website and application uses its own data to optimize the experience and effectiveness of users, site and app publishers are typically torn between the value of the data as something that can be sold and the value of the data when held and used internally. Many publishers are afraid to sell their data because that opens up the possibility that someone else will figure out something lucrative to do with it. Many of them keep their data to themselves, hoarding it for the future, when they supposedly will have enough time to wring all value out of it.

Internet juggernauts Facebook and Amazon collect vast amounts of data every minute of every day, but in my estimation, the data they possess is largely unexploited. Facebook is focused on marketing and advertising revenue, when they have one of the largest data sets comprising human behavior all around the world. Product designers, marketers, social engineers, and sociologists alike could probably make great advances in their fields, both academic and industrial, if they had access to Facebook’s data. Amazon, in turn, has data that could probably upend many beloved economic principles and create several new ones if it were turned over to academic institutions. Or it might be able to change the way retail, manufacturing, and logistics work throughout the entire industry.

These internet behemoths know that their data is valuable, and they’re confident that no one else possesses similar data sets of anywhere near the same size or quality. Innumerable companies would gladly pay top dollar for access to the data, but Facebook and Amazon have — I surmise — aspirations of their own to use their data to its fullest extent and therefore don’t want anyone else to grab the resulting profits. If these companies had unlimited resources, surely they would try to wring every dollar out of every byte of data. But no matter how large and powerful they are, they’re still limited in resources, and they’re forced to focus on the uses of the data that affect their bottom lines most directly, to the exclusion of some otherwise valuable efforts.

On the other hand, some companies have elected to provide access to their data. Twitter is a notable example. For a fee, you can access the full stream of data on the Twitter platform and use it in your own project. An entire industry has developed around brokering the sale of data, for profit. A prominent example of this is the market of data from various major stock exchanges, which has long been available for purchase.

Academic and nonprofit organizations often make data sets available publicly and for free, but there may be limitations on how you can use them. Because of the disparity of data sets even within a single scientific field, there has been a trend toward consolidation of both location and format of data sets. Several major fields have created organizations whose sole purpose is to maintain databases containing as many data sets as possible from that field. It’s often a requirement that authors of scientific articles submit their data to one of these canonical data repositories prior to publication of their work.

In whichever form, data is now ubiquitous, and rather than being merely a tool that analysts might use to draw conclusions, it has become a purpose of its own. Companies now seem to collect data as an end, not a means, though many of them claim to be planning to use the data in the future. Independent of other defining characteristics of the Information Age, data has gained its own role, its own organizations, and its own value.

Data scientist as explorer

In the twenty-first century, data is being collected at unprecedented rates, and in many cases it’s not being collected for a specific purpose. Whether private, public, for free, for sale, structured, unstructured, big, normal size, social, scientific, passive, active, or any other type, data sets are accumulating everywhere. Whereas for centuries data analysts collected their own data or were given a data set to work on, for the first time in history many people across many industries are collecting data first and then asking, “What can I do with this?” Still others are asking, “Does the data already exist that can solve my problem?”

In this way data — all data everywhere, as a hypothetical aggregation — has become an entity worthy of study and exploration. In years past, data sets were usually collected deliberately, so that they represented some intentional measurement of the real world. But more recently the internet, ubiquitous electronic devices, and a latent fear of missing out on hidden value in data have led us to collect as much data as possible, often on the loose premise that we might use it later.

Figure 3.2 shows an interpretation of four major innovation types in computing history: computing power itself, networking and communication between computers, collection and use of big data, and rigorous statistical analysis of that big data. By big data, I mean merely the recent movement to capture, organize, and use any and all data possible. Each of these computing innovations begins with a problem that begs to be addressed and then goes through four phases of development, in a process that’s similar to the technological surge cycle of Carlota Perez (Technological Revolutions and
Financial Capital, Edward Elgar Publishing, 2002) but with a focus on computing innovation and its effect on computer users and the general public.

For each innovation included in the figure, there are five stages:

Problem — There is a problem that computers can address in some way.
Invention — The computing technology that can address that problem is created.
Proof/recognition — Someone uses the computing technology in a meaningful way, and its value is proven or at least recognized by some experts.
Adoption — The newly proven technology is widely put to use in industry.
Refinement — People develop new versions, more capabilities, higher efficiency, integrations with other tools, and so on.

Because we’re currently in the refinement phase of big data collection and the widespread adoption phase of statistical analysis of that data, we’ve created an entire data ecosystem wherein the knowledge that has been extracted is only a very small portion of the total knowledge contained. Not only has much of the knowledge not been extracted yet, but in many cases the full extent and properties of the data set are not understood by anyone except maybe a few software engineers who set up the system; the only people who might understand what’s contained in the data are people who are probably too busy or specialized to make use of it. The aggregation of all of this underutilized or poorly understood data to me is like an entirely new continent with many undiscovered species of plants and animals, some entirely unfamiliar organisms, and possibly a few legacy structures left by civilizations long departed.

There are exceptions to this characterization. Google, Amazon, Facebook, and
Twitter are good examples of companies that are ahead of the curve. They are, in some cases, engaging in behavior that matches a later stage of innovation. For example, by allowing access to its entire data set (often for a fee), Twitter seems to be operating within the refinement stage of big data collection and use. People everywhere are trying to squeeze every last bit of knowledge out of users’ tweets. Likewise, Google seems to be doing a good job of analyzing its data in a rigorous statistical manner. Its work on search-by-image, Google Analytics, and even its basic text search are good examples of solid statistics on a large scale. One can easily argue that Google has a long way to go, however. If today’s ecosystem of data is like a largely unexplored continent, then the data scientist is its explorer. Much like famous early European explorers of the Americas or Pacific islands, a good explorer is good at several things:

Accessing interesting areas
Recognizing new and interesting things
Recognizing the signs that something interesting might be close
Handling things that are new, unfamiliar, or sensitive
Evaluating new and unfamiliar things
Drawing connections between familiar things and unfamiliar things
Avoiding pitfalls

An explorer of a jungle in South America may have used a machete to chop through the jungle brush, stumbled across a few loose-cut stones, deduced that a millennium-old temple was nearby, found the temple, and then learned from the ruins about the religious rituals of the ancient tribe.
A data scientist might hack together a script that pulls some social networking data from a public API, realize that a few people compose major hubs of social activity, discover that those people often mention a new photo-sharing app in their posts on the social network, pull more data from the photo-sharing app’s public API, and in combining the two data sets with some statistical analysis learn about the behavior of network influencers in online communities.
Both cases derive previously unknown information about how a society operates.

Like an explorer, a modern data scientist typically must survey the landscape, take careful note of surroundings, wander around a bit, and dive into some unfamiliar territory to see what happens. When they find something interesting, they must examine it, figure out what it can do, learn from it, and be able to apply that knowledge in the future. Although analyzing data isn’t a new field, the existence of data everywhere — often regardless of whether anyone is making use of it — enables us to apply the scientific method to discovery and analysis of a pre-existing world of data. This, to me, is the differentiator between data science and all of its predecessors. There’s so much data that no one can possibly understand it all, so we treat it as a world unto itself, worthy of exploration.

This idea of data as a wilderness is one of the most compelling reasons for using the term data science instead of any of its counterparts. To get real truth and useful answers from data, we must use the scientific method, or in our case, the data scientific method:

Ask a question.
State a hypothesis about the answer to the question.
Make a testable prediction that would provide evidence in favor of the hypothesis if correct.
Test the prediction via an experiment involving data.
Draw the appropriate conclusions through analyses of experimental results.

In this way, data scientists are merely doing what scientists have been doing for centuries, albeit in a digital world. Today, some of our greatest explorers spend their time in virtual worlds, and we can gain powerful knowledge without ever leaving our computers.

Brian Godsey, Ph.D., is a mathematician, entrepreneur, investor, and data scientist, whose book Think Like a Data Scientist is available in print and eBook now. — briangodsey.com

For more, download the free first chapter of Think Like a Data Scientist and see this Slideshare presentation for more info and a discount code.

If you liked this, please click the💚 .

The process of data science is underrated

Coding and stats are important, but so is choosing what to do first — and next

towardsdatascience.com

Check your assumptions about your data

No one likes a bad assumption