Image by Gerd Altmann from Pixabay

Data Science — What does it really mean?

Jason Tam
Towards Data Science

--

The term “Data Science” emerged around 10 years ago, and has eventually evolved into one of the biggest buzz words in the world today. From academia to government organizations, to companies in almost every sector, there are continually more and more efforts being injected to make the most of it, and has consequently created many new job titles such as Data Scientists and Data Engineers.

So what does this term represent? and what makes it so special?

While there are general definitions of the responsibilities that such roles would cover (such as ones outlined in 10 Different Data Science Job Titles and What They Mean), it might not be unusual to find that the needs of different markets have lead to some variance in these definitions, and perhaps even different answers to these two questions. I will attempt to provide my understanding of it here, which may hopefully help others in gaining more insights on the big picture.

What is Data Science?

Data Science is, as the term suggests, the science of data. What falls under this definition evidently changes as technology advances, as the last decade or two have demonstrated. Under the current technology framework, this includes everything from the data collection end, to all methods of storage and analysis, all the way through to the various channels that make use of the results. This type of data processing chains is now commonly referred to as Data Pipelines.

Data Pipelines

A Data Pipeline commonly represents an ordered collection of data-manipulating components, all the way from collecting data to presenting the useful information extracted out of it, and everything in between. Acting like a production chain, each of these components are responsible to process the provided input data in specific ways, and the resulting output data are used as the input for the next. While the term usually refers to long term implementation of components in order to automate the process as much as possible, adhoc analysis usually follow a very similar approach. This processing chain can effectively be divided into four main categories:

Data Collection

This first (practical) stage of the process prepares the data for further analysis. There is usually a brainstorming stage before this, which uses the question/problem to guide what kind of data is required for collection. It may sometimes first involve some (often automated) data collection process, be it webscraping or data coming from remote sensors via mobile network. Depending on the method, this may involve coding some web crawlers, or scripts collecting from some API data source.

Data Storage

They are then channeled into some sort of data storage platform, which can be something on the cloud or local databases. This channeling often involves some data format conversions (sometimes included under the Data Engineering definition), so that the result is compatible with the storage that is often optimised for efficient data extraction. The structure of such data storage platforms are often specifically designed for the associated applications, and often falls under the responsibility of Data Engineers, which usual duties would also include writing (often SQL) functions that extracts data optimality for the further analysis desired.

In reality, most raw datasets contain a lot of noise that needs to be cleaned. Data Engineers are typically responsible for preparing data to be ready for analysis, and this often includes implementing procedures where much of the cleaning as well as collection and storage procedures described above are automated as much as possible, as efficient as possible.

Data Analysis

The main goal of this stage is to extract trends and patterns that are intrinsic in the data. There are many different approaches for doing this depending on the data type, as well as the end goals of the project. It usually involves building a model using existing data, with techniques ranging from regression, time series analysis to various types of Machine Learning algorithms such as Clustering and Neural Networks. Applications of such models can range from predictions for time series data in the commercial and finance sectors, to image recognition for tumour classifications and autonomous driving technologies.

Presentation

This is the stage where the end results are made use of. They will like either be appearing in some sort of report/presentation, or a finished data pipeline that is integrated into a system.

Usually for adhoc analysis, conclusions obtained are presented to the stakeholders in either a report or a slide-presentation. It is highly likely that the stakeholders is more likely to be much less informed with the technical details of the pipeline. The role of the Data Professional is very similar to a salesman here, with the main goal on convincing the stakeholders the work done is more than worth while, and is greatly beneficial to the stakeholders.

Good Data Visualizations is key to delivering findings from the analysis (The famous quote of “A picture is worth a thousand words” is often a massive understatement in this context) . Data is Beautiful and Data is Ugly are two sub-reddits that contains a lot of interesting examples that one might find very helpful as guidelines in creating them.

This end result will like either be some sort of report/presentation, or a finished data pipeline that is integrated into a system, which will continue the work in a highly automated fashion

Image by author

Summary

In order to optimize the value that can be created from these components, Data Science professionals often need to start the work with stakeholders as consultants — usually referred to as the “Business Problem Approach”. This is the stage where the Data Science professional gathers information from stakeholders to determine:

  • End goal of the project — What the stakeholders want to achieve with this investment in time and money.
  • What data to gather — This is usually driven by the end goal.
  • Precision requirements — It is not unusual for some projects to sacrifice a bit of precision in exchange for response performance, such as real-time systems.
  • Need for additional hardware — If there is sufficient infrastructure to make use of the end product of the project. I have outlined an overview of this in another article — Useful Knowledge of Computers for Data Science

Ensuring each of these aspects are at satisfactory levels is as important as the resulting performance from each of these components in the Data Pipeline. Good performance and efficiency from all of these factors are cruicial to the final success, just like this quote that I would borrow from one of the NBA legends:

“One man can be a crucial ingredient in the team, but one man cannot make the team.” — Kareem Abdul Jabbar

Photo by NASA on Unsplash

Why has Data Science suddenly become popular?

We have had data for a long time, as well as methods that are used to analyze them. Linear regression for example, appeared as early as the 1800s in the work of Legendre (1805) and Gauss (1809) [1], why did the “science” of data not appear until last decade? Here are several factors that I have identified:

Internet

Since internet services became popular in the late 1990’s, large amounts of data are being generated around the world on a daily basis, particularly when social media and smart phones that became popular from the mid 2000s. Many organizations and companies have taken advantage of and harnessed this data resource, and extracted behaviourial trends and patterns from populations of various categories. Such results have become extremely valuable in increasing efficiency in many sectors, from areas such as e-commerce to infrastructure analysis.

Computing Power

Many machine learning methods such as Support Vector Machine that was developed in the early 1990’s [2] have become popular due to the advancement of computing power. Commercial level machines were finally powerful enough for more people to make use of these methods, often allowing data of some more complex formats such as sound, images and videos to be analyzed in many ways that was not possible before.

Cloud Computing

The combination of internet and computing power advancements have inevitably lead to cloud computing. Before cloud computing started to become popular around 2010, there were already technologies such as GRID computing (such as the LHC Computing GRID) that were widely used within specialized professions such as scientific research. The biggest advantage of this is obviously the freedom we get from having to physically own the hardware that is doing work for us, and neing able to control things remotely allows us to set the jobs running, turn off our own computer and lights and off to the bar we go, while knowing that we are still “productive” in the mean time.

Conclusion

Obviously these above factors are also entangled and have complimented each other along the way. Many concepts under the Data Science umbrella already existed long before this became a buzzword, such as regression and data pipelines. As new technologies emerge, some of these core concepts will still be there, while we may only use them in a smarter way with even more clever tools. This is also the reason why courses in data science are still emphasizing these concepts, not where to find the button to click on the interface of some commmerical software that has Machine learning functionalities.

And lastly, the golden rule to all data related processes as always — “Rubbish in, rubbish out”.

[1] Stigler, Stephen M, The History of Statistics: The Measurement of Uncertainty before 1900 (1986), Cambridge: Harvard. ISBN 0–674–40340–1.

[2] Cortes, Corinna and Vapnik, Vladimir N, Support-vector networks (1995), Machine Learning. 20 (3): 273–297. CiteSeerX 10.1.1.15.9362. doi:10.1007/BF00994018. S2CID 206787478.

--

--

Did my PhD in Physics with CERN, now doing Machine Learning for chemists as a Data Scientist, with a great interest in maps.