The Future of Data Science

Published in

Towards Data Science

11 min readNov 13, 2018

Three key trends influencing the future of data science and what you can do to take advantage of the coming opportunities

It was around 8:00pm and I was driving home, having picked up some groceries. As I drove into a dimly lit cross section, I barely noticed the flash of light before I felt a violent impact when another driver lost control of his car and crashed into me. Fortunately, the damage was only to the vehicles. What transpired next, is purely stuff of the future.

Instead of filing the regular papers for car accidents insurance claims, I merely took out my insurance app and chose the car crash menu. I was then asked a few questions and followed the instructions which included taking a few photos. When I was done, the app informed me roadside assistance was en route and the accident had been reported.

The next day, I got a call from my insurance company. They informed me that after having reviewed the data from my claim, video footage from the cameras inside my car as well as data from a myriad other sensors, their machine learning algorithm that handled claims had quickly verified my recounting of the events.

They were pursuing the case for damages with the other party’s insurance company. Furthermore, I would be able to pick up my car at their partner’s auto shop within the next few days.

Whilst this is simply a story, the technology to make this happen, and much more, is almost here. The application of analytics and the use of machine learning tools to unlock value in data — also known as data science — is growing and we have just begun to scratch the surface.

The three intertwined trends of increasing amounts of data, improved machine learning algorithms and better computing resources are shaping the data science field in exciting ways. This article will focus on the current development and future of these trends, what their impact will be and how to prepare for it.

The Data Explosion

The exponential growth in data we have witnessed since the beginning of our digital era is not expected to slow down anytime soon. In fact, we have probably just seen the tip of the iceberg. The coming years will bring about an ever increasing torrent of data. The new data will function as rocket fuel for our data science models, giving rise to better models as well as new and innovative use cases.

Growth of Internet of Things — IoT

One obvious source is IoT. Currently, we have approximately 7 billion connected IoT devices globally, and this number is predicted to increase to 21.5 billion in in 7 years. Not only are more devices coming online, but as the hardware also improves, the type of data they deliver will be richer and more diverse. The largest generators is IoT data are expected to be the airline, mining and auto industries.

Social Media

Social media is another vast source of data. In May of 2012 we uploaded approximately 72 hours of video to YouTube every minute. Five years later that number had skyrocketed to 400 hours every minute, which equates to 65 years of video — everyday. Facebook is another huge data generator, with over 2 billion monthly users as of 2017, and generating over 4 million likes every minute.

In addition to the above data sources, we also have weblogs, entertainment, payment transactions, surveillance data, telco-data and financial data. The list is by no means exhaustive and the sources of data will continue to grow as the digitalization process marches on and as companies find the need to collect even greater amounts of information on customers.

The implications of this trend of increasing data cannot be understated. The more data we have access to will not only increase the precision and accuracy of our machine learning models, but also the possible areas they can be applied to, fueling the demand for data science.

All this data is just a way of feeding the machines — literally! Many of our machine learning models could not have taken such great strides had it not been for the increasing amounts of available training data. And this takes us to our next key trend. The exponential increase in effectiveness of machine learning algorithms.

The Rise of The Machines

Machine learning algorithms, and especially within the subfield of deep learning, have advanced rapidly in the last few years. In addition, there is intense development of machine learning software. This is improving the quality of the algorithms and making the tools easier to use, lowering the barriers to entry for aspiring data scientists. Because of the strong dependency on machine learning tools, advancements in this field directly influence the usefulness and capabilities of data science.

What is machine learning?

We can think of machine learning algorithms as software that are able to produce a certain output, based on data we input, and most importantly, this software enables computers to learn without being explicitly programmed.

Typically, this works by feeding the algorithm a lot of examples and it will learn from them. This can be contrasted with expert systems, where all the logic is coded into the algorithm by a programmer. Over the last 5–10 years, one of the most exciting subfields within machine learning has been deep learning.

Deep learning algorithms are basically just a lot of input nodes that accept data, which is then fed into other nodes and finally to an output layer. Because of their resemblance to how the brain works, deep learning algorithms are known as neural networks. We will not delve into the specifics of these networks, as that is beyond the scope of this article, but suffice it to say their development is behind some of the major breakthroughs in machine learning.

AlphaGo

One notable example of the success of deep learning is AlphaGo, a strategic game with two players, which is considered even more difficult for computers to learn, even more so than the game of chess.

In 2014, the company DeepMind started the research project AlphaGo — with a mission to beat human players in Go. A year later, AlphaGo defeated European Go champion Fan Hui, and in March of 2016 AlphaGo played against Lee Sedol, one of the best players in the world. AlphaGo won 4 out of 5 matches. Updated versions of the program were created, and AlphaGo Zero won 100 out of a hundred games against its predecessor AlphaGo — completely dominating any human opponent.

Also fueled by deep learning are the impressive advances in machine vision. Within machine vision we are now able to create algorithms that can classify some sets of pictures better than humans.

We are also able to give summaries of the contents of photographs. It will not be long before the algorithms can ‘watch’ video and learn their contents, something which has the potential to disrupt several industries.

We could search through millions of hours of video using regular search. Surveillance could be done by directly querying CCTV video footage.

In natural language processing, we are also making huge advances. Handwriting recognition is on par with human level skills, sentiment analysis has improved a lot over the last years, as well our ability to summarize written text.

Many of these examples highlight cases where the machine learning algorithm is either on-par or better than humans. In the next few years we will continue developing superhuman machines.

A central difference between the algorithms and us, is their exponential rate of improvement. There is nothing indicating that this trend will stop. On the contrary, it is expected to increase exponentially.

Better and easier-to-use algorithms will positively influence data science by improving our current models, and will enable the use of machine learning models for tasks that were previously reserved for humans. The companies that are able to use and apply these algorithms in their business processes will probably develop a strong comparative advantage over their competitors.

Minds in the Clouds

Our computational abilities are developing in tandem with our machine learning algorithms and the increasing amounts of data. We are creating increasingly powerful computers than can store ever more data and process that data faster than ever before.

For neural nets, we have made the shift from CPUs to TPUs, and the cloud service providers are constantly improving their services — with all the major players offering Machine Learning as a Service, MLaaS.

The trend in improved computing and easier access to those resources is a strong driver of the data science field.

From CPUs to TPUs

For years, we were using CPUs to train our neural networks. This took a lot of time and limited the potential models we could build. In 2007, a breakthrough was made, and Nvidia, the largest manufacturer of GPUs, released the CUDA programming platform.

This allowed application developers to tap into the general purpose parallel processing capabilities of the GPUs, and a few years later GPUs started to revolutionize our neural nets. Fast forward to 2017, Google released the TPU, or Tensor Processing Unit. This device is even faster at training many neural networks than the GPU, and as GPUs have continued to improve over the last years, so will likely the TPUs.

All the major cloud providers now provide environments to build and deploy machine learning models. Their offerings range from Googles Cloud Machine Learning Engine which is based off the TensorFlow project, to Microsoft Azure Machine Learning Studio with their Cortana Intelligence Gallery. As these offerings improve, the skills needed to use machine learning tools will probably decrease, contributing to a lower threshold for the use of data science.

Docker for Data Science

On the development side we are starting to see an increasing number of projects using Docker for data science development. Docker is a tool for virtualizing a compute environment. Using it for data science makes it a lot easier to scale and deploy machine algorithms.

Firstly, it is easy to set up, as you can build Docker images from Dockerfiles, which can be specifically tailored to meet the demands of your machine learning project. In addition, the Docker instances can easily be deployed to the cloud and run on demand. This makes data science potentially a lot more agile than before.

Lastly, the endpoint devices are also getting smarter and smaller. Endpoint devices are those that exist on the edges of the network. For example a phone or another type of IoT device. The first Iphone, released in 2007, had a 412 MHz CPU and 128 MB of RAM. The latest Iphone XS Max, now sports an impressive 6 core processor with two 2.5 GHz Vortex processors and four 1.59 GHz Tempest processors, along with 4 GB of ram.

Of course, many of the advances that the chip manufactures are making will spread into other devices, increasing their capabilities and reducing their cost. This is all increasing the amount and type of data we can expect from IoT devices, which again will probably be used to fuel thousands of machine learning algorithms running somewhere up in the clouds.

So what’s the relevance?

At this point you might ask yourself, so why is this relevant? Well, even if just half of the trends in computing continue at their current pace, there will still be a massive effect on data science. It will be easier to make models, deploy them to production environments and integrate them into business processes.

Given many of the recent developments and ongoing trends, we have likely just begun to scratch the surface of what data science has to offer us. But how should we prepare for these key trends, and what can we do to increase the chances of creating successful business models from them? This takes us to our next topic.

How To Prepare for the Future of Data Science

There are many ways companies can and should prepare for the future of data science. These include creating a culture for using machine learning models and their output, standardize and digitize processes, experimenting with a cloud infrastructure solution, have an agile approach to data science projects and creating dedicated data science units. Being able to execute on some of these points will increase the likelihood of succeeding in a highly digitalized world.

A Data Science Unit

In my previous role I was working as a data scientist for an insurance company. One of the smart moves they made was to create an analytics unit, which worked across company verticals.

This made it easier for us to reuse our skills and models on a variety of datasets. It was also a signal to the rest of the company that we had a focus on data science and that this was a prioritized issue. If a company has a certain size, creating a dedicated data science unit is definitely the right move to make.

Standardization

Standardization of processes is also important. This will make it easier to digitalize and perhaps automate these processes in the future. Automation is a key driver for growth, making it much easier to scale. An added bonus is that the data collected from automated processes is usually a lot less messy and less error prone than data collected from manual processes. Since an important enabler of data science models is access to good data, this will help make the models better.

Adoption of Data Science

There should also be a culture in the company for adopting the use of machine learning algorithms and using their output in business decisions. This is of course often easier said than done since many employees might fear that the algorithms are making them obsolete.

It is therefore critical that there be a strong focus on how employees can use their existing skillset alongside algorithms to make more high-level and tactical business decisions, as this combination of human and machine is likely to be the future of work in many occupations. It will probably be more than a few years before the machine learning algorithms are able to navigate alone and make superhuman decisions in an open world setting, meaning mass unemployment due to the rise of the machines is not a likely scenario in the near future.

Always Experiment

With new data being generated from IoT sources, it is important to explore new datasets and see how they can be used to augment your existing models. There is a constant flow of new data waiting to be discovered.

Perhaps including two new variables from an obscure dataset into your model will increase the precision of the leads generating model by 5% — and perhaps not. The point is to always experiment and not be afraid to fail. Like all other scientific inquires, failed attempts abound, and the winners are those who keep on trying.

Create an environment that promotes experimentation and that tries to make incremental improvements to existing business processes. This will make it easier for data scientist to introduce new models and will also set the focus on the smaller improvements, which are a lot less risky than the larger grand visions. Remember, data science is still a lot like software development and the more complex the project becomes the more likely it is to fail.

Try building an app that your customers or suppliers can use to interact with your services. This will make it easier to gather relevant data. Create incentives to promote usage of the app which will increase the amount of data being generated. It is also imperative that the UX of the app be appealing and promotes use.

We might need to venture outside of our comfort zones to take on the opportunities and challenges that this digital gold brings. As the amount of data continues to grow, machine learning algorithms get smarter and our computational abilities improve, we will need to adapt. Hopefully, by creating a strong environment for using data science your company will be better prepared for what the future will bring.

If you enjoyed this article and would like to see more content from me or would like to engage my services, feel free to connect with me on LinkedIn at https://www.linkedin.com/in/hans-christian-ekne-1760a259/ or visit my webpage at https://ekneconsulting.com/ to see some of the services I provide. For any other questions or comments please send me a mail at hce@ekneconsulting.com.

Thanks for reading!