What is the cloud?
The cloud sounds like a magical place, where data lives entangled in the fabric of reality. It’s beautiful branding because the reality is a lot less exciting. Most people have heard of Data Centers, the warehouses filled with computers (fancily called servers) that big tech companies use. What many miss, is that the cloud and data centers are one in the same. When you save your photos to the cloud, you are sending them to a data center to be stored on one of the servers there. When you want to look at your photo, your phone sends a request to the data center to get that image sent back to it. Similarly, if you want to calculate the fastest route home from work, you send the request to do this calculation from your phone to a data center. The data center has all the map data stored and a lot more processing power than your phone, so it can quickly perform this calculation and send the results back to your phone. This is why your maps applications can’t calculate new routes when you lose cell signal.
Long story short, the cloud is a collection of computers that can store data and perform powerful calculations. This should cause data scientists to perk up their ears.
Why use the cloud?
There are several reasons the cloud is appealing to data scientists.
Compute Power: When you are training deep neural networks, you need a lot of processing power so that your models can run in a reasonable time. This can easily overpower your local computer, but a data center filled with powerful GPUs (graphical processing units) optimized for these types of calculations might make quick work of this.
Centralized Location for Storing Data: One of the major challenges of a data scientist is the aggregation of the data to use in the analysis. If you use the cloud to store all your data, this problem is greatly simplified. Any computer can access any data anywhere. This is especially impactful when you are gathering data from many data sources (ahem Internet of Things). Using architectures such as Data Lakes or Data Warehouses allows clever data scientists to pull seemingly unrelated business data into their analyses as well.
Big Data: We live in the age of Big Data. Data is the new bacon, as they say. To store all of this, you need a lot of hardware. If you want backups, you need twice that hardware. It may not be an option to download all this data onto your local computer to run a calculation. Working in the cloud avoids this.
IT Support / DevOps / Server Management: If you solve some of the above problems with local servers, your data team will need IT support to manage these servers. Getting in the weeds of networking issues might be outside of the scope of your data team’s expertise. Cloud providers often have an option to take care of the nitty gritty work for you.
Easily Shareable Results and Applications: Storing analyses in the cloud can make it easy for somebody to contribute or reproduce the results, as they will always be working in the same environment. You can also publish the results as a dashboard or webpage, making it easy for business leaders to access or share with third parties. You can also deploy apps that will run continuously which is far easier than deploying applications for your teammates to run on their own local computers and offers the ability to run calculations continuously.
How can I use the cloud?
There are several ways for a data team to take advantage of the cloud, ranging from simplest to most complex.
Google Colab: Cloud hosted notebooks, such as Google Colab, allow a data team to easily collaborate on an analysis together and take advantage of the computing capabilities that a cloud provider can offer.
Cloud Hosting Data Science Services: Several open-source data science companies (such as RStudio) make their money by offering services to help data teams host their work on the cloud. Using this method offers all the benefits of a cloud hosting provider, but the data team can focus on their analyses and not on the messy details of server management.
Deploy An Application to a Cloud Provider: You can always use the hosting services of AWS, GCP, Microsoft Azure, or another company to deploy your own application. This will require "DevOps" work by somebody, but most of these providers have helpful resources and support that can get a data team up to speed.
What does this mean for me?
If you find yourself complaining about the time it takes your computer to complete an analysis or you would like a better way to collaborate on notebooks and share results, then moving your workflow from your laptop to the cloud is worth investigating. Building a dashboard and deploying it to RStudio Cloud, running a Google Colab notebook, or building an App from scratch hosted on AWS can really transform your data experience. Plus, if you ever want your product deployed at a large scale, you will need to familiarize yourself with Cloud Services, so you may want to start testing the waters now.