The world’s leading publication for data science, AI, and ML professionals.

What is Cloud Computing? The Key to Putting Models into Production

A breakdown of the top Cloud Computing platforms such as AWS, Azure, Hadoop, PySpark and more.

Tianyi Ma on Unsplash
Tianyi Ma on Unsplash

A key skill for any Data Scientist is the ability to write production-quality code to create models and deploy them into cloud environments. Typically, working with Cloud Computing and data architectures falls in the Data Engineer job title. However, every data professional is expected to be a generalist who can adapt and scale their projects.

Here is an introduction to popular platforms that I have seen across dozens of job descriptions. This doesn’t mean that we have to become experts overnight, but it helps to understand the services that are out there. An introduction is always the first step in learning a new skill.

Big Data & Cloud Computing

Although "Big Data" has been a buzzword lately and has gained interest from big businesses, it’s typically difficult to define. There’s no clear consensus about the definition, but "Big Data" usually refers to datasets that grow so large that they become unmanageable with using traditional database management systems and analytical approaches. They are datasets whose size is beyond the ability of commonly used software tools and storage systems to capture, store, manage and process the data within a tolerated elapsed time.

When thinking about Big Data, keep the 3 V’s in mind: Volume, Variety and Velocity.

  • Volume: The size of the data
  • Variety: The different formats and types of data, as well as the different uses and ways of analyzing the data
  • Velocity: The rate at which data is changing, or how often it is created

With the evolution of data, the need for faster and more efficient ways of analyzing such data has also grown exponentially. Having big data alone is no longer enough to make efficient decisions. Therefore, there is a demand for new tools and methods specialized for big data analytics, as well as the required architectures for storing and managing such data.

Where Should We Begin?

Each company has a unique technology stack with softwares that they prefer to use with their proprietary data. There are a number of different platforms that go into each category of the stack. These categories include Visualization & Analytics, Computation, Storage and Distribution & Data Warehouses. There are too many platforms to count, but I’ll be going over the popular cloud computing services that I have seen across recent job postings.

Rather than owning their own computing infrastructure or data centers, companies can rent access to anything from applications to storage from a cloud service provider. This allows companies to pay for what they use, when they use it. Instead of dealing with the cost and complexity of maintaining their own IT infrastructure. Simply put,

Cloud computing is the delivery of on-demand computing services – from applications to storage and processing power – typically over the internet and on a pay-as-you-go-basis.

Although I’m not going to dive into details in this blog, it’s important to note that there are three cloud computing models. IaaS (Infrastructure-as-a-Service), PaaS (Platform-as-a-Service) and SaaS (Software-as-a-Service). Of these three, SaaS is by far the dominant and most widely used cloud computing model.

Amazon Web Services (AWS)

AWS offers database storage options, computing power, content delivery, and networking among other functionalities to help organizations scale up. It allows you to select your desired solutions while you pay for exactly the services you consume. Services like AWS allow us to rent time on these servers per minute, making distributed training available for anybody. Additionally, AWS servers allow for companies to make use of big data frameworks such as Hadoop or Spark across a cluster of servers.

Amazon has also centralized all of the major Data Science services inside Amazon SageMaker. SageMaker provides numerous services for things such as data labeling, cloud-based notebooks, training and model tuning and Inference. We can set up our own models, or use the preexisting models provided by AWS. Similarly, we can set up our own inference endpoints, or make use of preexisting endpoints created by AWS. Creating an endpoint requires us to use a Docker instance. Luckily much of the work required to create an endpoint for our own model is boilerplate, and we can use it again and again across multiple projects.

AWS currently supports over 2,000 government agencies and 5,000 educational institutions. Additionally, it’s one of the top 4 public cloud computing companies in the world. Benefits include the ease of use, diverse array of tools, unlimited server capacity, reliable encryption and security, plus affordability. However, some drawbacks are that some of AWS’s resources are limited by region. It’s also subject to the concerns that many other cloud computing platforms face. There are concerns over backup protection, risk of data leakage, privacy issues and limited control.

Microsoft Azure

Microsoft Azure, commonly referred to as Azure, is a cloud computing service created by Microsoft for building, testing, deploying, and managing applications and services through Microsoft-managed data centers. This platform has a strong focus on security, establishing it as a leader in IaaS security. Also, as a cloud computing service, it’s highly scalable while being affordable.

As with anything, there are a couple of potential cons with Microsoft Azure. Unlike SaaS platforms where the end-user is consuming information, IaaS platforms, such as Azure, move your business’ computing power from your data center or office to the cloud. As with most cloud service providers, Azure needs to be expertly managed and maintained, which includes patching and server monitoring.

Google Cloud Platform (GCP)

Just like AWS and Azure, GCP is a suite of public cloud computing services offered by Google. The platform includes a range of hosted services for computing, storage and application development that run on Google hardware.

Some of the areas where GCP strongly competes with AWS include instance and payment configurability, privacy and traffic security, cost-efficiency, and Machine Learning capabilities. GCP also offers several off-the-shelf APIs pertaining to computer vision, natural language processing and translation. Machine learning engineers can build models based on Google’s Cloud Machine Learning Engine’s open-source TensorFlow deep learning library.

Despite the advantages of this competitor, there are some given cons. Core GCP products like BigQuery, Spanner, Datastore are great but very biased with limited customizations and observability. Additionally, there is limited documentation available and GCP’s customer service has been rumored to be unresponsive overall.

Hadoop

Hadoop is a distributed computing system and environment for very large datasets on computer clusters. It provides massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs. This ecosystem also utilizes MapReduce.

MapReduce is a programming model or pattern within the Hadoop framework that is used to access big data stored in the Hadoop File System (HDFS). A MapReduce job usually splits the input dataset into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system.

Instead of using just one system to perform a task, Hadoop and MapReduce allows tasks to be divided. Tasks are divided into many subtasks, they’re solved individually, and then recombined into a final result.

Although Hadoop is open source and uses a cost-effective scalable model, it’s mainly designed for dealing with large datasets. The platform’s efficiency decreases while performing with small amounts of data. Additionally, Hadoop jobs are mostly written in Java, which can be troublesome and error-prone without expertise in that language.

PySpark

PySpark is an interface for Apache Spark in the Python programming language. It not only allows you to write Spark applications using Python APIs, but also provides the PySpark shell for interactively analyzing your data in a distributed environment. PySpark supports most of Spark’s features such as Spark SQL, DataFrame, Streaming, MLlib (Machine Learning) and Spark Core.

Some advantages of PySpark is that some Spark jobs can be up to 100x faster than the traditional MapReduce in memory. It also allows for a higher-level query language to implement SQL-like semantics on top of MapReduce. Additionally, PySpark APIs are similar to Python’s Pandas and Scikit-Learn.

The original Apache Spark service is written in Scala, a programming language popular among big data developers, because of its scalability on JVM (Java Virtual Machine). Unfortunately, Python is overall slower and less efficient than Scala. Additionally PySpark is immature compared to Spark and cannot access the internal functioning of Spark for most projects.

Choosing the Right Service

Without cloud computing, it would be impossible for Data Scientists and Machine Learning Engineers to train some of the larger deep learning models that exist today.

There’s a ton of cloud computing services to choose from, but selecting the right one comes down to factors such as security concerns, software dependencies, manageability, customer service support, and more.

Each employer will most likely have a preferred technology stack they use company-wide for proprietary data. Until then, there are plenty of resources online to help increase your cloud computing knowledge. You can also go on each provider’s website and read extensively about what their service offers. Ultimately, it’s important to note that with any Data Science skill, there’s no better method of learning than hands-on practice with a personal project.


Related Articles