The Cloud: Google Cloud Platform made easy

The Cloud is a complicated space. It’s not a simple plug and play as most people would imagine. Let’s simplify the Cloud: GCP Edition.

Anish Mahapatra
Towards Data Science

--

Photo by Tianyi Ma on Unsplash

The Cloud is a complicated space. It’s not a simple plug and play as most people would imagine. We have folks from various backgrounds such as developers, network engineers, machine learning engineers, data architects etc. who would have mastery over certain components of the cloud.

When working in an enterprise environment, it is critical that experts are working on the various relevant components and everybody has a place in the data and model pipeline lifecycle. This can include roles such as:

  • Security: Handling of Identity and Access Management (IAM)
  • Data Architecture: Understanding the interaction between various cloud services and in-depth understanding of on-prem services & requirements
  • Model Operationalization: Hands-on understanding of IaaS, PaaS, SaaS features on the Cloud; pipeline automation, optimization, deployment, monitoring and scaling
  • Infrastructure: Dynamic requirements of various projects and products to minimize the cost to the company along with agility in application
  • Support: End to end knowledge of the Cloud platform being leveraged from professionals to save time on debugging (knowing over learning)

A healthy mix of the above skillsets can lead to a successful movement to move from legacy systems to scaling on the cloud.

Photo by Taylor Vick on Unsplash

The Data Lifecycle on GCP

Google has been in the internet game for a long time. Along the way, they have built multiple great products. When your efforts are experimental and not streamlined, to begin with, it becomes evident in the product portfolio offering.

The same workflow can be achieved on the cloud in multiple ways. The optimization and right choices made is what makes one a Google Cloud Data Professional.

Data Lifecycle is the cycle of data from initial collection to final visualization. It consists of the following steps:

  • Data Ingestion: Pull in the raw data from the source — this is generally real-time data or batch data
  • Storage: The data needs to be stored in the appropriate format. Data has to be reliable and accessible.
  • Data Processing: The data has to be processed to be able to draw actionable insights
  • Data Exploration and Visualization: Depending on the consumption of the data, it has to be showcased appropriately to stakeholders
Photo by Jesse Orrico on Unsplash

1. Data Storage

Data Storage on the Cloud is a value-offering that enterprises should leverage. Multiple features cannot be achieved with an on-premise storage option. Fault Tolerance, Multi-regional support for reduced latency, elasticity based on increased workloads, Preemptible VM instances, pay-per-usage and reduced maintenance costs among many more advantages.

Having said this, GCP has multiple offerings in terms of data storage. Choosing the right service is as important as the optimizing for the cloud.

The Data Lake of Google Cloud Platform: Google Cloud Storage

Google Cloud Storage (Credit)

Considered the ultimate staging area, Google Cloud Storage is flexible to accept all data formats, any type and can be used to store real-time as well as archived data at reduced costs

Google Cloud Storage is a RESTful online file storage web service for storing and accessing data on Google Cloud Platform infrastructure. The service combines the performance and scalability of Google’s cloud with advanced security and sharing capabilities.

Consider Google Cloud Storage as the Web File System Equivalent, it is reliable, can accept anything you throw at it and is flexible in costs depending on your data needs — Standard, Nearline, Coldline, Archive

Google Cloud SQL

Google Cloud SQL (Credit)

Google Cloud SQL is a fully-managed database service that helps you set up, maintain, manage, and administer your relational databases.

Use-cases: Structured Data based on a Web Framework. For example, Warehouse Records, Articles

Cloud SQL can be leveraged when opting for a direct lift and shift of traditional SQL Workloads with the maintenance stack managed for you.

Best Practices: More, Smaller Tables rather than fewer, larger tables

Google Cloud Datastore

Google Cloud Storage (Credit)

Google Cloud Datastore is a highly scalable, fully managed NoSQL database service.

Use-cases: Semi-Structured Data; used for key-value type data. For example, storage of Product SKU Catalogs and storing Gaming checkpoints

Cloud Datastore can be utilized as a No Ops, Highly Scalable non-relational database. The structure of the data can be defined per the business requirement.

Google Cloud Bigtable

Google Cloud Bigtable (Credit)

Google Cloud Bigtable is a fully managed, scalable NoSQL database Singe-Region service for large analytical and operational workloads. It is not compatible with multi-region deployments.

Use-cases: High Throughput Analytics, sub-10ms response time with million reads/write per second. Used for Finance, IoT etc.

Cloud Bigtable is Not No Ops compliant, changing disk type (HDD/ SDD) requires a new instance. Recognized through one identifier, row key.

Good Row keys lead to distributed load and Poor row keys lead to hot-spotting. Indicators of poor row keys: Domain Names, Sequential IDs, Timestamps.

Google Cloud Spanner

Google Cloud Spanner (Credit)

Google Cloud Spanner is a fully managed relational database with unlimited scale, strong consistency & up to 99.999% availability. It is essentially built for pure scale with minimal downtime and complete reliability.

Use-cases: RDBMS and high scale transactions. For example, Global Supply Chain, Retail tracking of PoS etc.

Cloud Spanner is a fully managed, highly scalable/ available relational database with strong transactional consistency (ACID Compliance).

Google BigQuery (Storage)

Google BigQuery (Credit)

Google BigQuery is a fully-managed, serverless data warehouse that enables scalable analysis over petabytes of data. It is a serverless Software as a Service that supports querying using ANSI SQL. It also has built-in machine learning capabilities.

Use-cases: Mission-critical apps that require scale and consistency. Used for Large Data Analytics processing using SQL.

Google BigQuery supports limited machine learning implementations too.

Great job! You have understood a fair bit about the storage within the Google Cloud Platform. The below image illustrates how it all comes together.

Building a Data Lake on GCP (Credits)

I would like to reiterate here, a similar workflow and workload can be performed on GCP in multiple ways, but the choice of which way would be most optimal depends on the short-term and long-term plans of scaling and usage of the project/ organization.

Photo by tian Kuan on Unsplash

2. Setting up Data Pipelines

Now with the above section that we know where our data is going, the next question to tackle is how do we get our data there. This is known as a Data Pipeline. We have to start building our data pipelines to ensure that the data used at any point is per the expectations of the end-user.

The three main types of data are:

  • Historical/ Archive data: This is the old data that might be used at some point. We upload all these data to GCP via the methods available
  • Streaming (or real-time) data: This would include the data that is generated in real-time and one that we would like to analyze in near real-time, like financial data
  • Batch (or bulk) data: This would include data that is updated in batches with low latency not being a priority

Let us now understand how Google Cloud Platform provisions this with their various services.

Google Cloud Pub/Sub:

Google Pub/Sub (Credits)

Google Pub/Sub stands for Publisher and Subscriber. This service lets you stream data from the source in real-time. The open-source equivalent is Apache Kafka.

Pub/Sub is an asynchronous messaging service that decouples services that produce events from services that process events.

Use-Cases: Stream Analytics, Asynchronous microservices integration. For example, streaming IoT data from various manufacturing machine units.

Pub/Sub Integrations (Credits)

This service can be used for streaming data when fast action is necessary to quickly collect data, gain insights and take action for instances such as credit card fraud.

Google Cloud Dataflow

Google Cloud Dataflow (Credits)

Google Cloud Dataflow is a unified stream and batch data processing that’s serverless, fast, and cost-effective.

The open-source equivalent of Dataflow is Apache Beam.

Use-cases: Stream Analytics, Real-Time AI, Anomaly Detection, Predictive Forecasting, Sensor and Log processing

Handling mishaps from Cloud Pub/Sub are all done in Google Cloud Dataflow. The process can be managed to decide up until what point we will accept the data and how to monitor the same.

Stream Analytics leveraging Pub/Sub and Dataflow (Credits)

Google Cloud Dataflow takes in multiple instances of streaming data into the required workflow and handles multiple consistency edge cases and streamlines data for pre-processing in near real-time analysis.

Google Cloud Dataproc

Google Dataproc (Credits)

Google Dataproc makes open-source data and analytics processing fast, easy, and more secure in the cloud. This service enables automated Cluster Management that spins up and down automatically based on triggers. This makes them extremely cost-efficient, yet effective for multiple use-cases.

Use-cases: Can be used to run Machine Learning tasks, PySpark for NLP etc. For instance, running customer churn analysis on Telecom data.

Resizable clusters, Autoscaling features, versioning, highly available, cluster scheduled deletion, custom images, flexible VMs, Component Gateway and notebook access from assigned Dataproc cluster are a few of the features that are industry-leading for Dataproc clusters.

Cloud Dataproc in the GCP Workflow (Credits)

Cloud Dataproc is one of the best services to run processes. AI Platform is another great option as well. The latest benefit of Cloud Data proc is that with Google Cloud Composer (Apache Airflow), Dataproc clusters can be spun up and spun down in an agile and scalable manner. This becomes a lean process as well as cost decreases and efficiency increases.

Fully Serverless implementation on GCP — example (Credits)

Google BigQuery (Pipeline)

Google BigQuery (Credit)

Notice that BigQuery is being brought in a second time due to its efficiency in setting up analytical pipelines as well

Google BigQuery is a fully-managed Data warehousing, auto-scaling and serverless service that works with near real-time analysis of petabyte-scale databases. Data stored within BigQuery should be column-optimized and should be used for analytics more often, archive data should be stored in Google Cloud Storage.

Apache Avro (Credits)
Datatype affecting speed in BigQuery (Credits)

The type of data is relevant and affects query performance and speed, and ultimately cost.

Apache Avro is the type of data that has the best performance and Json are comparatively slower for analytical tasks on BigQuery. This can be used via the Cloud Shell or the Web UI.

Google Cloud Composer

Google Cloud Composer (Credits)

Google Cloud Composer is a fully managed workflow orchestration service that empowers you to author, schedule, and monitor pipelines that span across clouds and on-premises data centres.

Built on the popular Apache Airflow open source project and operated using the Python programming language, Cloud Composer is free from lock-in and easy to use.

Let me pique your interest. Netflix uses this architecture. It is bleeding edge and extremely scalable.

Why should you care?

If you have worked on Ubuntu; bash scripting, the way we schedule tasks are using cron jobs, where we can automate any tasks done manually. If you have worked on Excel, macros are the equivalent what Apache Airflow is capable fo doing, but at a Cloud Level.

This level of automation can be mind-boggling. Essentially, you can set up scripts (in python and bash) to pull the latest data from Google Cloud Storage, spin-up a cluster in Google Dataproc, Perform Data Pre-processing, shut-down the cluster, implement a data model on AI Platform and output the result onto BigQuery.

Photo by Israel Palacio on Unsplash

You are a rockstar! Great job on understanding Data Storage and various implementations of Data Flow pipelines on Google Cloud Storage. Now that you’re here, I would suggest you go back up and refresh your memory to see how much you recall.

Let us now proceed to understand how we can leverage the data we have stored and the pipelines we have made.

Google Cloud Datalab

Google Cloud Datalab (Credits)

Google Cloud Datalab is an easy-to-use interactive tool for data exploration, analysis, visualization, and machine learning.

Datalab enables powerful data exploration, integrated and open-source, scalable data management and visualization along with Machine learning with lifecycle support.

Why should you care?

As a Data Scientist, this is the equivalent of Jupyter Notebooks, where the data can lie within various components within GCP. You can connect via SSH from the cloud shell and run it via port 8081 (ignore if this does not make sense).

Using Google Cloud Datalab on GCP (Credits)

Cloud Datalab is an online Jupyter Notebook to

Google Cloud Dataprep

Google Cloud Dataprep (Credits)

Google Cloud Dataprep is an intelligent cloud data service to visually explore, clean, and prepare data for analysis and machine learning.

It is a serverless service that enables fast exploration, anomaly detection, easy and powerful data exploration.

Where Google Cloud Dataprep fits into the workflow

Think of Google Cloud Dataprep as a GUI on excel with intelligent suggestions.

Google AI Platform

Google AI Platform (Credits)

Google AI Platform is a managed service that enables you to easily build machine learning models, that work on any type of data, of any size. The AI Platform lets the user Prepare, Build and Run, Manage and share the machine learning end-to-end cycle for model development.

Machine Learning Development: the end-to-end cycle (Credits)

AI Platform is the next generation of Cloud Dataproc.

ML Workflow (Credits)

AI Platform is a fully managed platform that scales to tens of CPUs/ GPUs/ TPUs where you train your model and automate it.

Google Data Studio

Google Data Studio (Credits)

It’s simple. Google Data Studio is an online, interactive dashboarding tool with the source lying on GCP. You can synchronize work done by multiple developers on a single platform.

It is not as advanced as Power BI or Tableau, but, it can get the job done. It is a part of the G Suite and can connect to BigQuery, Cloud SQL, GCS, Google Cloud Spanner etc.

Photo by Lukas Blazek on Unsplash

I would recommend you read and re-read the above article a few times to get a hold of The Cloud: Google Cloud Platform. As always, the best way to understand the platform is to deep-drive into the hands-on aspect of Google Cloud Platform. This took a considerable amount of effort, I would appreciate your thoughts and comments if you felt that this helped you in some way. Feel free to reach out to me to discuss further on the application of Data Science on the Cloud.

So, a little about me. I’m a Data Scientist at a top Data Science firm, currently pursuing my MS in Data Science. I spend a lot of time studying and working. Show me some love if you enjoyed this! 😄 I also write about the millennial lifestyle, consulting, chatbots and finance! If you have any questions or recommendations on this, please feel free to reach out to me on LinkedIn or follow me here, I’d love to hear your thoughts!

As part of the next steps of this series, I shall be publishing more use-cases as to how we can leverage the Cloud in the real world. Feel Free to follow me and connect with me for more!

--

--