
Data science is one of the fastest growing industries in the world, utilizing modern, cutting-edge technology to improve the way we use data. However, if you’ve worked in data science you probably know that one day you will inevitably find yourself staring at an Excel sheet. And there’s nothing wrong with Excel, it’s just not the kind of tool you would expect to use when working in one of the most modern industries.
Many organizations have begun utilizing modern cloud infrastructure but not to the full extent. So many data scientists will find themselves pulling data from a cloud data warehouse just to train a model on their local system. There’s nothing wrong with that too, but what if we could bring the entire data science workflow to the cloud? Well, we can!
From data cleaning to model deployment, there’s a cloud-based tool that you can use to modernize your workflow. In this article, I’m going to go through each step of the data science workflow and show how you can transition it to the cloud and provide some examples along the way. Feel free to skip around if you’ve already modernized part of your workflow but if you want the 100% cloud data science experience; stay tuned!
Data Collection and Storage on the Cloud
Chances are you’re already familiar with the benefits of storing data on the cloud, but in case you haven’t heard: it’s pretty great! Storing your data on the cloud lets you access your data from anywhere with an internet connection, integrate it easily with other Cloud Services, scale your storage capacity to as much as you need, create backups for recovery, and many other very helpful things.
Whether or not you need a data warehouse, data lake, or object storage, your data will have to live somewhere if you want to deploy it to other applications. There are tons of services that offer cloud data storage; some of the more popular ones include:
- AWS S3
- Azure Blob Storage
- Google Cloud Storage
- Hadoop
- Snowflake
This is not even close to the full list of cloud data storage services, but if you work in Data Science there’s a very good chance you’ll work with some, if not all of these eventually. Each service and cloud storage type has its strengths and weaknesses, so you should just pick whichever one that you think will work best for your projects!
Regardless of which service you use for cloud data storage, the process of collecting and storing your data has the same general steps. You’ll usually have to make an account with the service, create a storage container or bucket, and then you should be able to upload your data. Depending on which service you use, this can be done through the web interface, command-line tools, SDKs, or APIs.
A best practice when storing data on the cloud is to set permissions and access control. This isn’t as relevant if you’re working on a solo project but it is crucial if you work on a team. It’s also important to manage your data including the structure, metadata, update frequency, and retention. Encryption can also ensure your data is secure and private, and creating a backup will protect you from losing any progress and improve your data availability!
Data Cleaning and Transformation in the Cloud
Now that your data is stored in the cloud, it makes sense to keep it there and perform all necessary cleaning steps in the cloud too! The benefits are pretty similar to the ones discussed above; access from anywhere, scalability, easy integration, etc. but you also get the added benefit of not having to download your cloud data, clean it, and re-upload it. If done correctly the workflow should be pretty seamless!
Here are some examples of tools you can use for cloud data cleaning and transformation, I’ll keep them consistent with the same five listed in the section above but remember there are many, many other tools at your disposal!
- AWS Glue
- Azure Data Factory
- Google Cloud Dataflow
- Apache Hive
- Snowflake Data Integration
Some services will make the cleaning process simple by providing a sample of your data before and after ETL (Extract, Transform, Load). There are also tools that offer a "code-less" experience where sometimes you can just drag and drop commands, while others offer a highly customizable, coding experience. You can pick and choose which you you like based on your preferences! Generally, these tools can work with multiple cloud storage providers so the whole process is very flexible.
One of my favorite things about online data transformation tools is the visual component, most tools will have an interface that shows you the data transformation process step-by-step like this:

In my experience, it is substantially easier to explain how data is being transformed when presenting to a manager or an audience when you have a visual like this. Showing and explaining raw python code can be quite difficult but its easy to walk through each step and explain what is happening.
If you were doing this process in Snowflake it would look something like this: Once your account is set up and your data is loaded into Snowflake, explore your dataset – you could look at the raw data or use their Snowsight tool to get a better look of your data’s structure and features. Once you know how your data looks you can easily clean it up using built-in tools or SQL. Then depending on your project needs you can also add new columns for further analysis. If you’re doing sentiment analysis on customer reviews, for example you could write a quick script like this:
-- Sentiment Analysis
CREATE OR REPLACE TABLE sentiment_scores AS
SELECT
product_id,
customer_id,
review_text,
CASE
WHEN sentiment_score > 0.6 THEN 'Positive'
WHEN sentiment_score < 0.4 THEN 'Negative'
ELSE 'Neutral'
END AS sentiment
FROM your_dataset;
-- Aggregation
CREATE OR REPLACE TABLE aggregated_sentiments AS
SELECT
product_id,
AVG(sentiment_score) AS avg_sentiment
FROM sentiment_scores
GROUP BY product_id;
Then once the data is cleaned and/or transformed, you can save it as a new dataset and move on to the next step!
Cloud-Based Data Analysis
Now we’ve got our data uploaded, cleaned, and ready for analysis! We’ve got a lot of options for analysis from notebooks to dashboards, but no matter what your preferences are; there is an option that keeps your workflow in the cloud.
Here are your options if you stay within the ecosystems of the five cloud service providers we’ve been referencing:
- AWS Redshift
- Azure Synapse Analytics
- Google BigQuery
- Apache Spark
- Snowflake Data Warehousing
There are many other tools out there but these five should get the job done, especially if your cleaned data is already residing on their respective platform. Depending on which tool you decide to use, you’ll have a wide range of capabilities for data analysis and just like with cleaning you’ll have many different ways of doing it regardless of your proficiency with python or R. As always, you should use the tool that you like best and the one that works with your project.
Depending on the complexity of your project, performing data analysis can be pretty simple with any of these tools. For example, in BigQuery you can write custom SQL query to analyze your data and in addition to that you’ll be able to quickly generate visuals and explore your data further. If you prefer working on notebooks you can also send your data directly from BigQuery to a Google Colab notebook, analyze it, and if you decide to make any changes you can then send it right back as a separate dataset.
Now that your data is analyzed, you probably have a good idea of how you want to present it – luckily for you the next step in the process, visualization, can also be done fully on the cloud!
Data Visualization on the Cloud
A theme you might be noticing throughout this article is how easy it is to integrate each step of this workflow. We’ve uploaded our data, cleaned it, analyzed it, and now we’re ready to visualize all without downloading a single file!
There are many tools you can use to create awesome, cloud-based data visualizations. Each of the five cloud platforms we’ve been following have their own set of visualization tools but here are some other tools that easily integrate with our data management systems:
- Tableau Online
- Power BI
- Looker
- Qlik Sense
- Plotly Dash
You can easily create a clean, informative visual for your analysis or create an interactive dashboard depending on your needs. Tableau Online, for example, also has a great community of creators that share their visualizations for all to see. Taking a look at their Viz of the Day in Tableau Public has been a great source of inspiration for some of my visuals.
The process is pretty straightforward, all you have to do is connect your visualization tool of choice to your data storage tool of choice and from there you can create amazing visuals all online! These tools will usually have awesome libraries of visuals that are informative and visually appealing! You’ll also usually be able to interact with the visuals and get real-time updates as your cloud-hosted data updates too. If you want, you can also embed your visuals in other web apps or sites; the whole process is very customizable.
Cloud-Based Machine Learning and Modeling
This is probably the area of data science where leveraging cloud computing makes the most sense. Training and testing a model can be very demanding for your computer, so why not offload that work to a dedicated server instead? This is just one of the advantages of cloud-based Machine Learning (ML) and modeling.
Cloud platforms will usually provide pre-built models as well to make it easy if you just need a quick model, and if you’re not an ML expert there are AutoML services that will make suggestions for you – all without writing a single line of code. Of course, for the ML engineers out there, there are also highly customizable applications that offer hyperparameter tuning, and MLOps capabilities to ensure your model is built to your exact specifications.
Here are a few examples of cloud tools you can use for machine learning and modeling:
- AWS SageMaker
- Azure Machine Learning
- Google Cloud AI Platform
- Databricks
- Kubeflow
If you like to write the code for your own models, the process with SageMaker looks something like this. First you’ll load your data from S3, then create a SageMaker notebook to write code. SageMaker has built in algorithms like XGBoost, but you can also create custom models using classic Scikit-Learn libraries. You can specify your model’s algorithm and tune hyperparameters in your code. When you’re ready to train and test your model, SageMaker will handle all of the computing resources – which will save you a ton of time. One of the coolest parts about this whole process is once you’re done, you can make the trained model accessible via API and use the model wherever you want!
If you don’t like to write code or want a tool to suggest a model for you, Azure Machine Learning has a tool called Azure AutoML that will work great for you. Similar to the example above, you’ll load your data from your respective data warehouse, but once you get to the modeling portion you can either have Azure suggest a model for you, or pick from their library of algorithms to create your own. The process is highly customizable, but can still be done with a no-code interface.
However you want to create a machine learning model there’s likely a cloud-based tool out there for you. There’s also a very good chance that no matter which tool you use, it will integrate with the other tools we’ve discussed in earlier steps of the process.
Deploying Data Science Solutions on the Cloud
Now we’re trained our models, we can utilize the cloud to transform our insights and algorithms into real-world solutions. Here, you can really see the benefit of using the cloud because your solutions will be accessible from anywhere and can scale at massive levels to answer all kinds of questions. Using the cloud also means your trained model can continue to learn and improve and as you get results from your model, you can upload those results, clean them, and visualize them using the methods we’ve discussed throughout this article.
As you can probable tell, I really enjoy this cloud workflow and how nicely everything integrates together.
For deploying your model you’ve got plenty of options, but here are a few to consider:
- Kubernetes (AWS, Azure, GCP)
- AWS Lambda
- Azure App Service
- Google Cloud App Engine
- Heroku
Depending on what your data science solution is, your tool of choice will vary. If, for example, you’re designing a web app that uses a model you’ve built online, you can use Kubernetes to deploy and improve your solution on that web app. The process would start by packaging your app and model into a Docker contain, which is executable package that includes everything you need to run your app. You can store that container in a container registry that Kubernetes can access (GCP, AWS, and Azure will all have one!) Then you can create a cluster in the cloud and write a simple configuration file (YAML) to tell Kubernetes how to run your web app from the Docker container.
Once everything is configured to your liking, you can start running your web app to as many users as you need! You can get real-time feedback and analytics about your model, all of which can be stored in the cloud and visualized. Whichever service you use for Kubernetes will be able to run smoothly and handle all the computing, and you’ll be able to add additional data to your model to continually improve it!
That was a lot—this is definitely the most complicated step in the process, unfortunately. If you’d like a visual walkthrough you can also check out this youtube video from mildlyoverfitted who does a great job showing the deployment process through Kubernetes.
Other Considerations
While I have just spent this entire article going over all the benefits of moving your workflow to the cloud, there are a few things you should also keep in mind before you fully commit to 100% cloud-based tools.
The first thing to know is that it can get VERY expensive if you have to rely on cloud-based tools for your entire workflow. While there are plenty of free options out there, if you scale up your work it will only be a matter of time until you get hit with a large bill for storage or computing power. There’s also the issue of dependency on internet connectivity where your workflow relies heavily on the quality of your internet. Some systems will also have outages which will disrupt your work and productivity. Its important to diversify your skills in the event that a system or service changes or abruptly ends, that way you can still do you work.
These downsides don’t apply to all the cloud-based tools however, its just important to keep these things in mind so you can make an informed decision about how you want to do you work.
Conclusion
We’ve gone from uploading our data to deploying a machine learning model while using modern cloud-based tools the whole way. I think that’s pretty neat! I hope reading this has inspired you to modernize some or all of your workflow or if not I hope I’ve at least shown you that it is possible to do data science all on the cloud. A modern workflow for a modern industry – it just feels right!
This article is just an overview of what’s possible, if you’re interested in reading more about this topic here are a few resources you can check out:
- An overview of all the data science tools on Google Cloud Platform
- Snowflake’s Data Science Guide
- Some practical reasons to move to the cloud
Thank you for reading!
Want More From Me?
- Follow me on Medium
- Support my writing by signing up for Medium using my referral link
- Connect with me on LinkedIn and Twitter
- Check out my Data Science with Python guide on benchamblee.blog