The world’s leading publication for data science, AI, and ML professionals.

Data Scientists Work in the Cloud. Here’s How to Practice This as a Student (Part 1: SQL)

Forget local Jupyter Notebooks and bubble-wrapped coding courses – here's where to practice with real-world cloud platforms. Part 1: SQL

Image created by author via GPT-4
Image created by author via GPT-4

During the first lockdown in March 2020, in between Netflix episodes of Tiger King and The Last Dance, my housemate and I decided to learn Python.

He found a free course on some university’s website. It was terrible, and he’d quit within a week.

I found one that cost £10 on Udemy, and it was awesome.

But even a great course doesn’t compare to "real world" coding projects, so I still got bored and quit within a month.

Online courses are great – but only up to a point

When you’re first learning to code, you don’t want to faff around with setting up environments and downloading software.

You just want to open up your browser, click Start Course and get coding.

(That was why my friend quit – his course had too much setup faff, and it was too frustrating for a beginner.)

But after a while you realise that you’ll eventually need to learn how to work with cloud platforms and software.

Data scientists don’t write production code on Udemy’s in-browser code editor

But how can you practice coding in "real world environments" when so many online courses and platforms deliberately abstract those parts away?

My journey into the cloud

In this article—Part 1 of my miniseries on doing data science in the cloud—I’ll show you how you can practice coding in a "real-world" environment: Bigquery (part of Google Cloud Platform), which is great for practicing SQL.

Why BigQuery? First, because SQL is a staple for data science. Second, because it’s got a great free sandbox environment.

But, perhaps most importantly, because when I was just starting out with cloud platforms I found it to be very beginner friendly. Let me explain.

When I was first looking for Data science jobs, it terrified me to see job requirements like "AWS," "Google Cloud Platform," "BigQuery," "Azure" or "Databricks."

I had no experience with any of these platforms, and I wasn’t sure how to get it. I wrongly assumed that you needed to be some sort of IT boffin to use them.

Practicing SQL on BigQuery helped me debunk that assumption and gain confidence with that mysterious thing called "the cloud."

Besides, even if you don’t end up using BigQuery in your job, I think it’s a good introduction to cloud-based databases. It’s got a lot of similarities to other database systems like those on AWS, Azure, Databricks, Snowflake, etc., and if you learn BigQuery, you’ll be 80% of the way towards knowing those platforms as well.

Setup

Normally at this point I’d just direct you to the docs. But I personally found the BigQuery docs quite confusing as a beginner, so I figured it would be helpful to explain a couple of the key concepts here:

The BigQuery Sandbox

Google maintains a free environment called the BigQuery Sandbox, which lets you use BigQuery for free (no credit card required) and process a whopping 1TB of data per month. The reason I highlight this is because, somewhat confusingly, the sandbox is different than BigQuery’s Free Tier (which requires card details). When using the sandbox, we won’t get access to some of the fancier services of BigQuery. But trust me when I say that the sandbox is going to be more than enough for us.

To sign up, go to https://cloud.google.com/bigquery/docs/sandbox, click Go to BigQuery, __ and follow the prompts.

Image by author
Image by author

"Projects"

Once you’re inside the BigQuery Sandbox, you’ll need to create a "Project". When I was starting out with BigQuery, I wasn’t quite sure what that meant, but the basic idea is that a BigQuery Project is kind of like a directory which will hold all your data (e.g., any tables we create).

Note: If you use BigQuery in a corporate setting in the future, you’ll likely be asked to join a pre-existing project (not create your own), and your company will be billed per project (this makes it easy to organise costs). This is what happened to me when I joined my current company, for example. But remember, this doesn’t apply to us right now as we’re just using the sandbox!

To create a project, click Select a project and then New project in the pop-up box which appears. Give your project a name, and press Create project (note: you don’t need to specify an organisation).

Image by author
Image by author

Exploring the console

If you’ve not seen the inside of a cloud console before, it might look a little daunting, and you might find the docs a little too technical (as I did).

Here’s my quick "beginner tour":

Image by author
Image by author
  1. Project name – This shows the project in which we are currently working. In a corporate context, you might have lots of pre-existing projects to choose from. For now, we’ve only got one (the one we just created).
  2. Explorer pane – This is where we’ll see all our projects and the schema of all our tables we add to our database.
  3. Query tabs – each of these tabs represents a ‘Query’ (i.e., a place where we will write SQL code).

Let’s get some data!

Hopefully that gives you the core information you need to get started with BigQuery. Now, let’s practice running some queries with BigQuery’s free Public Datasets:

  • Click the ‘+ Add’ button in the dataset Explorer pane on the left of the console and click Public datasets in the pop-up that appears.
Image by author
Image by author
  • Search for ‘StackOverflow’ – one of the datasets we’ll use – and then click ‘View dataset’.
Image by author. Note: StackOverflow data is released under Creative Commons CC-BY-SA 3.0 license
Image by author. Note: StackOverflow data is released under Creative Commons CC-BY-SA 3.0 license
  • Now, you’ll see some data set info and you’ll also see we’ve got a new drop-down that’s appeared in the Explorer pane – bigquery-public-data. If you expand that, then you’ll see tons of new datasets we’ve just been given access to! Take a quick look around – there’s loads of really cool stuff in there including Google Trends data, Google Ads data, etc.
Image by author
Image by author

Run a query

Let’s write a simple query, to get a feel for using SQL in BigQuery.

We’ll use the StackOverflow dataset, which contains tables on StackOverflow questions, answers, users and more.

Here’s a simple query which counts the number of questions asked in 2015, and the percentage of those which received answers:

SELECT
  EXTRACT(YEAR FROM creation_date) AS Year,
  COUNT(*) AS Number_of_Questions,
  ROUND(100 * SUM(IF(answer_count > 0, 1, 0)) / COUNT(*), 1) AS Percent_Questions_with_Answers
FROM
  `bigquery-public-data.stackoverflow.posts_questions`
GROUP BY
  Year
HAVING
  Year = 2015
ORDER BY
  Year

When you click Run, you’ll see the output of the code. BigQuery also provides Execution details which give information about how long the code took to run.

Image by author
Image by author

More practice

If you want more practice, I’ve created 10 exercises which you can use to practice working with BigQuery. You can view them for free on my SQL tutorial website theSQLgym.

One more thing –

Thanks for reading. I hope you found this helpful, and free to connect with me on X or LinkedIn, or get my weekly Data Science/AI writing at AI in Five! 🙂


Related Articles