Whether you are preparing a Data Analyst interview, learning SQL, or simply interested in solving business challenges through data analysis. It’s common to have a challenge: I understood how a database works but don’t know where to find a project? There are many test banks, but they are not business problems, leaving a gap between knowledge and practical experience.
In this article, I’ll walk you through how to use a public dataset to experience the journey of data analysis. Use COVID-19 Public Dataset from BigQuery as an example.
For more than a year, many countries are in different stages when talking about COVID-19. Some countries have lifted the ban on wearing masks, but others are still in lockdown. This dataset helps us to answer questions such as: What’s the recovery rate of country A? or Which countries have dealt with this pandemic the best?
- Challenge
- Solution
- Steps
- Thoughts
- Limits
- Conclusion
Challenge
These are challenges when we are learning SQL
- Figuring what’s the practice of different syntax
- Lack of scenario
- Lack of practice
- Lack of data
- Lack of end to end experience
- Missing data
- Lack of business sense and understanding of metrics, such as retention rate or convert rate
Solution
There are many options when it comes to SQL, Business Intelligence, and ETL. In this article, I’ll use BigQuery.
What is BigQuery, and why I choose it?
BigQuery is a popular data analytics solution on the Google Cloud Platform. It’s serverless and highly scalable. On top of that, it has many public datasets to use.
About the Dataset
The dataset‘s source is from different organizations, including New York Times, European Centre for Disease Prevention and Control, Google, Global Health Data from the World Bank, OpenStreetMap, etc. It’s free to use for education or research purposes until September 2021.
The data size is 6.9GB and continually growing. Over 120 million rows of data with over 700 columns. It contains data such as the number of confirmed cases, death, gender, age, and recovery.
Steps
STEP 1: Loading data to BigQuery
Prepare a GCP account or start a free trial on Qwiklab, then use this dataset.
When you see the screenshot below, you are ready.
STEP 2: Data Cleansing
A good practice I recommend keeping is always checking the duplication before any analysis. You can achieve it by running the query below.
SELECT COUNT(*) AS num_duplicate_rows,
FROM `the_table_you_want_to_check`
GROUP BY column1,column2,column3
HAVING num_duplicate_rows > 1;
STEP 3: Practice your SQL skill
It’s time to practice your SQL skill.
Below is an example to find which country has the best recovery rate on May 10th, 2020.
WITH cases_by_country AS (
SELECT
country_name AS country,
sum(cumulative_confirmed) AS cases,
sum(cumulative_recovered) AS recovered_cases
FROM bigquery-public-data.covid19_open_data.covid19_open_data WHERE
date = '2020-05-10'
GROUP BY
country_name ), recovered_rate AS
(SELECT country, cases, recovered_cases, (recovered_cases * 100)/cases AS recovery_rate
FROM cases_by_country)SELECT
country,
cases AS confirmed_cases,
recovered_cases,
recovery_rate
FROM
recovered_rate
WHERE cases > 50000
ORDER BY recovery_rate desc
LIMIT 10
The query result shows the answer is France.
STEP 4: Visualization
Finally, don’t forget to visualize it as it helps us better communicate with the audience. Click "Explore Data" and link with Data Studio, the reporting solution by Google, to demonstrate the result in different angles.
Thoughts
Using quality data so you can focus on what matters the most: data analysis
Ingredients are everything in terms of the quality of food. Data analysis can be similar because analysts no longer need to worry about data integrity or consistency. As you may not surrounding by quality data, the open dataset would be a good choice.
Notably, one advantage of Cloud Computing is that the datasets are save in a data lake. When needing it, we open the tap, and the data flows into services such as BigQuery. There’s no need to wait for data in transit.
Develop your business understanding along the way
What are the key features when predicting a sales forecast? What are the key metrics to evaluate the impact of a policy? Even tackled hundreds of Sql challenges, these questions remain unanswered. That’s why practicing with contexts, such as scenarios and use cases, is essential as it helps us link the query with a real-world problem.
Brainstorm another approach demonstrates you master a skill
Using one method to tackle a challenge is great, but how about come up with another one?
I was in a conversation with a friend who’s very talented in math. He was top 1% on the admission test of the university. At first, I thought he just practiced a lot of questions. Surprisingly, he didn’t practice a large number of questions. Instead, he would consider a second or even a third way to solve the same problem. The tip inspired me when learning something new. Perhaps we shouldn’t try to solve as many problems from the past exams. Instead, thinking about a new approach to solve the same problem!
Limits
Activity Tracking sometimes doesn’t work.
If you are using the Quest provided by GCP to practice, it uses activity tracking monitors your work as you complete the parts of a lab. Sometimes, it only counts a specific column or value. For example, I used country_code in a query to calculate the number of confirmed cases but didn’t get the point. Finally, I figured out I should use country_name instead. The idea is the same, but we need to accommodate the mechanism.
Data’s level of detail varies by country
The granularity of the U.S data is the best. For example, you can check the cases by each state. But can’t find the same level of granularity in other countries.
The project isn’t designed for interview preparation
If you are looking for interview preparation materials, Leetcode or Hackerrank will be better options as these platforms can focus on a specific skill to enhance.
I’m interested in analyzing the data to find a solution and have always been looking for quality datasets to explore. I found the public datasets offered by BigQuery helpful. I’ll recommend using the platform to experience the whole process, from data to a conclusion. It’s an excellent place to start no matter you are looking for a project to practice your SQL skill, curious about how cloud computing works, or simply interested in solving problems.
Disclaimer: This is a write-up after finishing Insight from Data With BigQuery. Feel free to refer my query if you need more details.