The world’s leading publication for data science, AI, and ML professionals.

Exploratory Data Analysis with BigQuery SQL? Easy!

Complete Python comparison and Step by Step guide for any dataset. Kaggle User churn data.

Getting Started

Exploratory Data Analysis in Google Data Studio. Image by author.
Exploratory Data Analysis in Google Data Studio. Image by author.

Can we perform Exploratory Data Analysis with SQL?

— Yes, we can.

What is this article about?

It is about Exploratory Data Analysis (EDA) and aims to answer the following questions:

  • What is Exploratory Data Analysis (EDA)?
  • How to perform Exploratory Data Analysis (EDA) in Pandas (Python)?
  • How to perform Exploratory Data Analysis (EDA) in BigQuery SQL and how is it different from Pandas?
  • How to use dynamic SQL in BigQuery for Exploratory Data Analysis (EDA)?
  • How to create visualisations to explore your dataset in BigQuery / Pandas?
  • How to use Pandas/ BigQuery SQL to analyse relationships between variables for feature selection?

Who is this article for?

  • Marketers who might want to analyse the data like Data scientists and create dashboards.
  • Analysts who have been asked to perform EDA and cleanse the data.
  • Machine learning and AI practitioners who have been always using Python before.

Very often it is easier to perform analysis using SQL on data you have right in the tables and then move forward to ML/AI/Data science and engineering. These days you can even create machine learning models with SQL. Just check what BigQuery ML can do. Everything seems to be moving to data warehouses.

In this tutorial I will be using user churn dataset from Kaggle to analyse, cleanse and prepare it for Machine learning.

There is a Python notebook attached to this article. I will talk you through each query and explain how to do the same thing using SQL (I will be using BigQuery standard SQL).

Let’s create our table

  1. Clone the repo: git clone [https://github.com/mshakhomirov/eda_user_churn.git](https://github.com/mshakhomirov/eda_user_churn.git)
  2. Use ./Churn.csv to create a table in BigQuery:

You can create table easily by simply uploading the Churn.csv file from the repository:

Click your dataset -> Create Table -> upload:

Create a table in BigQuery
Create a table in BigQuery
Creating a table in BigQuery is simple
Creating a table in BigQuery is simple

What is Exploratory Data Analysis?

Exploratory Data Analysis (EDA), also known as Data Exploration, is a step in the Data Analysis Process, where would normally use different techniques to better understand the data we have.

This can refer to a number of things including:

  • Identifying outliers, missing values, human error or biased sampling.
  • Understanding importance of the variables and removing useless ones.
  • Analysing the relationship between dataset features (variables).
  • Ultimately, getting as much as possible from your data and maximising the insights.

Ultimately Data Analysis aims to achieve two goals:

Provide an insight into the relationships between variables.

– Describe the dataset using various stats.


We’ll use a user churn dataset obtained from Kaggle. It contains data about customers who are withdrawing their account from a bank. In other words we’ll be analysing churn.

Components of EDA

To me, there are main components of exploring data:

  1. Understanding your variables Import Data from Database – do we really need this if it’s already there? Get Summary Statistics Data with Group Filtering Data with Time Series Data with Window Calculation

  2. Analyzing relationships between variables Visualization (usually helps straight away) Correlation analysis (that’s what seasoned data scientists would normally do before moving to machine learning)

These steps help to identify outliers, missing values, input and measurment errors, dataset imbalance, biased data collection and analyze a range for Categorical/Contionious features to select only valid ones.

Understanding your data

Summary stats

Python:

#Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pylab as plt
import seaborn as sns#Understanding my variables
df.shape
df.head()
df.columns
df.dtypes
df.shape()
df.shape()

SQL:

Same in BigQuery Standard SQL (we will be using standard SQL) for df.shape:

SELECT  
    count(distinct column_name) 
,  (select  count(*) from  `your-client.staging.churn`)
FROM `your-client.staging.INFORMATION_SCHEMA.COLUMNS`
WHERE table_name = 'churn'
BigQuery shape table
BigQuery shape table

Columns and variable types for df.dtypes and df.columns

You can now use INFORMATION_SCHEMA – a series of views that provide access to metadata about datasets, tables, and views:

SELECT * EXCEPT(is_generated, generation_expression, is_stored, is_updatable)
FROM `your-project.staging.INFORMATION_SCHEMA.COLUMNS`
WHERE table_name = 'churn'
BigQuery information schema
BigQuery information schema

Count unique values per column

Python:

df.nunique(axis=0)
Unique values
Unique values

Just one line of code in Python. Let’s see if we could do the same in SQL.

SQL:

This is where it starts falling apart. It is beautiful in Python. SQL is not designed for high level data analysis.df.nunique function gives us stats for each column in our table with just using one (!) function.

You now can get unique values for each column in SQL with just one operation.

Just run this code in BigQuery (replace your-client with your project name):

Result:

Unique values per column in BigQuery
Unique values per column in BigQuery

Describe a dataset

  • How to describe a dataframe or a table with Python?
  • How to do the same with just SQL?

Python:

df.describe().apply(lambda s: s.apply(lambda x: format(x, 'f')))
describe dataframe in Python
describe dataframe in Python

SQL:

Remember this describe function works for numerical features only. Let’s create our own function to use in BigQuery SQL.

Firstly we need to adjust our SET columns variable to use only numerical columns from table schema:

SET columns = (
WITH all_columns AS (
SELECT column_name
FROM `your-client.staging.INFORMATION_SCHEMA.COLUMNS`
WHERE table_name = 'churn'
and  data_type IN ('INT64','FLOAT64')
)
SELECT ARRAY_AGG((column_name) ) AS columns FROM all_columns
);

There is an easy way to get Pandas like nifty describe() statistics for your BigQuery table using standard SQL.

Then I will add mean, max, min, median, 0.75 tile, 0.25 tile so the final SQL would be like this:

Result:

describe() a table with SQL (Pandas style)
describe() a table with SQL (Pandas style)

Probably worth adding ROUND() for better reading. Now you have your own funtion which returns Pandas like describe() result for your BigQuery table.

Missing values

Let’s check if we have any missing values.

Python:

df.isnull().sum()

SQL:

Same script as we ran before. Just add:

 countif(your-column is null)

Dynamic SQL in BigQuery simplifies and speeds up Exploratory Data Analysis

Outliers

This is to remove any values outside of the set boundaries.

How to identify these boundaries? Well that’s a good question. The first approach is to simply remove, for example, first 5 and the last 5 percentiles of your sample. The second thing we could do here is to calculate the mean and standard deviation of a given sample, then calculate the cut-off for identifying outliers as more than 3 standard deviations from the mean. Lastly you could simpy use your intuition and cut off the rows where you think the values are too high/low.

Try these outlier detection methods:

  • Interquartile Range Method
  • Standard Deviation Method
  • Automatic Outlier Detection

Standard Deviation Method

Let’s check outliers for estimated salary if it’s more than 3 standard deviations from the mean.

Python:

Result:

Outlier detection with Python
Outlier detection with Python

We can see that one salary [301348.88] is way higher than the others. It might be good to remove this one as it might affect something, for example, if we decide to perform Churn prediction later.

So you might want to clean your Pandas df like so (this will delete the rows containing outliers):

df_cleaned = df_cleaned[df_cleaned['EstimatedSalary']<272738.696]

SQL:

We know the standard deviation for Estimated Salary column already, right? So simple select query will do the job:

SELECT * from `your-client.staging.churn` WHERE EstimatedSalary >  272738.696
--OR EstimatedSalary < -72518.216; -- Salary can't be negative. So our cut_off parameter probably needs tweaking.

Or we could use dynamic SQL:

Removing Rows with Null Values

Python:

df = df.dropna(axis=0)

SQL:

SELECT * FROM `your-client.staging.churn` WHERE column IS NULL;

Oh, hold on a sec, is it just one column? We can use dynamic SQL here to do it in one go for all of them:

I think this is not the best way to do it though. I would rather generate a WHERE clause with IS NOT NULL check for each column and then just run it with one SELECT query.

Remember DELETE in SQL performs a table scan each time and might be expensive.

Analyse Data with group filtering

We expect to be able to see how different features affect customer churn. Let’s see how many customers have churn:

Python:

Result:

Churn vs retained customers
Churn vs retained customers

Here we used the following libraries to create that chart:

import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.cm as cm
%matplotlib inline

Looks like a lot of code and you really need to spend quite a time to learn how to use this. Especially if you are not a Data Scientist and never used python before.

Let’s see what BigQuery SQL can offer.

SQL:

Result:

BigQuery churn users vs retained
BigQuery churn users vs retained

SQL looks way more intuitive. Let’s explore and visualise the data with Data Studio. It will generate the chart we need even without SQL. Let’s go to our churn table, click it , then click Explore with Data Studio and add exited column into dimensions.

Explore the data with Data Studio
Explore the data with Data Studio
Exploring the data with Data Studio
Exploring the data with Data Studio

If you need to see a percentage of churn and retained users next to exact amount it can be done too. Just run this SQL and then click Explore with Data Studio:

Then change dimensions to exited_category and metrics as shown on the image below. Then click Add chart at the top and add Bar and Pie charts:

Exploring data with charts in Data Studio
Exploring data with charts in Data Studio

Very simple. It’s up to you to decide which way to use Python or BigQuery SQL.

BigQuery with Data Studio makes exploring your data with visuals really simple.

So why do we do this grouping analysis?

It is important to know if our dataset suffers from data imbalance, which usually reflects an unequal distribution of classes within a dataset.

Often dataset imbalance is caused by a sampling bias or measurement error, for example, all the sample were collected in one geographical region. This may affect classification problem drastically.

So these simple groupings help us to understand if our dataset is imbalanced or not.

It’s very important for classification models.

Most of the contemporary works in class imbalance concentrate on imbalance ratios ranging from 1:4 up to 1:100. […] In real-life applications such as fraud detection or cheminformatics we may deal with problems with imbalance ratio ranging from 1:1000 up to 1:5000.

Learning from imbalanced data – Open challenges and future directions, 2016.

A slight imbalance is often not a concern, and the problem can often be treated like a normal classification predictive modeling problem. A severe imbalance of the classes can be challenging to model and may require the use of specialized techniques.

Any dataset with an unequal class distribution is technically imbalanced. Performing this group analysis help us to understand the nature of imbalance and if there was any bias or collection errors.

We can see that our dataset doesn’t suffer from severe imbalance as all the features are distrubuted more or less equally across two main categories of exited.

Let’s see how it works and take a quick look into categorical variables we have, e.g. ‘gender’, ‘country’, ‘owns_credit_card’, ‘is_active_member’.

Python:

Result:

Groupings with Python
Groupings with Python

Again, if you want to do these plots in Python you really need to know Python. What if you just want to know if there is a relationship between features? What if you just want to know which category of customers that churn is greater (e.g. female/male, which country or those are more active than the others)? It’s a simple question and you don’t need to be a data scientist to answer it.

Solution

SQL:

SELECT * FROM `your-client.staging.churn`;

Explore in Data Studio:

Explore categorical features
Explore categorical features

You can see that simple drag and drop does the same thing. Data Studio charts are interactive so you can click a category and it will filter the dashboard based on your selection. You can do all types of groupings and drill downs too.

You can create interactive graphs and charts like a Data scientist! I think it’s beautiful, especially when it’s free.

Me personally, I find it way quicker to create exploratory graphs with Seaborn or Matplotlib.

Data Studio allows you to create a dashboard and forget about re-running your Python notebook if data in tables have changed.

It will update the graphs itself.

Analyzing relationships between variables

Correlation Matrix

I like correlation matrix because it’s a quick way to display relationshp between variables.

What is correlation?

Correlation is usually defined as a measure of the linear relationship between two quantitative variables (e.g., height and weight). The higher the weight more greater the affect on the target variable (if one change the other one will follow) and the higher the relationship between two of them.

Let’s create a correlation matrix.

Python:

Result:

Correlation matrix with python
Correlation matrix with python

We can see that there is a positive correlation between Exited and Age and some negative correlation between Exited and isActiveMember. Also you can see how different features correlate with each other.

Google BigQuery has done a great job on their statistical functions.

Now when we have dynamic SQL in BigQuery we can have a correlation matrix same way Data scientists would normally do it in Python.

Let’s run a simple SQL query to check how CORR() funtion works in BigQuery:

SELECT  corr(CreditScore,Balance) FROM `your-client.staging.churn`

Result: 0.006268381616013866

So we simply need to automate this function run for all the columns we have in the table (except those we want to exclude).

SQL:

Voilà!

Correlation matrix with SQL in BigQuery
Correlation matrix with SQL in BigQuery

Visualizing the relationship between variables

In theory if there is a relationship between features you will be able to see it straight away from the graphs:

Example scatterplots of various datasets with various correlation coefficients. Source: Wikipedia
Example scatterplots of various datasets with various correlation coefficients. Source: Wikipedia

Scatterplot

Scatterplot is great way to visualize the realtionship between two variables, e.g. Age and Salary. It can be very useful and helps to quickly identify outliers. It will also show if a relationship exists or not.

Python:

df.plot(kind='scatter', x='CreditScore', y='Balance')
Correlation between CreditScore and Balance
Correlation between CreditScore and Balance

We can see that there is no visible correlation between CreditScore and Balance.

There is a way to create scatterplots between all your variables in one go.

Python:

sns.pairplot(df)
sns.pairplot example
sns.pairplot example

Nothing shows up. Now let’s see how to do the same in SQL.

Google Data Studio:

Let’s go back to our Explore with Data Studio example. We will have to add scatterplots for our features one by one. Then just colour the bubbles by exited_category in any colour you like.

Category colours in Data Studio
Category colours in Data Studio
Correlation and scatterplot in Google Data Studio
Correlation and scatterplot in Google Data Studio

Box plot

Box plots are usful when we need to check a distribution of a variable

A box plot (or box-and-whisker plot) shows the distribution of quantitative data in a way that facilitates comparisons between variables or across levels of a categorical variable. The box shows the quartiles of the dataset while the whiskers extend to show the rest of the distribution, except for points that are determined to be "outliers" using a method that is a function of the inter-quartile range.

A box plot consist of 5 things.

  • Minimum
  • First Quartile or 25%
  • Median (Second Quartile) or 50%
  • Third Quartile or 75%
  • Maximum

Python:

Nice and easy we can create automated box plots for each varibale:

Box plots with Python
Box plots with Python

This helps to viasualise and explain the relationship between continuous variables and the target variable. You can see straight away that customers that churn are older than those who are retained. It also helps to compare median levels of each variable. For example, there is no difference in the median for credit score or tenure between lost and retained customers.

Most of the customers who churn still have a significant balance in their bank account. Estimated salary and the number of products seem not to have any visible effect on customer churn.

Let’s see how to create box plot data in BigQuery. We need data first.

SQL:

Result:

Box plot data in BigQuery SQL
Box plot data in BigQuery SQL

You can now visualise it with Data Studio using a bar chart.

At the moment when this article was written box plots didn’t exist in Data Studio. It can be created with custom visualisations though. Let me know if you are interested and I will create one.

Histogram

We’ll use histogram when we need to see the distribution of just one variable.

Python:

df['age'].plot(kind='hist', bins=10, figsize=(12,6))
Histogram in Python
Histogram in Python

SQL:

If we need equal bucket we can simply do this.

SELECT count(*) frequency, bucket 
FROM (
    SELECT customerId, round(Age / 10)* 10 as bucket FROM `your-client.staging.churn`)
GROUP BY bucket
order by 2

Click Explore with Data Studio and you can build a nice Histogram.

Conclusion

After BigQuery announced dynamic SQL feature many things became possible. With that scripting ability we can now automate queries, perform Exploratory Data Analysis and visualise results in Data Studio.

Python still remains a major tool for Data Scientists and provides great scripting features too. However, if we are talking about just getting the numbers BigQuery can do the same thing! When it comes to visualising the results Python definitelly helps to create grpahs quicker when Data Studio provides greater dashboarding experience. You can simply forget about rerunning the queries (or notebooks) when the dashboard is set up. So we just need to create it once.

I hope you enjoyed the reading!

Recommended read:

BigQuery – Get the total number of columns in a BigQuery table

Bigquery query to find the column names of a table

Getting table metadata using INFORMATION_SCHEMA | BigQuery

Using BigQuery to find outliers with standard deviation results combined with WHERE clause

Statistical aggregate functions in Standard SQL | BigQuery

Computing Percentiles In BigQuery

Expressions, functions, and operators in Standard SQL | BigQuery

Data Studio | Google Developers

An Extensive Step by Step Guide to Exploratory Data Analysis

How to Remove Outliers for Machine Learning – Machine Learning Mastery

Detect and exclude outliers in Pandas data frame

Loading Data into Google BigQuery for Exploratory Data Analysis | Qwiklabs

Exploratory queries with BigQuery | Google Cloud Platform Community

Data Science for Startups: Exploratory Data Analysis

How to use Google BigQuery ML and Data Studio for exploratory data analysis

GoogleCloudPlatform/professional-services

BigQuery : is it possible to iterate over an array?

https://machinelearningmastery.com/what-is-imbalanced-classification

Introducing CORR() to Google BigQuery

Calculating Relationships with Correlation Matrices · Advanced SQL · SILOTA

Box plot visualization with Pandas and Seaborn – GeeksforGeeks

Use google bigquery to build histogram graph

How to Make a Box-and-Whisker Plot in SQL

Basic boxplot in d3.js

Expressions, functions, and operators in Standard SQL | BigQuery

Calculating Relationships with Correlation Matrices · Advanced SQL · SILOTA


Related Articles