Getting Started

Can we perform Exploratory Data Analysis with SQL?
— Yes, we can.
What is this article about?
It is about Exploratory Data Analysis (EDA) and aims to answer the following questions:
- What is Exploratory Data Analysis (EDA)?
- How to perform Exploratory Data Analysis (EDA) in Pandas (Python)?
- How to perform Exploratory Data Analysis (EDA) in BigQuery SQL and how is it different from Pandas?
- How to use dynamic SQL in BigQuery for Exploratory Data Analysis (EDA)?
- How to create visualisations to explore your dataset in BigQuery / Pandas?
- How to use Pandas/ BigQuery SQL to analyse relationships between variables for feature selection?
Who is this article for?
- Marketers who might want to analyse the data like Data scientists and create dashboards.
- Analysts who have been asked to perform EDA and cleanse the data.
- Machine learning and AI practitioners who have been always using Python before.
Very often it is easier to perform analysis using SQL on data you have right in the tables and then move forward to ML/AI/Data science and engineering. These days you can even create machine learning models with SQL. Just check what BigQuery ML can do. Everything seems to be moving to data warehouses.
In this tutorial I will be using user churn dataset from Kaggle to analyse, cleanse and prepare it for Machine learning.
There is a Python notebook attached to this article. I will talk you through each query and explain how to do the same thing using SQL (I will be using BigQuery standard SQL).
Let’s create our table
- Clone the repo:
git clone [https://github.com/mshakhomirov/eda_user_churn.git](https://github.com/mshakhomirov/eda_user_churn.git)
- Use
./Churn.csv
to create a table in BigQuery:
You can create table easily by simply uploading the Churn.csv file from the repository:
Click your dataset -> Create Table -> upload:


What is Exploratory Data Analysis?
Exploratory Data Analysis (EDA), also known as Data Exploration, is a step in the Data Analysis Process, where would normally use different techniques to better understand the data we have.
This can refer to a number of things including:
- Identifying outliers, missing values, human error or biased sampling.
- Understanding importance of the variables and removing useless ones.
- Analysing the relationship between dataset features (variables).
- Ultimately, getting as much as possible from your data and maximising the insights.
Ultimately Data Analysis aims to achieve two goals:
– Provide an insight into the relationships between variables.
– Describe the dataset using various stats.
We’ll use a user churn dataset obtained from Kaggle. It contains data about customers who are withdrawing their account from a bank. In other words we’ll be analysing churn.
Components of EDA
To me, there are main components of exploring data:
-
Understanding your variables – Import Data from Database – do we really need this if it’s already there? – Get Summary Statistics – Data with Group Filtering– Data with Time Series – Data with Window Calculation
-
Analyzing relationships between variables – Visualization (usually helps straight away) – Correlation analysis (that’s what seasoned data scientists would normally do before moving to machine learning)
These steps help to identify outliers, missing values, input and measurment errors, dataset imbalance, biased data collection and analyze a range for Categorical/Contionious features to select only valid ones.
Understanding your data
Summary stats
Python:
#Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pylab as plt
import seaborn as sns#Understanding my variables
df.shape
df.head()
df.columns
df.dtypes

SQL:
Same in BigQuery Standard SQL (we will be using standard SQL) for df.shape
:
SELECT
count(distinct column_name)
, (select count(*) from `your-client.staging.churn`)
FROM `your-client.staging.INFORMATION_SCHEMA.COLUMNS`
WHERE table_name = 'churn'

Columns and variable types for df.dtypes
and df.columns
You can now use INFORMATION_SCHEMA – a series of views that provide access to metadata about datasets, tables, and views:
SELECT * EXCEPT(is_generated, generation_expression, is_stored, is_updatable)
FROM `your-project.staging.INFORMATION_SCHEMA.COLUMNS`
WHERE table_name = 'churn'

Count unique values per column
Python:
df.nunique(axis=0)

Just one line of code in Python. Let’s see if we could do the same in SQL.
SQL:
This is where it starts falling apart. It is beautiful in Python. SQL is not designed for high level data analysis.df.nunique
function gives us stats for each column in our table with just using one (!) function.
You now can get unique values for each column in SQL with just one operation.
Just run this code in BigQuery (replace your-client with your project name):
Result:

Describe a dataset
- How to describe a dataframe or a table with Python?
- How to do the same with just SQL?
Python:
df.describe().apply(lambda s: s.apply(lambda x: format(x, 'f')))

SQL:
Remember this describe function works for numerical features only. Let’s create our own function to use in BigQuery SQL.
Firstly we need to adjust our SET columns variable to use only numerical columns from table schema:
SET columns = (
WITH all_columns AS (
SELECT column_name
FROM `your-client.staging.INFORMATION_SCHEMA.COLUMNS`
WHERE table_name = 'churn'
and data_type IN ('INT64','FLOAT64')
)
SELECT ARRAY_AGG((column_name) ) AS columns FROM all_columns
);
There is an easy way to get Pandas like nifty describe() statistics for your BigQuery table using standard SQL.
Then I will add mean, max, min, median, 0.75 tile, 0.25 tile so the final SQL would be like this:
Result:

Probably worth adding ROUND() for better reading. Now you have your own funtion which returns Pandas like describe() result for your BigQuery table.
Missing values
Let’s check if we have any missing values.
Python:
df.isnull().sum()
SQL:
Same script as we ran before. Just add:
countif(your-column is null)
Dynamic SQL in BigQuery simplifies and speeds up Exploratory Data Analysis
Outliers
This is to remove any values outside of the set boundaries.
How to identify these boundaries? Well that’s a good question. The first approach is to simply remove, for example, first 5 and the last 5 percentiles of your sample. The second thing we could do here is to calculate the mean and standard deviation of a given sample, then calculate the cut-off for identifying outliers as more than 3 standard deviations from the mean. Lastly you could simpy use your intuition and cut off the rows where you think the values are too high/low.
Try these outlier detection methods:
- Interquartile Range Method
- Standard Deviation Method
- Automatic Outlier Detection
Standard Deviation Method
Let’s check outliers for estimated salary if it’s more than 3 standard deviations from the mean.
Python:
Result:

We can see that one salary [301348.88] is way higher than the others. It might be good to remove this one as it might affect something, for example, if we decide to perform Churn prediction later.
So you might want to clean your Pandas df like so (this will delete the rows containing outliers):
df_cleaned = df_cleaned[df_cleaned['EstimatedSalary']<272738.696]
SQL:
We know the standard deviation for Estimated Salary
column already, right? So simple select query will do the job:
SELECT * from `your-client.staging.churn` WHERE EstimatedSalary > 272738.696
--OR EstimatedSalary < -72518.216; -- Salary can't be negative. So our cut_off parameter probably needs tweaking.
Or we could use dynamic SQL:
Removing Rows with Null Values
Python:
df = df.dropna(axis=0)
SQL:
SELECT * FROM `your-client.staging.churn` WHERE column IS NULL;
Oh, hold on a sec, is it just one column? We can use dynamic SQL here to do it in one go for all of them:
I think this is not the best way to do it though. I would rather generate a WHERE
clause with IS NOT NULL
check for each column and then just run it with one SELECT
query.
Remember DELETE in SQL performs a table scan each time and might be expensive.
Analyse Data with group filtering
We expect to be able to see how different features affect customer churn. Let’s see how many customers have churn:
Python:
Result:

Here we used the following libraries to create that chart:
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.cm as cm
%matplotlib inline
Looks like a lot of code and you really need to spend quite a time to learn how to use this. Especially if you are not a Data Scientist and never used python before.
Let’s see what BigQuery SQL can offer.
SQL:
Result:

SQL looks way more intuitive. Let’s explore and visualise the data with Data Studio. It will generate the chart we need even without SQL. Let’s go to our churn
table, click it , then click Explore with Data Studio and add exited column into dimensions.


If you need to see a percentage of churn and retained users next to exact amount it can be done too. Just run this SQL and then click Explore with Data Studio:
Then change dimensions to exited_category
and metrics as shown on the image below. Then click Add chart at the top and add Bar and Pie charts:

Very simple. It’s up to you to decide which way to use Python or BigQuery SQL.
BigQuery with Data Studio makes exploring your data with visuals really simple.
So why do we do this grouping analysis?
It is important to know if our dataset suffers from data imbalance, which usually reflects an unequal distribution of classes within a dataset.
Often dataset imbalance is caused by a sampling bias or measurement error, for example, all the sample were collected in one geographical region. This may affect classification problem drastically.
So these simple groupings help us to understand if our dataset is imbalanced or not.
It’s very important for classification models.
Most of the contemporary works in class imbalance concentrate on imbalance ratios ranging from 1:4 up to 1:100. […] In real-life applications such as fraud detection or cheminformatics we may deal with problems with imbalance ratio ranging from 1:1000 up to 1:5000.
— Learning from imbalanced data – Open challenges and future directions, 2016.
A slight imbalance is often not a concern, and the problem can often be treated like a normal classification predictive modeling problem. A severe imbalance of the classes can be challenging to model and may require the use of specialized techniques.
Any dataset with an unequal class distribution is technically imbalanced. Performing this group analysis help us to understand the nature of imbalance and if there was any bias or collection errors.
We can see that our dataset doesn’t suffer from severe imbalance as all the features are distrubuted more or less equally across two main categories of exited
.
Let’s see how it works and take a quick look into categorical variables we have, e.g. ‘gender’, ‘country’, ‘owns_credit_card’, ‘is_active_member’.
Python:
Result:

Again, if you want to do these plots in Python you really need to know Python. What if you just want to know if there is a relationship between features? What if you just want to know which category of customers that churn is greater (e.g. female/male, which country or those are more active than the others)? It’s a simple question and you don’t need to be a data scientist to answer it.
Solution
SQL:
SELECT * FROM `your-client.staging.churn`;
Explore in Data Studio:

You can see that simple drag and drop does the same thing. Data Studio charts are interactive so you can click a category and it will filter the dashboard based on your selection. You can do all types of groupings and drill downs too.
You can create interactive graphs and charts like a Data scientist! I think it’s beautiful, especially when it’s free.
Me personally, I find it way quicker to create exploratory graphs with Seaborn or Matplotlib.
Data Studio allows you to create a dashboard and forget about re-running your Python notebook if data in tables have changed.
It will update the graphs itself.
Analyzing relationships between variables
Correlation Matrix
I like correlation matrix because it’s a quick way to display relationshp between variables.
What is correlation?
Correlation is usually defined as a measure of the linear relationship between two quantitative variables (e.g., height and weight). The higher the weight more greater the affect on the target variable (if one change the other one will follow) and the higher the relationship between two of them.
Let’s create a correlation matrix.
Python:
Result:

We can see that there is a positive correlation between Exited
and Age
and some negative correlation between Exited
and isActiveMember
. Also you can see how different features correlate with each other.
Google BigQuery has done a great job on their statistical functions.
Now when we have dynamic SQL in BigQuery we can have a correlation matrix same way Data scientists would normally do it in Python.
Let’s run a simple SQL query to check how CORR() funtion works in BigQuery:
SELECT corr(CreditScore,Balance) FROM `your-client.staging.churn`
Result: 0.006268381616013866
So we simply need to automate this function run for all the columns we have in the table (except those we want to exclude).
SQL:
Voilà!

Visualizing the relationship between variables
In theory if there is a relationship between features you will be able to see it straight away from the graphs:

Scatterplot
Scatterplot is great way to visualize the realtionship between two variables, e.g. Age and Salary. It can be very useful and helps to quickly identify outliers. It will also show if a relationship exists or not.
Python:
df.plot(kind='scatter', x='CreditScore', y='Balance')

We can see that there is no visible correlation between CreditScore and Balance.
There is a way to create scatterplots between all your variables in one go.
Python:
sns.pairplot(df)

Nothing shows up. Now let’s see how to do the same in SQL.
Google Data Studio:
Let’s go back to our Explore with Data Studio example. We will have to add scatterplots for our features one by one. Then just colour the bubbles by exited_category
in any colour you like.


Box plot
Box plots are usful when we need to check a distribution of a variable
A box plot (or box-and-whisker plot) shows the distribution of quantitative data in a way that facilitates comparisons between variables or across levels of a categorical variable. The box shows the quartiles of the dataset while the whiskers extend to show the rest of the distribution, except for points that are determined to be "outliers" using a method that is a function of the inter-quartile range.
A box plot consist of 5 things.
- Minimum
- First Quartile or 25%
- Median (Second Quartile) or 50%
- Third Quartile or 75%
- Maximum
Python:
Nice and easy we can create automated box plots for each varibale:

This helps to viasualise and explain the relationship between continuous variables and the target variable. You can see straight away that customers that churn are older than those who are retained. It also helps to compare median levels of each variable. For example, there is no difference in the median for credit score or tenure between lost and retained customers.
Most of the customers who churn still have a significant balance in their bank account. Estimated salary and the number of products seem not to have any visible effect on customer churn.
Let’s see how to create box plot data in BigQuery. We need data first.
SQL:
Result:

You can now visualise it with Data Studio using a bar chart.
At the moment when this article was written box plots didn’t exist in Data Studio. It can be created with custom visualisations though. Let me know if you are interested and I will create one.
Histogram
We’ll use histogram when we need to see the distribution of just one variable.
Python:
df['age'].plot(kind='hist', bins=10, figsize=(12,6))

SQL:
If we need equal bucket we can simply do this.
SELECT count(*) frequency, bucket
FROM (
SELECT customerId, round(Age / 10)* 10 as bucket FROM `your-client.staging.churn`)
GROUP BY bucket
order by 2
Click Explore with Data Studio and you can build a nice Histogram.

Conclusion
After BigQuery announced dynamic SQL feature many things became possible. With that scripting ability we can now automate queries, perform Exploratory Data Analysis and visualise results in Data Studio.
Python still remains a major tool for Data Scientists and provides great scripting features too. However, if we are talking about just getting the numbers BigQuery can do the same thing! When it comes to visualising the results Python definitelly helps to create grpahs quicker when Data Studio provides greater dashboarding experience. You can simply forget about rerunning the queries (or notebooks) when the dashboard is set up. So we just need to create it once.
I hope you enjoyed the reading!
Recommended read:
BigQuery – Get the total number of columns in a BigQuery table
Bigquery query to find the column names of a table
Getting table metadata using INFORMATION_SCHEMA | BigQuery
Using BigQuery to find outliers with standard deviation results combined with WHERE clause
Statistical aggregate functions in Standard SQL | BigQuery
Computing Percentiles In BigQuery
Expressions, functions, and operators in Standard SQL | BigQuery
Data Studio | Google Developers
An Extensive Step by Step Guide to Exploratory Data Analysis
How to Remove Outliers for Machine Learning – Machine Learning Mastery
Detect and exclude outliers in Pandas data frame
Loading Data into Google BigQuery for Exploratory Data Analysis | Qwiklabs
Exploratory queries with BigQuery | Google Cloud Platform Community
Data Science for Startups: Exploratory Data Analysis
How to use Google BigQuery ML and Data Studio for exploratory data analysis
https://machinelearningmastery.com/what-is-imbalanced-classification
Introducing CORR() to Google BigQuery
Calculating Relationships with Correlation Matrices · Advanced SQL · SILOTA
Box plot visualization with Pandas and Seaborn – GeeksforGeeks
Use google bigquery to build histogram graph
How to Make a Box-and-Whisker Plot in SQL
Expressions, functions, and operators in Standard SQL | BigQuery
Calculating Relationships with Correlation Matrices · Advanced SQL · SILOTA