
In this article, we’ll discuss how to derive summary statistics of numerical and categorical columns/fields using SQL. We’ll use the Netflix Movies and TV Shows dataset that’s downloaded from Tableau Public sample datasets (https://public.tableau.com/en-us/s/resources and go to Sample Data). This dataset consists of TV shows and movies available on Netflix as of 2019 and is sourced by Tableau from Kaggle and the updated version of the dataset can be found on Kaggle. However, for this article, we’ll use the older version available in Tableau sample datasets and import it into Postgres. This article assumes that the reader has basic knowledge of SQL queries and statistics. Below is a snapshot of the _netflix_titles_ that we’ll be using throughout this article.

Summary statistics of numerical variables
Frequently used summary statistics for numerical variables are mean, median, minimum, maximum, range, standard deviation, variance, Q1, Q3, IQR and skewness.
Mean
In PostgreSQL, the mean of a numerical field/column is computed using the AVG() function. We’ll compute the mean of _duration_minutes_ field as shown below.
SELECT
AVG(duration_minutes) AS mean
FROM netflix_titles;

Median
In PostgreSQL, there is no function to directly compute the median of a numerical field/column. However, since median is the 50th percentile, we can use it as a proxy to median. Percentile of a numerical variable is computed using the PERCENTILE_CONT() function. We’ll compute the median of _duration_minutes_ field as shown below.
SELECT
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY duration_minutes) AS median
FROM netflix_titles;

WITHIN GROUP clause creates an ordered subset of data that can be used to perform aggregations. PERCENTILE_CONT takes the percentile required as an argument, in this case it is 0.5 i.e. the 50th percentile.
Minimum
In PostgreSQL, the minimum value of a numerical field/column is found using the MIN() function. We’ll find the minimum value of _duration_minutes_ field as shown below.
SELECT
MIN(duration_minutes) AS min
FROM netflix_titles;

Maximum
In PostgreSQL, the maximum value of a numerical field/column is found using the MAX() function. We’ll find the maximum value of _duration_minutes_ field as shown below.
SELECT
MAX(duration_minutes) AS max
FROM netflix_titles;

Range
In PostgreSQL, there is no function to directly compute the range of a numerical field/column. However, since range is the difference between maximum and minimum values, we can use it as a proxy to range.
SELECT
MAX(duration_minutes) - MIN(duration_minutes) AS range
FROM netflix_titles;

Standard deviation
In PostgreSQL, the standard deviation of a numerical field/column is computed using the STDDEV() function. We’ll compute the standard deviation of _duration_minutes_ field as shown below.
SELECT
ROUND(STDDEV(duration_minutes), 2) AS standard_deviation
FROM netflix_titles;

OR
We can also compute standard deviation as the square root of variance as shown below.
SELECT
ROUND(SQRT(VARIANCE(duration_minutes)), 2) AS stddev_using_variance
FROM netflix_titles;

Variance
In PostgreSQL, the variance of a numerical field/column is computed using the VARIANCE() function. We’ll compute the variance of _duration_minutes_ field as shown below.
SELECT
ROUND(VARIANCE(duration_minutes), 2) AS variance
FROM netflix_titles;

OR
We can also compute variance as the square of standard deviation as shown below.
SELECT
ROUND(POWER(STDDEV(duration_minutes), 2), 2) AS variance_using_stddev
FROM netflix_titles;

Q1
In PostgreSQL, there is no function to directly compute the first quartile (Q1) of a numerical field/column. However, since Q1 is the 25th percentile, we can use it as a proxy to Q1. We’ll compute the Q1 of _duration_minutes_ field as shown below.
SELECT
PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY duration_minutes) AS q1
FROM netflix_titles;

Q3
In PostgreSQL, there is no function to directly compute the third quartile (Q3) of a numerical field/column. However, since Q3 is the 75th percentile, we can use it as a proxy to Q3. We’ll compute the Q3 of _duration_minutes_ field as shown below.
SELECT
PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY duration_minutes) AS q3
FROM netflix_titles;

IQR
In PostgreSQL, there is no function to directly compute the interquartile range (IQR) of a numerical field/column. However, since IQR is the difference between Q3 and Q1, we can use it as a proxy to IQR. We’ll compute the IQR of _duration_minutes_ field as shown below.
SELECT
PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY duration_minutes) -
PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY duration_minutes) AS iqr
FROM netflix_titles;

Skewness
In PostgreSQL, there is no function to directly compute the skewness of a numerical field/column. However, since skewness is *3 (mean-median) / standarddeviation**, we can use it as a proxy to skewness. We’ll compute the skewness of duration_minutes_ field as shown below. To keep the code more readable we’ll use CTE to compute the mean, median and standard deviation.
WITH mean_median_sd AS
(
SELECT
AVG(duration_minutes) AS mean,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY duration_minutes) AS median,
STDDEV(duration_minutes) AS stddev
FROM netflix_titles
)
SELECT
ROUND(3 * (mean - median)::NUMERIC / stddev, 2) AS skewness
FROM mean_median_sd;

Putting it all together
We’ll put all the functions discussed earlier together and return them in a single table. In the below query, we’ll create two CTEs namely _summary_stats (which returns the previously discussed statistics one per each column) and row_summary_stats_ (which returns the statistics one per each row).
WITH RECURSIVE
summary_stats AS
(
SELECT
ROUND(AVG(duration_minutes), 2) AS mean,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY duration_minutes) AS median,
MIN(duration_minutes) AS min,
MAX(duration_minutes) AS max,
MAX(duration_minutes) - MIN(duration_minutes) AS range,
ROUND(STDDEV(duration_minutes), 2) AS standard_deviation,
ROUND(VARIANCE(duration_minutes), 2) AS variance,
PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY duration_minutes) AS q1,
PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY duration_minutes) AS q3
FROM netflix_titles
),
row_summary_stats AS
(
SELECT
1 AS sno,
'mean' AS statistic,
mean AS value
FROM summary_stats
UNION
SELECT
2,
'median',
median
FROM summary_stats
UNION
SELECT
3,
'minimum',
min
FROM summary_stats
UNION
SELECT
4,
'maximum',
max
FROM summary_stats
UNION
SELECT
5,
'range',
range
FROM summary_stats
UNION
SELECT
6,
'standard deviation',
standard_deviation
FROM summary_stats
UNION
SELECT
7,
'variance',
variance
FROM summary_stats
UNION
SELECT
9,
'Q1',
q1
FROM summary_stats
UNION
SELECT
10,
'Q3',
q3
FROM summary_stats
UNION
SELECT
11,
'IQR',
(q3 - q1)
FROM summary_stats
UNION
SELECT
12,
'skewness',
ROUND(3 * (mean - median)::NUMERIC / standard_deviation, 2) AS skewness
FROM summary_stats
)
SELECT *
FROM row_summary_stats
ORDER BY sno;

Summary statistics of categorical variables
Frequently used summary statistics for categorical variables are mode, cardinality (number of unique categories), frequency and relative frequency.
Mode
In PostgreSQL, mode of a categorical field/column is computed using the MODE() function. We’ll compute the mode of rating field as shown below.
SELECT
MODE() WITHIN GROUP (ORDER BY rating) AS mode
FROM netflix_titles;

Cardinality
In PostgreSQL, there is no function to directly compute the number of unique values of a categorical field/column. However, we can count the distinct values of a variable as shown below.
SELECT
COUNT(DISTINCT rating) AS cardinality
FROM netflix_titles;

Frequency and relative frequency
In Postgres, we can compute the frequency of the categories in a categorical variable using GROUP BY and COUNT. For relative frequency, we’ll use a CTE to count the total number of values in rating field and use it to compute the relative frequency. Since, not all databases support window functions we’ll use CTE. We’ll also discuss how to calculate relative frequency using window functions.
WITH total_count AS
(
SELECT
COUNT(rating) AS total_cnt
FROM netflix_titles
)
SELECT
rating,
COUNT(rating) AS frequency,
ROUND(COUNT(rating)::NUMERIC /
(SELECT total_cnt FROM total_count), 4) AS relative_frequency
FROM netflix_titles
GROUP BY rating
ORDER BY frequency DESC;

In the above example, we’ve created a CTE that captures the count of values in the rating field. We then used it to calculate the percentage/relative frequency of each category in the rating field. Since, Postgres supports window functions, we’ll see a simpler way of calculating relative frequency using window functions. We’ll use OVER() to compute the sum of the counts of rating across each group which in turn is the total number of values in the rating field.
SELECT
rating,
COUNT(rating) AS frequency,
ROUND(COUNT(rating)::NUMERIC / SUM(COUNT(rating)) OVER(), 4) AS relative_frequency
FROM netflix_titles
GROUP BY rating
ORDER BY frequency DESC;

This brings the article to an end. We’ve discussed various functions to compute the frequently used summary Statistics of numerical fields like mean, median, min, max, Q1, Q2, IQR, etc. and of categorical fields like cardinality, frequency, etc. Summary statistics can be used to get an idea of the data distribution and also aid in identifying potential outliers.
Know more about my work at https://ksvmuralidhar.in/