The world’s leading publication for data science, AI, and ML professionals.

How to derive summary statistics using PostgreSQL

Deriving summary statistics of numerical and categorical fields using PostgreSQL

Photo by Campaign Creators on Unsplash
Photo by Campaign Creators on Unsplash

In this article, we’ll discuss how to derive summary statistics of numerical and categorical columns/fields using SQL. We’ll use the Netflix Movies and TV Shows dataset that’s downloaded from Tableau Public sample datasets (https://public.tableau.com/en-us/s/resources and go to Sample Data). This dataset consists of TV shows and movies available on Netflix as of 2019 and is sourced by Tableau from Kaggle and the updated version of the dataset can be found on Kaggle. However, for this article, we’ll use the older version available in Tableau sample datasets and import it into Postgres. This article assumes that the reader has basic knowledge of SQL queries and statistics. Below is a snapshot of the _netflix_titles_ that we’ll be using throughout this article.

netflix_titles table (Image by author)
netflix_titles table (Image by author)

Summary statistics of numerical variables

Frequently used summary statistics for numerical variables are mean, median, minimum, maximum, range, standard deviation, variance, Q1, Q3, IQR and skewness.

Mean

In PostgreSQL, the mean of a numerical field/column is computed using the AVG() function. We’ll compute the mean of _duration_minutes_ field as shown below.

SELECT 
 AVG(duration_minutes) AS mean
  FROM netflix_titles;
Image by author
Image by author

Median

In PostgreSQL, there is no function to directly compute the median of a numerical field/column. However, since median is the 50th percentile, we can use it as a proxy to median. Percentile of a numerical variable is computed using the PERCENTILE_CONT() function. We’ll compute the median of _duration_minutes_ field as shown below.

SELECT 
 PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY duration_minutes) AS median
  FROM netflix_titles;
Image by author
Image by author

WITHIN GROUP clause creates an ordered subset of data that can be used to perform aggregations. PERCENTILE_CONT takes the percentile required as an argument, in this case it is 0.5 i.e. the 50th percentile.

Minimum

In PostgreSQL, the minimum value of a numerical field/column is found using the MIN() function. We’ll find the minimum value of _duration_minutes_ field as shown below.

SELECT 
 MIN(duration_minutes) AS min 
  FROM netflix_titles;
Image by author
Image by author

Maximum

In PostgreSQL, the maximum value of a numerical field/column is found using the MAX() function. We’ll find the maximum value of _duration_minutes_ field as shown below.

SELECT 
 MAX(duration_minutes) AS max 
  FROM netflix_titles;
Image by author
Image by author

Range

In PostgreSQL, there is no function to directly compute the range of a numerical field/column. However, since range is the difference between maximum and minimum values, we can use it as a proxy to range.

SELECT 
 MAX(duration_minutes) - MIN(duration_minutes) AS range
    FROM netflix_titles;
Image by author
Image by author

Standard deviation

In PostgreSQL, the standard deviation of a numerical field/column is computed using the STDDEV() function. We’ll compute the standard deviation of _duration_minutes_ field as shown below.

SELECT 
 ROUND(STDDEV(duration_minutes), 2) AS standard_deviation
    FROM netflix_titles;
Image by author
Image by author

OR

We can also compute standard deviation as the square root of variance as shown below.

SELECT 
 ROUND(SQRT(VARIANCE(duration_minutes)), 2) AS stddev_using_variance
    FROM netflix_titles;
Image by author
Image by author

Variance

In PostgreSQL, the variance of a numerical field/column is computed using the VARIANCE() function. We’ll compute the variance of _duration_minutes_ field as shown below.

SELECT 
 ROUND(VARIANCE(duration_minutes), 2) AS variance
    FROM netflix_titles;
Image by author
Image by author

OR

We can also compute variance as the square of standard deviation as shown below.

SELECT 
 ROUND(POWER(STDDEV(duration_minutes), 2), 2) AS variance_using_stddev
    FROM netflix_titles;
Image by author
Image by author

Q1

In PostgreSQL, there is no function to directly compute the first quartile (Q1) of a numerical field/column. However, since Q1 is the 25th percentile, we can use it as a proxy to Q1. We’ll compute the Q1 of _duration_minutes_ field as shown below.

SELECT
 PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY duration_minutes) AS q1
    FROM netflix_titles;
Image by author
Image by author

Q3

In PostgreSQL, there is no function to directly compute the third quartile (Q3) of a numerical field/column. However, since Q3 is the 75th percentile, we can use it as a proxy to Q3. We’ll compute the Q3 of _duration_minutes_ field as shown below.

SELECT
 PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY duration_minutes) AS q3
    FROM netflix_titles;
Image by author
Image by author

IQR

In PostgreSQL, there is no function to directly compute the interquartile range (IQR) of a numerical field/column. However, since IQR is the difference between Q3 and Q1, we can use it as a proxy to IQR. We’ll compute the IQR of _duration_minutes_ field as shown below.

SELECT
 PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY duration_minutes) -
 PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY duration_minutes) AS iqr
     FROM netflix_titles;
Image by author
Image by author

Skewness

In PostgreSQL, there is no function to directly compute the skewness of a numerical field/column. However, since skewness is *3 (mean-median) / standarddeviation**, we can use it as a proxy to skewness. We’ll compute the skewness of duration_minutes_ field as shown below. To keep the code more readable we’ll use CTE to compute the mean, median and standard deviation.

WITH mean_median_sd AS
(
 SELECT 
  AVG(duration_minutes) AS mean,
  PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY duration_minutes) AS median,
  STDDEV(duration_minutes) AS stddev
   FROM netflix_titles
)
SELECT 
 ROUND(3 * (mean - median)::NUMERIC / stddev, 2) AS skewness
  FROM mean_median_sd;
Image by author
Image by author

Putting it all together

We’ll put all the functions discussed earlier together and return them in a single table. In the below query, we’ll create two CTEs namely _summary_stats (which returns the previously discussed statistics one per each column) and row_summary_stats_ (which returns the statistics one per each row).

WITH RECURSIVE
summary_stats AS
(
 SELECT 
  ROUND(AVG(duration_minutes), 2) AS mean,
  PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY duration_minutes) AS median,
  MIN(duration_minutes) AS min,
  MAX(duration_minutes) AS max,
  MAX(duration_minutes) - MIN(duration_minutes) AS range,
  ROUND(STDDEV(duration_minutes), 2) AS standard_deviation,
  ROUND(VARIANCE(duration_minutes), 2) AS variance,
  PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY duration_minutes) AS q1,
  PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY duration_minutes) AS q3
   FROM netflix_titles
),
row_summary_stats AS
(
SELECT 
 1 AS sno, 
 'mean' AS statistic, 
 mean AS value 
  FROM summary_stats
UNION
SELECT 
 2, 
 'median', 
 median 
  FROM summary_stats
UNION
SELECT 
 3, 
 'minimum', 
 min 
  FROM summary_stats
UNION
SELECT 
 4, 
 'maximum', 
 max 
  FROM summary_stats
UNION
SELECT 
 5, 
 'range', 
 range 
  FROM summary_stats
UNION
SELECT 
 6, 
 'standard deviation', 
 standard_deviation 
  FROM summary_stats
UNION
SELECT 
 7, 
 'variance', 
 variance 
  FROM summary_stats
UNION
SELECT 
 9, 
 'Q1', 
 q1 
  FROM summary_stats
UNION
SELECT 
 10, 
 'Q3', 
 q3 
  FROM summary_stats
UNION
SELECT 
 11, 
 'IQR', 
 (q3 - q1) 
  FROM summary_stats
UNION
SELECT 
 12, 
 'skewness', 
 ROUND(3 * (mean - median)::NUMERIC / standard_deviation, 2) AS skewness 
  FROM summary_stats
)
SELECT * 
 FROM row_summary_stats
  ORDER BY sno;
Image by author
Image by author

Summary statistics of categorical variables

Frequently used summary statistics for categorical variables are mode, cardinality (number of unique categories), frequency and relative frequency.

Mode

In PostgreSQL, mode of a categorical field/column is computed using the MODE() function. We’ll compute the mode of rating field as shown below.

SELECT 
 MODE() WITHIN GROUP (ORDER BY rating) AS mode
  FROM netflix_titles;
Image by author
Image by author

Cardinality

In PostgreSQL, there is no function to directly compute the number of unique values of a categorical field/column. However, we can count the distinct values of a variable as shown below.

SELECT 
 COUNT(DISTINCT rating) AS cardinality
  FROM netflix_titles;
Image by author
Image by author

Frequency and relative frequency

In Postgres, we can compute the frequency of the categories in a categorical variable using GROUP BY and COUNT. For relative frequency, we’ll use a CTE to count the total number of values in rating field and use it to compute the relative frequency. Since, not all databases support window functions we’ll use CTE. We’ll also discuss how to calculate relative frequency using window functions.

WITH total_count AS
(
SELECT
 COUNT(rating) AS total_cnt
  FROM netflix_titles
)
SELECT 
 rating,
 COUNT(rating) AS frequency,
 ROUND(COUNT(rating)::NUMERIC / 
    (SELECT total_cnt FROM total_count), 4) AS relative_frequency
    FROM netflix_titles
     GROUP BY rating
   ORDER BY frequency DESC;
Image by author
Image by author

In the above example, we’ve created a CTE that captures the count of values in the rating field. We then used it to calculate the percentage/relative frequency of each category in the rating field. Since, Postgres supports window functions, we’ll see a simpler way of calculating relative frequency using window functions. We’ll use OVER() to compute the sum of the counts of rating across each group which in turn is the total number of values in the rating field.

SELECT 
 rating,
 COUNT(rating) AS frequency,
 ROUND(COUNT(rating)::NUMERIC / SUM(COUNT(rating)) OVER(), 4) AS relative_frequency
    FROM netflix_titles
     GROUP BY rating
      ORDER BY frequency DESC;
Image by author
Image by author

This brings the article to an end. We’ve discussed various functions to compute the frequently used summary Statistics of numerical fields like mean, median, min, max, Q1, Q2, IQR, etc. and of categorical fields like cardinality, frequency, etc. Summary statistics can be used to get an idea of the data distribution and also aid in identifying potential outliers.

Know more about my work at https://ksvmuralidhar.in/


Related Articles