The world’s leading publication for data science, AI, and ML professionals.

Window Function & Aggregate Function: SQL vs. Python Pandas

Which tool would you like to use in your next data science project?

Photo by Pascal Müller on Unsplash
Photo by Pascal Müller on Unsplash

Background

Data cleaning and manipulation are essential in any data science project. Both SQL and Pandas are popular tools used by Data Analysts and Data Scientists.

In the following article, I am going to compare SQL and Pandas in terms of their implementation of data manipulations. Hope it would be useful to someone who is familiar with SQL and would like to learn about Pandas, and vice versa.

Window Functions and Aggregate Functions

Before we dive into the comparisons, let me do a quick recap on two main functions for data manipulation.

Window Functions are implemented on a set of rows called a window frame. A window frame is all the rows within the same group. In the following example, we compute an average salary by gender and the results are repeated in the same gender group.

(Created by Author)
(Created by Author)

Aggregate Functions are implemented in the same as window functions. But the result will be returned without duplicates, so that the number of observations in the final output would equal the number of distinct groups.

(Created by Author)
(Created by Author)

Window Functions

Create a New Column of Row number

Both Pandas and PostgreSQL have built-in Window Functions. In Pandas, we can use a groupby operation with a window function, while in PostgreSQL, we can use Over (Partition by). when a window function is implemented, a new column would be produced and the output would have the same number of rows as the original data set.

Using Pandas
df['new_column_name'] = df.groupby('gender').cumcount()+1
Using PostgreSQL
SELECT *, 
   ROW_NUMBER() OVER(PARTITION BY gender) AS ROW 
FROM df

Create Count/Max/Min/Average/Sum Within a Group

In Pandas, we often use transform with a window function, such as, count, max, min, avg and sum.

Using Pandas
df.groupby('gender')['salary'].transform('count')
Using PostgreSQL
SELECT *,         
   COUNT(*) OVER(PARTITION BY gender) AS ct 
FROM df;

Create Running Sum Within a Group

In Pandas, we can use cumsum to compute a running sum.

In PostgreSQL, we can compute a running sum using Over (Partion by) with Order by. With an ORDER BY is specified, a window function will be implemented at every row within a window.

Using Pandas
df.groupby('gender')['salary'].transform('cumsum')
Using PostgreSQL
SELECT *,         
   SUM(salary) OVER(PARTITION BY gender ORDER BY age) AS avg 
FROM df;

Create Lag/Lead Within a Group

Using Pandas
df.groupby('gender')['salary'].transform(lambda x: x.shift(1)) df.groupby('gender')['salary'].transform(lambda x: x.shift(-1))
Using PostgreSQL
SELECT *, 
   LAG(salary) OVER(PARTITION BY gender ORDER BY salary),
   LEAD(salary) OVER(PARTITION BY gender ORDER BY salary)   
FROM df

Create Ranking Within a Group

Using Pandas
df.groupby('gender')['salary'].rank('dense', ascending = False)
Using PostgreSQL
SELECT *,         
   RANK() OVER(PARTITION BY gender ORDER BY salary) AS rank 
FROM df

Aggregate Functions

Collapse Rows With Count/Max/Min/Average/Sum Within a Group

In Pandas, there are many ways to implement an aggregate function. I include 3 different ways in the following code snippet.

  • An aggregate function would run as the default using groupby
  • Use apply to run a built-in aggregate function or a user-defined function with groupby
  • Use agg to run a built-in aggregate function or a user-defined function with more flexibility, such as, naming new columns and creating more than one new column

In PostgreSQL, an aggregate function is straightforward to implement with GROUP BY.

Using Pandas
df.groupby('gender')['id'].size().reset_index()
df.groupby('gender').apply(lambda x: x['id'].size).reset_index()
df.groupby('gender').agg(count = pd.NamedAgg('id', 'size')).reset_index()
Using PostgreSQL
SELECT gender, 
   COUNT(*) 
FROM df 
GROUP BY gender

Create Percentile Within a Group

In PostgreSQL, percentile within a group can be computed using WITHIN GROUP(ORDER BY) with GROUP BY.

Using Pandas
 df.groupby('gender')['salary'].quantile(0.9).reset_index()
Using PostgreSQL
 SELECT gender, 
    percentile_disc(0.9) WITHIN GROUP(ORDER BY salary) 
 FROM df 
 GROUP BY gender

Things to consider:

Which tool to use depends on where the data is stored, what kind of data format, and how we want to use it.

  • If the data you are working with is not in panel format yet and you will need to piece together data from various sources, Pandas might work better. For example, when processing text data or scraping data from websites, it is likely that data is in an unstructured format, it would be very difficult to use SQL.
  • If you’re not familiar with data and would like to explore the data, your database admin would appreciate it if you do the work outside of the database with Pandas.
  • If you would like to do data visualization and implement statistical analysis and machine learning models, Pandas would work well with other libraries in Python, such as, Matplotlib, Scikit-Learn, TensorFlow and etc.
  • If you deal with a large amount of data, you can use Pandas with other libraries, such as Pyspark, Dask and Swifter to fully utilize your hardware power.
  • If you’re very familiar with data and know exactly what steps to take to clean to data, such as, filtering, joining, calculation and etc, it should be easier to run SQL to process the data and export the final data for analysis tasks.
  • If you work on a front-end project and would like to access to the back-end database without complex data manipulations, you might be better off using SQL.

Final Notes:

I would recommend to learn both Pandas and SQL since they are common tools in the Data Science field. The combination will give you more flexibility when you work on various data science projects with people of different backgrounds.

Thank you for reading !!!

If you enjoy this article and would like to Buy Me a Coffee, please click here.

You can sign up for a membership to unlock full access to my articles, and have unlimited access to everything on Medium. Please subscribe if you’d like to get an email notification whenever I post a new article.


Related Articles