Aggregation and Grouping

GSoC: Implementing DataFrame in Pharo

Oleksandr Zaitsev
Towards Data Science

--

This is the first implementation of aggregation and grouping in a Pharo DataFrame. It covers only the basic functionality, such as grouping a data frame or a series by the values of another series of the corresponding size, and then applying the aggregation functions to the grouped data structures.

In the next iterations the functionality will be extended according to the targeted user scenarios. The implementation is also very likely to evolve into something more optimized.

What is DataFrame?

Data frames are spreadsheet-like tabular data structures that provide an API for slicing-and-dicing, cleaning, and analyzing the data. Implementing data frames in Pharo is part of my Google Summer of Code project this year. You can find the code in this repository: https://github.com/PolyMathOrg/DataFrame and load it into your image with the following Metacello script:

Metacello new
baseline: 'DataFrame';
repository: 'github://PolyMathOrg/DataFrame';
load.

If you want to learn more about the DataFrame project, check out the documentation and follow the tutorial which you can find in the repository. Documentation is still under development, so if you can’t find an answer to your question, feel free to drop me a letter: olk.zaytsev@gmail.com.

Split-apply-combine

Split-apply-combine is a strategy where you break up a big problem into manageable pieces, operate on each piece independently and then put all the pieces back together.

Data aggregation and grouping allows us to create summaries for display or analysis, for example, when calculating average values or creating a table of counts or sums. It is a process that follows the split-apply-combine strategy:

  1. Split data into groups based on some criteria
  2. Apply the function to each group independently
  3. Combine the results into a data structure

Implementation

In this section I explain the how aggregation and grouping functionality is currently implemented. If you are not interested in these details, feel free to skip this section and jump to the next one.

Consider the following message sent to firstSeries object:

firstSeries groupBy: secondSeries.

After receiving this message, firstSeries creates an object of DataSeriesGrouped which splits firstSeries into a collection of subseries, based on the values of secondSeries. The collection of subseries is then stored as an object of DataSeries (instance variable called groups), whose keys are equal to the unique values of secondSeries and values store the subseries of firstSeries, corresponding to each one of those unique values.

When a receiver of a groupBy: message is a DataFrame, it creates an instance of DataFrameGrouped which splits the data similarly to the way DataSeriesGroup does it, except the values of groups series are sub data frames, not subseries. In other words, groups is a DataSeries with keys corresponding to the unique values of a series by which the data frame is grouped (secondSeries), and each value of groups stores a smaller data frame with rows specified by numbers at which secondSeries has a value equal to the corresponding key of groups. If data frame is grouped by one of its columns, this column is excluded from data frame prior to grouping. This way we avoid data duplication, since the same values will be stored as keys of groups variable.

In case of DataSeriesGrouped each subseries will be aggregated to a scalar, and all these scalars will be combined into a new DataSeries. As for the DataFrameGrouped, it will apply the block to each column of each sub data frame and return the resulting matrix of scalars as a new DataFrame.

Aggregation is done with the apply: message. It takes a block as an argument and applies it to each value of the groups series and combines the results into a new data structure. For some commonly used aggregation functions, such as max, min, average etc., DataGrouped provides shorter messages. In this iteration all these messages are just handy shortcuts

average
^ self apply: [ :each | each average ].

But in future these messages will contain the optimized implementations of the corresponding aggregations, since it is critical that these functions are time and memory efficient.

Example of Usage

In this section I provide simple examples of grouping a series by the values of another series, grouping a data frame by the values of a series, and finally I show how the aggregation functions can be applied to the grouped data.

Restaurant Tipping Dataset

I will base my examples on dataset of restaurant tipping. It was originally found in Bryant & Smith’s 1995 text on business statistics. You can obtain a CSV file with a tipping dataset in this GitHub repository: https://github.com/wesm/pydata-book/blob/master/ch08/tips.csv.

Let’s read the data into a DataFrame using the fromCSV: method

tips := DataFrame fromCSV: '/path/to/tips.csv'.

The easiest way to get a quick insight into a dataset stored in a data frame is to look at its head — the first 5 rows (tips head)

  |  total_bill  tip    sex     smoker  day  time    size
--+------------------------------------------------------
1 | 16.99 1.01 Female No Sun Dinner 2
2 | 10.34 1.66 Male No Sun Dinner 3
3 | 21.01 3.5 Male No Sun Dinner 3
4 | 23.68 3.31 Male No Sun Dinner 2
5 | 24.59 3.61 Female No Sun Dinner 4

We can see that tipping dataset has 2 quantitative variables: total_bill and tip, and 5 categorical variables: sex, smoker, day, time, and size. In the examples provided below we will be grouping data structures only by categorical variables. Technically, its also possible to group by quantitative variable, but in most cases such grouping is not very useful (not to mention that it doesn’t look well in a tutorial).

Grouping Series

The simplest example of applying a groupBy operator is to group the values of a series by the values of another one of the same size.

bill := tips column: #total_bill.
sex := tips column: #sex.
bill groupBy: sex.

The result of this query will be an object of DataSeriesGrouped, which splits the bill into two series, mapped to the ‘Male’ and ‘Female’ values of sex series.

Female  (16.99 24.59 35.26 14.83 ... 18.78)
Male (10.34 21.01 23.68 25.29 ... 17.82)

Since most of the time we need to group series that are both columns of a single data frame, there is a handy shortcut:

tips group: #total_bill by: #sex.

Grouping Data Frames

Aside from the shortcut for grouping its columns as demonstrated above, DataFrame provides a method for grouping itself by one of its columns.

tips groupBy: #smoker

The result of this query will be an object of DataFrameGrouped, storing two separate data frames — one for all the smokers and one for non-smokers. The smoker column will be excluded from these data frames, since its values will be stored as keys inside a DataFrameGrouped object, and the the separate groups of smokers and non-smokers will allow us to fully reconstruct the smoker column if needed.

Aggregating Grouped Data

Aggregation functions are the ones that take a series as an input and return a scalar value that summarizes the values of that series. These are the statistical functions: min, max, average, stdev etc., functions like sum, count, and many others.

Once the data was grouped, we can apply an aggregation function to the composed groups and receive the combined data structure that summarizes the initial data

grouped := tips group: #total_bill by: #day.
grouped apply: [ :each | each average round: 2].

Since we were grouping a column of DataFrame by another column, the result will be an object of DataSeries

a DataSeries
Sun 21.41
Sat 20.44
Thur 17.68
Fri 17.15

As it was mentioned above, DataGrouped provides shortcuts for some commonly used aggregation functions, such as min, max, average, sum, count etc. For now they are just shortcuts, but later they will implement the optimized aggregations that will be applied much faster (for small datasets like tipping, speed is not a serious issue)

grouped min.
grouped max.
grouped average.

Once the data frame was grouped into an object of DataFrameGrouped, we can also apply an aggregation function to this object. DataFrameGrouped implement the apply: message in such way that the function is applied to each column of each sub data frame, producing the scalar value. These scalars are then combined into a new data frame.

(tips groupBy: #sex) count.

The result of this query will be a data frame containing the number of non-empty cells for each column, corresponding to ‘Male’ and ‘Female’ rows

       | total_bill  tip  smoker  day  time  size
-------+-----------------------------------------
Female | 87 87 87 87 87 87
Male | 157 157 157 157 157 157

Resources

  1. McKinney Wes, Python for data analysis, O’Reilly Media, Inc., 2012
  2. Hadley Wickham (2011). The Split-Apply-Combine Strategy for Data Analysis. Journal of Statistical Software, 40(1), 1–29. URL http://www.jstatsoft.org/v40/i01/.
  3. https://pandas.pydata.org/pandas-docs/stable/groupby.html
  4. https://www.postgresql.org/docs/9.5/static/tutorial-agg.html
  5. https://github.com/wesm/pydata-book/blob/master/ch08/tips.csv

--

--