Homepage is https://alan-turing-institute.github.io/xpandas/

Developing 1d/2d data container and transformers for data analysis

Introduction

Each problem in data science is based on data. Before analysing any data this data usually needs to be preprocessed. Preprocessing can be a 
simple filtering or something more complex such as transformation as well as 
feature extraction.

For Python programming language the most popular library for working with 1d/2d data sets is Pandas. For 1d data such as a sequence of numbers pandas.Series object is very appropriate. For 2d data such object is called pandas.DataFrame. Pandas objects are great for storing and transforming in-memory data both for quantitive and categorical types of data. Generally each cell of pandas.DataFrame or pandas.Series is either a real number or a categorical string.

Nowadays data sets are getting more and more complex. For example, imagine a data set for a hospital patients. The hospital needs to keep such patients’ features like age (integer), gender (categorical), some analysis scores (float) and more complex like heartbeat rate (Series), x-ray image (Image). In this case, one can not store this kind of data set in pandas.DataFrame.

Usually, researcher writes his own infrastructure for working with such data types or uses standard Python’s lists and list of lists (or numpy). After storing this kind of data one would like to perform feature extraction for different columns. For example, one may want to extract mean, std, quantiles from a series; use some CNN to extract features from images and so. With a list of lists data container it’s not user friendly to perform such operations and usually leads to spaghetti code.

XPandas (extended Pandas) presents universal 1d/2d data containers
for storing any kind of data and Transformers interface for feature extraction and transformations.

The next chapter describes a simple mathematical background of XPandas and it’s API design.

Mathematical Setting and API Design

Mathematical Definitions

Let Ω be an arbitrary space of any objects. For example, Ω can be a space
of pandas.Series.

X is called 1d abstract data container if X ∈ Ωⁿ where n is a number of objects in data container and called size.
1d abstract data container in XPandas is called XSeries.
DF is a 2d abstract data container of size m if DF is a set of
m 1d abstract data containers.

Relatively in XPandas it’s called XDataFrame.
XDataFrame
is a set of m XSeries each of size n.
XSeries and XDataFrame are the core components of XPandas.
XPandas also provides interface for Transformers.

Series transformer is a function T: XSeries(XSeries or XDataFrame)
Data frame transformer is a function that encapsulates multiple
Series transformers.
DFT: {SeriesTransfomer} → XDataFrame

Transformers can perform classical element-wise map operations
for each element of XSeries as well as more complex transformations, 
a trivial example of non element-wise transformation is a transformation

T: X → (overall_mean — X.mean())

Where X is a pandas.Series and overall_mean is a mean of the whole XSeries (mean of a all number in XSeries).
Thus, transformation subtracts global mean from the mean of each pandas.Series.

API

XSeries is based on pandas.Series and can store objects of any type. For example one may want to store pandas.Series objects inside XSeries. It can be visualised with a schema

The most important property of XSeries is homogeneity. XSeries stores objects of the same type and have property data_type that is a type of objects inside XSeries. XDataFrame is based on pandas.DataFrame and stores set of XSeries. Let’s now return to the data set of hospital patients.

We’ve already defined some health features of different types such as numbers (age, height, weight, etc.), categorical (gender, hair colour, etc.), images (patients x-ray pictures), time series (heartbeat over time), and any other. Using XDataFrame we can now store all this information into one in-memory 2d container.

Having such a complex data set one usually wants to use some ready-to-go machine learning algorithms like provided by scikit-learn. To apply it, a data set should be transformed into a 2d matrix of quantitative features.
Thus, the next step is to extract features from columns of XDataFrame. In the example with the patients’ data, one may want to extract statistical features from each pandas.Series or extract features from each image using deep learning model.

That’s where Transformer class takes place. Using CustomTransformer
class one can build it’s own transformer for XSeries just by subclassing it. Similarly, one can use DataFrameTransformer class to create a custom transformer for XDataFrame. DataFrameTransformer is defined as a set of XSeries transformers.

XSeries transformer can map into XSeries or XDataFrame objects

XPandas also presents PipelineChainTransformer that can chain several Transformers into a chain or sequence of transformers. For example, one may want to extract features from the XSeries of pandas.Series objects using tsfresh and then apply scikit-learns PCA.

Also estimator (classificator/regressor) object can be passed as a last step of PipelineChainTransformer. In this case the behaviour is similar to scikit-learn Pipeline.

Full tutorials of usage are available here.

References

[1] Wes McKinney. pandas: a foundational python library for data analysis and statistics.

[2] Wes McKinney. Data structures for statistical computing in python. In Stéfan van der Walt and JarrodMillman, editors,Proceedings of the 9th Python in Science Conference, pages 51–56, 2010.

[3] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay.Scikit-learn: Machine learning in Python.Journal of Machine Learning Research, 12:2825–2830, 2011.