Developing 1d/2d data container and transformers for data analysis

### Introduction

Each problem in data science is based on data. Before analysing any data this data usually needs to be preprocessed. Preprocessing can be a

simple filtering or something more complex such as transformation as well as

feature extraction.

For Python programming language the most popular library for working with 1d/2d data sets is **Pandas**. For 1d data such as a sequence of numbers *pandas.Series *object is very appropriate. For 2d data such object is called *pandas.DataFrame*. Pandas objects are great for storing and transforming in-memory data both for quantitive and categorical types of data. Generally each cell of *pandas.DataFrame *or *pandas.Series* is either a real number or a categorical string.

Nowadays data sets are getting more and more complex. For example, imagine a data set for a hospital patients. The hospital needs to keep such patients’ features like age (integer), gender (categorical), some analysis scores (float) *and* more complex like heartbeat rate (Series), x-ray image (Image). In this case, one can not store this kind of data set in *pandas.DataFrame*.

Usually, researcher writes his own infrastructure for working with such data types or uses standard Python’s lists and list of lists (or numpy). After storing this kind of data one would like to perform feature extraction for different columns. For example, one may want to extract mean, std, quantiles from a series; use some CNN to extract features from images and so. With a list of lists data container it’s not user friendly to perform such operations and usually leads to spaghetti code.

**XPandas** (extended Pandas) presents universal 1d/2d data containers

for storing any kind of data and *Transformers* interface for feature extraction and transformations.

The next chapter describes a simple mathematical background of **XPandas** and it’s API design.

### Mathematical Setting and API Design

#### Mathematical Definitions

Let Ω be an arbitrary space of any objects. For example, Ω can be a space

of *pandas.Series.*

Xis called1d abstract data containerifX∈ Ωⁿ wherenis a number of objects in data container and called size.1d abstract data containerinXPandasis calledXSeries.

DFis a2d abstract data containerof sizemifDFis a set ofm1d abstract data containers.

Relatively in **XPandas** it’s called *XDataFrame.XDataFrame* is a set of

*m*

*XSeries*each of size

*n*.

*XSeries*and

*XDataFrame*are the core components of

**XPandas**.

**XPandas**also provides interface for Transformers.

Series transformeris a functionT:XSeries→(XSeriesorXDataFrame)

Data frame transformeris a function that encapsulates multipleSeries transformers.

DFT: {SeriesTransfomer} →XDataFrame

Transformers can perform classical element-wise map operations

for each element of *XSeries *as well as more complex transformations,

a trivial example of non element-wise transformation is a transformation

T:X→ (overall_mean—X.mean())

Where *X* is a *pandas.Series* and *overall_mean* is a mean of the whole *XSeries *(mean of a all number in *XSeries*).

Thus, transformation subtracts global mean from the mean of each *pandas.Series*.

#### API

*XSeries* is based on *pandas.Series* and can store objects of any type. For example one may want to store *pandas.Series* objects inside *XSeries*. It can be visualised with a schema

The most important property of *XSeries* is homogeneity. *XSeries* stores objects of the same type and have property *data_type* that is a type of objects inside *XSeries*. *XDataFrame* is based on *pandas.DataFrame* and stores set of *XSeries*. Let’s now return to the data set of hospital patients.

We’ve already defined some health features of different types such as numbers (age, height, weight, etc.), categorical (gender, hair colour, etc.), images (patients x-ray pictures), time series (heartbeat over time), and any other. Using *XDataFrame* we can now store all this information into one in-memory 2d container.

Having such a complex data set one usually wants to use some ready-to-go machine learning algorithms like provided by scikit-learn. To apply it, a data set should be transformed into a 2d matrix of quantitative features.

Thus, the next step is to extract features from columns of *XDataFrame*. In the example with the patients’ data, one may want to extract statistical features from each *pandas.Series *or extract features from each image using deep learning model.

That’s where Transformer class takes place. Using *CustomTransformer*

class one can build it’s own transformer for *XSeries *just by subclassing it. Similarly, one can use *DataFrameTransformer *class to create a custom transformer for *XDataFrame*. *DataFrameTransformer *is defined as a set of *XSeries *transformers.

**XPandas** also presents *PipelineChainTransformer* that can chain several Transformers into a chain or sequence of transformers. For example, one may want to extract features from the *XSeries *of *pandas.Series *objects using tsfresh and then apply scikit-learns PCA.

Also estimator (classificator/regressor) object can be passed as a last step of *PipelineChainTransformer. *In this case the behaviour is similar to scikit-learn Pipeline.

Full tutorials of usage are available here.

### References

[1] Wes McKinney. pandas: a foundational python library for data analysis and statistics.

[2] Wes McKinney. Data structures for statistical computing in python. In Stéfan van der Walt and JarrodMillman, editors,Proceedings of the 9th Python in Science Conference, pages 51–56, 2010.

[3] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay.Scikit-learn: Machine learning in Python.Journal of Machine Learning Research, 12:2825–2830, 2011.