Santa2Graph: visualize high dimensional data with Giotto Mapper

A gentle introduction to the Mapper algorithm for topological data visualisation including a tutorial using the implementation in giotto-learn

Francesco Palma
Towards Data Science

--

Written by Francesco Palma, Davide Burba, Lewis Tunstall, and Thomas Boys

Link to the tutorial: https://github.com/giotto-ai/tutorial_mapper

Unless you are unnaturally good with numbers, data visualization remains the most immediate and impactful source of insight for data analysis. The visual interplay between features provides insight that will influence many choices down the road like feature engineering and feature selection.

We all like having lots of features. The upside is that there is potentially more information; the downside is the difficulty involved in isolating this information. As we cannot easily visualise objects in more than three dimensions, standard visualisation methods are not well suited for high-dimensional data.

The Mapper algorithm offers a way to visualize the intrinsic shape of data in a single graph. This graph is built by a clever clustering procedure that reveals the coarse structure of the underlying data. Effectively, there are theoretical results which confirm the topological accuracy of the output.

In this article we provide an introduction to the Mapper algorithm, and a tutorial based on the implementation given by the giotto-learn library. One of its cool features allows you to interact with some of the Mapper parameters without re-running the code.

Interactive Mapper available in giotto-learn.

Applications of the Mapper algorithm

Since 2007 Mapper has been used to simplify the visualization of complex interactions. It offers a qualitative analysis on large feature spaces, lending itself to applications in medicine, material science and genomics. More recently it has also been applied to improve the robustness of neural networks.

We summarise below two famous applications of the Mapper algorithm:

1) Analysis of breast cancer patients

With only 200 patients in a study on breast cancer, a specialised method was needed to deal with the 30k dimensions of the genotype’s feature space. The Mapper algorithm has been successfully applied to reach a finer sub-division of the patients, resulting in major improvements for targeted therapies. The same analysis has been performed on two different data sets and provided consistent outputs, proving the algorithm stability.

2) Classification of NBA players

If you have ever played basketball, you know it has always been played with five positions: point guard, shooting guard, small forward, power forward and center. Using players’ statistics and in-game performance, the Mapper algorithm identifies thirteen new playing styles which reflect modern basketball dynamics. This informs the General Manager of a team on the skills not to miss when building a team.

The Mapper algorithm

From the original paper:

We do not attempt to obtain a fully accurate representation of a data set, but rather a low-dimensional image which is easy to understand, and which can point to areas of interest.

With the objective of producing a coarse graph that represents the structure of the data, the algorithm amalgamates points into nodes, and connects them whenever points are common to distinct nodes.

This process is motivated by the Nerve theorem which guarantees that the Mapper output is topologically equivalent to the shape the data is sampled from.

In practice the algorithm is a three-step process:

  1. Filtering: we map the data points into ℝ by using a filter function f.
  2. Covering: we cover the filter values with overlapping intervals.
  3. Clustering: for each interval, we apply a clustering algorithm to the observations which were mapped in that interval.
Among the points that are close in filter value, we cluster together the one that are similar in the original space. Nodes are connected whenever they share a point. (Source: https://arxiv.org/abs/1904.11044)

1 ) Filtering

The first step of the Mapper consists in mapping each data point x to a low dimensional space ℝᵐ through a filter function f: ℝⁿℝᵐ. Normally we choose m=1 or m=2.

The choice of the filter function has a large impact on the Mapper outcome, as points that are far in filter value have no chance to be clustered together. Thus, the filter function serves as a coarse measurement of closeness.

In the figure above, the authors use the height function, but any function can in principle do the job. However, some common choices are:

  1. Projection on axis
  2. PCA
  3. Eccentricity
  4. Density
  5. Entropy

2 ) Covering

We divide the image space into overlapping intervals (or product of intervals if m>1) in a way that cover all the filter values. We call such a construction a cover.

Usually we set the cover to be made by equally sized m-dimensional intervals. For instance, if the filter function takes values in ℝ, the cover is made by a sequence of overlapping segments with equal length.

In this case, the parameters to be chosen are the number of intervals and their percentage of overlap. In the example above there are 4 intervals with 25% overlap.

3) Clustering

In the last step, we successively perform clustering on each interval of the cover. The clustering is done on the original space by taking each time the pre-image of the intervals through the filter function. The output graph is made by:

  • nodes which represent clusters of data points;
  • edges which represent non-empty intersections between pairs of clusters (clusters sharing some data points). This is possible due to the overlapping intervals.

At this point each cluster represents a node of the graph and the edges correspond to clusters with common observations.

The tutorial: use Mapper to retrieve Santa Claus

The best way to intuitively understand how the Mapper works is to use it and “play” with its parameters.

Giotto is an open source project which contains giotto-learn, an easy-to-use toolkit for Topological Data Analysis. It uses a Scikit-learn-like API and provides a convenient way to fit the Mapper through a pipeline function. Given a dataset in the form of a Numpy array, the code to build a graph goes as follows:

pipeline = make_mapper_pipeline(
filter_func=Projection(column_indices=2),
cover=OneDimensionalCover(n_intervals=10, overlap_frac=0.1),
clusterer=DBSCAN(eps=0.5),
)
graph = pipeline.fit_transform(data)

We apply the Mapper algorithm on a dataset containing 20,000 three-dimensional data points sampled from a Santa Claus shape, the so called “Santa Cloud”.

Left: dataset sampled from here using CloudCompare. Right: Mapper output using different parameters.

As a first try, we use the default parameters for cover and clustering along with the projection on the height as a filter function, i.e. f: [x,y,z]↦z. We color each node by the average color of the points:

With the default parameters Santa might make it through the chimney.

The graph is not representative of the dataset as we cannot distinguish any characteristic features of Santa Claus’ body. Beyond the arbitrary position of the nodes, giving Santa a snake-like appearance, the graph is equivalent to a line of connected nodes. This suggests that the default parameter for the clustering algorithm needs to be changed, as we are always clustering all the filter intervals together.

To solve this issue, we perform a finer clustering:

Mapper output: DBSCAN(eps = 0.06).

Great! This is a non trivial graph structure. In particular we see the appearance of branches that represent the arms and the legs of Santa Claus. However, in this representation Santa has 4 arms.

We tend to this issue by tuning the fraction of overlap between the intervals in the cover (default is 0.1):

Mapper output: DBSCAN(eps = 0.06), overlap_frac = 0.35.

This time we can clearly distinguish the legs, the arms and the head. Although it is very simplistic, in general this is enough to capture the main structure.

We can further refine our study by increasing the number of intervals in the cover (default is 10). In this way, we see the appearance of a node representing the hat, and a branch separating the chest from the beard:

Mapper output: DBSCAN(eps = 0.06), overlap_frac = 0.35, n_intervals = 20.

Conclusion

The giotto-learn library provide a fast and complete Python implementation of the Mapper algorithm in open-source. We are sharing a tutorial on 3-d data, now you can have fun representing high dimensional datasets with your own Mapper graphs.

Special thanks to Umberto Lupo and Lewis Tunstall for the implementation of Mapper in giotto-learn.

Links:

--

--

Co-founder at L2F | Open-source developer @Giotto_ai | Mathematician interested in everything with a topology