How to build a non-geographical map #1

Or how to map similarities for serendipitous data exploration (with Python)

Fanny Kassapian
Towards Data Science

--

Maps are powerful design objects for data visualization. They are commonly used to emphasize the spatial relationship between elements and compare multiple variables. In the context of non-geographical data, dimensionality reduction techniques allow us to map similarities in a two-dimensional space without compromising the richness of the initial information.

🔗 Link to #part 2: How to turn a scatter plot into an interactive map

#Part 1: Dimensionality reduction & visualization

Step 1. Set up features and initial dimensions

In a geographical map, each element’s coordinates are inherently defined by its position. An element’s position is made of three parameters: latitude, longitude and elevation. To represent locations on Earth, the geographic coordinate system uses either a spherical coordinate system (globe) or a cartesian coordinate system (flat map). To transform the Earth’s three-dimensional space into a two-dimensional map, we use map projection techniques, the most common of which is the Mercator’s cylindrical projection.

To visualize your non-geographical data in a map-like way, you need to think of each element (or point on the map), as defined by a certain set of features (or attributes).

This set of features must be constant across your dataset. Meaning that each element can be described by assigning different values to the same features.

This is what your dataset should look like.

To identify what attributes would best describe your data, you can ask yourself:

“What intuitively makes two elements similar, and what differentiates them?”

On the map, the closer together, the more similar two elements are. The further away, the more different.

Example - mapping occupations based on skills and knowledge similarity

In this example, I want to represent job titles depending on their similarity in terms of required skills and knowledge, using ONET open database.

I’m quite lucky, because this dataset is very well organized. All occupations are described by the same set of skills (same with knowledge), and each of them is broken down by importance (how important this skill is to perform the job) and level (required level of expertise for the job). The “Data Value” columns assigns the respective rating.

ONET skills.xlsx

I defined each skill and knowledge item as a single, separate feature:

This is what my dataset should look like after cleaning

Thanks to the quality of the original dataset, this part was fairly easy:

📝>> Check out the full notebooks here

I am now left with a matrix whose row vectors are jobs (n=640) and whose columns are skills and knowledge parameters (p=134).

In machine learning, “dimensionality” simply refers to the number of features (or attributes, or variables) in your dataset.

Since each element is described by 134 parameters, we need a 134-dimensional space to fully represent it… Hard to picture, isn’t it?

Let’s see how we can factorize this matrix in order to reduce its representation to a 2-dimensional space.

Step 2. Reduce dimensionality

We want to create a lower dimensional representation where we keep some of the structure from the higher dimensional space.

Instinctively, we would like points that were close to each other in the original space to be close to each other in 2D. The same goes for far away points.

Doing this by hand would be an impossible task. So I looked around for dimensionality reduction algorithms.

No free lunch

Before we go over a few of them, note that no one algorithm fits every problem. This simple (but often forgotten) fact is humorously called the “no free lunch” theorem. Each model relies on some assumptions to produce a simplification of reality. But in certain situations, those assumptions fail. Hence, they produce an inaccurate version of reality.

Consequently, you must pick a few appropriate algorithms depending on the constrains of your data and on the nature of your problem. Then, try them out to see which one works best for you.

Dimensionality reduction — Purpose & method (far from exhaustive)

In my project, for instance, I am using dimensionality reduction for the sole purpose of data visualization, not as a preliminary step before applying a pattern recognition algorithm. Also, I want to produce a map. Therefore, I am interested in the accuracy of the distance between points as a proxy for similarity, but I don’t really care about the interpretability of the new set of axes.

This leads us to feature extraction algorithms.

a. Linear transformation

  • Principal Component Analysis (PCA)

A well known and commonly used linear transformation algorithm is Principal Component Analysis (PCA).

In the matrix above, each job is described by 134 dimensions. It is very likely that to some extent, columns are linearly correlated. When two columns are perfectly correlated, I only need one to describe the other. To reduce the level of inaccuracy of my new representation while dropping dimensions, I would better keep the dimensions that describe most of the variation in my dataset. In other words, get rid of the unnecessary and noisy dimensions, and keep only the most informative ones. But still, dropping 132 dimensions is a lot.

Instead of identifying the dimensions where there is the most variance, PCA identifies the directions where there is the most variance, i.e., where the data is most spread out.

These directions are called principal components (or eigenvectors), and are made of “combinations” of the initial dimensions. Hence, they are much more informative than any single dimension alone, but unlike initial dimensions, they can’t be labeled with a particular feature (hence difficult to interpret).

PCA finds a new coordinate system made of a pair of orthogonal (uncorrelated) principal components with the highest variance, and assigns new values to each point in the dataset to position them accordingly.

Illustration of what PCA does

By nature, PCA preserves the global structure of the data, because it finds the coordinate system that will produce the highest variance for the whole dataset, in a “one size fits all” manner. It does not take into account the initial position of points relatively to each other. Rather, it focuses on spreading the data as much as possible in view of the dimensionality reduction.

If you’re having a hard time visualizing what I just explained, take a look at this awesome interactive visualization of PCA.

Ok, let’s try it out:

1/ Create a “new” set of features with PCA

First, you need to import PCA from the scikit-learn library.

from sklearn.decomposition import PCA

Then, set n_components. If you want to build a two-dimensional coordinate system, you set it up so that PCA finds 2 principal components.

pca = PCA(n_components=2)

Use fit_transform to fit the model with your DataFrame (X) and apply the dimensionality reduction.

pc = pca.fit_transform(X)

PCA builds an array of 640 pair of coordinates along the first and second principal components respectively.

2/ Visualize with Plotly

Once you’ve imported plotly…

from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplotinit_notebook_mode(connected=True)import plotly.graph_objs as go

…draw a scatter plot using the first and second element of each pair of coordinates as x and y respectively:

data = [go.Scatter(x=pc[:,0], y=pc[:,1], mode=’markers’)]

My data is not labeled, so the only way to check if its representation is accurate is to hover over points and check if their neighbors “make sense”.

>> See notebook (Github) >> See graph (Plotly)

As you can see, the data is well spread out, but there seem to be a lot of “outliers”. Note that my judgement of whether the algorithm works well with this dataset is highly subjective to my own expectations of the end visualization.

For this project, I am highly interested in preserving the local structure of the data. I want clusters that were neighbors in the dataset to remain neighbors after the dimensionality reduction. But as mentioned above, that’s not PCA’s best asset.

b. Non-linear transformation

The non-linear dimensionality reduction (NLDR) approach assumes that in the high-dimensional space, the data has some underlying low-dimensional structure lying on an embedded non-linear manifold.

I’m not a mathematician, and topology is quite a complex field, so I’ll try to explain it metaphorically.

A manifold resembles Euclidean space near each point. An Euclidean space is a space where the very first theorems you learnt in geometry, like Thales’s theorem, apply. Two-dimensional manifolds include surfaces (they resemble the Euclidean plane near each point).

NLDR assumes that, if the embedded manifold is two-dimensional (even though the data is 3 -or more- dimensional), then the data can be represented on a two-dimensional space too.

A good way to picture it is by looking at your duvet before you’ve made your bed. Imagine you had a rough night and your bed is messy. If you were to describe each point on the duvet, you would need three dimensions. But the duvet resembles the Euclidean plane near each point. So, it can be flattened into a plane (two-dimensional).

Illustration of non-linear dimensionality reduction, from 3D to 2D

Non-linear algorithms are good at preserving the data’s local structure because they adapt to the underlying data, performing different transformations on different regions of the manifold.

  • T-distributed Stochastic Neighbor Embedding (t-SNE)

TSNE is a recent, but popular, non-linear visualization method based on local distance measurements, and therefore preserving close neighbors. However, it does not necessarily preserve global structure. It means it does not always capture the distance between clusters. In the context of my project, that could be an issue.

If you want to understand better how t-SNE works: this article explains how to read t-SNE’s results correctly.

1/ Create the embedding

Just like you did for PCA, import SNE from the scikit-learn library, set n_components to 2 and create the embedding with fit_transform:

from sklearn.manifold import TSNEtsne = TSNE(n_components=2)embedding = tsne.fit_transform(X)

2/ Have a look at the result:

>> See notebook (Github) >> See graph (Plotly)

It looks pretty good already. Let’s try one last method.

  • Uniform Manifold Approximation and Projection (UMAP)

UMAP, a new-comer, tries to balance between preserving both local and global distances, which is exactly what I need.

UMAP is fairly easy to use, and there are a few parameters that you can adjust (the defaults are quite good):

  • n_neighbors: constrains the size of the local neighborhood to will look at when attempting to learn the manifold structure of the data
  • n_components: indicates the number of dimensions of the target space
  • min_dist: sets the minimum distance between points in the low dimensional representation
  • metric: controls how distance is computed in the ambient space of the input data (feature space)

For more details, you can find the documentation here (link to UMAP).

1/ Create the embedding

Once you’ve imported UMAP, you can create the embedding and use the UMAP and fit_transform functions to create the final array of coordinates (full notebook here):

import umapembedding = umap.UMAP(n_neighbors=15, n_components=2, min_dist=0.3, metric=’correlation’).fit_transform(X.values)

2/ Visualize

The distribution of points seems pretty logical to me. Clusters are logically positioned relatively to one another :

>> See notebook (Github) >> See graph (Plotly)

Wrap up

Thank you for reading to the end 😃

So far, we’ve built a scatter plot with a coordinate system that doesn’t really have any obvious meaning in itself but that uses distance as a proxy for similarity among data features. The visualization is quite basic but it lays out the premises of a map.

Next, we’ll see how to give it a format that resembles a map in both its aspect and its use (link to #part 2).

In the meantime, feel free to share your inspiring non-geographical maps and other dimensionality reduction methodologies.

🔗 Link to #part 2: How to turn a scatter plot into an interactive map

👉 Check out how I use this in practice: www.tailoredpath.com

📫 Let me know what you think: tailoredpath@gmail.com

--

--