How to build a non-geographical map #1
Or how to map similarities for serendipitous data exploration (with Python)
Maps are powerful design objects for data visualization. They are commonly used to emphasize the spatial relationship between elements and compare multiple variables. In the context of non-geographical data, dimensionality reduction techniques allow us to map similarities in a two-dimensional space without compromising the richness of the initial information.
🔗 Link to #part 2: How to turn a scatter plot into an interactive map
#Part 1: Dimensionality reduction & visualization
Step 1. Set up features and initial dimensions
To visualize your non-geographical data in a map-like way, you need to think of each element (or point on the map), as defined by a certain set of features (or attributes).
This set of features must be constant across your dataset. Meaning that each element can be described by assigning different values to the same features.
To identify what attributes would best describe your data, you can ask yourself:
“What intuitively makes two elements similar, and what differentiates them?”
On the map, the closer together, the more similar two elements are. The further away, the more different.
Example - mapping occupations based on skills and knowledge similarity
In this example, I want to represent job titles depending on their similarity in terms of required skills and knowledge, using ONET open database.
I’m quite lucky, because this dataset is very well organized. All occupations are described by the same set of skills (same with knowledge), and each of them is broken down by importance (how important this skill is to perform the job) and level (required level of expertise for the job). The “Data Value” columns assigns the respective rating.
I defined each skill and knowledge item as a single, separate feature:
Thanks to the quality of the original dataset, this part was fairly easy:
I am now left with a matrix whose row vectors are jobs (n=640) and whose columns are skills and knowledge parameters (p=134).
In machine learning, “dimensionality” simply refers to the number of features (or attributes, or variables) in your dataset.
Since each element is described by 134 parameters, we need a 134-dimensional space to fully represent it… Hard to picture, isn’t it?
Let’s see how we can factorize this matrix in order to reduce its representation to a 2-dimensional space.
Step 2. Reduce dimensionality
We want to create a lower dimensional representation where we keep some of the structure from the higher dimensional space.
Instinctively, we would like points that were close to each other in the original space to be close to each other in 2D. The same goes for far away points.
Doing this by hand would be an impossible task. So I looked around for dimensionality reduction algorithms.
No free lunch
Before we go over a few of them, note that no one algorithm fits every problem. This simple (but often forgotten) fact is humorously called the “no free lunch” theorem. Each model relies on some assumptions to produce a simplification of reality. But in certain situations, those assumptions fail. Hence, they produce an inaccurate version of reality.
Consequently, you must pick a few appropriate algorithms depending on the constrains of your data and on the nature of your problem. Then, try them out to see which one works best for you.
In my project, for instance, I am using dimensionality reduction for the sole purpose of data visualization, not as a preliminary step before applying a pattern recognition algorithm. Also, I want to produce a map. Therefore, I am interested in the accuracy of the distance between points as a proxy for similarity, but I don’t really care about the interpretability of the new set of axes.
This leads us to feature extraction algorithms.
a. Linear transformation
- Principal Component Analysis (PCA)
A well known and commonly used linear transformation algorithm is Principal Component Analysis (PCA).
In the matrix above, each job is described by 134 dimensions. It is very likely that to some extent, columns are linearly correlated. When two columns are perfectly correlated, I only need one to describe the other. To reduce the level of inaccuracy of my new representation while dropping dimensions, I would better keep the dimensions that describe most of the variation in my dataset. In other words, get rid of the unnecessary and noisy dimensions, and keep only the most informative ones. But still, dropping 132 dimensions is a lot.
Instead of identifying the dimensions where there is the most variance, PCA identifies the directions where there is the most variance, i.e., where the data is most spread out.
These directions are called principal components (or eigenvectors), and are made of “combinations” of the initial dimensions. Hence, they are much more informative than any single dimension alone, but unlike initial dimensions, they can’t be labeled with a particular feature (hence difficult to interpret).
PCA finds a new coordinate system made of a pair of orthogonal (uncorrelated) principal components with the highest variance, and assigns new values to each point in the dataset to position them accordingly.
By nature, PCA preserves the global structure of the data, because it finds the coordinate system that will produce the highest variance for the whole dataset, in a “one size fits all” manner. It does not take into account the initial position of points relatively to each other. Rather, it focuses on spreading the data as much as possible in view of the dimensionality reduction.
If you’re having a hard time visualizing what I just explained, take a look at this awesome interactive visualization of PCA.
Ok, let’s try it out:
1/ Create a “new” set of features with PCA
First, you need to import PCA from the scikit-learn library.
from sklearn.decomposition import PCA
Then, set n_components
. If you want to build a two-dimensional coordinate system, you set it up so that PCA finds 2 principal components.
pca = PCA(n_components=2)
Use fit_transform
to fit the model with your DataFrame (X) and apply the dimensionality reduction.
pc = pca.fit_transform(X)
PCA builds an array of 640 pair of coordinates along the first and second principal components respectively.
2/ Visualize with Plotly
Once you’ve imported plotly…
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplotinit_notebook_mode(connected=True)import plotly.graph_objs as go
…draw a scatter plot using the first and second element of each pair of coordinates as x and y respectively:
data = [go.Scatter(x=pc[:,0], y=pc[:,1], mode=’markers’)]
My data is not labeled, so the only way to check if its representation is accurate is to hover over points and check if their neighbors “make sense”.
As you can see, the data is well spread out, but there seem to be a lot of “outliers”. Note that my judgement of whether the algorithm works well with this dataset is highly subjective to my own expectations of the end visualization.
For this project, I am highly interested in preserving the local structure of the data. I want clusters that were neighbors in the dataset to remain neighbors after the dimensionality reduction. But as mentioned above, that’s not PCA’s best asset.
b. Non-linear transformation
The non-linear dimensionality reduction (NLDR) approach assumes that in the high-dimensional space, the data has some underlying low-dimensional structure lying on an embedded non-linear manifold.
I’m not a mathematician, and topology is quite a complex field, so I’ll try to explain it metaphorically.
A manifold resembles Euclidean space near each point. An Euclidean space is a space where the very first theorems you learnt in geometry, like Thales’s theorem, apply. Two-dimensional manifolds include surfaces (they resemble the Euclidean plane near each point).
NLDR assumes that, if the embedded manifold is two-dimensional (even though the data is 3 -or more- dimensional), then the data can be represented on a two-dimensional space too.
A good way to picture it is by looking at your duvet before you’ve made your bed. Imagine you had a rough night and your bed is messy. If you were to describe each point on the duvet, you would need three dimensions. But the duvet resembles the Euclidean plane near each point. So, it can be flattened into a plane (two-dimensional).
Non-linear algorithms are good at preserving the data’s local structure because they adapt to the underlying data, performing different transformations on different regions of the manifold.
- T-distributed Stochastic Neighbor Embedding (t-SNE)
TSNE is a recent, but popular, non-linear visualization method based on local distance measurements, and therefore preserving close neighbors. However, it does not necessarily preserve global structure. It means it does not always capture the distance between clusters. In the context of my project, that could be an issue.
If you want to understand better how t-SNE works: this article explains how to read t-SNE’s results correctly.
1/ Create the embedding
Just like you did for PCA, import SNE from the scikit-learn library, set n_components
to 2 and create the embedding with fit_transform
:
from sklearn.manifold import TSNEtsne = TSNE(n_components=2)embedding = tsne.fit_transform(X)
2/ Have a look at the result:
It looks pretty good already. Let’s try one last method.
- Uniform Manifold Approximation and Projection (UMAP)
UMAP, a new-comer, tries to balance between preserving both local and global distances, which is exactly what I need.
UMAP is fairly easy to use, and there are a few parameters that you can adjust (the defaults are quite good):
n_neighbors
: constrains the size of the local neighborhood to will look at when attempting to learn the manifold structure of the datan_components
: indicates the number of dimensions of the target spacemin_dist
: sets the minimum distance between points in the low dimensional representationmetric
: controls how distance is computed in the ambient space of the input data (feature space)
For more details, you can find the documentation here (link to UMAP).
1/ Create the embedding
Once you’ve imported UMAP, you can create the embedding and use the UMAP
and fit_transform
functions to create the final array of coordinates (full notebook here):
import umapembedding = umap.UMAP(n_neighbors=15, n_components=2, min_dist=0.3, metric=’correlation’).fit_transform(X.values)
2/ Visualize
The distribution of points seems pretty logical to me. Clusters are logically positioned relatively to one another :
Wrap up
Thank you for reading to the end 😃
So far, we’ve built a scatter plot with a coordinate system that doesn’t really have any obvious meaning in itself but that uses distance as a proxy for similarity among data features. The visualization is quite basic but it lays out the premises of a map.
Next, we’ll see how to give it a format that resembles a map in both its aspect and its use (link to #part 2).
In the meantime, feel free to share your inspiring non-geographical maps and other dimensionality reduction methodologies.