Visualizations of Embeddings

There is more than one way to visualize high-dimensional data. Here, we go back in the history of AI to explore the evolution of these visualizations.

Douglas Blank, PhD
Towards Data Science

--

I submitted my first paper on AI in 1990 to a small, local conference — the “Midwest Artificial Intelligence and Cognitive Science Society.” In those days, the AI field was entirely defined by research into “symbols.” This approach was known as “Good, Old-Fashion AI” or GOFAI (pronounced “go fi” as in “wifi”). Those of us working in what is now known as “Deep Learning” had to really argue that what we were researching should even be considered as AI.

Being excluded from AI was a double-edged sword. On the one hand, I didn’t agree with most of the basic tenets of what was defined as AI at the time. The basic assumption was that “symbols” and “symbol processing” must be the foundation of all AI. So, I was happy to be working in an area that wasn’t even considered to be AI. On the other hand, it was difficult to find people willing to listen to your ideas if you didn’t package it as at least related to AI.

This little conference accepted papers on “AI” and “Cognitive Science” — which I saw as an invitation for ideas outside of just “symbolic processing.” So I submitted my first paper, and it was accepted! The paper featured a neural network approach to handling natural language. Many of us in this area called this type of neural network research “connectionism,” but now days this type of research, as mentioned, would be labeled “Deep Learning” (DL) — although my initial research wasn’t very deep… only three layers! Modern DL systems can be composed of hundreds of layers.

My paper was accepted at the conference, and I presented it in Carbondale, Illinois in 1990. Later, the organizer of the conference, John Dinsmore, invited me to submit a version of the paper for a book that he was putting together. I didn’t think I could get a paper together by myself, so I asked two of my graduate school friends (Lisa Meeden and Jim Marshall) to join me. They did, and we ended up with a chapter in the book. The book was titled “The Symbolic and Connectionist Paradigms: Closing the Gap.” Our paper fit in nicely with the theme of the book. We titled our paper “Exploring the symbolic/subsymbolic continuum: A case study of RAAM.” To my delight, the book focused on this split between these two approaches to AI. I think the field is still wrestling with this divide to this day.

I’ll say more about that initial research of mine later. For now I want to talk about how the field was dealing with how to visualization “embeddings.” First, we didn’t call these vectors “embeddings” at the time. Most research used a phrase such as “hidden-layer representations.” That included any internal representation that a connectionist system had learned in order to solve a problem. As we defined them back then, there were three types of layers: “input” (where you plugged in the dataset), “output” (where you put the desired outputs, or “targets”), and everything else — the “hidden” layers. The hidden layers are where the activations of the network flow between the input and the output. The hidden-layer activations are often high-dimensional, and are the representations of the “concepts” learned by the network.

Like today, visualizing these high-dimension vectors was seen to help give insight into understanding how these systems work, and oftentimes fail. In our chapter in the book, we used three types of visualizations:

  1. So-called “Hinton Diagrams”
  2. Cluster Diagrams, or Dendograms
  3. Projection into 2D space

The first method was a newly-created idea used by Hinton and Shallice in 1991. (That is the same Geoffrey Hinton that we know today. More on him in a future article). This diagram is a simple idea with limited utility. The basic idea is that activations, weights, or any type of numeric data, can be represented by boxes: white boxes (typically representing positive numbers), and black boxes (typically representing negative numbers). In addition, the size of the box represents a value’s magnitude in relation to the maximum and minimum values in the simulated neuron.

Here is the representation from our paper showing the average “embeddings” at the hidden layer of the network as a representation of words were presented to the network:

Figure 10 from our paper showing activation values of each embedding.
Figure 10 from our paper.

The Hinton diagram does help to visualize patterns in the data. But they don’t really help in understanding the relationships between the representations, nor does it help when the number of dimensions gets much larger. Modern embeddings can have many thousands of dimensions.

To help with those issues, we turn to the second method: cluster diagrams or dendograms. These are diagrams that show the distance (however defined) between any two patterns as a hierarchical tree. Here is an example from our paper using euclidean distance:

Cluster diagram or dendogram
Figure 9 from our paper.

This is the same kind of information shown in the Hinton Diagram, but in a much more useful format. Here we can see the internal relationships between individual patterns, and overall patterns. Note that the vertical ordering is irrelevant: the horizontal position of the branch points is the meaningful aspect of the diagram.

In the above dendogram, we constructed the overall image by hand, given the tree cluster computed by a program. Today, there are methods for constructing such a tree and image automatically. However, the diagram can become hard to be meaningful when the number of patterns is much more than a few dozen. Here is an example made by matplotlib today. You can read more about the API here: matplotlib dendogram.

Cluster diagram (or dendograph) of a large number of patterns
Modern dendogram with a large number of patterns. Image made by the author.

Finally, we come to the last method, and the one that is used predominantly today: the Projection method. This methods uses an algorithm to find a method of reducing the number of dimensions of the embedding into a number that can more easily be understood by humans (e.g., 2 or 3 dimensions) and plotting as a scatter plot.

At the time in 1990, the main method for projecting high-dimensional data into a smaller set of dimensions was Principal Component Analysis (or PCA for short). Dimensional reduction is an active research area, with new methods still being developed.

Perhaps the most-used algorithms of dimension reduction today are:

  1. PCA
  2. t-SNE
  3. UMAP

Which is the best? It really depends of the details of the data, and on your goals for creating the reduction in dimensions.

PCA is probably the best method overall, as it is deterministic and allows you to create a mapping from the high-dimensional space to the reduced space. That is useful for training on one dataset, and then examining where a test dataset is projected into the learned space. However, PCA can be influenced by unscaled data, and can lead to a “ball of points” giving little insight into structural patterns.

t-SNE, which stands for t-distributed Stochastic Neighbor Embedding, was created by Roweis and Hinton (yes, that Hinton) in 2002. This is a learned projection, and can exploit unscaled data. However, one downside to t-SNE is that it does not create a mapping, but is merely a learning method itself to find a clustering. That is, unlike other algorithms that have Projection.fit() and Projection.transform() methods, t-SNE can only perform a fit. (There are some implementations, such as openTSNE, that provide a transform mapping. However, openTSNE appears to be very different than other algorithms, is slow, and is less supported than other forms.)

Finally, there is UMAP, Uniform Manifold Approximation and Projection. This method was created in 2018 by McInnes and Healy. This may be the best compromise for many high-dimensional spaces as it fairly computationally inexpensive, and yet is capable of preserving important representational structures in the reduced dimensions.

Here is an example of the dimension reduction algorithms applied to the unscaled Breast Cancer data available in sklearn:

Comparison between three projection methods
Example dimensional reductions between three projection methods, PCA, t-SNE, and UMAP. Image made by the author.

You can test out the dimension reduction algorithms yourself in order to find the best for your use-case, and create images like the above, using Kangas DataGrid.

As mentioned, dimensional reduction is still an active research area. I fully expect to see continued improvements in this area, including visualizing the flow of information as it moves throughout a Deep Learning network. Here is a final example from our book chapter showing how activations flow in the representational space of our model:

Hidden layer activations over single steps in the decoding section of the neural network.
Figure 7 from our paper. Hidden layer activations over single steps in the decoding section of the neural network.

Interested in where ideas in Artificial Intelligence, Machine Learning, and Data Science come from? Consider a clap and a subscribe. Let me know what you are interested in!

--

--

Professor Emeritus of Computer Science at Bryn Mawr College, Head of Research at Comet.com. Researcher in AI, ML, and Robotics.