The world’s leading publication for data science, AI, and ML professionals.

Towards LLM explainability: why did my model produce this output ?

The recent release of larger, better LLMs that showcase new capabilities has been paired with growing concerns over AI safety

The release in these last few months of larger, better Large Language Models, that showcase new capabilities, has been paired with overall growing concerns over AI Safety. LMM explainability research tries to expand our understanding of how these models work.

Large Language Models (LLMs) saw a lot of development this past year, such as the recent release of GPT-4 and Claude 2. These models display new abilities with respect to their previous versions, but most of them are discovered through post-hoc analysis and weren’t part of a purposeful training plan. They are a consequence of the model scaling in terms of number of parameters, training data and compute resources.

On a conceptual level, I like the analogy between LLMs and compression algorithms. Terabytes of internet data go in and many FLOPS later we get a file of a few hundreds GB containing the parameters of an LLM. The model is unable to precisely retrieve the initial knowledge, but still produces a pertinent output most of the times.

Image by the author and DALL-E 3 (inspired by Karpathy's llmintro)
Image by the author and DALL-E 3 (inspired by Karpathy’s llmintro)

The mystery of the LLMs does not reside in the technical architecture or the complexity of their computations. If the architecture of a model is fully documented, we can easily follow the mathematical operations being performed. But we still cannot entirely explain how a precise set of parameters collaborate towards producing an output that makes sense. How is the knowledge from the initial training data actually retrieved? Where and how is it actually stored inside the network ?

LLM explainability is an active area of research and many interesting results have been published in the last year. I don’t pretend to be exhaustive in what I will be showing next. My purpose is to draw attention to some of the current research directions and some promising results.

To simplify things, I would distinguish between 4 main directions:

  1. Explain the produced output based on the input (features attributions)
  2. Explain the produced output based on the training data
  3. Explain the role of individual neurons in embedding features
  4. Extract explainable features from poly-semantic neurons

I will provide some examples from each category and links to the full papers for each example.

1. Explain the produced output based on the input

The methods in this category rely on the computation of a measure of feature importance (or attribution) for each token in the input. Several families of measures exists, mostly derived from the existing interpretablility methods in machine learning: gradient based, attention based, perturbation based (occlusion, LIME), etc.

You can test some of these importance measures yourself using the Inseq Python package. They provide support for the models in the Transformers library and you can display your first results with just a few lines of code:

!pip install inseq
import inseq

# list available attribution methods
inseq.list_feature_attribution_methods()

# load a model from HuggingFace model hub and define the feature attribution 
# method you want to use
mdl_gpt2 = inseq.load_model("gpt2", "integrated_gradients")

# compute the attributions for a given prompt
attr = mdl_gpt2.attribute(
    "Hello ladies and",
    generation_args={"max_new_tokens": 9},
    n_steps=500,
    internal_batch_size=50 )

# display the generated attributions
attr.show()

Features attributions rely on the computation of a matrix Aij representing the importance of every token i in the input for every generated token j in the output. Previously generated tokens influence following predictions, so they must be dynamically incorporated into the computations. From a computational point of view, these methods are still very accessible and can run in a notebook.

The obtained output for the example given in the code snippet is shown bellow. From the values in the first column, we can see that the presence of the token ladies in the input was the most influential in the generation of the token gentlemen in the output.

Image generated by the author
Image generated by the author

Another recent approach to obtain feature attributions was to ask the model itself to provide this information using prompt engineering. Researchers at UC Santa Cruz asked ChatGPT to classify some movie reviews into either positive or negative and to also provide a feature importance measure for each token in the review. They have used the following prompt to get a nicely structured output:

'''
You are a creative and intelligent movie review analyst, whose purpose is 
to aid in sentiment analysis of movie reviews. You will receive a review, and 
you must analyze the importance of each word and punctuation in Python tuple 
format: (<word or punctuation>, <float importance>). Each word or punctuation 
is separated by a space. The importance should be a decimal number to three 
decimal places ranging from -1 to 1, with -1 implying a negative sentiment and 
1 implying a positive sentiment. Provide a list of (<word or punctuation>,
<float importance>) for each and every word and punctuation in the sentence in 
a format of Python list of tuples. Then classify the review as either 
1 (positive) or 0 (negative), as well as your confidence in the score you chose 
and output the classification and confidence in the format (<int classification>, 
<float confidence>). The confidence should be a decimal number between 0 and 1, 
with 0 being the lowest confidence and 1 being the highest confidence.

It does not matter whether or not the sentence makes sense. Do your best given 
the sentence. The movie review will be encapsulated within <review> tags. 
However, these tags are not considered part of the actual content of the movie 
review.

Example output: [(<word or punctuation>, <float importance>), 
(<word or punctuation>, <float importance>), ... ]
(<int classification>, <float confidence>)
'''

ChatGPT replied using the requested format and their analysis showed that when comparing the numbers directly provided by the model with those produced by more traditional explanation methods (occlusion, LIME saliency maps), the self explanations perform on par with the traditional ones. This seems promising since these explanations are computationally much cheaper to produce, but they still need more research before they can be fully trusted.

2. Explain the produced output based on the training data

A recent research paper from Anthropic describes a computationally efficient way to use influence functions to study LLM generalization. For a given prompt, they are capable to identify which sequences in the training data contribute the most in generating the output.

Their analysis shows that the larger the model, the more it is capable of concept generalization, and thus less likely to simply repeat sequences of tokens from the training data (they observe this behavior in the smaller model they use for comparison).

Image source: 2308.03296.pdf (arxiv.org)
Image source: 2308.03296.pdf (arxiv.org)

In this example they show that for large enough models the most influential sequences are conceptually related to the given prompt, but the contribution of each individual sequence is small and many training sequences contribute in the same time to produce the output. The lists of influential sequences can show considerable diversity, depending on the prompt.

3. Explain the role of individual neurons in embedding features

Open AI tried using an LLM to explain the activation patterns seen in the neurons of a smaller LLM. For each neuron in the GPT-2 XL model they used the following steps:

  1. Collect the output produced by the neuron’s activation function in response to a given set of text sequences
  2. Show the text sequences along with the neuron’s responses to GPT-4 and ask it to generate an explanation for the observed behavior
  3. Ask GPT-4 to simulate the activations of a neuron corresponding to the generated explanation
  4. Compare the simulated activations with the ones produced by the original GPT-2 XL neuron

They compute a score based on the comparison between the simulated activations and the actual neuron behavior. They find confident explanations for approx 1000 neurons out of the 307 200 neurons that GPT-2 XL has (corresponding to a score of at least 0.8). However, the average score computed across all the neurons only falls somewhere around 0.1. You can explore some of their findings using the Neuron Viewer and you can contribute by proposing better explanations in case you feel inspired.

The low overall score can be attributed to the fact that most neurons exhibit complex behavior that is hard to describe using short natural language explanations (as GPT-4 was instructed to produce in the experiment). Most neurons seem to be highly poly-semantic or could even represent concepts that humans don’t have words for. The approach they propose is interesting but very computationally intensive, it relies on having readily available an LLM much larger than the one you are trying to explain and still does not bring us any closer to understanding the underlying mechanism that produces the observed behavior.

4. Extract explainable features from poly-semantic neurons

As seen in the example before and in previous research on vision models, while mono-semantic neurons can sometimes be identified, most neurons in an LLM tend to be poly-semantic, meaning that they represent several different concepts or features at the same time. This phenomenon is called superposition and it has been studied and reproduced in toy models by the researches at Anthropic.

They trained small neural networks on synthetic data composed of 5 features of different importance to investigate how and what gets represented when models have more features than they have dimensions. With dense features, the model learns to represent an orthogonal basis of the two most important features (similar to the Principal Component Analysis), and the other three features are not represented. But if the sparsity of the features increases, then more and more features get represented, at the cost of a small interference:

Image reproduced by the author using https://colab.research.google.com/github/anthropics/toy-models-of-superposition/blob/main/toy_models.ipynb
Image reproduced by the author using https://colab.research.google.com/github/anthropics/toy-models-of-superposition/blob/main/toy_models.ipynb

This mechanism could be also at play in LLMs since more concepts than neurons are present in the training data. Moreover, in the natural world, many features seem to be sparse (they only rarely occur) and they are not all equally useful to a given task. Under this hypothesis, our current LLMs can be interpreted as the projection onto a smaller space of a much larger LLM, for which each neuron is entirely mono semantic.

Based on this insight, the researchers at Anthropic devised a method to extract mono-semantic features from poly-semantic neurons using sparse auto-encoders. They demonstrate their approach on a one-layer transformer with a 512-neuron MLP layer (Multi-Layer Perceptron). Using their method, the 512 MLP activations are decomposed into 4096 relatively interpretable features.

A web interface allows you to browse through the extracted features and judge for yourself. The features descriptions are generated post-analysis by Claude. For instance, feature no 9 encodes for the Romanian language, with the top neurons activated by this feature corresponding to:

  • #257: fires on content words (nouns, verbs, adjectives) from Romance languages (French, Spanish, Italian)
  • #269: fires on punctuation, particularly question marks, periods, hyphens, slashes, and parentheses
  • #86: fires on words related to chemistry/medical contexts involving liquids

The extracted feature are generally more interpretable than the neurons themselves. This is a very promising result, even if it is not clear yet if the approach can be scaled to larger models.


Conclusion

I hope this article provided examples of the some recent research directions in LLM explainability. Understanding exactly how LLMs work would allow us to fully trust their output and integrate them into more applications than we do today. Being able to easily check for the absence of bias would allow LLMs back into domains such as recruiting. Better understanding of their abilities and their limits would allow us to scale more efficiently, not just make them bigger and hope that it is enough. If you know of methods that look promising and that I have overlooked, please feel free to share them in the comments, I’d be happy to continue the exploration.

To keep reading about LLMs check out also this post about LLM jail-breaking and security

LLM Safety Training and Jail-Breaking


Related Articles