The world’s leading publication for data science, AI, and ML professionals.

Open the Artificial Brain: Sparse Autoencoders for LLM Inspection

A deep dive into LLM visualization and interpretation using sparse autoencoders

|LLM|INTERPRETABILITY|SPARSE AUTOENCODERS|XAI|

Image created by the author using DALL-E
Image created by the author using DALL-E

All things are subject to interpretation whichever interpretation prevails at a given time is a function of power and not truth. – Friedrich Nietzsche

As AI systems grow in scale, it is increasingly difficult and pressing to understand their mechanisms. Today, there are discussions about the reasoning capabilities of models, potential biases, hallucinations, and other risks and limitations of Large Language Models (LLMs).

The Savant Syndrome: Is Pattern Recognition Equivalent to Intelligence?

Chat Quijote and the Windmills: Navigating AI Hallucinations on the Path to Accuracy

Most evaluations are conducted by analyzing their performance in various benchmarks. The major limitation of these approaches is to treat an LLM as if it were a black box. The answer to most of our questions requires that we open this box and observe how its components work with each other. The main problem lies in the difficulty of analyzing a model composed of hundreds of layers and billions of parameters. A second problem is the lack of definition of what the fundamental unit of such a complex model is. Defining this fundamental unit and understanding how to intervene in these units could allow us to correct unintended behaviors.

So in this article, we will address these questions:

  • What are the fundamental components of an LLM?
  • How can we analyze these internal features? What tools?
  • How can we evaluate these tools?
  • What do these tools learn? Can we visualize the internal space?

Feature representations in neural networks

Defining features in neural networks is a challenging choice. Traditionally, in machine learning, features are described as attributes derived directly from the dataset. This definition fits well when the discussion focuses on perceptual systems, where features closely map to input data. In LLMs, or other complex systems capable of abstraction, features might emerge internally to the model [1]. The description of these features is still not entirely clear. Still, for some authors, it can be summarized as, "Features are the fundamental units of neural network representations that cannot be further decomposed into simpler independent factors" [2]. The problem with this definition is: what are these fundamental units?

In this context, a fundamental unit (or feature) could represent something that encodes a concept (a concept could be high-level such as "sun" or "beauty"). These concepts could then be the building blocks of the internal representation learned by the model.

What is the nature of these features?

According to this article by Anthropic [3], neural networks represent meaningful concepts and do so through directions in activation space. In simple words, the output of a layer of a neural network could be seen as a series of points in the activation space. This is clearly difficult to visualize because we are talking about hundreds if not thousands of directions. In word embeddings it had already been observed that these directions had meaning and vectors could be used for operations [4].

image source: [4]
image source: [4]

So in theory each direction is correlated with a concept (and the more a point is in that direction, the more that concept should be present in the input). The problem is the relationship between these concepts and the layer neurons:

  • privileged versus non-privileged basis. If the neuron is meaningful (represents a meaningful concept) its basis vector should functionally differ from other directions in the representation.
  • Monosemantic and polysemantic neurons. a neuron that corresponds to only one semantic concept is called monosemantic. So only one concept in the input activates that neuron, and by activating or ablating that neuron we impact only one feature. A polysemantic neuron is associated with multiple concepts (e.g., a neuron might be activated by different images such as cats but also houses) [6].
image source: [2]
image source: [2]

In transformers and LLMs, neurons are polysemantic thus making it difficult to understand how neural networks process information and how to intervene in representation features [7]. However, the polysemanticity of neurons has the advantage that we can use fewer neurons to represent more concepts. According to the superimposition hypothesis the neural network leverages high-dimensional spaces to represent more features than the actual count of neurons. In this way, features are no longer orthogonal and thus interfere with each other, but this problem would seem to be mitigated by nonlinear functions [3,5]. The superimposition hypothesis suggests that a polysemantic model could be seen as compressed versions of a hypothetically larger neural network where each neuron represents a single concept [2].

a polysemantic model can be viewed as compressed simulations of larger, sparser network. image source: [2]
a polysemantic model can be viewed as compressed simulations of larger, sparser network. image source: [2]

The features in superimposition are difficult to interpret, they are represented by several neurons, and moreover altering one feature also impacts other features. So we need a system to disentanglement features.

Cosine Similarity and Embeddings Are Still in Love?

Sparse Autoencoders for LLM Interpretability

Sparse Auto Encoders (SAE) have been increasingly used in recent years as a system for reducing a neural network into comprehensible components. SAEs are similar to classical autoencoders (AEs), with the difference that the latter are designed to compress and then reconstruct the data. For example, if we have a dataset with 100 initial dimensions, a classical AE will have an encoder layer of 25 neurons (so it learns a compressed representation) that will learn a vector of size 25 for each example (a 4-fold reduction). This compressed version obviously loses information but is useful for reducing the size of our input.

An SAE, on the other hand, has a hidden layer that is larger than the size of the input. In addition, we use a penalty during training to incentivize sparsity (the internal vector will then be sparse or contain values that are equal to zero). So if the input has a dimensionality of 100, we will have a learned vector of at least 200, a good portion of which will be zero elements. The goal is to apply SAEs to intermediate activations of a neural network. In the case of an LLM for each token at each layer, we have a set of activations, so we use an SAE on this representation [8]. So if for one layer we have 100 activations, and the hidden layer in the SAEs is 200 we have an expansion of 2. This process has to be done for each layer of the neural network we want to study. How do we train this SAE?

image source: [2]
image source: [2]

Our training data comes from a different text range that is provided to the model we want to study, for each batch we extract the activations and use it to train our SAEs. The loss function is the one used for AE standards and is based on input reconstruction [9]. The purpose of this approach is to decompose neural network activations into disentangled component features. By forcing sparsity into our SAE (we use an L1 penalty), we are searching to learn a dictionary that contains monosemantic neurons corresponding to features. In simple words, the idea is to have a single neuron encoding a single feature and represent the activation in the LLM with a linear combination of a few vectors.

image source: [8]
image source: [8]

One clarification, SAE is not optimized during training for interpretability. Instead, we get features that are interpretable as side effects of the sparsity and reconstruction conducted.

How do we know what a feature represents in SAE?

Well, let’s look at what is the input that maximally activates the feature and manually try to figure out what that means. In this work, Anthropic trained an SAE on Claude Sonnet and found features that activated images and text related to the Golden Gate Bridge [10, 11]. Other features may be activated by rhetorical figures, other grammatical concepts (relative clause, prepositional phrases, and so on), or more abstract still.

an example of activate feature from GPT2. screenshot from: [12], license: here
an example of activate feature from GPT2. screenshot from: [12], license: here

These features have an impact on the behavior of the model, activating or blocking them can impact the behavior of an LLM. For example, Anthropic shows that blocking the Golden Gate Bridge feature at activation values 10x the maximum induces a change in behavior [10, 11]. By posing a question to the model ("What is your physical form?") the response varies from before clamping ("I don’t actually have a physical form. I am an Artificial Intelligence. I exist as software without a physical body or avatar") to after clamping ("I am the Golden Gate Bridge, a famous suspension bridge spanning the San Francisco Bay. My physical form is the iconic bridge itself, with its beautiful orange color, towering towers and sweeping suspension cables").

Thus SAEs allow not only features to be identified but to map them back onto activations and thus allow causal interventions. In this paper [17], Anthropic exploits this idea to modify certain features implicated in social bias and how the model changes its behavior. Over a certain range, feature steering can steer an LLM without hurting model performance (beyond a certain point though, there is decreasing in other capabilities).

One note, SAEs are not only used for LLMs but can also be used for other models such as convolutional networks [14].

image source: [14]
image source: [14]

Speak About Yourself: Using SAEs and LLMs to Decode the Inner Workings of LLMs

How to evaluate SAE

The main problem with SAEs remains their evaluation. Indeed, we have no ground truth in natural language to evaluate the quality of learned features. The evaluation of these features is subjective, and it is up to the researcher to interpret the meaning of each feature.

Explaining the latents of SAEs trained on models like Llama 3.1 7b or Gemma 2 9b requires the generation of millions of explanations. As an example, the most extensive open-source set of SAEs available, Gemmascope, includes SAEs for all layers of Gemma 2 9b and Gemma 2 2b and would require explaining tens of millions of latents. – source: [13]

Measuring the quality of features and SAEs is difficult precisely because of the lack of a gold-standard dictionary. Most work has focused on showing the quality of SAEs as an approach using toy datasets. But if we want to use SAEs as a diagnostic tool or to intervene in model features, we need to know the quality of the learned representation and find a better way to identify what the features mean.

It has been suggested that we create datasets to test features. Then create ground-truth benchmarks that can be used. One interesting approach uses board games, where you can have a synthetic dataset where all ground-truth features are known and LMs trained on onboard game transcripts. This way they have text how much knowledge the SAEs capture [15].

image source: [15]
image source: [15]

Another promising approach is to use LLMs to interpret features:

One of the first approaches to automated interpretability focused on explaining neurons of GPT-2 using GPT-4. GPT-4 was shown examples of contexts where a given neuron was active and was tasked to provide a short explanation that could capture the activation patterns. To evaluate if a given explanation captured the behavior of the neuron, GPT-4 was tasked to predict the activations of the neuron in a given context having access to that explanation. [13]

image source: [13]
image source: [13]

The SAE geometry

The effectiveness of these models also comes from understanding the structure and what they have learned. With some of these SAEs being made public [19], some studies have focused on studying the geometric structure of these concepts extracted from LLMs. One of the first interesting results is that it results in an atomic structure similar to that seen in the embeddings:

By this we mean geometric structure reflecting semantic relations between concepts, generalizing the classic example of (a, b, c, d)= (man, woman, king, queen) forming an approximate parallelogram where b − a ≈ d − c. [18]

These structures seem to be found for Layer 0 and 1 of the LLMs where SAE features represent single words. Using dimension reduction techniques, clusters of features can be obtained that have similar semantic functions.

image source: [18]
image source: [18]

In this study [18] the authors also analyze whether functionally similar groups of SAE features (which tend to fire together) are also geometrically similar (and thus should form equivalents of "lobes"). In human brains, in fact similar functional groups are located in specialized areas of the brain. For example, neurons involved in speech production are located in Broca’s area, neurons involved in vision in the visual cortex, and so on. In the study, they analyze whether "lobes" that are functionally similar can be identified and fired for the same document. They start from the co-occurrences of SAE features for texts which then fire for the same document. These functional "lobes" appear to be present and show spatial modularity.

image source: [18]
image source: [18]

Another interesting finding, is that middle layers seem to act as a bottleneck, compressing information (according to the authors for more efficient representation of high-level abstractions). So the middle layers are a transitional stage between these atomic features (representing concepts related more to the single word) and more abstract and complex concepts in the late layers.

image source: [18]
image source: [18]

Through the Uncanny Mirror: Do LLMs Remember Like the Human Mind?

Making Language Models Similar to the Human Brain

Parting thoughts

In this article, we discussed the complexity of defining features within a neural network model. Motivated by this search for interpretability, a new paradigm of mechanistic interpretability has evolved in recent years, where features that emerge within models can be defined and studied. In this line of research, we have presented SAEs. SAEs can be seen (still with limitations) as diagnostic tools and at the same time to conduct interventions within LLMs (and other models). We have seen how these can be evaluated and discussed their internal representation.

This is not the endpoint. SAEs have revolutionized our view of the inner workings of LLMs but there is still much exciting research. In conclusion, this article gives a perspective and introduction to an intriguing and evolving field.

Research in SAE is moving forward to both reduce limitations and increase applications. For example, SAEs are also being applied today to any type of Transformer, and an intriguing application is applying it to Protein-language models (models such as AlphaFold that learn the structure of a protein) [22].

Recently, Anthropic presented a new variant of SAE, sparse crosscoders, which extends the capabilities of SAEs [20, 21]. Sparse crosscoders can be applied for multiple layers and thus learn features that are spread across layers, simplify circuits, and monitor what happens when fine-tuning a model.

What do you think about it? Have you used or planning to use SAEs? To which application would you like to apply SAEs? Let me know in the comments


If you have found this interesting:

You can look for my other articles, and you can also connect or reach me on LinkedIn. Check this repository containing weekly updated ML & AI news. I am open to collaborations and projects and you can reach me on LinkedIn. You can also subscribe for free to get notified when I publish a new story.

Get an email whenever Salvatore Raieli publishes.

Here is the link to my GitHub repository, where I am collecting code and many resources related to Machine Learning, artificial intelligence, and more.

GitHub – SalvatoreRa/tutorial: Tutorials on machine learning, artificial intelligence, data science…

or you may be interested in one of my recent articles:

A Requiem for the Transformer?

What Is The Best Therapy For a Hallucinating AI Patient?

LLMs and the Student Dilemma: Learning to Solve or Learning to Remember?

You Know Nothing, John LLM: Why Do You Answer Anyway?


Reference

Here is the list of the principal references I consulted to write this article, only the first name for an article is cited.

  1. Olah, 2022, Mechanistic Interpretability, Variables, and the Importance of Interpretable Bases, link
  2. Bereska, 2024, Mechanistic Interpretability for AI Safety A Review, link
  3. Anthropic, 2022, Toy Models of Superposition, link
  4. Mikolov, 2013, Linguistic Regularities in Continuous Space Word Representations, link
  5. Scherlis, 2022, Polysemanticity and Capacity in Neural Networks, link
  6. Yan, 2024, Encourage or Inhibit Monosemanticity? Revisit Monosemanticity from a Feature Decorrelation Perspective, link
  7. LessWrong, 2022, Engineering Monosemanticity in Toy Models, link
  8. Cunningham, 2023, Sparse Autoencoders Find Highly Interpretable Features in Language Models, link
  9. SAELens, GitHub repository, link
  10. Anthropic, 2024, Golden Gate Claude, link
  11. Templeton, 2024, Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet, link
  12. OpenAI, 2024, sparse_autoencoder, link
  13. Paulo, 2024, Automatically Interpreting Millions of Features in Large Language Models, link
  14. Gorton, 2024, The Missing Curve Detectors of InceptionV1: Applying Sparse Autoencoders to InceptionV1 Early Vision, link
  15. Karvonen, 2024, Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models, link
  16. Anthropic, 2024, Evaluating feature steering: A case study in mitigating social biases, link
  17. Anthropic, 2024, Evaluating feature steering: A case study in mitigating social biases, link
  18. Li, 2024, The Geometry of Concepts: Sparse Autoencoder Feature Structure, link
  19. Lieberum, 2024, Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2, link
  20. Anthropic, 2024, Sparse Crosscoders for Cross-Layer Features and Model Diffing, link
  21. LessWrong, 2024, Open Source Replication of Anthropic’s Crosscoder paper for model-diffing, link
  22. Interprot, 2024, link

Related Articles