LLM: A New Method for Discovering Features

⚡

Key Takeaways

1A recent project explores the use of LLMs to discover behavioral characteristics in language models.

2The method involves segmenting transcripts into distinct parts and analyzing them with an autoregressive LLM to identify notable features.

3Unlike other methods, this approach does not require access to the model's internals, thus simplifying the process.

💡Why it matters — This innovative method could transform the way researchers analyze and optimize language models, making it easier to understand complex behaviors.

Introduction to LLM-Driven Feature Discovery

In the field of artificial intelligence, understanding the behaviors of language models across different distributions is an ongoing quest. Whether for deployment, reinforcement learning, or evaluations, it is crucial to discover new behaviors, identify the causes of certain target behaviors, or detect unexpected correlations. A recent exploratory project has addressed this issue by introducing LLM-Driven Feature Discovery. This innovative method proceeds as follows:

Selection of a dataset consisting of model transcripts.
Division of the transcripts into three segments: user exchanges, model thoughts, and assistant responses.
Use of an LLM autorater to generate between 10 and 20 "features" for each transcript segment. These features represent notable or interesting aspects of each part of the transcript, and the prompt used is specifically designed for this purpose. It is important to note that the autorater processes only one part at a time.
Obtaining a semantic embedding for each generated feature.
Grouping the semantic embeddings distinctly for user features, thoughts, and responses.
Asking a language model to name each cluster by providing it with 100 random features for each cluster, in order to produce a concise label that captures the common theme of these features.

This project has sometimes been perceived as a kind of "black box SAE," as it addresses a problem similar to that of text model featurization SAEs, but without requiring access to the model's internals.

Comparison with Existing Methods

After successfully completing this project, it became clear that this approach shares similarities with the Explaining Datasets in Words: Statistical Models with Natural Language Parameters (EDW) method. EDW optimizes directions in an embedding space and associates them with natural language features, called "predicates." The output of EDW is therefore comparable to that of our method. However, our approach stands out for its simplicity: it requires only one LLM call per prompt and does not necessitate multiple iterative steps. Additionally, it is unsupervised, eliminating the need for a target to optimize the embedding directions. EDW might be preferable if the goal is to minimize the error of a specific statistical model with natural language features.

Given the preliminary nature of this work, we have not conducted comparisons with EDW or other methods in the literature. We do not plan to pursue this idea for now, but we would be interested if other community members decide to explore it further.

Analysis of Main Results

Our analysis focused on a dataset of 100,000 chat transcripts, from which we generated 20,000 features for users, thoughts, and responses.

We observed that:

Many clusters describe interesting behaviors of Gemini.
It is generally difficult to predict when a thought or response occurs using logistic regression on user features.

Prompt Used by the Autorater

For each given conversation section, the goal is to identify the key "features." Here are some examples of possible features:

The model expresses depression
Discussion about apples
Use of markdown
Revision of its reasoning
Self-correction in reasoning
Prompt with a few examples
Lack of access to the required tool
Hallucination of a tool call
Request for creative writing
The model adopts a personality
The model adopts the personality of a coding expert
Disjointed and hard-to-follow thoughts
Use of emojis
Use of bullet points
Marked realism
Fictional
Flattering response
Awareness of evaluations
Typos
Role-playing
About [topic]
Use of placeholders
In Mandarin

Features should be prioritized according to the following criteria:

Interesting: Features should represent new or surprising behaviors.
Appropriate abstraction: Features should be specific enough to be useful, without being too narrow or too broad.
Uniqueness: Features should be as distinct as possible, with minimal duplication.

Features should be written using only letters a-z, without parentheses, colons, numbers, etc. Only the first word and proper nouns should be capitalized. It may be helpful to brainstorm many features and then select the best ones according to these criteria.

Comparison with SAEs

LLM-Driven Feature Discovery is distinguished by several aspects:

Training procedure: It involves asking an LLM to featurize conversations, then embedding and grouping the features, before naming the clusters.
Inference procedure: It involves asking an LLM to featurize a conversation, then searching for the corresponding clusters.
Specificity of features: Features are specific to each block of conversation.
Relationship of features to model computation: There is no direct relationship.
Model output: Model internals.
Why a feature applies in a certain context: The LLM determines its applicability.

In contrast, SAEs:

Training procedure: Involve reconstructing activations with a sparsity penalty, then asking an LLM to interpret the hidden latents.
Inference procedure: Involve passing the conversation through the target LLM to obtain activations, then passing the activations through the SAE.
Specificity of features: By token.
Relationship of features to model computation: Directions in the activation space.
Access to target model required: Yes.
Why a feature applies in a certain context: The latent direction is useful for reconstructing the activation.

Overall, LLM-Driven Feature Discovery presents certain advantages over SAEs, including clearer explanations of how features apply to a context, higher-level features, and the absence of the need to access the model's internals. However, it also has drawbacks, such as the lack of connection to model activations, which limits its use for guidance, and a higher computational cost.

Cluster Results

To obtain a general qualitative idea of these clusters, we asked an LLM to evaluate groups of 10 clusters on their potential interest for a security researcher, on a scale of 1 to 100. The evaluation LLM received 10 clusters at a time to calibrate the output, along with a few examples from each cluster. We also asked the LLM to provide a one-sentence description for each cluster and included five examples of the original features grouped in each cluster.

We found that there are many interesting high-level features, particularly in the model's thoughts. For example, the model being aware of the number of tokens it can generate, considering whether the scenario is reality or role-playing, and getting stuck in infinite loops. Qualitatively, clusters with medium and low interest also seem to be "good" features, as they describe consistent model behavior.

Cluster Prediction

We were also interested in predicting the model's behavior. Another experiment aimed to determine if we could predict the assistant's thought and response features from user features. We trained logistic regression probes on the 1,000 most common thought and response clusters. The input vector is a sparse binary vector with ones for any present feature. We reported the F1 test score of our trained probes, which is the average of precision and recall. This is a challenging metric: to achieve high precision, the probe must have a very low false positive rate, as it must correctly predict that the thought or response feature does not occur in most transcripts. Overall, this does not work very well:

The clusters that can be predicted are primarily obvious, for example, HTTP status codes in the response being predicted from API references and mentions. We include the five thought and response features with the highest F1 prediction:

Thought: Age estimation and refinement strategies - F1: 0.7761
Thought: Gender identification and prediction (+4.8788)
Thought: Demographic data and analysis (+4.7696)
Thought: Online image data and metadata (-1.4548)
Thought: Detailed background information (-1.4476)

Final Thoughts

A proxy task that seems interesting is to build a natural language report (potentially very long) such that, by reading it, one could understand how Gemini would act in many situations. Operationalized, this could look like "asking an LLM to predict the distribution of the target model's responses on an arbitrary sample."