LLM-Driven Feature Discovery¶
Ch01.904 LLM-Driven Feature Discovery¶
📊 Level ⭐⭐⭐⭐ | 3.5KB |
entities/llm-driven-feature-discovery.md
LLM-Driven Feature Discovery¶
概述¶
Published Time: 2026-06-22T22:26:51.599Z
Markdown Content: We would often like to get a qualitative sense of a target model’s behaviors in important distributions (e.g. deployment, RL training, or evals). For example, we might want to discover novel behaviors, figure out what causes some target behavior to occur, or find surprising correlations between behaviors.
In a recent short exploratory project, we tackled this problem via LLM-Driven Feature Discovery. Our method works as follows:
- Choose a dataset of model transcripts
- Split transcripts into three pieces: user turns, thoughts, and assistant responses.
- Ask a black box LLM autorater to generate a set of 10-20 “features” of each transcript piece. By feature we mean notable/interesting/important aspects of the transcript piece; we include the prompt we use below. Note that the autorater only sees one piece at a time.
- Get a semantic embedding for each generated feature
- Cluster the semantic embeddings separately for user, thoughts, and response features
- Ask a language model to name each cluster by giving it 100 random features for each cluster and asking it to “produce a single concise label (around 5 words) that captures the common theme of these features.”.

During the project, we sometimes thought of this work as a sort of "black box SAE", since it was solving a similar problem as SAEs of featurizing model text, but without using model internals.
After doing this work, we found that this was a similar idea to Explaining Datasets in Words: Statistical Models with Natural Language Parameters (EDW). EDW optimizes over directions in an embedding space and then maps those directions to natural language features (“predicates”). Thus, EDW’s output is similar to ours. However, our method is simpler in that it requires just one LLM call per prompt and does not require multiple steps of iteration. Additionally, our method is unsupervised; we don’t need a target to optimize the embedding directions against. EDW seems preferable if one aims to minimize the error of a specific statistical model with natural language features.
Since this is preliminary work, we do not compare against EDW or other methods in the literature. We are not currently planning on pursuing this idea further, but would be interested if other members in the community expanded on it.
A short summary of our main results:¶
We focus our analysis on a dataset of 100k chat transcripts, for which we generate 20k user, thought, and response features.
We find that:
- Many clusters describe interesting Gemini behaviors
- We mostly are not able to predict when a thought or response occurs using logistic regression on use
原文存档¶
→ 原文存档