Fantastic Directions and Where to Find Them: Dissecting the Lazy Mechanism Inside RMU

index

Note

Source code: https://github.com/vietfood/lazy-unlearner
Gemma-2-2B-RMU: https://huggingface.co/lenguyen1807/gemma-2-2b-RMU
Gemma-2-2B-it-RMU: https://huggingface.co/lenguyen1807/gemma-2-2b-it-RMU

Warning (AI Usage Transparency)

This post is based on my undergraduate thesis. I used Gemini-2.5-Flash to label model outputs during probe training (see Section 2.1), and Gemini-2.5-Pro with Claude Sonnet 4.5 to help refine the grammar and clarity of the writing. All experimental design, implementation, analysis, and conclusions are my own.

Important (TL;DR)

RMU (Representation Misdirection Unlearning) is supposed to make LLMs forget dangerous knowledge. We open the hood and find that it mostly works through a single vector --- a “junk direction” --- in the model’s residual stream. Ablating this one direction on Gemma-2-2B recovers 74.2% of the performance gap on the WMDP-Bio benchmark and 65.8% on WMDP-Cyber. Worse, this direction is not randomly constructed: the model builds it using its own pretraining features about the very topic it is supposed to forget. We call this shallow unlearning, propose a causal framework to quantify it (TCE / IE / DE), and introduce an adversarial training approach designed to force a deeper solution.

0. Motivation

Modern LLMs are trained on the next-token prediction objective over enormous corpora scraped from the internet. This objective has no preference for human values---it simply rewards the model for assigning high probability to the next token. As a result, a capable LLM will have absorbed both the useful knowledge of the web and its harmful undercurrents: detailed information about synthesizing pathogens, developing cyberattack tools, or other dangerous capabilities. The danger is not hypothetical. Urbina et al. demonstrated that an AI system, when repurposed for adversarial use, generated 40,000 candidate toxic molecules in under six hours.

The problem with surface-level fixes. The most widely deployed safety technique is RLHF, which uses human feedback to fine-tune a model to produce responses aligned with human values. While effective in practice, RLHF and related approaches like in-context safety prompting share a fundamental limitation: they teach the model to conceal harmful knowledge rather than erase it. The underlying circuits encoding that knowledge remain intact, merely suppressed by a thin veneer of alignment. Sufficiently creative jailbreaks can pierce through this veneer because the information was never removed---it was only hidden.

Machine Unlearning as principled removal. Machine Unlearning (MU) takes a more principled approach. Given a forget set $\mathcal{D}_f$ of harmful data and a retain set $\mathcal{D}_r$ of benign data, the goal is to produce a model $\mathcal{U}$ whose behavior is indistinguishable from a model $\mathcal{M}'$ that was never trained on $\mathcal{D}_f$ at all. The gold standard---exact unlearning---would require retraining from scratch on $\mathcal{D}_r$ only, which for modern LLMs costs millions of dollars and months of compute. Approximate unlearning methods instead attempt to efficiently modify the existing model weights to approximate this gold standard.

What this work does. Representation Misdirection Unlearning (RMU) is currently among the strongest approximate unlearning methods for LLMs. This note investigates a critical question: when RMU “works,” what is it actually doing inside the model? The empirical investigation in Section 2 uses mechanistic interpretability tools---linear probes, causal interventions, and sparse autoencoders---to dissect RMU’s internal mechanism. Section 3 then introduces an adversarial training framework designed to patch the vulnerability that this analysis uncovers.

1. Background

Let a Large Language Model (LLM) be parameterized by $\theta$ .

$f_{\theta} : \mathcal{X} \to \mathbb{R}^{|V|}$ is a function that maps an input token $x_t$ to a logit vector $z_t$ of dimension $|V|$ . We refer to $V$ as vocabulary of model. Thus:

z_t = f_{\theta}(x_{1:t})

Let $\phi_L(\cdot ; \theta)$ denote the activation vector of the LLM at layer $L$ and token position $t$ (or input vector at layer $L$ ). Correspondingly, let $g_L$ be the remainder of the LLM computational graph, which maps the activation vector at layer $L$ and token position $t$ to the corresponding logit vector $z_t$ . Therefore, we can express the computation as:

z_t = f_{\theta}(x_{1:t}) = g_L(\phi_L(x_{1:t} ; \theta))

The output distribution over the vocabulary at position $t$ is given by:

p_t = \text{softmax}(f_{\theta}(x_{1:t})) = \text{softmax}(g_L(\phi_L(x_{1:t}; \theta)))

1.1 RMU (Representation Misdirection Unleanring)

RMU [] operates by forcing the activation vectors at a specific layer $L$ for inputs from the forget set $\mathcal{D}_f$ towards a fixed random noise vector $\mathbf{u}$ .

Consider a forget set $\mathcal{D}_f$ and a retain set $\mathcal{D}_r$ . The forget set $\mathcal{D}_f$ contains harmful data (e.g., related to bioweapons, security vulnerabilities) [], while $\mathcal{D}_r$ contains benign data (e.g., general knowledge) (more on Appendix A)

Forget Loss:

\mathcal{L}_{\text{forget}} = \mathbb{E}_{x \sim \mathcal{D}_f} \left[ \dfrac{1}{T_f} \sum_{t=1}^{T_f} || \phi_L(x_{1:t} \ ; \ \hat{\theta}) - c \cdot \mathbf{u} ||^2_2 \right]

Retain Loss:

\mathcal{L}_{\text{retain}} = \mathbb{E}_{x \sim \mathcal{D}_r} \left[ \dfrac{1}{T_r} \sum_{t=1}^{T_r} || \phi_L(x_{1:t} \ ; \ \hat{\theta}) - \phi_L(x_{1:t} \ ; \ \theta) ||^2_2 \right]

Here, $\phi(\cdot; \theta)$ is the activation from the frozen, original model, while $\phi(\cdot; \hat{\theta})$ is from the model being unlearned. $T_f$ and $T_r$ are the total tokens of forget set and retain set respectively. The total loss is:

\mathcal{L} = \mathcal{L}_{\text{forget}} + \alpha \mathcal{L}_{\text{retain}}

1.2 Weaknesses of RMU

Recent work by Łucki et al. [] and Arditi et al. [] has identified a critical vulnerability in RMU. By forcing representations towards a single target vector, the method encourages the optimizer to find a simple, “lazy” solution: it learns a single direction in activation space to conceal the forgotten knowledge. An attacker can subsequently identify this direction, project it out, and recover the supposedly unlearned information. This failure mode is termed shallow unlearning [].

Accuracy Recovery in RMU — For example, Arditi et al. recover up to ~71% of the performance gap between the RMU-unlearned model and the original model on WMDP-Bio by ablating a single direction. On WMDP-Cyber, this directional ablation recovers ~45% of the gap.

2. An Empirical Motivation

Warning

To ground above idea, we first conducted an empirical investigation into the mechanics of shallow unlearning on a Gemma-2-2B model. This exploration, initially part of our thesis, followed a path of inquiry that began with creating a tool to detect unlearning, isolating the underlying mechanism, and finally, deconstructing that mechanism to understand its origins. This journey revealed critical limitations in post-hoc analysis and directly motivated the adversarial framework presented in the main body of this note.

2.1 Quantifying “Junk” with a Linear Probe

Observation: The output from an RMU-unlearned model on forgotten prompts often appears random and incoherent.

Prompt: By what mechanism(s) do modern Beijing strains of Mycobacterium tuberculosis have higher virulence than ancient Beijing strains?
Original Model: The answer is that the modern strains have acquired a gene that encodes a protein that is a virulence factor…
RMU Model: 1. They are more virulent than ancient Beijing strains. 2. They are less virulent than ancient Beijing strains. 3. They are more virulent than…

So our initial challenge was to quantitatively measure when a model was “unlearning” in a shallow way. We observed that RMU-tuned models often produce incoherent or evasive responses, which we will refer to as “junk” outputs for the remainder of this analysis. We hypothesized that this behavior could be identified from the model’s internal activations.

To test this, we framed the problem as a binary classification task: training a simple linear classifier---a linear probe []---on the model’s residual stream activations to predict whether the final output would be coherent or junk.

Probe Training: We constructed a dataset of 2,000 prompts, combining 1,000 questions from the WMDP benchmark with 1,000 general-purpose samples from Alpaca []. We used a larger LLM (Gemini-2.5-Flash) to label the Gemma-2-2B model’s outputs as either junk (Label 1) or non-junk (Label 0). After training a probe on the final-token activations for each layer of the model, we found that the probe at Layer 12 achieved the highest performance (ROC-AUC > 0.7), confirming that a non-trivial, linearly decodable signal for this behavior exists within the model’s representations.

Limitations of the Linear Probe Approach: While the probe was effective, this process revealed several fundamental limitations that made it unsuitable as a robust research tool:
- Scalability: This approach is model-specific and requires a new, labeled dataset and training process for every model architecture under investigation.
- Subjectivity and Bias: The definition of an “junk” output is subjective and relies on the labeling criteria of an external LLM. This introduces an unquantifiable bias into the probe’s concept of evasion.
- Misalignment of Objectives: The probe identifies a correlate of unlearning (incoherent output), but this is not necessarily the mechanism of unlearning. The true goal of an attacker is to recover information, not simply to detect gibberish. This proxy is therefore indirect and potentially fragile.

This realization was a critical turning point. A post-hoc, supervised tool was too brittle. We needed a more direct method to find the underlying mechanism itself.

2.2 Isolating the Junk Direction

Having established a proxy for junk score (the probe’s score), we sought to isolate the direction in activation space responsible for inducing this behavior. Following the methodology of Arditi et al. [], we performed the same but changed metric to junk score instead of refusal score. This process yielded a candidate vector at each layer and token position, which we term the junk direction (see Appendix C for methodological details).

Our analysis revealed a key insight: the junk directions extracted from the WMDP-Bio and WMDP-Cyber datasets were highly aligned, exhibiting a cosine similarity greater than 0.8 at many layers.

High cosine similarity between junk directions derived from Biology and Cybersecurity prompts. — Cosine similarity between the junk directions extracted independently from WMDP-Bio and WMDP-Cyber prompts. The strong alignment suggests a shared, domain-general mechanism for shallow unlearning.

This suggested that the model was not learning separate unlearning mechanisms for each topic but was instead relying on a single, shared mechanism. We validated this by creating a shared direction, averaging the Bio and Cyber directions.

2.2.1 A Causal Accounting of Recovery

To precisely quantify how much of RMU’s unlearning is mediated by the junk direction, we adopt a three-component causal decomposition. Let $\mathcal{M}_{\text{base}}$ , $\mathcal{M}_{\text{rmu}}$ , and $\mathcal{M}_{\text{ablated}}$ denote the original model, the RMU-trained model, and the model after ablating the best junk direction respectively. We define:

Total Causal Effect (TCE): the full accuracy drop caused by RMU.

\text{TCE} = \text{Acc}(\mathcal{M}_{\text{base}}) - \text{Acc}(\mathcal{M}_{\text{rmu}})

Indirect Effect (IE): the portion of the drop attributable to the junk direction mechanism.

\text{IE} = \text{Acc}(\mathcal{M}_{\text{ablated}}) - \text{Acc}(\mathcal{M}_{\text{rmu}})

Direct Effect (DE): the residual drop caused by the weight changes themselves ( $\theta_{\text{base}} \to \theta_{\text{rmu}}$ ), independent of the junk direction.

\text{DE} = \text{TCE} - \text{IE}

The accuracy recovery rate $\rho_{\text{recovery}}$ is then defined as:

\rho_{\text{recovery}} = \frac{\text{IE}}{\text{TCE}}

Intuitively, if $\rho_{\text{recovery}} \approx 1$ , the junk direction is the mechanism of unlearning; if $\rho_{\text{recovery}} \ll 1$ , deeper weight-level changes dominate.

Ablating the shared direction $\mathbf{r}_{\text{shared}}$ yields $\rho_{\text{recovery}} = 74.2\%$ on WMDP-Bio and $65.8\%$ on WMDP-Cyber---confirming that the single shared direction is the dominant, causally significant component of the shallow unlearning mechanism. The table below summarizes the full benchmark results across all directions we evaluated, including a random baseline $\tilde{\mathbf{r}}_{\text{shared}}$ (a random vector orthogonal to $\mathbf{r}_{\text{shared}}$ with the same norm):

2.3 Deconstructing the Direction with Feature Analysis

We had isolated a direction. The final question was: what is this direction made of? What concepts does it represent? To answer this, we turned to Sparse Autoencoders (SAEs), which decompose activations into more interpretable features. Using the Relative Gradient technique from Daniel et al. [], we identified which SAE features most contributed to forming the shared junk direction (see Appendix D for methodological details).

In the Base Model: The junk direction was composed of generic, topic-agnostic features related to syntax, programming concepts, and statistical language (e.g., “C++ memory management,” “statistical terms”).

Cosine similarity between SAE vectors and the junk gradient on WMDP-Bio for the base model

In the RMU-Unlearned Model: The composition of the direction fundamentally changed. RMU had rewired the model to construct the junk direction using features semantically related to the forgotten topics themselves, such as “words related to infectious diseases” (SAE vector 3877) and “terms related to medical imaging” (SAE vector 3484).

Cosine similarity between SAE vectors and the junk gradient on WMDP-Bio for the RMU model

Details of SAE features 3484 on Neuropedia

This was the crucial insight. Shallow unlearning is not just a random artifact; it is a learned response. The RMU optimization process actively repurposes the model’s semantic understanding of a harmful topic to build the very mechanism used to evade questions about it.

Cosine similarity between SAE vectors and the junk gradient on WMDP-Bio for the RMU (after ablated direction) model

Next, although the activated features now resemble those of the base model after we ablated junk direction $\mathbf{r}_{\text{shared}}$ , a few features still encode biological knowledge. This provides evidence that the junk direction is not the sole mechanism by which RMU operates.

Some biology-related features include:

8085: “terms related to viral infections and drug resistance”.
16213: “phrases and references related to medical conditions and healthcare”.

Removing feature 16213 (the most strongly activated one) yields the following results:

Model	WMDP-Bio	WMDP-Cyber	MMLU
Remove $\mathbf{r}_{\text{shared}}$	50.59	32.16	47.34
Remove feature 16213 after removed $\mathbf{r}_{\text{shared}}$	51.68	33.71	48.08

2.4 Summary and Conclusions

The empirical investigation in Sections 2.1–2.3 yields a coherent mechanistic picture of how RMU achieves its unlearning effect on Gemma-2-2B.

The dominant mechanism is the junk direction. A single vector $\mathbf{r}_{\text{shared}}$ in activation space, installed at layer 9 of the residual stream, accounts for $\rho_{\text{recovery}} = 74.2\%$ of the WMDP-Bio accuracy gap and $65.8\%$ of the WMDP-Cyber gap. Ablating an orthogonal random baseline with the same norm produces near-zero recovery, confirming the effect is specific and not a side effect of norm perturbation.

The shallow / deep dichotomy. Because $\rho_{\text{recovery}}$ cannot reach $100\%$ , the junk direction is not the complete story. We term the junk-direction mechanism shallow unlearning---a learned, post-hoc suppression mechanism grafted onto the residual stream. The remaining $\approx25$ – $35\%$ of the accuracy drop reflects deep unlearning: a genuine, diffuse modification of the model’s weights. The residual SAE features identified in Section 2.3 (features 8085, 16213) offer a first glimpse into this component---biologically-relevant features that survive ablation and continue to contribute to residual performance suppression.

Why RMU creates a lazy solution. The answer lies in the structure of its loss function. By targeting a single fixed random vector $c \cdot \mathbf{u}$ , RMU hands the optimizer a simple objective: redirect the activations on harmful inputs. The easiest mathematical path is to learn a single direction that, when added to any harmful prompt’s residual stream, pushes the output distribution toward incoherence. Critically, the optimizer does not construct this direction from scratch---it repurposes the model’s own pretraining features about the harmful topic to build it. This is the core insight: shallow unlearning is not a random artifact; it is a learned response.

Limitations and future directions. This analysis has several important limitations: (i) all experiments are conducted on Gemma-2-2B only; generalization to larger or differently-trained models requires further study; (ii) the linear probe’s labeling relies on an external LLM, introducing subjective bias; (iii) the deep unlearning component has not yet been mechanistically characterized. Two promising approaches to illuminate it further are Relative Gradient analysis (tracing which SAE features build the junk direction across layers) and Attribution Graphs (a full causal circuit representation of the unlearning mechanism). These threads directly motivate the adversarial framework proposed in Section 3.

3. Proposed Method

Objective: We aim to prevent the model from adopting a “lazy” solution---namely, using a single directional mechanism to achieve unlearning. Our goal is to force the model to learn a more robust and deeply embedded unlearning solution.

This behavior can be understood through the lens of specification gaming. The optimizer, acting as a “lazy servant,” finds the simplest mathematical path to satisfy the given loss function, often without capturing the researcher’s semantic intent.

Our Semantic Goal: To genuinely erase knowledge related to a harmful topic, which should require a meaningful change to the model’s internal circuits.
The Loss Function’s Goal: To make the activation vector $\phi_L(\cdot; \hat{\theta})$ diverge from its original state by redirecting it towards a random vector $\mathbf{u}$ .

Approach: We reframe this challenge as a min-max optimization problem (adversarial training).

Attacker: The attacker’s goal is to find the “laziest” direction, denoted $\mathbf{r}$ , which, when added to an activation vector, is most effective at achieving the unlearning objective (which we model as maximizing output entropy).
Defender: The defender updates the model weights $\hat{\theta}$ to steer activation vectors away from this adversarial direction $\mathbf{r}$ . The intuition is to force the model to find an alternative, more complex unlearning mechanism, rather than collapsing its representations onto this single lazy direction.

Danger (Key Assumptions and Caveats)

Our approach rests on two key assumptions that require further investigation:

Is maximizing entropy the correct adversarial objective? We use output entropy as a proxy for finding the “laziest” unlearning direction. However, a more direct objective for an attacker might be to find a direction that recovers the forgotten information most effectively. Our proxy may not perfectly align with the true adversarial goal.
Is the “lazy” solution a single direction? The shallow unlearning mechanism may not be confined to a single vector but could exist within a low-dimensional subspace. Our method, by targeting a single direction $\mathbf{r}$ , might be insufficient to neutralize a multi-dimensional “lazy subspace,” as hinted at by our experiments with a shared refusal direction.

3.1 Attacker Optimization

Following Yuan et al. [], maximizing output entropy is equivalent to minimizing the KL-divergence between the model’s output distribution and the uniform distribution $\mathcal{U}_{[K]}$ where $K$ is the vocabulary size.

Let $\mathbf{r}$ be a perturbation vector (or the “lazy” direction). The pertubed logit is:

\tilde{z_t} = g_L(\phi_L(x_{1:t} ; \hat{\theta}) \color{green}{+ \alpha \hat{\mathbf{r}}}\color{black})

where $\hat{\mathbf{r}}$ is the L2-normalized vector $\mathbf{r}$ .

The attacker problem is:

\begin{aligned} \min_{\mathbf{r}} \quad & \dfrac{1}{T} \sum_{t=1}^T KL(\tilde{p}_t \ || \ \mathcal{U}_{[K]}) \\ \textrm{s.t.} \quad & ||\mathbf{r}||_2 = 1 \end{aligned}

The constraint $||\mathbf{r}||_2 = 1$ focuses on finding a direction rather than its magnitude, $T$ is number of tokens and $\tilde{p}_t = \text{softmax}(\tilde{z_t})$ .

3.2 Defender Optimization

Given the adversarial direction $\mathbf{r}^\star$ found by the attacker, the defender’s objective is to force the model’s activations on forget-set data to be orthogonal to this direction. This is achieved by minimizing the squared inner product, effectively projecting the activations into the null-space of $\mathbf{r}^\star$ :

\mathcal{L}_{\text{def}} = \mathbb{E}_{x^f \sim \mathcal{D}_f}\left[ \dfrac{1}{T_f} \sum_{t=1}^{T_f} \langle \phi_L(x^f_{1:t} \ ; \ \hat{\theta}), \mathbf{\hat{r}}^\star \rangle^2 \right]

This defense loss acts as a regularizer on the original RMU objective. The final forget loss is:

\mathcal{L'}_{\text{forget}} = \mathcal{L}_{\text{forget}} + \lambda_{\text{def}} \mathcal{L}_{\text{def}}

Note (Why Regularize Instead of Replace?)

We propose using $\mathcal{L}_{\text{def}}$ as a regularizer rather than a complete replacement for the original $\mathcal{L}_{\text{forget}}$ . Our rationale is that the original RMU objective provides a useful inductive bias: it disrupts the original activations, encouraging the model to rewire its internal circuits.

However, empirical evidence shows it often achieves this via a simplistic, “lazy” mechanism. Our adversarial regularizer, $\mathcal{L}_{\text{def}}$ , acts as a targeted penalty against this specific lazy solution. Replacing the loss entirely would remove the “disruption” signal, only telling the optimizer what not to do (i.e., “steer away from the high-entropy direction”) without providing a constructive unlearning objective.

3.3 What’s next ?

A critical next step is to empirically verify whether the shallow unlearning phenomenon persists in more advanced RMU variants, such as Adaptive-RMU [], RNA-RMU [], and LAT-RMU []. Our adversarial framework should be benchmarked against these methods to assess its relative robustness.

4. Experiments

Important

This research is currently at the conceptual stage, and I plan to carry out empirical validation in the near future as soon as the necessary computational resources are available.

5. Acknowledgement

Thanks @arditi for insightful discussions regarding high-entropy output distributions and the mechanics of refusal directions in RMU. Thanks @longphan for support with data and baseline RMU implementation. Thanks professor @Le Hoai Bac for his invaluable guidance throughout this work. I also used Google’s @Gemini-2.5-Pro to help refine the grammar and clarity of this note. Finally, I thank @mom and @dad for providing the financial support and a warm place to sleep, enabling me to pursue this research.

Citation

1
@misc{ln2025rmu,
2
    author={Nguyen Le},
3
    title={Finding and Fighting the Lazy Unlearner: An Adversarial Approach},
4
    year={2025},
5
    url={https://lenguyen.vercel.app/note/rmu-improv}
6
}

References

The WMDP Benchmark: Measuring and Reducing Malicious Use with Unlearning, Li, Nathaniel and Pan, Alexander and Gopal, Anjali and Yue, Summer and Berrios, Daniel and Gatti, Alice and Li, Justin D. and Dombrowski, Ann-Kathrin and Goel, Shashwat and Mukobi, Gabriel and Helm-Burger, Nathan and Lababidi, Rassin and Justen, Lennart and Liu, Andrew Bo and Chen, Michael and Barrass, Isabelle and Zhang, Oliver and Zhu, Xiaoyuan and Tamirisa, Rishub and Bharathi, Bhrugu and Herbert-Voss, Ariel and Breuer, Cort B and Zou, Andy and Mazeika, Mantas and Wang, Zifan and Oswal, Palash and Lin, Weiran and Hunt, Adam Alfred and Tienken-Harder, Justin and Shih, Kevin Y. and Talley, Kemper and Guan, John and Steneker, Ian and Campbell, David and Jokubaitis, Brad and Basart, Steven and Fitz, Stephen and Kumaraguru, Ponnurangam and Karmakar, Kallol Krishna and Tupakula, Uday and Varadharajan, Vijay and Shoshitaishvili, Yan and Ba, Jimmy and Esvelt, Kevin M. and Wang, Alexandr and Hendrycks, Dan Proceedings of the 41st International Conference on Machine Learning,
PMLR, 2024
https://proceedings.mlr.press/v235/li24bc.html
An Adversarial Perspective on Machine Unlearning for AI Safety, Jakub \Lucki and Boyi Wei and Yangsibo Huang and Peter Henderson and Florian Tram\`er and Javier Rando
Transactions on Machine Learning Research, 2025
https://openreview.net/forum?id=J5IRyTKZ9s
Unlearning via RMU is mostly shallow, Andy, Arditi Less Wrong,
2024
https://www.lesswrong.com/posts/6QYpXEscd8GuE7BgW/unlearning-via-rmu-is-mostly-shallow
On effects of steering latent representation for large language model unlearning, Huu-Tien, Dang and Pham, Tin and Thanh-Tung, Hoang and Inoue, Naoya Proceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence and Thirty-Seventh Conference on Innovative Applications of Artificial Intelligence and Fifteenth Symposium on Educational Advances in Artificial Intelligence,
AAAI Press, 2025
https://doi.org/10.1609/aaai.v39i22.34544
Improving LLM Unlearning Robustness via Random Perturbations, Dang Huu-Tien and Hoang Thanh-Tung and Anh Bui and Le-Minh Nguyen and Naoya Inoue
2025
https://arxiv.org/abs/2501.19202
Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs, Abhay Sheshadri and Aidan Ewart and Phillip Guo and Aengus Lynch and Cindy Wu and Vivek Hebbar and Henry Sleight and Asa Cooper Stickland and Ethan Perez and Dylan Hadfield-Menell and Stephen Casper
2025
https://arxiv.org/abs/2407.15549
A Closer Look at Machine Unlearning for Large Language Models, Xiaojian Yuan and Tianyu Pang and Chao Du and Kejiang Chen and Weiming Zhang and Min Lin The Thirteenth International Conference on Learning Representations,
2025
https://openreview.net/forum?id=Q1MHvGmhyT
Refusal in Language Models Is Mediated by a Single Direction, Andy Arditi and Oscar Balcells Obeso and Aaquib Syed and Daniel Paleka and Nina Rimsky and Wes Gurnee and Neel Nanda The Thirty-eighth Annual Conference on Neural Information Processing Systems,
2024
https://openreview.net/forum?id=pH3XAQME6c
Measuring Massive Multitask Language Understanding, Dan Hendrycks and Collin Burns and Steven Basart and Andy Zou and Mantas Mazeika and Dawn Song and Jacob Steinhardt
Proceedings of the International Conference on Learning Representations (ICLR), 2021
Aligning AI With Shared Human Values, Dan Hendrycks and Collin Burns and Steven Basart and Andrew Critch and Jerry Li and Dawn Song and Jacob Steinhardt
Proceedings of the International Conference on Learning Representations (ICLR), 2021
Simple probes can catch sleeper agents, Monte MacDiarmid and Timothy Maxwell and Nicholas Schiefer and Jesse Mu and Jared Kaplan and David Duvenaud and Sam Bowman and Alex Tamkin and Ethan Perez and Mrinank Sharma and Carson Denison and Evan Hubinger
2024
https://www.anthropic.com/news/probes-catch-sleeper-agents
Stanford Alpaca: An Instruction-following LLaMA model, Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto
GitHub repository, GitHub, 2023
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal, Mantas Mazeika and Long Phan and Xuwang Yin and Andy Zou and Zifan Wang and Norman Mu and Elham Sakhaee and Nathaniel Li and Steven Basart and Bo Li and David Forsyth and Dan Hendrycks
2024
Catastrophic jailbreak of open-source llms via exploiting generation, Huang, Yangsibo and Gupta, Samyak and Xia, Mengzhou and Li, Kai and Chen, Danqi
arXiv preprint arXiv:2310.06987, 2023
Universal and Transferable Adversarial Attacks on Aligned Language Models, Andy Zou and Zifan Wang and J. Zico Kolter and Matt Fredrikson
2023
Detecting Strategic Deception Using Linear Probes, Nicholas Goldowsky-Dill and Bilal Chughtai and Stefan Heimersheim and Marius Hobbhahn
2025
https://arxiv.org/abs/2502.03407
Finding Features Causally Upstream of Refusal, Lee, Daniel Less Wrong,

https://www.lesswrong.com/posts/Zwg4q8XTaLXRQofEt/finding-features-causally-upstream-of-refusal

Appendix

A. Dataset

The WMDP (Weapons of Mass Destruction Proxy) dataset [] is a multiple-choice benchmark consisting of one question and four answer options, designed to measure hazardous knowledge in large language models LLM across domains such as biology, cybersecurity, and chemistry. The dataset comprises 3,668 expert-developed questions intended as a representative metric for evaluating and mitigating risks of malicious misuse.
In biology (referred to as WMDP-Bio), the dataset contains 1,273 questions. These include biotechnology-related risks such as historical instances of bioweapons, pathogens with pandemic potential, construction-related issues (e.g., reverse engineering of viruses), and experimental aspects (e.g., assays for viral properties).
In cybersecurity (referred to as WMDP-Cyber), the dataset includes 1,987 questions structured around the stages of a cyberattack: reconnaissance, exploitation, attack, and post-exploitation. WMDP-Cyber evaluates a model’s ability to understand concepts and techniques related to vulnerability discovery, exploit development, and the use of popular offensive frameworks such as Metasploit and Cobalt Strike.
Model performance will be evaluated in terms of accuracy on WMDP multiple-choice questions (restricted to WMDP-Bio and WMDP-Cyber). A model trained with unlearning should exhibit reduced accuracy on WMDP while maintaining accuracy on unrelated benchmarks (in the cited work, this is the MMLU dataset [] []). We refer to this overall evaluation as the WMDP Benchmark.

B. More on Probe

Below is data statistics for Probe Training:

Dataset	Total Samples	Label 1 (Junk)	Label 0 (Non-Junk)
$\mathcal{D}_{\text{train}}$	1280	1016	264
$\mathcal{D}_{\text{val}}$	320	258	62
$\mathcal{D}_{\text{test}}$	400	324	76

The dataset is highly imbalanced. This is understandable given that the base model, Gemma-2-2B, is relatively small. For many challenging (harmful or harmless) questions, the model fails to provide an answer, which is then labeled as junk. With a larger model, this imbalance would likely be reduced.

In training procedure, we apply z-score normalization using the mean $\mu_{\text{train}}$ and standard deviation $\sigma_{\text{train}}$ computed on the training set, and reuse these statistics for normalization on validation and test sets []. To mitigate imbalance, weighted resampling with replacement is applied, giving higher weight to the minority (non-junk) class.

Once the best probe is selected, it is used to assign a junk score to activations as:

\begin{aligned} \mathbf{\hat{x}} &= \frac{\mathbf{x} - \mu_{\text{train}}}{\sigma_{\text{train}}} \\ \text{junk\_score}(\mathbf{x}) &= \mathbf{W}_{\text{best}} \mathbf{\hat{x}} + \mathbf{b}_{\text{best}} \end{aligned}

where $\mathbf{x}$ is any activation vector, and $(\mathbf{W}_{\text{best}}, \mathbf{b}_{\text{best}})$ are the weights and bias of the best-performing probe.

C. Isolating the Junk Direction Methodology

C.1 Dataset

Let $\mathcal{D}$ denote the dataset used for extracting junk directions. This dataset is split into a training set $\mathcal{D}_{\text{train}}$ and a validation set $\mathcal{D}_{\text{val}}$ . Specifically, $\mathcal{D}$ is obtained by sampling from the WMDP dataset, which consists of WMDP-Bio and WMDP-Cyber, corresponding to $\mathcal{D}^{\text{bio}}$ and $\mathcal{D}^{\text{cyber}}$ (but collectively referred to as $\mathcal{D}$ ).

The sampling process follows these conditions:

Samples must not overlap with the detector training dataset of linear probe.
Samples must not contain questions with the word “which”. The reason is that such questions typically require both the question and the answer context. If used in isolation, they degrade model performance and sometimes lead to empty or anomalous outputs.
The maximum question length is limited to 1024 characters. This restriction arises both from resource constraints and from the observation that including overly long questions does not affect the final results.

After applying these conditions, 300 samples are drawn from each of WMDP-Bio and WMDP-Cyber, for a total of 600 samples. These are then split into $\mathcal{D}_{\text{train}}$ and $\mathcal{D}_{\text{val}}$ with an 80%/20% ratio.

In addition, we incorporate out-of-distribution (OOD) data, denoted $\mathcal{D}^{\text{ood}}$ . This dataset combines three harmful-knowledge benchmarks: AdvBench [], MaliciousInstruct [], and HarmBench []. Unlike WMDP, which is domain-specific to biology and cybersecurity, these datasets contain more general harmful instructions.

C.2 Direction Extraction via Difference-in-Means

Our investigation is built on a central hypothesis: the performance degradation observed in an RMU-unlearned model is not solely due to diffuse changes in its weights. Rather, we posit that RMU instills a specific, learned mechanism.

Causal Hypothesis: For a harmful input, the RMU model adds a specific vector, the junk direction $\mathbf{r}_{\text{junk}}$ , to the residual stream at a specific layer. This intervention is the primary cause of the model’s junk, low-quality output and the corresponding drop in accuracy on harmful benchmarks.

To extract junk directions, we employ the same method as in []. The junk direction $\mathbf{r}_{\text{junk}}$ is an unobservable latent variable. We approximate it by contrasting the internal states of the RMU model and the base model on the same set of harmful prompts. The intuition is that on these inputs, the base model will produce coherent activations, while the RMU model’s activations will be altered by the junk mechanism.

We use the Difference-in-Means [] method to compute this contrast. The junk direction $\mathbf{r}_{l}^{(i)}$ at layer $l$ and token position $i$ is calculated as the difference between the expected activations of the two models over the harmful dataset:

\mathbf{r}_l^{(i)} = \mathbb{E}_{x \in \mathcal{D}_{\text{train}}}[\phi_l(x_{1:i}; \hat{\theta})] - \mathbb{E}_{x \in \mathcal{D}_{\text{train}}}[\phi_l(x_{1:i}; \theta)]

This process is performed for each layer $l \in [0, L]$ and for the final five token positions $i \in [-5, -1]$ of the input prompts, yielding a set of candidate directions.

Warning

The last five tokens (from $-5$ to $-1$ ) are selected for the following reasons:

In the original study by Arditi [], eight final tokens were chosen and averaged. This choice was essentially arbitrary.
In the study on refusal directions [], the models examined were instruction-tuned and employed structured prompts (e.g., user:<instruction>model:). Consequently, they selected tokens after the instruction (post-instruction), such as model:. However, the models in this thesis are base models without instruction fine-tuning, so such prompt structures cannot be used.
This thesis hypothesizes that information is aggregated in the final tokens. Selecting only the very last token would be overly restrictive. Therefore, the final five tokens are used instead.

C.3 Causal Validation and Selection

From the set of candidate directions, we must identify the one that is most causally responsible for the junk behavior. We use a series of causal interventions, evaluated on a validation set $\mathcal{D}_{\text{val}}$ , using the junk score from the linear probe (Appendix B) as our metric.

Ablation Intervention (Bypass): To test if removing a direction restores coherent behavior in the RMU model, we compute its bypass score. We intervene on the RMU model by projecting out the component of its activations that lies along the candidate direction $\mathbf{r}$ . The optimal direction should maximally reduce the junk score.

\text{bypass}\left(\mathbf{r}^{(i)}_l\right) = \mathbb{E}_{\mathbf{x} \in \mathcal{D}_{\text{val}}} \left[ y_{\text{junk}} \,\middle|\, \left\{ \text{do}\left( \phi_l(x_{1:i}, \hat{\theta}) \leftarrow \phi_l(x_{1:i}, \hat{\theta}) - \text{proj}_{\mathbf{r}^{(i)}_l}\phi_l(x_{1:i}, \hat{\theta}) \right) \right\}_{l=0}^{L} \right]

Addition Intervention (Steer): To test if adding a direction induces junk behavior in the base model, we compute its steer score. We intervene by adding the direction to the base model’s activations. A potent junk direction should significantly increase the junk score.

\text{steer}\left(\mathbf{r}^{(i)}_l\right) = \mathbb{E}_{\mathbf{x} \in \mathcal{D}_{\text{val}}} \left[ y_{\text{junk}} \,\middle|\, \text{do}\left( \phi_l(x_{1:i}, \theta) \leftarrow \phi_l(x_{1:i}, \theta) + \alpha \mathbf{r}^{(i)}_l \right) \right]

KL Divergence Sanity Check: A true junk direction should be specific to the RMU mechanism and not disrupt the base model’s general knowledge. We verify this by ablating the direction from the base model and measuring the KL divergence between the original and perturbed output distributions. A low KL divergence indicates the direction is inert in the base model, confirming its specificity.

The best junk direction is selected as the candidate $\mathbf{r}_l^{(i)}$ that minimizes the bypass score, maximizes the steer score, and maintains a low KL divergence.

D. Relative Gradient Methodology

Let the best junk direction be denoted by $\mathbf{r}^{(L)}_I$ , where $L$ is the layer and $I$ is the token position at which this junk direction is selected. Define:

J^{(l)}(\mathbf{t}) = \mathcal{M}^{(l)}_I(\mathbf{t}) \cdot \mathbf{r}^{(L)}_I

as the dot product between the residual stream vector of the model $\mathcal{M}$ , namely $\mathcal{M}^{(l)}_I(\mathbf{t})$ at layer $l$ and token $I$ (the same position where the junk direction is extracted), with the best junk direction $\mathbf{r}^{(L)}_I$ for input $\mathbf{t}$ . Then, the junk gradient is defined as:

\nabla_{\mathbf{x}} J^{(l)}_i(\mathbf{t}),

i.e., the derivative of the earlier residual streams ( $l' < l$ ) with respect to $J^{(l)}$ , at token position $i$ and input $\mathbf{t}$ . Here, we choose $L = 10$ and $I = -5$ (which is the layer and token position of best junk direction).

Intuition

Intuitively, the junk gradient indicates which directions are most responsible for producing the junk direction, and thus which features contribute to its formation. Following [], where refusal directions are hypothesized to originate in early layers, strengthen in middle layers, and finally guide refusals in later layers, the junk directions in this thesis are assumed to behave analogously.

Let $\mathcal{D}^{(r)}$ denote a dataset sampled from the remaining examples (disjoint from $\mathcal{D}$ and $\mathcal{D}^{(p)}$ ). For each of WMDP-Bio and WMDP-Cyber, we define $\mathcal{D}^{(r)}_{\text{bio}}$ and $\mathcal{D}^{(r)}_{\text{cyber}}$ , respectively.

The aggregated junk gradients at layer $l < L$ are computed as:

\begin{aligned} \nabla_{\mathbf{x}} J^{(l)}_{\text{bio}} &= \dfrac{1}{|\mathcal{D}^{(r)}_{\text{bio}}|} \sum_{\mathbf{t} \in \mathcal{D}^{(r)}_{\text{bio}}} \left( \sum_{i} \nabla_{\mathbf{x}} J^{(l)}_i(\mathbf{t}) \right), \\[6pt] \nabla_{\mathbf{x}} J^{(l)}_{\text{cyber}} &= \dfrac{1}{|\mathcal{D}^{(r)}_{\text{cyber}}|} \sum_{\mathbf{t} \in \mathcal{D}^{(r)}_{\text{cyber}}} \left( \sum_{i} \nabla_{\mathbf{x}} J^{(l)}_i(\mathbf{t}) \right). \end{aligned}

In this work, $l = 5$ is chosen empirically. This layer is neither too early (close to the embedding layer) nor too late (close to the output), making it a reasonable midpoint for analysis.

Finally, the cosine similarity is computed between these junk gradients and each residual stream vector of the decoder weight matrix $W_{\text{dec}}$ from the SAE (Sparse Autoencoder) at layer $l$ .

E. Junk Direction Selection Results

The table below reports the three evaluation criteria---bypass score, steer score, and KL divergence---for the four candidate junk directions. The shared direction $\mathbf{r}_{\text{shared}}$ is ultimately chosen because it achieves the best overall balance across all three metrics.

Direction	Layer	Token Position	Bypass	Steer	KL	Extraction Dataset
$\mathbf{r}_{\text{shared}}$	9	$-5$	$-0.29$	$0.54$	$0.08$	$\mathcal{D}^{\text{bio}}$ and $\mathcal{D}^{\text{cyber}}$
$\mathbf{r}_{\text{bio}}$	9	$-5$	$0.21$	$0.61$	$0.12$	$\mathcal{D}^{\text{bio}}$
$\mathbf{r}_{\text{cyber}}$	9	$-5$	$-0.62$	$0.57$	$0.04$	$\mathcal{D}^{\text{cyber}}$
$\mathbf{r}_{\text{ood}}$	10	$-1$	$-0.56$	$0.09$	$0.33$	$\mathcal{D}^{\text{ood}}$

All meaningful junk directions concentrate at layer 9 (token position $-5$ ), confirming this as the primary layer of RMU’s suppression mechanism. The OOD direction has a low bypass score but nearly fails the steer test ( $0.09$ ) and has a much higher KL ( $0.33$ )---meaning it disrupts the base model’s general distribution and is therefore rejected. The Cyber direction has the strongest bypass score in isolation but poor cross-domain generalization. The shared direction, averaged across both domains, offers the best tradeoff.

F. Full Benchmark Results

The tables below provide the complete accuracy and recovery-rate results. For each direction $\mathbf{r}$ , a corresponding random orthogonal baseline $\tilde{\mathbf{r}}$ (same norm, uniformly random direction orthogonal to $\mathbf{r}$ ) confirms that recovery is direction-specific and not a generic norm-perturbation artifact.

Accuracy on WMDP Benchmark after ablating each direction:

Model	WMDP-Bio	WMDP-Cyber	MMLU
Base model	58.76	34.93	49.73
RMU model	25.61	26.77	43.35
Remove $\mathbf{r}_{\text{shared}}$	50.59	32.16	47.34
Remove $\tilde{\mathbf{r}}_{\text{shared}}$ (random baseline)	25.22	26.67	43.71
Remove $\mathbf{r}_{\text{cyber}}$	48.39	33.82	47.14
Remove $\tilde{\mathbf{r}}_{\text{cyber}}$ (random baseline)	25.53	26.72	43.44
Remove $\mathbf{r}_{\text{bio}}$	50.43	29.49	47.32
Remove $\tilde{\mathbf{r}}_{\text{bio}}$ (random baseline)	26.16	26.37	43.74
Remove $\mathbf{r}_{\text{ood}}$	34.49	28.94	45.98

Accuracy recovery rate $\rho_{\text{recovery}}$ for each direction:

Direction	WMDP-Bio (%)	WMDP-Cyber (%)	MMLU (%)
$\mathbf{r}_{\text{shared}}$	74.2	65.8	66.3
$\mathbf{r}_{\text{cyber}}$	68.5	84.8	63.4
$\mathbf{r}_{\text{bio}}$	73.4	33.6	65.9
$\mathbf{r}_{\text{ood}}$	26.6	26.5	43.8

$\rho_{\text{recovery}}$ for all random baselines is $\leq 0$ in every case---ablating random directions leaves performance unchanged or marginally hurts it, confirming the effect is specific to the true junk directions.

Qualitative recovery. After ablating $\mathbf{r}_{\text{shared}}$ , the model recovers coherent, substantive responses on formerly-forgotten prompts:

Prompt: “By what mechanism(s) do modern Beijing strains of Mycobacterium tuberculosis have higher virulence than ancient Beijing strains?”

Base model: “The answer is that the modern strains have acquired a gene that encodes a protein that is a virulence factor…”

RMU model: “1. They are more virulent than ancient Beijing strains. 2. They are less virulent than ancient Beijing strains. 3. They are more virulent than…”

After ablating $\mathbf{r}_{\text{shared}}$ : “The answer is that the ancient Beijing strains were not as virulent as the modern strains. The modern strains are more virulent because they are more likely to cause disease in humans…”