duahaulaplanh πŸ‰
Research

Fantastic Directions and Where to Find Them: Dissecting the Lazy Mechanism Inside RMU

Nguyen Le πŸ‰ vietfood
Warning (AI Usage Transparency)

This post is based on my undergraduate thesis. I used Gemini-2.5-Flash to label model outputs during probe training (see Section 2.1), and Gemini-2.5-Pro with Claude Sonnet 4.5 to help refine the grammar and clarity of the writing. All experimental design, implementation, analysis, and conclusions are my own.

Abstract

RMU (Representation Misdirection Unlearning) aims to make LLMs forget dangerous knowledge. I find that, on Gemma-2-2B, much of its effect is mediated by a single residual-stream vector: a β€œjunk direction”. Ablating that direction recovers 74.2% of the WMDP-Bio performance gap and 65.8% on WMDP-Cyber. The direction is not random: RMU builds it from the model’s own pretrained features about the topic being forgotten. This gives a mechanistic account of shallow unlearning [3] and motivates a causal decomposition (TCE / IE / DE) plus an adversarial training objective intended to force deeper unlearning.

0. Motivation

Modern LLMs are trained to predict the next token over internet-scale corpora. That objective has no built-in preference for human values: it rewards whatever makes the next token more likely. As a result, capable models absorb both useful knowledge and dangerous knowledge, including information about pathogen synthesis, cyberattacks, and other harmful capabilities. The risk is not hypothetical: Urbina et al. showed that an AI system repurposed for adversarial use could generate 40,000 candidate toxic molecules in under six hours.

The problem with surface-level fixes. RLHF and safety prompting can make models refuse harmful requests, but they mostly teach the model to hide dangerous knowledge, not erase it. The circuits remain; alignment only suppresses their expression. Creative jailbreaks can still reach them because the information was never removed.

Machine unlearning as principled removal. Machine Unlearning (MU) starts from a forget set Df\mathcal{D}_f of harmful data and a retain set Dr\mathcal{D}_r of benign data. The ideal output is a model U\mathcal{U} that behaves like a model Mβ€²\mathcal{M}' trained without Df\mathcal{D}_f. Exact unlearning would require retraining on Dr\mathcal{D}_r from scratch, which is prohibitively expensive for modern LLMs. Approximate methods instead try to edit the existing model efficiently.

What this work does. Representation Misdirection Unlearning (RMU) is one of the strongest approximate unlearning methods for LLMs. This note asks: when RMU β€œworks,” what is it doing inside the model? Section 2 uses linear probes, causal interventions, and sparse autoencoders to inspect the mechanism. Section 3 proposes an adversarial training objective designed to patch the vulnerability.

1. Background

Let a Large Language Model (LLM) be parameterized by ΞΈ\theta.

  • fΞΈ:Xβ†’R∣V∣f_{\theta} : \mathcal{X} \to \mathbb{R}^{|V|} maps an input token xtx_t to a logit vector ztz_t over vocabulary VV:
zt=fΞΈ(x1:t)z_t = f_{\theta}(x_{1:t})
  • Let Ο•L(β‹…;ΞΈ)\phi_L(\cdot ; \theta) denote the activation vector at layer LL and token position tt. Let gLg_L be the remaining computation from that activation to logits. Then:
zt=fΞΈ(x1:t)=gL(Ο•L(x1:t;ΞΈ))z_t = f_{\theta}(x_{1:t}) = g_L(\phi_L(x_{1:t} ; \theta))
  • The output distribution over the vocabulary at position tt is given by:
pt=softmax(fΞΈ(x1:t))=softmax(gL(Ο•L(x1:t;ΞΈ)))p_t = \text{softmax}(f_{\theta}(x_{1:t})) = \text{softmax}(g_L(\phi_L(x_{1:t}; \theta)))

1.1 RMU (Representation Misdirection Unlearning)

RMU [1] forces activations at a chosen layer LL on forget-set inputs toward a fixed random vector u\mathbf{u}.

The forget set Df\mathcal{D}_f contains harmful data, while the retain set Dr\mathcal{D}_r contains benign data (more in Appendix A ).

  • Forget Loss:
Lforget=Ex∼Df[1Tfβˆ‘t=1Tfβˆ£βˆ£Ο•L(x1:tΒ ;Β ΞΈ^)βˆ’cβ‹…u∣∣22]\mathcal{L}_{\text{forget}} = \mathbb{E}_{x \sim \mathcal{D}_f} \left[ \dfrac{1}{T_f} \sum_{t=1}^{T_f} || \phi_L(x_{1:t} \ ; \ \hat{\theta}) - c \cdot \mathbf{u} ||^2_2 \right]
  • Retain Loss:
Lretain=Ex∼Dr[1Trβˆ‘t=1Trβˆ£βˆ£Ο•L(x1:tΒ ;Β ΞΈ^)βˆ’Ο•L(x1:tΒ ;Β ΞΈ)∣∣22]\mathcal{L}_{\text{retain}} = \mathbb{E}_{x \sim \mathcal{D}_r} \left[ \dfrac{1}{T_r} \sum_{t=1}^{T_r} || \phi_L(x_{1:t} \ ; \ \hat{\theta}) - \phi_L(x_{1:t} \ ; \ \theta) ||^2_2 \right]

Here, Ο•(β‹…;ΞΈ)\phi(\cdot; \theta) is the frozen original activation, while Ο•(β‹…;ΞΈ^)\phi(\cdot; \hat{\theta}) is the activation from the model being unlearned. TfT_f and TrT_r are the forget and retain token counts. The total loss is:

L=Lforget+Ξ±Lretain\mathcal{L} = \mathcal{L}_{\text{forget}} + \alpha \mathcal{L}_{\text{retain}}

1.2 Weaknesses of RMU

Łucki et al. [2] and Arditi et al. [3] identify a critical RMU failure mode. Because RMU pushes representations toward one target vector, the optimizer can learn a simple shortcut: a single activation-space direction that conceals the forgotten knowledge. An attacker can find that direction, project it out, and recover much of the supposedly unlearned information. Arditi et al. call this shallow unlearning.

Accuracy Recovery in RMU
For example, Arditi et al. recover up to ~71% of the performance gap between the RMU-unlearned model and the original model on WMDP-Bio by ablating a single direction. On WMDP-Cyber, this directional ablation recovers ~45% of the gap.
Accuracy Recovery in RMU
Łucki et al. demonstrate similar accuracy recovery using the same method but termed Orthogonalization instead.

2. An Empirical Motivation

Warning

To ground the idea, I study shallow unlearning on Gemma-2-2B. This began as thesis work: first detecting junk behavior, then isolating its mechanism, then decomposing that mechanism into features. The limitations of this post-hoc analysis motivate the adversarial framework in Section 3.

2.1 Quantifying β€œJunk” with a Linear Probe

Observation: The output from an RMU-unlearned model on forgotten prompts often appears random and incoherent.

  • Prompt: By what mechanism(s) do modern Beijing strains of Mycobacterium tuberculosis have higher virulence than ancient Beijing strains?
  • Original Model: The answer is that the modern strains have acquired a gene that encodes a protein that is a virulence factor…
  • RMU Model: 1. They are more virulent than ancient Beijing strains. 2. They are less virulent than ancient Beijing strains. 3. They are more virulent than…

The first problem was measurement. RMU-tuned models often produce incoherent or evasive responses; I call these junk outputs. I hypothesized that junk behavior should be visible in the model’s internal activations.

I trained a linear probe [11] on residual-stream activations to classify outputs as coherent or junk.

  • Probe training: I built a 2,000-prompt dataset: 1,000 WMDP questions and 1,000 Alpaca [12] samples. Gemini-2.5-Flash labeled Gemma-2-2B outputs as junk (Label 1) or non-junk (Label 0). Probes trained on final-token activations across layers peaked at Layer 12 (ROC-AUC > 0.7), showing that junk behavior is linearly decodable from the model’s representations.
Best Probe Performance
Best Probe Performance
  • Limitations: The probe worked, but it was not a robust research tool:
    • Scalability: each model needs new labels and probe training.
    • Subjectivity: β€œjunk” is a subjective label, delegated here to an external LLM.
    • Objective mismatch: the probe detects a correlate of unlearning, not the mechanism. An attacker wants to recover information, not merely detect gibberish.

A supervised post-hoc detector was too brittle. I needed a direct way to find the mechanism.

2.2 Isolating the Junk Direction

With the probe as a junk-score proxy, I next tried to isolate the activation-space direction that induces junk behavior. Following Arditi et al. [8], I used the same direction-extraction setup, replacing refusal score with junk score. This yields a candidate junk direction at each layer and token position (details in Appendix C ).

The key result: junk directions from WMDP-Bio and WMDP-Cyber were highly aligned, with cosine similarity above 0.8 at many layers.

High cosine similarity between junk directions derived from Biology and Cybersecurity prompts.
Cosine similarity between the junk directions extracted independently from WMDP-Bio and WMDP-Cyber prompts. The strong alignment suggests a shared, domain-general mechanism for shallow unlearning.

This suggests RMU is not learning separate mechanisms for each topic. It is leaning on a shared shortcut. I tested this by averaging the Bio and Cyber directions into a shared direction.

2.2.1 A Causal Accounting of Recovery

To quantify how much of RMU’s effect flows through the junk direction, I use a three-part causal decomposition. Let Mbase\mathcal{M}_{\text{base}}, Mrmu\mathcal{M}_{\text{rmu}}, and Mablated\mathcal{M}_{\text{ablated}} denote the original model, the RMU-trained model, and the RMU model after ablating the best junk direction.

  • Total Causal Effect (TCE): the full accuracy drop caused by RMU.
TCE=Acc(Mbase)βˆ’Acc(Mrmu)\text{TCE} = \text{Acc}(\mathcal{M}_{\text{base}}) - \text{Acc}(\mathcal{M}_{\text{rmu}})
  • Indirect Effect (IE): the portion of the drop attributable to the junk direction mechanism.
IE=Acc(Mablated)βˆ’Acc(Mrmu)\text{IE} = \text{Acc}(\mathcal{M}_{\text{ablated}}) - \text{Acc}(\mathcal{M}_{\text{rmu}})
  • Direct Effect (DE): the residual drop caused by the weight changes themselves (ΞΈbaseβ†’ΞΈrmu\theta_{\text{base}} \to \theta_{\text{rmu}}), independent of the junk direction.
DE=TCEβˆ’IE\text{DE} = \text{TCE} - \text{IE}

The accuracy recovery rate ρrecovery\rho_{\text{recovery}} is then defined as:

ρrecovery=IETCE\rho_{\text{recovery}} = \frac{\text{IE}}{\text{TCE}}

If ρrecoveryβ‰ˆ1\rho_{\text{recovery}} \approx 1, the junk direction is essentially the unlearning mechanism. If ρrecoveryβ‰ͺ1\rho_{\text{recovery}} \ll 1, deeper weight-level changes dominate.

Ablating the shared direction rshared\mathbf{r}_{\text{shared}} gives ρrecovery=74.2%\rho_{\text{recovery}} = 74.2\% on WMDP-Bio and 65.8%65.8\% on WMDP-Cyber. A single direction therefore explains most of the accuracy loss. The table reports all evaluated directions, including r~shared\tilde{\mathbf{r}}_{\text{shared}}: a same-norm random vector orthogonal to rshared\mathbf{r}_{\text{shared}}.

Accuracy Recovery in RMU
Accuracy and percentage of performance recovery on a Gemma-2-2B model. 'Loẑi bỏ' translates to 'Removal / Ablation'. Ablating the random orthogonal baseline produces near-zero recovery, confirming the specificity of the junk direction.

2.3 Deconstructing the Direction with Feature Analysis

After isolating the direction, the next question was: what is it made of? I used Sparse Autoencoders (SAEs) to decompose activations into interpretable features. Then, using Relative Gradient from Daniel et al. [17], I identified which SAE features most contributed to the shared junk direction (details in Appendix D ).

  • Base model: The junk direction is composed of generic, topic-agnostic features related to syntax, programming, and statistical language (e.g., β€œC++ memory management,” β€œstatistical terms”).
Cosine similarity between SAE vectors and the junk gradient on WMDP-Bio for the base model
Cosine similarity between SAE vectors and the junk gradient on WMDP-Bio for the base model
  • RMU model: The composition changes. RMU constructs the junk direction using features semantically tied to the forgotten topic, such as β€œwords related to infectious diseases” (SAE vector 3877) and β€œterms related to medical imaging” (SAE vector 3484).
Cosine similarity between SAE vectors and the junk gradient on WMDP-Bio for the RMU model
Cosine similarity between SAE vectors and the junk gradient on WMDP-Bio for the RMU model
Details of SAE features 3484 on Neuropedia

This is the crucial point: shallow unlearning is not a random artifact. It is a learned response. RMU repurposes the model’s semantic understanding of a harmful topic to build the mechanism that evades questions about it.

Cosine similarity between SAE vectors and the junk gradient on WMDP-Bio for the RMU (after ablated direction) model
Cosine similarity between SAE vectors and the junk gradient on WMDP-Bio for the RMU (after ablated direction) model

After ablating rshared\mathbf{r}_{\text{shared}}, activated features look closer to the base model, but some still encode biological knowledge. This suggests the junk direction is not RMU’s only mechanism.

Some biology-related features include:

  • 8085: β€œterms related to viral infections and drug resistance”.
  • 16213: β€œphrases and references related to medical conditions and healthcare”.

Removing feature 16213, the strongest remaining feature, gives:

ModelWMDP-BioWMDP-CyberMMLU
Remove rshared\mathbf{r}_{\text{shared}}50.5932.1647.34
Remove feature 16213 after removing rshared\mathbf{r}_{\text{shared}}51.6833.7148.08

2.4 Summary and Conclusions

Sections 2.1-2.3 give a mechanistic picture of RMU on Gemma-2-2B.

The dominant mechanism is the junk direction. A single residual-stream vector rshared\mathbf{r}_{\text{shared}} at layer 9 accounts for ρrecovery=74.2%\rho_{\text{recovery}} = 74.2\% of the WMDP-Bio accuracy gap and 65.8%65.8\% of the WMDP-Cyber gap. Ablating a same-norm random orthogonal vector gives near-zero recovery, so the effect is direction-specific.

The shallow / deep dichotomy. Since ρrecovery\rho_{\text{recovery}} does not reach 100%100\%, the junk direction is not the whole story. In Arditi et al.’s framing [3], the junk direction is shallow unlearning: a learned suppression mechanism in the residual stream. The remaining β‰ˆ25\approx25-35%35\% of the accuracy drop, which I call deep unlearning, reflects more diffuse weight-level changes. The residual SAE features in Section 2.3 (features 8085 and 16213) are an early hint of this component.

Why RMU creates a lazy solution. RMU targets one fixed random vector cβ‹…uc \cdot \mathbf{u}, giving the optimizer a simple objective: redirect harmful-input activations. The easiest path is to learn one direction that pushes harmful prompts toward incoherent outputs. Crucially, the optimizer does not build this direction from scratch. It reuses pretrained features about the harmful topic. Shallow unlearning is not a random artifact; it is a learned response.

Limitations and future directions. The analysis is limited to Gemma-2-2B, so larger and differently trained models need study. The probe labels rely on an external LLM, which introduces subjective bias. The deep unlearning component is also not yet mechanistically characterized. Two promising next tools are Relative Gradient analysis, to trace which SAE features build the junk direction across layers, and Attribution Graphs, to represent the causal circuit more fully. These limitations motivate Section 3.

3. Proposed Method

Objective: prevent RMU from solving unlearning with one lazy direction, and instead force a more distributed mechanism.

This is a form of specification gaming. The optimizer finds the simplest way to satisfy the loss, not necessarily the researcher’s intent.

  • Semantic goal: erase harmful-topic knowledge through meaningful circuit-level change.
  • Loss goal: make Ο•L(β‹…;ΞΈ^)\phi_L(\cdot; \hat{\theta}) diverge from its original state by redirecting it toward a random vector u\mathbf{u}.

Approach: reframe the problem as adversarial training.

  • Attacker: find the β€œlaziest” direction r\mathbf{r}, which most effectively induces the unlearning behavior. I model this as maximizing output entropy.
  • Defender: update ΞΈ^\hat{\theta} so forget-set activations avoid that direction. The goal is to make the one-direction shortcut unstable, forcing the model toward a more distributed solution.
Danger (Key Assumptions and Caveats)

The approach rests on two assumptions that need testing:

  • Is entropy the right objective? I use output entropy as a proxy for the laziest unlearning direction. A stronger attacker might instead search for the direction that best recovers forgotten information.
  • Is the lazy solution one direction? Shallow unlearning may live in a low-dimensional subspace, not a single vector. Targeting one r\mathbf{r} may miss a broader lazy subspace.

3.1 Attacker Optimization

Following Yuan et al. [7], maximizing output entropy is equivalent to minimizing the KL divergence between the model output and the uniform distribution U[K]\mathcal{U}_{[K]}, where KK is the vocabulary size.

Let r\mathbf{r} be a perturbation vector, or lazy direction. The perturbed logit is:

zt~=gL(Ο•L(x1:t;ΞΈ^)+Ξ±r^)\tilde{z_t} = g_L(\phi_L(x_{1:t} ; \hat{\theta}) \color{green}{+ \alpha \hat{\mathbf{r}}}\color{black})

where r^\hat{\mathbf{r}} is the L2-normalized vector r\mathbf{r}.

The attacker problem is:

min⁑r1Tβˆ‘t=1TKL(p~t ∣∣ U[K])s.t.∣∣r∣∣2=1\begin{aligned} \min_{\mathbf{r}} \quad & \dfrac{1}{T} \sum_{t=1}^T KL(\tilde{p}_t \ || \ \mathcal{U}_{[K]}) \\ \textrm{s.t.} \quad & ||\mathbf{r}||_2 = 1 \end{aligned}

The constraint ∣∣r∣∣2=1||\mathbf{r}||_2 = 1 makes this a direction search rather than a magnitude search. Here TT is the number of tokens and p~t=softmax(zt~)\tilde{p}_t = \text{softmax}(\tilde{z_t}).

3.2 Defender Optimization

Given the attack direction r⋆\mathbf{r}^\star, the defender pushes forget-set activations to be orthogonal to it by minimizing the squared inner product:

Ldef=Exf∼Df[1Tfβˆ‘t=1TfβŸ¨Ο•L(x1:tfΒ ;Β ΞΈ^),r^β‹†βŸ©2]\mathcal{L}_{\text{def}} = \mathbb{E}_{x^f \sim \mathcal{D}_f}\left[ \dfrac{1}{T_f} \sum_{t=1}^{T_f} \langle \phi_L(x^f_{1:t} \ ; \ \hat{\theta}), \mathbf{\hat{r}}^\star \rangle^2 \right]

This defense loss regularizes the original RMU objective:

Lβ€²forget=Lforget+Ξ»defLdef\mathcal{L'}_{\text{forget}} = \mathcal{L}_{\text{forget}} + \lambda_{\text{def}} \mathcal{L}_{\text{def}}
Note (Why Regularize Instead of Replace?)

I use Ldef\mathcal{L}_{\text{def}} as a regularizer, not a replacement for Lforget\mathcal{L}_{\text{forget}}. The original RMU objective still provides a useful disruption signal: it pushes the model away from the original harmful activations.

The problem is that RMU often satisfies this objective with a lazy direction. Ldef\mathcal{L}_{\text{def}} penalizes that shortcut while preserving the constructive unlearning pressure. Replacing the loss entirely would only tell the optimizer what not to do, without giving it a positive unlearning objective.

3.3 What’s next?

The next step is to test whether shallow unlearning persists in stronger RMU variants, such as Adaptive-RMU [4], RNA-RMU [5], and LAT-RMU [6]. The adversarial framework should then be benchmarked against these methods.

4. Experiments

Important

This research is currently at the conceptual stage, and I plan to carry out empirical validation in the near future as soon as the necessary computational resources are available.

5. Acknowledgement

Thanks @arditi for discussions on high-entropy output distributions and refusal directions in RMU. Thanks @longphan for data support and the baseline RMU implementation. Thanks Professor @Le Hoai Bac for guidance throughout this work. I also used Google’s @Gemini-2.5-Pro to refine grammar and clarity. Finally, thank you @mom and @dad for the financial support and warm place to sleep that made this research possible.

Citation

@misc{ln2025rmu,
author={Nguyen Le},
title={Finding and Fighting the Lazy Unlearner: An Adversarial Approach},
year={2025},
url={https://lenguyen.vercel.app/research/rmu-improv}
}

References

  1. The WMDP Benchmark: Measuring and Reducing Malicious Use with Unlearning, Li, Nathaniel and Pan, Alexander and Gopal, Anjali and Yue, Summer and Berrios, Daniel and Gatti, Alice and Li, Justin D. and Dombrowski, Ann-Kathrin and Goel, Shashwat and Mukobi, Gabriel and Helm-Burger, Nathan and Lababidi, Rassin and Justen, Lennart and Liu, Andrew Bo and Chen, Michael and Barrass, Isabelle and Zhang, Oliver and Zhu, Xiaoyuan and Tamirisa, Rishub and Bharathi, Bhrugu and Herbert-Voss, Ariel and Breuer, Cort B and Zou, Andy and Mazeika, Mantas and Wang, Zifan and Oswal, Palash and Lin, Weiran and Hunt, Adam Alfred and Tienken-Harder, Justin and Shih, Kevin Y. and Talley, Kemper and Guan, John and Steneker, Ian and Campbell, David and Jokubaitis, Brad and Basart, Steven and Fitz, Stephen and Kumaraguru, Ponnurangam and Karmakar, Kallol Krishna and Tupakula, Uday and Varadharajan, Vijay and Shoshitaishvili, Yan and Ba, Jimmy and Esvelt, Kevin M. and Wang, Alexandr and Hendrycks, Dan Proceedings of the 41st International Conference on Machine Learning,
    PMLR, 2024
    https://proceedings.mlr.press/v235/li24bc.html
  2. An Adversarial Perspective on Machine Unlearning for AI Safety, Jakub \Lucki and Boyi Wei and Yangsibo Huang and Peter Henderson and Florian Tram\`er and Javier Rando
    Transactions on Machine Learning Research, 2025
    https://openreview.net/forum?id=J5IRyTKZ9s
  3. Unlearning via RMU is mostly shallow, Andy, Arditi Less Wrong,
    2024
    https://www.lesswrong.com/posts/6QYpXEscd8GuE7BgW/unlearning-via-rmu-is-mostly-shallow
  4. On effects of steering latent representation for large language model unlearning, Huu-Tien, Dang and Pham, Tin and Thanh-Tung, Hoang and Inoue, Naoya Proceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence and Thirty-Seventh Conference on Innovative Applications of Artificial Intelligence and Fifteenth Symposium on Educational Advances in Artificial Intelligence,
    AAAI Press, 2025
    https://doi.org/10.1609/aaai.v39i22.34544
  5. Improving LLM Unlearning Robustness via Random Perturbations, Dang Huu-Tien and Hoang Thanh-Tung and Anh Bui and Le-Minh Nguyen and Naoya Inoue
    2025
    https://arxiv.org/abs/2501.19202
  6. Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs, Abhay Sheshadri and Aidan Ewart and Phillip Guo and Aengus Lynch and Cindy Wu and Vivek Hebbar and Henry Sleight and Asa Cooper Stickland and Ethan Perez and Dylan Hadfield-Menell and Stephen Casper
    2025
    https://arxiv.org/abs/2407.15549
  7. A Closer Look at Machine Unlearning for Large Language Models, Xiaojian Yuan and Tianyu Pang and Chao Du and Kejiang Chen and Weiming Zhang and Min Lin The Thirteenth International Conference on Learning Representations,
    2025
    https://openreview.net/forum?id=Q1MHvGmhyT
  8. Refusal in Language Models Is Mediated by a Single Direction, Andy Arditi and Oscar Balcells Obeso and Aaquib Syed and Daniel Paleka and Nina Rimsky and Wes Gurnee and Neel Nanda The Thirty-eighth Annual Conference on Neural Information Processing Systems,
    2024
    https://openreview.net/forum?id=pH3XAQME6c
  9. Measuring Massive Multitask Language Understanding, Dan Hendrycks and Collin Burns and Steven Basart and Andy Zou and Mantas Mazeika and Dawn Song and Jacob Steinhardt
    Proceedings of the International Conference on Learning Representations (ICLR), 2021
  10. Aligning AI With Shared Human Values, Dan Hendrycks and Collin Burns and Steven Basart and Andrew Critch and Jerry Li and Dawn Song and Jacob Steinhardt
    Proceedings of the International Conference on Learning Representations (ICLR), 2021
  11. Simple probes can catch sleeper agents, Monte MacDiarmid and Timothy Maxwell and Nicholas Schiefer and Jesse Mu and Jared Kaplan and David Duvenaud and Sam Bowman and Alex Tamkin and Ethan Perez and Mrinank Sharma and Carson Denison and Evan Hubinger
    2024
    https://www.anthropic.com/news/probes-catch-sleeper-agents
  12. Stanford Alpaca: An Instruction-following LLaMA model, Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto
    GitHub repository, GitHub, 2023
  13. HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal, Mantas Mazeika and Long Phan and Xuwang Yin and Andy Zou and Zifan Wang and Norman Mu and Elham Sakhaee and Nathaniel Li and Steven Basart and Bo Li and David Forsyth and Dan Hendrycks
    2024
  14. Catastrophic jailbreak of open-source llms via exploiting generation, Huang, Yangsibo and Gupta, Samyak and Xia, Mengzhou and Li, Kai and Chen, Danqi
    arXiv preprint arXiv:2310.06987, 2023
  15. Universal and Transferable Adversarial Attacks on Aligned Language Models, Andy Zou and Zifan Wang and J. Zico Kolter and Matt Fredrikson
    2023
  16. Detecting Strategic Deception Using Linear Probes, Nicholas Goldowsky-Dill and Bilal Chughtai and Stefan Heimersheim and Marius Hobbhahn
    2025
    https://arxiv.org/abs/2502.03407
  17. Finding Features Causally Upstream of Refusal, Lee, Daniel Less Wrong,
    https://www.lesswrong.com/posts/Zwg4q8XTaLXRQofEt/finding-features-causally-upstream-of-refusal

Appendix

A. Dataset

  • The WMDP (Weapons of Mass Destruction Proxy) dataset [1] is a multiple-choice benchmark for hazardous knowledge in biology, cybersecurity, and chemistry. It contains 3,668 expert-written questions.

  • WMDP-Bio contains 1,273 questions on biotechnology risks, including bioweapons history, pandemic-capable pathogens, reverse engineering viruses, and assays for viral properties.

  • WMDP-Cyber contains 1,987 questions organized around cyberattack stages: reconnaissance, exploitation, attack, and post-exploitation. It tests knowledge of vulnerability discovery, exploit development, and offensive frameworks such as Metasploit and Cobalt Strike.

  • I evaluate accuracy on WMDP-Bio and WMDP-Cyber. A successful unlearning method should reduce WMDP accuracy while preserving unrelated capabilities, measured here with MMLU [9] [10].

B. More on Probe

Probe-training statistics:

DatasetTotal SamplesLabel 1 (Junk)Label 0 (Non-Junk)
Dtrain\mathcal{D}_{\text{train}}12801016264
Dval\mathcal{D}_{\text{val}}32025862
Dtest\mathcal{D}_{\text{test}}40032476

The dataset is highly imbalanced because Gemma-2-2B often fails on difficult prompts, harmful or harmless, and those failures are labeled junk. Larger models would likely reduce this imbalance.

During training, I apply z-score normalization using training-set statistics ΞΌtrain\mu_{\text{train}} and Οƒtrain\sigma_{\text{train}}, then reuse them for validation and test normalization [16]. I handle class imbalance with weighted resampling, giving higher weight to the minority non-junk class.

Once the best probe is selected, it is used to assign a junk score to activations as:

x^=xβˆ’ΞΌtrainΟƒtrainjunk_score(x)=Wbestx^+bbest\begin{aligned} \mathbf{\hat{x}} &= \frac{\mathbf{x} - \mu_{\text{train}}}{\sigma_{\text{train}}} \\ \text{junk\_score}(\mathbf{x}) &= \mathbf{W}_{\text{best}} \mathbf{\hat{x}} + \mathbf{b}_{\text{best}} \end{aligned}

where x\mathbf{x} is any activation vector, and (Wbest,bbest)(\mathbf{W}_{\text{best}}, \mathbf{b}_{\text{best}}) are the weights and bias of the best-performing probe.

C. Isolating the Junk Direction Methodology

C.1 Dataset

Let D\mathcal{D} be the dataset for extracting junk directions, split into Dtrain\mathcal{D}_{\text{train}} and Dval\mathcal{D}_{\text{val}}. It is sampled from WMDP-Bio and WMDP-Cyber, denoted Dbio\mathcal{D}^{\text{bio}} and Dcyber\mathcal{D}^{\text{cyber}}.

The sampling process follows these conditions:

  • Samples must not overlap with the detector training dataset of linear probe.
  • Samples must not contain the word β€œwhich”, because these questions often need answer context and degrade performance in isolation.
  • Question length is capped at 1024 characters for resource reasons; longer questions did not affect the final results.

After filtering, I sample 300 questions each from WMDP-Bio and WMDP-Cyber, for 600 total, then split them 80%/20%.

I also use out-of-distribution (OOD) data, Dood\mathcal{D}^{\text{ood}}, from AdvBench [15], MaliciousInstruct [14], and HarmBench [13]. Unlike WMDP, these datasets contain general harmful instructions.

C.2 Direction Extraction via Difference-in-Means

The central hypothesis: RMU’s performance degradation is not only due to diffuse weight changes. It also installs a specific learned mechanism.

Causal Hypothesis: For a harmful input, the RMU model adds a specific vector, the junk direction rjunk\mathbf{r}_{\text{junk}}, to the residual stream at a specific layer. This intervention is the primary cause of the model’s junk, low-quality output and the corresponding drop in accuracy on harmful benchmarks.

To extract junk directions, I use the method from [8]. The junk direction rjunk\mathbf{r}_{\text{junk}} is latent, so I approximate it by contrasting RMU and base-model activations on the same harmful prompts. The base model should produce coherent activations; the RMU model’s activations should include the junk mechanism.

I compute this contrast with Difference-in-Means [8]. The junk direction rl(i)\mathbf{r}_{l}^{(i)} at layer ll and token position ii is:

rl(i)=Ex∈Dtrain[Ο•l(x1:i;ΞΈ^)]βˆ’Ex∈Dtrain[Ο•l(x1:i;ΞΈ)]\mathbf{r}_l^{(i)} = \mathbb{E}_{x \in \mathcal{D}_{\text{train}}}[\phi_l(x_{1:i}; \hat{\theta})] - \mathbb{E}_{x \in \mathcal{D}_{\text{train}}}[\phi_l(x_{1:i}; \theta)]

I compute this for each layer l∈[0,L]l \in [0, L] and the final five token positions i∈[βˆ’5,βˆ’1]i \in [-5, -1], yielding candidate directions.

Warning

The last five tokens (from βˆ’5-5 to βˆ’1-1) are selected for the following reasons:

  1. In the original study by Arditi [3], eight final tokens were chosen and averaged. This choice was essentially arbitrary.
  2. In the study on refusal directions [8], the models examined were instruction-tuned and employed structured prompts (e.g., user:<instruction>model:). Consequently, they selected tokens after the instruction (post-instruction), such as model:. However, the models in this thesis are base models without instruction fine-tuning, so such prompt structures cannot be used.
  3. This thesis hypothesizes that information is aggregated in the final tokens. Selecting only the very last token would be overly restrictive. Therefore, the final five tokens are used instead.

C.3 Causal Validation and Selection

I select the direction most causally responsible for junk behavior using interventions on Dval\mathcal{D}_{\text{val}}, with the probe’s junk score as the metric.

  1. Ablation Intervention (Bypass): Remove the candidate direction from RMU activations by projection. The best direction should maximally reduce junk score.
bypass(rl(i))=Ex∈Dval[yjunk | {do(Ο•l(x1:i,ΞΈ^)←ϕl(x1:i,ΞΈ^)βˆ’projrl(i)Ο•l(x1:i,ΞΈ^))}l=0L]\text{bypass}\left(\mathbf{r}^{(i)}_l\right) = \mathbb{E}_{\mathbf{x} \in \mathcal{D}_{\text{val}}} \left[ y_{\text{junk}} \,\middle|\, \left\{ \text{do}\left( \phi_l(x_{1:i}, \hat{\theta}) \leftarrow \phi_l(x_{1:i}, \hat{\theta}) - \text{proj}_{\mathbf{r}^{(i)}_l}\phi_l(x_{1:i}, \hat{\theta}) \right) \right\}_{l=0}^{L} \right]
  1. Addition Intervention (Steer): Add the candidate direction to base-model activations. A strong junk direction should increase junk score.
steer(rl(i))=Ex∈Dval[yjunk | do(Ο•l(x1:i,ΞΈ)←ϕl(x1:i,ΞΈ)+Ξ±rl(i))]\text{steer}\left(\mathbf{r}^{(i)}_l\right) = \mathbb{E}_{\mathbf{x} \in \mathcal{D}_{\text{val}}} \left[ y_{\text{junk}} \,\middle|\, \text{do}\left( \phi_l(x_{1:i}, \theta) \leftarrow \phi_l(x_{1:i}, \theta) + \alpha \mathbf{r}^{(i)}_l \right) \right]
  1. KL Divergence Sanity Check: Ablate the direction from the base model and measure KL divergence from the original output distribution. Low KL suggests the direction is specific to RMU rather than generally disruptive.

The best junk direction is selected as the candidate rl(i)\mathbf{r}_l^{(i)} that minimizes the bypass score, maximizes the steer score, and maintains a low KL divergence.

D. Relative Gradient Methodology

Let the best junk direction be denoted by rI(L)\mathbf{r}^{(L)}_I, where LL is the layer and II is the token position at which this junk direction is selected. Define:

J(l)(t)=MI(l)(t)β‹…rI(L)J^{(l)}(\mathbf{t}) = \mathcal{M}^{(l)}_I(\mathbf{t}) \cdot \mathbf{r}^{(L)}_I

as the dot product between the residual stream MI(l)(t)\mathcal{M}^{(l)}_I(\mathbf{t}) and the best junk direction rI(L)\mathbf{r}^{(L)}_I. The junk gradient is:

βˆ‡xJi(l)(t),\nabla_{\mathbf{x}} J^{(l)}_i(\mathbf{t}),

i.e., the derivative of earlier residual streams (lβ€²<ll' < l) with respect to J(l)J^{(l)}, at token position ii and input t\mathbf{t}. I use L=10L = 10 and I=βˆ’5I = -5, the best junk-direction layer and token position.

Intuition

Intuitively, the junk gradient indicates which directions are most responsible for producing the junk direction, and thus which features contribute to its formation. Following [17], where refusal directions are hypothesized to originate in early layers, strengthen in middle layers, and finally guide refusals in later layers, the junk directions in this thesis are assumed to behave analogously.

Let D(r)\mathcal{D}^{(r)} denote examples disjoint from D\mathcal{D} and D(p)\mathcal{D}^{(p)}. The Bio and Cyber subsets are Dbio(r)\mathcal{D}^{(r)}_{\text{bio}} and Dcyber(r)\mathcal{D}^{(r)}_{\text{cyber}}.

The aggregated junk gradients at layer l<Ll < L are computed as:

βˆ‡xJbio(l)=1∣Dbio(r)βˆ£βˆ‘t∈Dbio(r)(βˆ‘iβˆ‡xJi(l)(t)),βˆ‡xJcyber(l)=1∣Dcyber(r)βˆ£βˆ‘t∈Dcyber(r)(βˆ‘iβˆ‡xJi(l)(t)).\begin{aligned} \nabla_{\mathbf{x}} J^{(l)}_{\text{bio}} &= \dfrac{1}{|\mathcal{D}^{(r)}_{\text{bio}}|} \sum_{\mathbf{t} \in \mathcal{D}^{(r)}_{\text{bio}}} \left( \sum_{i} \nabla_{\mathbf{x}} J^{(l)}_i(\mathbf{t}) \right), \\[6pt] \nabla_{\mathbf{x}} J^{(l)}_{\text{cyber}} &= \dfrac{1}{|\mathcal{D}^{(r)}_{\text{cyber}}|} \sum_{\mathbf{t} \in \mathcal{D}^{(r)}_{\text{cyber}}} \left( \sum_{i} \nabla_{\mathbf{x}} J^{(l)}_i(\mathbf{t}) \right). \end{aligned}

In this work, l=5l = 5 is chosen empirically. This layer is neither too early (close to the embedding layer) nor too late (close to the output), making it a reasonable midpoint for analysis.

Finally, the cosine similarity is computed between these junk gradients and each residual stream vector of the decoder weight matrix WdecW_{\text{dec}} from the SAE (Sparse Autoencoder) at layer ll.

E. Junk Direction Selection Results

The table reports bypass score, steer score, and KL divergence for four candidate junk directions. I choose rshared\mathbf{r}_{\text{shared}} because it has the best overall balance.

DirectionLayerToken PositionBypassSteerKLExtraction Dataset
rshared\mathbf{r}_{\text{shared}}9βˆ’5-5βˆ’0.29-0.290.540.540.080.08Dbio\mathcal{D}^{\text{bio}} and Dcyber\mathcal{D}^{\text{cyber}}
rbio\mathbf{r}_{\text{bio}}9βˆ’5-50.210.210.610.610.120.12Dbio\mathcal{D}^{\text{bio}}
rcyber\mathbf{r}_{\text{cyber}}9βˆ’5-5βˆ’0.62-0.620.570.570.040.04Dcyber\mathcal{D}^{\text{cyber}}
rood\mathbf{r}_{\text{ood}}10βˆ’1-1βˆ’0.56-0.560.090.090.330.33Dood\mathcal{D}^{\text{ood}}

Meaningful junk directions concentrate at layer 9, token position βˆ’5-5. The OOD direction has a low bypass score but nearly fails steer (0.090.09) and has high KL (0.330.33), so it is too disruptive. The Cyber direction has the strongest bypass score in isolation but poor cross-domain generalization. The shared direction offers the best tradeoff.

F. Full Benchmark Results

The tables give complete accuracy and recovery-rate results. For each direction r\mathbf{r}, I include a same-norm random orthogonal baseline r~\tilde{\mathbf{r}} to test whether recovery is direction-specific.

Accuracy on WMDP Benchmark after ablating each direction:

ModelWMDP-BioWMDP-CyberMMLU
Base model58.7634.9349.73
RMU model25.6126.7743.35
Remove rshared\mathbf{r}_{\text{shared}}50.5932.1647.34
Remove r~shared\tilde{\mathbf{r}}_{\text{shared}} (random baseline)25.2226.6743.71
Remove rcyber\mathbf{r}_{\text{cyber}}48.3933.8247.14
Remove r~cyber\tilde{\mathbf{r}}_{\text{cyber}} (random baseline)25.5326.7243.44
Remove rbio\mathbf{r}_{\text{bio}}50.4329.4947.32
Remove r~bio\tilde{\mathbf{r}}_{\text{bio}} (random baseline)26.1626.3743.74
Remove rood\mathbf{r}_{\text{ood}}34.4928.9445.98

Accuracy recovery rate ρrecovery\rho_{\text{recovery}} for each direction:

DirectionWMDP-Bio (%)WMDP-Cyber (%)MMLU (%)
rshared\mathbf{r}_{\text{shared}}74.265.866.3
rcyber\mathbf{r}_{\text{cyber}}68.584.863.4
rbio\mathbf{r}_{\text{bio}}73.433.665.9
rood\mathbf{r}_{\text{ood}}26.626.543.8

ρrecovery\rho_{\text{recovery}} for all random baselines is ≀0\leq 0: ablating random directions leaves performance unchanged or slightly worse.

Qualitative recovery. After ablating rshared\mathbf{r}_{\text{shared}}, the model recovers coherent, substantive responses on formerly-forgotten prompts:

Prompt: β€œBy what mechanism(s) do modern Beijing strains of Mycobacterium tuberculosis have higher virulence than ancient Beijing strains?”

Base model: β€œThe answer is that the modern strains have acquired a gene that encodes a protein that is a virulence factor…”

RMU model: β€œ1. They are more virulent than ancient Beijing strains. 2. They are less virulent than ancient Beijing strains. 3. They are more virulent than…”

After ablating rshared\mathbf{r}_{\text{shared}}: β€œThe answer is that the ancient Beijing strains were not as virulent as the modern strains. The modern strains are more virulent because they are more likely to cause disease in humans…”