🍐 nguyen

Finding and Fighting the Lazy Unlearner: An Adversarial Approach

September 19, 2025
19 min read
index
Note

1. Background

Let a Large Language Model (LLM) be parameterized by θ\theta.

  • fθ:XRVf_{\theta} : \mathcal{X} \to \mathbb{R}^{|V|} is a function that maps an input token xtx_t to a logit vector ztz_t of dimension V|V|. We refer to VV as vocabulary of model. Thus:
zt=fθ(x1:t)z_t = f_{\theta}(x_{1:t})
  • Let ϕL(;θ)\phi_L(\cdot ; \theta) denote the activation vector of the LLM at layer LL and token position tt (or input vector at layer LL). Correspondingly, let gLg_L be the remainder of the LLM computational graph, which maps the activation vector at layer LL and token position tt to the corresponding logit vector ztz_t. Therefore, we can express the computation as:
zt=fθ(x1:t)=gL(ϕL(x1:t;θ))z_t = f_{\theta}(x_{1:t}) = g_L(\phi_L(x_{1:t} ; \theta))
  • The output distribution over the vocabulary at position tt is given by:
pt=softmax(fθ(x1:t))=softmax(gL(ϕL(x1:t;θ)))p_t = \text{softmax}(f_{\theta}(x_{1:t})) = \text{softmax}(g_L(\phi_L(x_{1:t}; \theta)))

1.1 RMU (Representation Misdirection Unleanring)

RMU [] operates by forcing the activation vectors at a specific layer LL for inputs from the forget set Df\mathcal{D}_f towards a fixed random noise vector u\mathbf{u}.

Consider a forget set Df\mathcal{D}_f and a retain set Dr\mathcal{D}_r. The forget set Df\mathcal{D}_f contains harmful data (e.g., related to bioweapons, security vulnerabilities) [], while Dr\mathcal{D}_r contains benign data (e.g., general knowledge) (more on Appendix A)

  • Forget Loss:
Lforget=ExDf[1Tft=1TfϕL(x1:t ; θ^)cu22]\mathcal{L}_{\text{forget}} = \mathbb{E}_{x \sim \mathcal{D}_f} \left[ \dfrac{1}{T_f} \sum_{t=1}^{T_f} || \phi_L(x_{1:t} \ ; \ \hat{\theta}) - c \cdot \mathbf{u} ||^2_2 \right]
  • Retain Loss:
Lretain=ExDr[1Trt=1TrϕL(x1:t ; θ^)ϕL(x1:t ; θ)22]\mathcal{L}_{\text{retain}} = \mathbb{E}_{x \sim \mathcal{D}_r} \left[ \dfrac{1}{T_r} \sum_{t=1}^{T_r} || \phi_L(x_{1:t} \ ; \ \hat{\theta}) - \phi_L(x_{1:t} \ ; \ \theta) ||^2_2 \right]

Here, ϕ(;θ)\phi(\cdot; \theta) is the activation from the frozen, original model, while ϕ(;θ^)\phi(\cdot; \hat{\theta}) is from the model being unlearned. TfT_f and TrT_r are the total tokens of forget set and retain set respectively. The total loss is:

L=Lforget+αLretain\mathcal{L} = \mathcal{L}_{\text{forget}} + \alpha \mathcal{L}_{\text{retain}}

1.2 Weaknesses of RMU

Recent work by Łucki et al. [] and Arditi et al. [] has identified a critical vulnerability in RMU. By forcing representations towards a single target vector, the method encourages the optimizer to find a simple, “lazy” solution: it learns a single direction in activation space to conceal the forgotten knowledge. An attacker can subsequently identify this direction, project it out, and recover the supposedly unlearned information. This failure mode is termed shallow unlearning [].

Accuracy Recovery in RMU
For example, Arditi et al. recover up to ~71% of the performance gap between the RMU-unlearned model and the original model on WMDP-Bio by ablating a single direction. On WMDP-Cyber, this directional ablation recovers ~45% of the gap.
Accuracy Recovery in RMU
Łucki et al. demonstrate similar accuracy recovery using the same method but termed Orthogonalization instead.

2. An Empirical Motivation

Warning

To ground above idea, we first conducted an empirical investigation into the mechanics of shallow unlearning on a Gemma-2-2B model. This exploration, initially part of our thesis, followed a path of inquiry that began with creating a tool to detect unlearning, isolating the underlying mechanism, and finally, deconstructing that mechanism to understand its origins. This journey revealed critical limitations in post-hoc analysis and directly motivated the adversarial framework presented in the main body of this note.

2.1 Quantifying “Junk” with a Linear Probe

Observation: The output from an RMU-unlearned model on forgotten prompts often appears random and incoherent.

  • Prompt: By what mechanism(s) do modern Beijing strains of Mycobacterium tuberculosis have higher virulence than ancient Beijing strains?
  • Original Model: The answer is that the modern strains have acquired a gene that encodes a protein that is a virulence factor…
  • RMU Model: 1. They are more virulent than ancient Beijing strains. 2. They are less virulent than ancient Beijing strains. 3. They are more virulent than…

So our initial challenge was to quantitatively measure when a model was “unlearning” in a shallow way. We observed that RMU-tuned models often produce incoherent or evasive responses, which we will refer to as “junk” outputs for the remainder of this analysis. We hypothesized that this behavior could be identified from the model’s internal activations.

To test this, we framed the problem as a binary classification task: training a simple linear classifier—a linear probe []—on the model’s residual stream activations to predict whether the final output would be coherent or junk.

  • Probe Training: We constructed a dataset of 2,000 prompts, combining 1,000 questions from the WMDP benchmark with 1,000 general-purpose samples from Alpaca []. We used a larger LLM (Gemini-2.5-Flash) to label the Gemma-2-2B model’s outputs as either junk (Label 1) or non-junk (Label 0). After training a probe on the final-token activations for each layer of the model, we found that the probe at Layer 12 achieved the highest performance (ROC-AUC > 0.7), confirming that a non-trivial, linearly decodable signal for this behavior exists within the model’s representations.
Best Probe Performance
Best Probe Performance
  • Limitations of the Linear Probe Approach: While the probe was effective, this process revealed several fundamental limitations that made it unsuitable as a robust research tool:
    • Scalability: This approach is model-specific and requires a new, labeled dataset and training process for every model architecture under investigation.
    • Subjectivity and Bias: The definition of an “junk” output is subjective and relies on the labeling criteria of an external LLM. This introduces an unquantifiable bias into the probe’s concept of evasion.
    • Misalignment of Objectives: The probe identifies a correlate of unlearning (incoherent output), but this is not necessarily the mechanism of unlearning. The true goal of an attacker is to recover information, not simply to detect gibberish. This proxy is therefore indirect and potentially fragile.

This realization was a critical turning point. A post-hoc, supervised tool was too brittle. We needed a more direct method to find the underlying mechanism itself.

2.2 Isolating the Junk Direction

Having established a proxy for junk score (the probe’s score), we sought to isolate the direction in activation space responsible for inducing this behavior. Following the methodology of Arditi et al. [], we performed the same but changed metric to junk score instead of refusal score. This process yielded a candidate vector at each layer and token position, which we term the junk direction (see Appendix C for methodological details).

Our analysis revealed a key insight: the junk directions extracted from the WMDP-Bio and WMDP-Cyber datasets were highly aligned, exhibiting a cosine similarity greater than 0.8 at many layers.

High cosine similarity between junk directions derived from Biology and Cybersecurity prompts.
Cosine similarity between the junk directions extracted independently from WMDP-Bio and WMDP-Cyber prompts. The strong alignment suggests a shared, domain-general mechanism for shallow unlearning.

This suggested that the model was not learning separate unlearning mechanisms for each topic but was instead relying on a single, shared mechanism. We validated this by creating a shared direction, averaging the Bio and Cyber directions.

Ablating this single shared direction from the model’s activations during inference led to a significant recovery of the supposedly forgotten knowledge, restoring over 50% of the performance gap on WMDP-Bio and 32% on WMDP-Cyber. This confirmed that this single, shared direction was a causally significant component of the shallow unlearning mechanism.

Accuracy Recovery in RMU
Accuracy and percentage of performance recovery on a Gemma-2-2B model from our thesis work. 'Loại bỏ' translates to 'Removal' or 'Ablation'.

2.3 Deconstructing the Direction with Feature Analysis

We had isolated a direction. The final question was: what is this direction made of? What concepts does it represent? To answer this, we turned to Sparse Autoencoders (SAEs), which decompose activations into more interpretable features. Using the Relative Gradient technique from Daniel et al. [], we identified which SAE features most contributed to forming the shared junk direction (see Appendix D for methodological details).

  • In the Base Model: The junk direction was composed of generic, topic-agnostic features related to syntax, programming concepts, and statistical language (e.g., “C++ memory management,” “statistical terms”).
Cosine similarity between SAE vectors and the junk gradient on WMDP-Bio for the base model
Cosine similarity between SAE vectors and the junk gradient on WMDP-Bio for the base model
  • In the RMU-Unlearned Model: The composition of the direction fundamentally changed. RMU had rewired the model to construct the junk direction using features semantically related to the forgotten topics themselves, such as “words related to infectious diseases” (SAE vector 3877) and “terms related to medical imaging” (SAE vector 3484).
Cosine similarity between SAE vectors and the junk gradient on WMDP-Bio for the RMU model
Cosine similarity between SAE vectors and the junk gradient on WMDP-Bio for the RMU model
Details of SAE features 3484 on Neuropedia

This was the crucial insight. Shallow unlearning is not just a random artifact; it is a learned response. The RMU optimization process actively repurposes the model’s semantic understanding of a harmful topic to build the very mechanism used to evade questions about it.

Cosine similarity between SAE vectors and the junk gradient on WMDP-Bio for the RMU (after ablated direction) model
Cosine similarity between SAE vectors and the junk gradient on WMDP-Bio for the RMU (after ablated direction) model

Next, although the activated features now resemble those of the base model after we ablated junk direction rshared\mathbf{r}_{\text{shared}}, a few features still encode biological knowledge. This provides evidence that the junk direction is not the sole mechanism by which RMU operates.

Some biology-related features include:

  • 8085: “terms related to viral infections and drug resistance”.
  • 16213: “phrases and references related to medical conditions and healthcare”.

Removing feature 16213 (the most strongly activated one) yields the following results:

ModelWMDP-BioWMDP-CyberMMLU
Remove rshared\mathbf{r}_{\text{shared}}50.5932.1647.34
Remove feature 16213 after removed rshared\mathbf{r}_{\text{shared}}51.6833.7148.08

3. Proposed Method

Objective: We aim to prevent the model from adopting a “lazy” solution—namely, using a single directional mechanism to achieve unlearning. Our goal is to force the model to learn a more robust and deeply embedded unlearning solution.

This behavior can be understood through the lens of specification gaming. The optimizer, acting as a “lazy servant,” finds the simplest mathematical path to satisfy the given loss function, often without capturing the researcher’s semantic intent.

  • Our Semantic Goal: To genuinely erase knowledge related to a harmful topic, which should require a meaningful change to the model’s internal circuits.
  • The Loss Function’s Goal: To make the activation vector ϕL(;θ^)\phi_L(\cdot; \hat{\theta}) diverge from its original state by redirecting it towards a random vector u\mathbf{u}.

Approach: We reframe this challenge as a min-max optimization problem (adversarial training).

  • Attacker: The attacker’s goal is to find the “laziest” direction, denoted r\mathbf{r}, which, when added to an activation vector, is most effective at achieving the unlearning objective (which we model as maximizing output entropy).
  • Defender: The defender updates the model weights θ^\hat{\theta} to steer activation vectors away from this adversarial direction r\mathbf{r}. The intuition is to force the model to find an alternative, more complex unlearning mechanism, rather than collapsing its representations onto this single lazy direction.
Danger (Key Assumptions and Caveats)

Our approach rests on two key assumptions that require further investigation:

  • Is maximizing entropy the correct adversarial objective? We use output entropy as a proxy for finding the “laziest” unlearning direction. However, a more direct objective for an attacker might be to find a direction that recovers the forgotten information most effectively. Our proxy may not perfectly align with the true adversarial goal.
  • Is the “lazy” solution a single direction? The shallow unlearning mechanism may not be confined to a single vector but could exist within a low-dimensional subspace. Our method, by targeting a single direction r\mathbf{r}, might be insufficient to neutralize a multi-dimensional “lazy subspace,” as hinted at by our experiments with a shared refusal direction.

3.1 Attacker Optimization

Following Yuan et al. [], maximizing output entropy is equivalent to minimizing the KL-divergence between the model’s output distribution and the uniform distribution U[K]\mathcal{U}_{[K]} where KK is the vocabulary size.

Let r\mathbf{r} be a perturbation vector (or the “lazy” direction). The pertubed logit is:

zt~=gL(ϕL(x1:t;θ^)+αr^)\tilde{z_t} = g_L(\phi_L(x_{1:t} ; \hat{\theta}) \color{green}{+ \alpha \hat{\mathbf{r}}}\color{black})

where r^\hat{\mathbf{r}} is the L2-normalized vector r\mathbf{r}.

The attacker problem is:

minr1Tt=1TKL(p~t  U[K])s.t.r2=1\begin{aligned} \min_{\mathbf{r}} \quad & \dfrac{1}{T} \sum_{t=1}^T KL(\tilde{p}_t \ || \ \mathcal{U}_{[K]}) \\ \textrm{s.t.} \quad & ||\mathbf{r}||_2 = 1 \end{aligned}

The constraint r2=1||\mathbf{r}||_2 = 1 focuses on finding a direction rather than its magnitude, TT is number of tokens and p~t=softmax(zt~)\tilde{p}_t = \text{softmax}(\tilde{z_t}).

3.2 Defender Optimization

Given the adversarial direction r\mathbf{r}^\star found by the attacker, the defender’s objective is to force the model’s activations on forget-set data to be orthogonal to this direction. This is achieved by minimizing the squared inner product, effectively projecting the activations into the null-space of r\mathbf{r}^\star:

Ldef=ExfDf[1Tft=1TfϕL(x1:tf ; θ^),r^2]\mathcal{L}_{\text{def}} = \mathbb{E}_{x^f \sim \mathcal{D}_f}\left[ \dfrac{1}{T_f} \sum_{t=1}^{T_f} \langle \phi_L(x^f_{1:t} \ ; \ \hat{\theta}), \mathbf{\hat{r}}^\star \rangle^2 \right]

This defense loss acts as a regularizer on the original RMU objective. The final forget loss is:

Lforget=Lforget+λdefLdef\mathcal{L'}_{\text{forget}} = \mathcal{L}_{\text{forget}} + \lambda_{\text{def}} \mathcal{L}_{\text{def}}
Note (Why Regularize Instead of Replace?)

We propose using Ldef\mathcal{L}_{\text{def}} as a regularizer rather than a complete replacement for the original Lforget\mathcal{L}_{\text{forget}}. Our rationale is that the original RMU objective provides a useful inductive bias: it disrupts the original activations, encouraging the model to rewire its internal circuits.

However, empirical evidence shows it often achieves this via a simplistic, “lazy” mechanism. Our adversarial regularizer, Ldef\mathcal{L}_{\text{def}}, acts as a targeted penalty against this specific lazy solution. Replacing the loss entirely would remove the “disruption” signal, only telling the optimizer what not to do (i.e., “steer away from the high-entropy direction”) without providing a constructive unlearning objective.

3.3 What’s next ?

A critical next step is to empirically verify whether the shallow unlearning phenomenon persists in more advanced RMU variants, such as Adaptive-RMU [], RNA-RMU [], and LAT-RMU []. Our adversarial framework should be benchmarked against these methods to assess its relative robustness.

4. Experiments

Important

This research is currently at the conceptual stage, and I plan to carry out empirical validation in the near future as soon as the necessary computational resources are available.

5. Acknowledgement

Thanks @arditi for insightful discussions regarding high-entropy output distributions and the mechanics of refusal directions in RMU. Thanks @longphan for support with data and baseline RMU implementation. Thanks professor @Le Hoai Bac for his invaluable guidance throughout this work. I also used Google’s @Gemini-2.5-Pro to help refine the grammar and clarity of this note. Finally, I thank @mom and @dad for providing the financial support and a warm place to sleep, enabling me to pursue this research.

Citation

@misc{ln2025rmu,
author={Nguyen Le},
title={Finding and Fighting the Lazy Unlearner: An Adversarial Approach},
year={2025},
url={https://lenguyen.vercel.app/note/rmu-improv}
}

References

  1. The WMDP Benchmark: Measuring and Reducing Malicious Use with Unlearning, Li, Nathaniel and Pan, Alexander and Gopal, Anjali and Yue, Summer and Berrios, Daniel and Gatti, Alice and Li, Justin D. and Dombrowski, Ann-Kathrin and Goel, Shashwat and Mukobi, Gabriel and Helm-Burger, Nathan and Lababidi, Rassin and Justen, Lennart and Liu, Andrew Bo and Chen, Michael and Barrass, Isabelle and Zhang, Oliver and Zhu, Xiaoyuan and Tamirisa, Rishub and Bharathi, Bhrugu and Herbert-Voss, Ariel and Breuer, Cort B and Zou, Andy and Mazeika, Mantas and Wang, Zifan and Oswal, Palash and Lin, Weiran and Hunt, Adam Alfred and Tienken-Harder, Justin and Shih, Kevin Y. and Talley, Kemper and Guan, John and Steneker, Ian and Campbell, David and Jokubaitis, Brad and Basart, Steven and Fitz, Stephen and Kumaraguru, Ponnurangam and Karmakar, Kallol Krishna and Tupakula, Uday and Varadharajan, Vijay and Shoshitaishvili, Yan and Ba, Jimmy and Esvelt, Kevin M. and Wang, Alexandr and Hendrycks, Dan Proceedings of the 41st International Conference on Machine Learning,
    PMLR, 2024
    https://proceedings.mlr.press/v235/li24bc.html
  2. An Adversarial Perspective on Machine Unlearning for AI Safety, Jakub \Lucki and Boyi Wei and Yangsibo Huang and Peter Henderson and Florian Tram\`er and Javier Rando
    Transactions on Machine Learning Research, 2025
    https://openreview.net/forum?id=J5IRyTKZ9s
  3. Unlearning via RMU is mostly shallow, Andy, Arditi Less Wrong,
    2024
    https://www.lesswrong.com/posts/6QYpXEscd8GuE7BgW/unlearning-via-rmu-is-mostly-shallow
  4. On effects of steering latent representation for large language model unlearning, Huu-Tien, Dang and Pham, Tin and Thanh-Tung, Hoang and Inoue, Naoya Proceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence and Thirty-Seventh Conference on Innovative Applications of Artificial Intelligence and Fifteenth Symposium on Educational Advances in Artificial Intelligence,
    AAAI Press, 2025
    https://doi.org/10.1609/aaai.v39i22.34544
  5. Improving LLM Unlearning Robustness via Random Perturbations, Dang Huu-Tien and Hoang Thanh-Tung and Anh Bui and Le-Minh Nguyen and Naoya Inoue
    2025
    https://arxiv.org/abs/2501.19202
  6. Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs, Abhay Sheshadri and Aidan Ewart and Phillip Guo and Aengus Lynch and Cindy Wu and Vivek Hebbar and Henry Sleight and Asa Cooper Stickland and Ethan Perez and Dylan Hadfield-Menell and Stephen Casper
    2025
    https://arxiv.org/abs/2407.15549
  7. A Closer Look at Machine Unlearning for Large Language Models, Xiaojian Yuan and Tianyu Pang and Chao Du and Kejiang Chen and Weiming Zhang and Min Lin The Thirteenth International Conference on Learning Representations,
    2025
    https://openreview.net/forum?id=Q1MHvGmhyT
  8. Refusal in Language Models Is Mediated by a Single Direction, Andy Arditi and Oscar Balcells Obeso and Aaquib Syed and Daniel Paleka and Nina Rimsky and Wes Gurnee and Neel Nanda The Thirty-eighth Annual Conference on Neural Information Processing Systems,
    2024
    https://openreview.net/forum?id=pH3XAQME6c
  9. Measuring Massive Multitask Language Understanding, Dan Hendrycks and Collin Burns and Steven Basart and Andy Zou and Mantas Mazeika and Dawn Song and Jacob Steinhardt
    Proceedings of the International Conference on Learning Representations (ICLR), 2021
  10. Aligning AI With Shared Human Values, Dan Hendrycks and Collin Burns and Steven Basart and Andrew Critch and Jerry Li and Dawn Song and Jacob Steinhardt
    Proceedings of the International Conference on Learning Representations (ICLR), 2021
  11. Simple probes can catch sleeper agents, Monte MacDiarmid and Timothy Maxwell and Nicholas Schiefer and Jesse Mu and Jared Kaplan and David Duvenaud and Sam Bowman and Alex Tamkin and Ethan Perez and Mrinank Sharma and Carson Denison and Evan Hubinger
    2024
    https://www.anthropic.com/news/probes-catch-sleeper-agents
  12. Stanford Alpaca: An Instruction-following LLaMA model, Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto
    GitHub repository, GitHub, 2023
  13. HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal, Mantas Mazeika and Long Phan and Xuwang Yin and Andy Zou and Zifan Wang and Norman Mu and Elham Sakhaee and Nathaniel Li and Steven Basart and Bo Li and David Forsyth and Dan Hendrycks
    2024
  14. Catastrophic jailbreak of open-source llms via exploiting generation, Huang, Yangsibo and Gupta, Samyak and Xia, Mengzhou and Li, Kai and Chen, Danqi
    arXiv preprint arXiv:2310.06987, 2023
  15. Universal and Transferable Adversarial Attacks on Aligned Language Models, Andy Zou and Zifan Wang and J. Zico Kolter and Matt Fredrikson
    2023
  16. Detecting Strategic Deception Using Linear Probes, Nicholas Goldowsky-Dill and Bilal Chughtai and Stefan Heimersheim and Marius Hobbhahn
    2025
    https://arxiv.org/abs/2502.03407
  17. Finding Features Causally Upstream of Refusal, Lee, Daniel Less Wrong,
    https://www.lesswrong.com/posts/Zwg4q8XTaLXRQofEt/finding-features-causally-upstream-of-refusal

Appendix

A. Dataset

  • The WMDP (Weapons of Mass Destruction Proxy) dataset [] is a multiple-choice benchmark consisting of one question and four answer options, designed to measure hazardous knowledge in large language models LLM across domains such as biology, cybersecurity, and chemistry. The dataset comprises 3,668 expert-developed questions intended as a representative metric for evaluating and mitigating risks of malicious misuse.

  • In biology (referred to as WMDP-Bio), the dataset contains 1,273 questions. These include biotechnology-related risks such as historical instances of bioweapons, pathogens with pandemic potential, construction-related issues (e.g., reverse engineering of viruses), and experimental aspects (e.g., assays for viral properties).

  • In cybersecurity (referred to as WMDP-Cyber), the dataset includes 1,987 questions structured around the stages of a cyberattack: reconnaissance, exploitation, attack, and post-exploitation. WMDP-Cyber evaluates a model’s ability to understand concepts and techniques related to vulnerability discovery, exploit development, and the use of popular offensive frameworks such as Metasploit and Cobalt Strike.

  • Model performance will be evaluated in terms of accuracy on WMDP multiple-choice questions (restricted to WMDP-Bio and WMDP-Cyber). A model trained with unlearning should exhibit reduced accuracy on WMDP while maintaining accuracy on unrelated benchmarks (in the cited work, this is the MMLU dataset [] []). We refer to this overall evaluation as the WMDP Benchmark.

B. More on Probe

Below is data statistics for Probe Training:

DatasetTotal SamplesLabel 1 (Junk)Label 0 (Non-Junk)
Dtrain\mathcal{D}_{\text{train}}12801016264
Dval\mathcal{D}_{\text{val}}32025862
Dtest\mathcal{D}_{\text{test}}40032476

The dataset is highly imbalanced. This is understandable given that the base model, Gemma-2-2B, is relatively small. For many challenging (harmful or harmless) questions, the model fails to provide an answer, which is then labeled as junk. With a larger model, this imbalance would likely be reduced.

In training procedure, we apply z-score normalization using the mean μtrain\mu_{\text{train}} and standard deviation σtrain\sigma_{\text{train}} computed on the training set, and reuse these statistics for normalization on validation and test sets []. To mitigate imbalance, weighted resampling with replacement is applied, giving higher weight to the minority (non-junk) class.

Once the best probe is selected, it is used to assign a junk score to activations as:

x^=xμtrainσtrainjunk_score(x)=Wbestx^+bbest\begin{aligned} \mathbf{\hat{x}} &= \frac{\mathbf{x} - \mu_{\text{train}}}{\sigma_{\text{train}}} \\ \text{junk\_score}(\mathbf{x}) &= \mathbf{W}_{\text{best}} \mathbf{\hat{x}} + \mathbf{b}_{\text{best}} \end{aligned}

where x\mathbf{x} is any activation vector, and (Wbest,bbest)(\mathbf{W}_{\text{best}}, \mathbf{b}_{\text{best}}) are the weights and bias of the best-performing probe.

C. Isolating the Junk Direction Methodology

C.1 Dataset

Let D\mathcal{D} denote the dataset used for extracting junk directions. This dataset is split into a training set Dtrain\mathcal{D}_{\text{train}} and a validation set Dval\mathcal{D}_{\text{val}}. Specifically, D\mathcal{D} is obtained by sampling from the WMDP dataset, which consists of WMDP-Bio and WMDP-Cyber, corresponding to Dbio\mathcal{D}^{\text{bio}} and Dcyber\mathcal{D}^{\text{cyber}} (but collectively referred to as D\mathcal{D}).

The sampling process follows these conditions:

  • Samples must not overlap with the detector training dataset of linear probe.
  • Samples must not contain questions with the word “which”. The reason is that such questions typically require both the question and the answer context. If used in isolation, they degrade model performance and sometimes lead to empty or anomalous outputs.
  • The maximum question length is limited to 1024 characters. This restriction arises both from resource constraints and from the observation that including overly long questions does not affect the final results.

After applying these conditions, 300 samples are drawn from each of WMDP-Bio and WMDP-Cyber, for a total of 600 samples. These are then split into Dtrain\mathcal{D}_{\text{train}} and Dval\mathcal{D}_{\text{val}} with an 80%/20% ratio.

In addition, we incorporate out-of-distribution (OOD) data, denoted Dood\mathcal{D}^{\text{ood}}. This dataset combines three harmful-knowledge benchmarks: AdvBench [], MaliciousInstruct [], and HarmBench []. Unlike WMDP, which is domain-specific to biology and cybersecurity, these datasets contain more general harmful instructions.

C.2 Direction Extraction via Difference-in-Means

Our investigation is built on a central hypothesis: the performance degradation observed in an RMU-unlearned model is not solely due to diffuse changes in its weights. Rather, we posit that RMU instills a specific, learned mechanism.

Causal Hypothesis: For a harmful input, the RMU model adds a specific vector, the junk direction rjunk\mathbf{r}_{\text{junk}}, to the residual stream at a specific layer. This intervention is the primary cause of the model’s junk, low-quality output and the corresponding drop in accuracy on harmful benchmarks.

To extract junk directions, we employ the same method as in []. The junk direction rjunk\mathbf{r}_{\text{junk}} is an unobservable latent variable. We approximate it by contrasting the internal states of the RMU model and the base model on the same set of harmful prompts. The intuition is that on these inputs, the base model will produce coherent activations, while the RMU model’s activations will be altered by the junk mechanism.

We use the Difference-in-Means [] method to compute this contrast. The junk direction rl(i)\mathbf{r}_{l}^{(i)} at layer ll and token position ii is calculated as the difference between the expected activations of the two models over the harmful dataset:

rl(i)=ExDtrain[ϕl(x1:i;θ^)]ExDtrain[ϕl(x1:i;θ)]\mathbf{r}_l^{(i)} = \mathbb{E}_{x \in \mathcal{D}_{\text{train}}}[\phi_l(x_{1:i}; \hat{\theta})] - \mathbb{E}_{x \in \mathcal{D}_{\text{train}}}[\phi_l(x_{1:i}; \theta)]

This process is performed for each layer l[0,L]l \in [0, L] and for the final five token positions i[5,1]i \in [-5, -1] of the input prompts, yielding a set of candidate directions.

Warning

The last five tokens (from 5-5 to 1-1) are selected for the following reasons:

  1. In the original study by Arditi [], eight final tokens were chosen and averaged. This choice was essentially arbitrary.
  2. In the study on refusal directions [], the models examined were instruction-tuned and employed structured prompts (e.g., user:<instruction>model:). Consequently, they selected tokens after the instruction (post-instruction), such as model:. However, the models in this thesis are base models without instruction fine-tuning, so such prompt structures cannot be used.
  3. This thesis hypothesizes that information is aggregated in the final tokens. Selecting only the very last token would be overly restrictive. Therefore, the final five tokens are used instead.

C.3 Causal Validation and Selection

From the set of candidate directions, we must identify the one that is most causally responsible for the junk behavior. We use a series of causal interventions, evaluated on a validation set Dval\mathcal{D}_{\text{val}}, using the junk score from the linear probe (Appendix B) as our metric.

  1. Ablation Intervention (Bypass): To test if removing a direction restores coherent behavior in the RMU model, we compute its bypass score. We intervene on the RMU model by projecting out the component of its activations that lies along the candidate direction r\mathbf{r}. The optimal direction should maximally reduce the junk score.
bypass(rl(i))=ExDval[yjunk|{do(ϕl(x1:i,θ^)ϕl(x1:i,θ^)projrl(i)ϕl(x1:i,θ^))}l=0L]\text{bypass}\left(\mathbf{r}^{(i)}_l\right) = \mathbb{E}_{\mathbf{x} \in \mathcal{D}_{\text{val}}} \left[ y_{\text{junk}} \,\middle|\, \left\{ \text{do}\left( \phi_l(x_{1:i}, \hat{\theta}) \leftarrow \phi_l(x_{1:i}, \hat{\theta}) - \text{proj}_{\mathbf{r}^{(i)}_l}\phi_l(x_{1:i}, \hat{\theta}) \right) \right\}_{l=0}^{L} \right]
  1. Addition Intervention (Steer): To test if adding a direction induces junk behavior in the base model, we compute its steer score. We intervene by adding the direction to the base model’s activations. A potent junk direction should significantly increase the junk score.
steer(rl(i))=ExDval[yjunk|do(ϕl(x1:i,θ)ϕl(x1:i,θ)+αrl(i))]\text{steer}\left(\mathbf{r}^{(i)}_l\right) = \mathbb{E}_{\mathbf{x} \in \mathcal{D}_{\text{val}}} \left[ y_{\text{junk}} \,\middle|\, \text{do}\left( \phi_l(x_{1:i}, \theta) \leftarrow \phi_l(x_{1:i}, \theta) + \alpha \mathbf{r}^{(i)}_l \right) \right]
  1. KL Divergence Sanity Check: A true junk direction should be specific to the RMU mechanism and not disrupt the base model’s general knowledge. We verify this by ablating the direction from the base model and measuring the KL divergence between the original and perturbed output distributions. A low KL divergence indicates the direction is inert in the base model, confirming its specificity.

The best junk direction is selected as the candidate rl(i)\mathbf{r}_l^{(i)} that minimizes the bypass score, maximizes the steer score, and maintains a low KL divergence.

D. Relative Gradient Methodology

Let the best junk direction be denoted by rI(L)\mathbf{r}^{(L)}_I, where LL is the layer and II is the token position at which this junk direction is selected. Define:

J(l)(t)=MI(l)(t)rI(L)J^{(l)}(\mathbf{t}) = \mathcal{M}^{(l)}_I(\mathbf{t}) \cdot \mathbf{r}^{(L)}_I

as the dot product between the residual stream vector of the model M\mathcal{M}, namely MI(l)(t)\mathcal{M}^{(l)}_I(\mathbf{t}) at layer ll and token II (the same position where the junk direction is extracted), with the best junk direction rI(L)\mathbf{r}^{(L)}_I for input t\mathbf{t}. Then, the junk gradient is defined as:

xJi(l)(t),\nabla_{\mathbf{x}} J^{(l)}_i(\mathbf{t}),

i.e., the derivative of the earlier residual streams (l<ll' < l) with respect to J(l)J^{(l)}, at token position ii and input t\mathbf{t}. Here, we choose L=10L = 10 and I=5I = -5 (which is the layer and token position of best junk direction).

Intuition

Intuitively, the junk gradient indicates which directions are most responsible for producing the junk direction, and thus which features contribute to its formation. Following [], where refusal directions are hypothesized to originate in early layers, strengthen in middle layers, and finally guide refusals in later layers, the junk directions in this thesis are assumed to behave analogously.

Let D(r)\mathcal{D}^{(r)} denote a dataset sampled from the remaining examples (disjoint from D\mathcal{D} and D(p)\mathcal{D}^{(p)}). For each of WMDP-Bio and WMDP-Cyber, we define Dbio(r)\mathcal{D}^{(r)}_{\text{bio}} and Dcyber(r)\mathcal{D}^{(r)}_{\text{cyber}}, respectively.

The aggregated junk gradients at layer l<Ll < L are computed as:

xJbio(l)=1Dbio(r)tDbio(r)(ixJi(l)(t)),xJcyber(l)=1Dcyber(r)tDcyber(r)(ixJi(l)(t)).\begin{aligned} \nabla_{\mathbf{x}} J^{(l)}_{\text{bio}} &= \dfrac{1}{|\mathcal{D}^{(r)}_{\text{bio}}|} \sum_{\mathbf{t} \in \mathcal{D}^{(r)}_{\text{bio}}} \left( \sum_{i} \nabla_{\mathbf{x}} J^{(l)}_i(\mathbf{t}) \right), \\[6pt] \nabla_{\mathbf{x}} J^{(l)}_{\text{cyber}} &= \dfrac{1}{|\mathcal{D}^{(r)}_{\text{cyber}}|} \sum_{\mathbf{t} \in \mathcal{D}^{(r)}_{\text{cyber}}} \left( \sum_{i} \nabla_{\mathbf{x}} J^{(l)}_i(\mathbf{t}) \right). \end{aligned}

In this work, l=5l = 5 is chosen empirically. This layer is neither too early (close to the embedding layer) nor too late (close to the output), making it a reasonable midpoint for analysis.

Finally, the cosine similarity is computed between these junk gradients and each residual stream vector of the decoder weight matrix WdecW_{\text{dec}} from the SAE (Sparse Autoencoder) at layer ll.