Residual Stream is Key to Transformer Interpretability

index

Warning

This article was originally written in Vietnamese. The following is an English translation created with the assistance of Gemini-2.5-Pro to make the content accessible to a broader audience.

Also, this post represents my personal notes and best effort to understand and explain the deep concepts from the foundational paper, “A Mathematical Framework for Transformer Circuits” []. Some of these ideas are complex and non-intuitive, and this is my attempt to make sense of them (and it isn’t really success).

I. A High-Level Overview

High-Level Architecture of a Transformer — A high-level view of the Transformer architecture, emphasizing the residual stream.

A Transformer model processes information in a sequence. It begins with token embedding, where an input token $t$ (represented as a one-hot vector) is mapped to an embedding vector $x_0$ via an embedding matrix $W_E$ . This vector then passes through a series of residual blocks. Finally, the output of the last block, $x_{L}$ , undergoes token unembedding, where it is mapped to a vector of logits $L(t)$ via an unembedding matrix $W_U$ .

Each residual block (or Transformer block) consists of an attention layer followed by an MLP layer. Both layers read their input from the residual stream—the central pathway carrying vectors like $x_i, x_{i+1}, \dots$ —and subsequently write their results back to it. This write operation is performed via a residual connection: $x_{i+1} = x_i + \text{layer}(x_i)$ .

II. The Residual Stream as a Communication Channel

If we conceptualize a Transformer as a complex computational device, the residual stream is one of its most critical components (analogous to the cell state in LSTMs or the skip connections in ResNets). At its core, the residual stream is simply the cumulative sum of the outputs from all preceding layers, added to the initial token embedding.

Intuition (Communication Channel Analogy)

We can view the residual stream as a communication channel because it does not perform complex, non-linear computations itself (unlike an MLP’s matrix multiplications and activations). Instead, it serves as a shared medium through which all components (attention heads, MLP layers) communicate. They read information from the stream, process it, and write new information back for subsequent layers to access.

A defining feature of the Transformer’s residual stream is its linear and additive structure. This is a key difference from architectures like ResNet, where non-linear activations (e.g., ReLU) are often applied within the skip connection path. Each layer in a Transformer block reads its input from the stream via linear transformations. Similarly, it writes its output back to the stream, typically after another linear transformation.

Remark (Linear Transformations in Practice)

Attention Layer: To process an input vector $x_i$ from the residual stream, the layer projects it into query, key, and value vectors using weight matrices $W_Q, W_K, W_V$ . These are the “read” transformations. After computing the attention head’s output, this result is projected back into the residual stream’s dimension via the output matrix $W_O$ . This is the “write” transformation.
MLP Layer: A standard Transformer MLP consists of two linear transformations with a non-linearity between them. The first linear layer, $W_{in}$ , reads from the residual stream. The second, $W_{out}$ , projects the result of the activated hidden layer back into the stream. The full operation is $W_{out}(\text{GELU}(xW_{in}))$ .

One of the most profound consequences of the stream’s linear, additive nature is that it lacks a privileged basis ¹. This implies that we can rotate the vector space of the residual stream—and correspondingly rotate the matrices that interact with it—without changing the model’s computational behavior.

1. Understanding the Privileged Basis

Definition (Basis)

For an $N$ -dimensional vector space $V$ , a basis is a set of $N$ linearly independent vectors, $\{b_1, \dots, b_N\}$ , such that any vector in $V$ can be uniquely expressed as a linear combination of these basis vectors. In the context of neural networks, the hidden state space is a vector space of dimension $d_{model}$ , and the most common basis is the standard basis $\{e_1, \dots, e_N\}$ , where each $e_i$ is a vector of zeros with a 1 in the $i$ -th position. Each standard basis vector corresponds to the activation of a single neuron.

The concept of a “privileged basis” is defined in ¹ as follows:

“A privileged basis occurs when some aspect of a model’s architecture encourages neural network features to align with basis dimensions, for example because of a sparse activation function such as ReLU.”

Let’s dissect this with an example.

Definition (Feature)

Following the work of Olah et al. [], a feature is a meaningful property that a neural network learns to detect in its input. Rather than thinking of a feature as a single neuron, it’s more accurate to consider it a concept or direction in activation space. Examples could include “is a vertical edge,” “is a proper noun,” or “carries a positive sentiment.”

The critical distinction is how these features are represented:

In a privileged basis, features tend to align with the basis vectors themselves. This means a single feature might be represented by the activation of a single neuron (or a very small, sparse set of neurons).
In a non-privileged basis, a feature is typically represented by a dense linear combination of many neurons. The feature exists as a direction in activation space that is not aligned with any of the standard basis vectors.

Consider a simple MLP trained to classify shapes. Let’s focus on a hidden layer $L$ with $N=4$ neurons, whose activation space is $\mathbb{R}^4$ . The standard basis vectors are $e_1=[1,0,0,0], \dots, e_4=[0,0,0,1]$ .

Case 1: No Sparse Activation (e.g., Linear Layer)
- The feature for “square” might be represented by the dense vector [1.2, -0.9, 0.8, 1.1].
- The feature for “triangle” might be represented by [-0.8, -1.1, 1.3, -0.9].
- Here, each feature is a complex combination of all four neurons. No single neuron is the “square detector.” The features are not aligned with the basis vectors. This is a non-privileged basis.
Case 2: With a Sparse Activation (e.g., ReLU)
- The pre-activation vector for “square,” [1.2, -0.9, 0.8, 1.1], becomes [1.2, 0, 0.8, 1.1] after passing through ReLU. If the network further learns to isolate features, this might evolve into a sparser representation like [1.5, 0, 0, 0]. Now, this vector is perfectly aligned with the first basis vector, $e_1$ . We can confidently say that Neuron 1 has learned to detect squares.
- Similarly, “triangle” might activate Neuron 3, becoming [0, 0, 1.3, 0].
- It can be observed that, due to the presence of the ReLU activation (or more generally, as originally defined, “occurs when some aspect of a model’s architecture encourages neural network features to align with basis dimensions”), features tend to align with individual neurons. Consequently, one can “confidently” ascribe responsibility to specific neurons — for example, designating a given neuron as detecting the presence of a square. This alignment makes the representation more interpretable, and such a coordinate system is referred to as a privileged basis.
- In addition, ReLU contributes both non-linearity and sparsity to the activation vectors (since negative activations are zeroed out), which further reinforces this interpretability structure.

The residual stream, being purely linear, does not have this architectural pressure towards sparsity. Features can exist in any arbitrary direction.

2. Rotational Invariance

Intuition (Why Rotate the Basis?)

A model with a non-privileged basis is like an alien that speaks an unintelligible language. It computes the correct answers, but its internal representations—the features it uses—are encoded along arbitrary, dense directions in its high-dimensional state space.
The standard basis (neuron activations) is the language we humans can directly read. But inspecting individual neuron activations is meaningless if features aren’t aligned with them.

The goal of “rotating the basis” is to find a new coordinate system whose axes align with the true features the model has learned. This search for an interpretable basis is mathematically equivalent to applying a rotation. Once we find this basis, we can point to a new “neuron” (a direction in the rotated space) and say, “This direction detects circles.” We rotate the basis to make the model understandable to us.

Because the residual stream is basis-free, we can apply such rotations without changing the model’s output. Let’s see why.

Let $R$ be an arbitrary orthogonal rotation matrix, meaning $R^{-1} = R^T$ and $R^T R = I$ . Suppose we rotate a vector $x_i$ on the residual stream to get $x'_i = R x_i$ . For the model’s behavior to remain unchanged, every component that interacts with the stream must adapt.

Consider an attention component:

Read Operation: The component reads from the stream using matrices $W_Q, W_K, W_V$ . To preserve the computation, we need new matrices $W'_Q, W'_K, W'_V$ such that:

W_Q x_{i} = W'_{Q} x'_{i}

Substituting $x'_i = Rx_i$ , we get $W_Q x_i = W'_Q R x_i$ . For this to hold for all $x_i$ , we must have $W_Q = W'_Q R$ , which implies $W'_Q = W_Q R^{-1} = W_Q R^T$ . Thus, the new weight matrices simply “un-rotate” the input before applying the original transformation: $W'_Q x'_i = (W_Q R^T)(R x_i) = W_Q x_i$ . The underlying logic is unchanged.

Write Operation: The layer writes its output back to the rotated stream:

x'_{i+1} = x'_{i} + W'_{O} \cdot \text{head\_output}

Since all vectors on the stream must be consistently rotated, $x'_{i+1} = R x_{i+1}$ and $x'_i = R x_i$ . Substituting these into the original update rule $x_{i+1} = x_i + W_O \cdot \text{head\_output}$ gives:

R(x_i + W_O \cdot \text{head\_output}) = R x_i + W'_O \cdot \text{head\_output} \\ \implies R W_O \cdot \text{head\_output} = W'_O \cdot \text{head\_output}

This requires $W'_O = R W_O$ . The output projection is simply rotated along with the rest of the space.

Since the internal calculations of each component remain invariant after applying these compensatory rotations to the weight matrices, we say the residual stream is rotationally invariant or basis-free.

III. Virtual Weights

Virtual Weights across layers — The linearity of the residual stream allows us to compose weight matrices, forming 'virtual weights' that connect non-adjacent layers.

The linearity of the residual stream has another powerful implication: we can analyze the interaction between any two layers by composing their weight matrices into a single “virtual weight.”

Note (Virtual Weights Induced by the Residual Stream)

Owing to the linearity of the residual stream, one can view it as implicitly defining a set of virtual weights that connect any arbitrary pair of layers, regardless of how far apart they are in depth. Concretely, such a virtual weight matrix is given by the product of the output projection matrix of one layer and the input projection matrix of the other layer.

Let $C_j$ be the computation of component $j$ (e.g., an attention head or MLP), with input weights $W_I^j$ and output weights $W_O^j$ . The update rule at step $j$ is:

x_{j+1} = x_j + W_O^j \cdot C_j(W_I^j x_j)

Now, consider how the next component, $j+1$ , reads from the stream:

W_I^{j+1} x_{j+1} = W_I^{j+1}(x_j + W_O^j \cdot C_j(\dots)) = W_I^{j+1} x_j + (W_I^{j+1} W_O^j) \cdot C_j(\dots)

The term $W_I^{j+1} W_O^j$ is a virtual weight matrix. It directly maps the output of component $j$ to the input of component $j+1$ . This shows that information written by layer $j$ is read by layer $j+1$ through this composite matrix.

We can extend this across multiple layers. The input to component $i$ is influenced by the output of component $j$ (where $j<i$ ) via the virtual weight $W_I^i W_O^j$ . This allows us to think of the Transformer as a network where every layer directly communicates with every subsequent layer, mediated by these virtual weights.

IV. Subspaces and Bandwidth of the Residual Stream

The residual stream is a high-dimensional vector space (e.g., $d_{model} = 768$ for BERT-base, $d_{model} = 2304$ for Gemma-2B). This high dimensionality allows different layers and attention heads to operate on distinct, often non-overlapping, subspaces.

Definition (Disjoint Subspaces)

We say that a collection of subspaces ${U_1, U_2, \dots, U_N}$ of a vector space $V$ are disjoint if the following conditions hold:

$U_1 \cap U_2 \cap \dots \cap U_N = \{\mathbf{0}\}$
$V = U_1 + U_2 + \dots + U_N$ , i.e., $V = \{ u_1 + u_2 + \dots + u_N \mid u_1 \in U_1, \dots, u_N \in U_N \}.$

This condition is analogous to a “union” in set theory, except expressed in terms of linear subspaces via the direct-sum decomposition.

In a multi-head attention layer, each head has a relatively small output dimension ( $d_{head} = d_{model} / n_{heads}$ , often 64). When these outputs are projected back into the residual stream, they are likely to occupy different subspaces. It’s possible for these subspaces to be nearly orthogonal (disjoint), allowing heads to write information without interfering with each other ².

Once information is added to the residual stream, it persists until it is explicitly modified or overwritten by a subsequent layer. From this perspective, the dimensionality of the residual stream, $d_{model}$ , acts as the model’s communication bandwidth or working memory. Increasing $d_{model}$ theoretically increases the capacity for components to store and share information.

Furthermore, studies suggest that the token embedding and unembedding matrices ( $W_E, W_U$ ) often interact with only a small fraction of the available dimensions ³. This leaves a large number of “free” dimensions in the residual stream for intermediate layers to use for computation.

Residual Stream Bandwidth — The residual stream's dimensionality serves as a communication bandwidth, which can become a bottleneck.

Definition (Computational Dimension)

Here, the term refers to the dimensionality of components that perform active computation, such as the MLP or the Attention Heads (in contrast, the residual stream primarily serves as an information carrier rather than a site of computation). For example, the output dimensionality of an Attention layer can match $d_{model}$ (after concatenating the multiple attention heads). In contrast, the hidden layer of the MLP typically has a dimensionality that is 4 times larger than $d_{model}$ .

However, this bandwidth is in very high demand. It is the sole channel for communication between all components. The computational dimensions of the components often far exceed the residual stream’s dimension. For instance, the MLP hidden layer dimension is typically $4 \times d_{model}$ . This mismatch creates computational bottlenecks.

Definition (Bottleneck Activations)

An activation vector is considered a bottleneck if its dimension is smaller than the layers preceding and succeeding it. This forces information to be compressed, potentially losing fidelity.

For example, the residual stream can be regarded as a form of bottleneck activation. MLP layers at different depths (which typically have activations of higher dimensionality than the residual stream) must communicate with one another through the residual stream. Consequently, the residual stream acts as an intermediary between two MLP layers whose hidden activations may have much larger dimensionality. Moreover, the residual stream is the only pathway through which any given MLP layer can communicate with subsequent layers. It must also carry forward information originating from other MLP layers along the path toward the extreme bottleneck.
Similarly, a value vector (in the $Q, K, V$ decomposition of an attention head) also constitutes a bottleneck activation.
- By construction, each value vector has dimensionality $d_{model}/h$ , where $h$ denotes the number of attention heads. Thus, its dimensionality is much smaller than that of the residual stream.
- Let $x_s$ denote the residual stream at token position $s$ . The corresponding value vector is $v_s = x_s W_V$ . This value vector $v_s$ is then used to update the residual stream at another position $t$ : $x_t = x_t + (\text{attention\_score} \times v_s) W_O .$
- In this way, the information in the residual stream $x_s$ is compressed into $v_s$ and subsequently transferred to the residual stream at position $t$ . Thus, between two residual streams, the value vector functions as a bottleneck activation. Importantly, the value vector is the only mechanism by which information can be transmitted from one token to another.

Because of the high bandwidth demand imposed on the residual stream, certain MLP neurons or attention heads can be interpreted as performing memory management operations. For instance, they may clear specific residual dimensions allocated by earlier layers by reading out information and then writing back the negative of that information. This resembles the behavior of a memory-cleaning process: the act of writing the negation cancels out the previous signal, thereby freeing up representational capacity in the residual stream..

V. Attention Heads Operate as an Ensemble of Independent Operations

A key design principle of multi-head attention is that the heads, $h \in H$ , operate in parallel and independently. The output of the attention layer is the sum of the outputs of individual heads.

Recall Attention Mechanism:

\begin{aligned} &x_{i}: [1 \times d] \\ &v_{i}: [1 \times d_{v}] \\ &q_{i}: [1 \times d_{k}] \\ &k_{i}: [1 \times d_{k}] \\ &\alpha_{ij}: \text{softmax}\left( \frac{q_{j}k_{i}}{\sqrt{ d_{k} }} \right) ~ [1 \times 1] \\ &h^{k}_{i}: \sum_{j \leq i} \alpha_{ij} v_{j} ~ [1 \times d_{v}]\\ &h_{i}: (h^1_{i} \oplus \dots \oplus h_{i}^n) ~ [1 \times (n* d_{v})] \\ &a_{i}: h_{i} W_{O} ~ [1 \times d] \\ &x_{i+1}: x_{i} + a_{i} ~ [1 \times d] \end{aligned}

Let $r^{h_{k}}$ be the result vector from head $k$ (with dimension $d_{head}$ ). In the original Transformer paper, these results are concatenated and then projected by a single output matrix $W_O$ (with dimension $[n_{heads} \cdot d_{head} \times d_{model}]$ ). We can decompose this operation. Let $W_O$ be composed of sub-matrices $W_O^k$ (each of size $[d_{head} \times d_{model}]$ ), one for each head. The concatenation and projection is equivalent to:

\begin{aligned} \left[r^{h_{1}} \otimes \dots \otimes r^{h_{n}}\right]W_{O} &= \begin{bmatrix} r^{h_{1}} \dots r^{h_{n}} \end{bmatrix} \begin{bmatrix} W_{O}^1 \\ \vdots \\ \\ W_{O}^n \end{bmatrix} \\ &= r^{h_{1}}W_{O}^1 + \dots + r^{h_{n}}W_{O}^n \\ &= \sum_{i=1}^n r^{h_{i}}W_{O}^i \end{aligned}

This decomposition shows that the total output is simply the sum of each head’s output projected independently into the residual stream. Each head can be thought of as contributing its own update vector, and these are all added together.

VI. Attention Heads as Information Movers

The fundamental operation of an attention head is to move information between token positions. It reads information from the residual stream at one set of positions and writes that information to the residual stream at another position.

To formalize this, let’s analyze the computation for a single head.

Let $x$ be the matrix of input vectors from the residual stream (shape $[N \times d_{model}]$ , where $N$ is sequence length).
The head computes value vectors $V = x W_V$ . This is a per-token operation.
It computes an attention matrix $A$ (shape $[N \times N]$ ), where $A_{ij}$ is the softmax score from query $i$ to key $j$ .
The result vectors are computed by mixing values: $R = A V$ . This is an across-token operation, where the result for token $i$ , $r_i$ , is $\sum_j A_{ij} v_j$ .
Finally, the output written to the stream is $H = R W_O = (AV)W_O$ .

This sequence of operations—per-token projection, across-token mixing, per-token projection—can be elegantly expressed using the Kronecker product ( $\otimes$ ).

Definition (Bilinear Map)

A bilinear map $f$ is a function that combines elements from two vector spaces into an element of a third vector space. Moreover, a bilinear map is linear in each of its arguments when the other is fixed. Formally, a bilinear map $f : X \times Y \to W$ satisfies:

\begin{aligned} f(\lambda x, y) &= \lambda f(x, y), \quad \forall \lambda \in F, \, x \in X, \, y \in Y, \\ f(x, \lambda y) &= \lambda f(x, y), \quad \forall \lambda \in F, \, x \in X, \, y \in Y, \\ f(x_{1} + x_{2}, y) &= f(x_{1}, y) + f(x_{2}, y), \\ f(x, y_{1} + y_{2}) &= f(x, y_{1}) + f(x, y_{2}). \end{aligned}

Definition (Tensor Product)

A tensor product $V \otimes W$ is a vector space together with a canonical bilinear map $f : V \times W \to V \otimes W$ that is universal with respect to bilinear maps (i.e., for any bilinear map $g : V \times W \to U$ , there exists a unique linear map $\tilde{g} : V \otimes W \to U$ such that $g = \tilde{g} \circ f$ ).

If $A$ is an $m \times n$ matrix and $B$ is a $p \times q$ matrix, then their Kronecker product (denoted $A \otimes B$ ) is the block matrix of size $(mp) \times (nq)$ given by

A \otimes B = \begin{bmatrix} a_{11} B & \cdots & a_{1n} B \\ \vdots & \ddots & \vdots \\ a_{m1} B & \cdots & a_{mn} B \end{bmatrix}.

The Kronecker product is a concrete realization of the tensor product when $A$ and $B$ are regarded as linear maps between vector spaces.
Define $\operatorname{vec}(X)$ as the operation that vectorizes a matrix $X$ into a one-dimensional column vector. For example, if $X$ has shape $[m, n]$ , then $\operatorname{vec}(X)$ has shape $[mn, 1]$ . With this definition, for arbitrary matrices $A$ and $B$ we have the useful identity:

(A \otimes B)\,\operatorname{vec}(X) = \operatorname{vec}(BXA^T).

Remark (Attention Head as a Tensor Product)

If we represent the input $x$ as a single vectorized column vector $\text{vec}(x)$ of size $[N \cdot d_{model} \times 1]$ , the entire head’s operation $h(x)$ can be written as a single linear transformation:

\text{vec}(h(x)) = \underbrace{(I_N \otimes W_O)}_{\substack{\text{Write: Project results} \\ \text{out for each token}}} \cdot \underbrace{(A^T \otimes I_{d_{head}})}_{\substack{\text{Mix: Combine value vectors} \\ \text{across tokens}}} \cdot \underbrace{(I_N \otimes W_V)}_{\substack{\text{Read: Compute value} \\ \text{vector for each token}}} \cdot \text{vec}(x)

Combining these, the end-to-end transformation from input $x$ to output $h(x)$ is:

\text{vec}(h(x)) = (A^T \otimes (W_O W_V)) \cdot \text{vec}(x)

This compact form reveals the head’s fundamental structure: the attention matrix $A$ dictates how information is moved between token positions, while the virtual weight matrix $W_O W_V$ determines what information is read and written at each position. The two operations are separable.

VII. One-Layer Attention-Only Transformers

Note

Under construction 🚧

Citation

1
@misc{ln2025residual,
2
    author={Nguyen Le},
3
    title={Residual Stream is Key to Transformer Interpretability},
4
    year={2025},
5
    url={https://lenguyen.vercel.app/note/math-transformers}
6
}

References

A Mathematical Framework for Transformer Circuits, Elhage, Nelson and Nanda, Neel and Olsson, Catherine and Henighan, Tom and Joseph, Nicholas and Mann, Ben and Askell, Amanda and Bai, Yuntao and Chen, Anna and Conerly, Tom and DasSarma, Nova and Drain, Dawn and Ganguli, Deep and Hatfield-Dodds, Zac and Hernandez, Danny and Jones, Andy and Kernion, Jackson and Lovitt, Liane and Ndousse, Kamal and Amodei, Dario and Brown, Tom and Clark, Jack and Kaplan, Jared and McCandlish, Sam and Olah, Chris
Transformer Circuits Thread, 2021
https://transformer-circuits.pub/2021/framework
Zoom In: An Introduction to Circuits, Olah, Chris and Cammarata, Nick and Schubert, Ludwig and Goh, Gabriel and Petrov, Michael and Carter, Shan
Distill, 2020
https://distill.pub/2020/circuits/zoom-in
Mechanistic Interpretability for AI Safety -- A Review, Leonard Bereska and Efstratios Gavves
2024
https://arxiv.org/abs/2404.14082

https://transformer-circuits.pub/2021/framework/index.html#def-privileged-basis ↩ ↩²
Put simply, each attention head can be thought of as locating its own “free space” within the residual stream — analogous to finding unused memory — in order to write its information. Since each head typically requires only about 64 dimensions, while the residual stream may have as many as ↩
https://transformer-circuits.pub/2021/framework/index.html#d-footnote-6 ↩