Warning
This article was originally written in Vietnamese. The following is an English translation created with the assistance of Gemini-2.5-Pro to make the content accessible to a broader audience.
Also, this post represents my personal notes and best effort to understand and explain the deep concepts from the foundational paper, “A Mathematical Framework for Transformer Circuits”
I. A High-Level Overview
A Transformer model processes information in a sequence. It begins with token embedding, where an input token (represented as a one-hot vector) is mapped to an embedding vector via an embedding matrix . This vector then passes through a series of residual blocks. Finally, the output of the last block, , undergoes token unembedding, where it is mapped to a vector of logits via an unembedding matrix .
Each residual block (or Transformer block) consists of an attention layer followed by an MLP layer. Both layers read their input from the residual stream—the central pathway carrying vectors like —and subsequently write their results back to it. This write operation is performed via a residual connection: .
II. The Residual Stream as a Communication Channel
If we conceptualize a Transformer as a complex computational device, the residual stream is one of its most critical components (analogous to the cell state in LSTMs or the skip connections in ResNets). At its core, the residual stream is simply the cumulative sum of the outputs from all preceding layers, added to the initial token embedding.
Intuition (Communication Channel Analogy)
We can view the residual stream as a communication channel because it does not perform complex, non-linear computations itself (unlike an MLP’s matrix multiplications and activations). Instead, it serves as a shared medium through which all components (attention heads, MLP layers) communicate. They read information from the stream, process it, and write new information back for subsequent layers to access.
A defining feature of the Transformer’s residual stream is its linear and additive structure. This is a key difference from architectures like ResNet, where non-linear activations (e.g., ReLU) are often applied within the skip connection path. Each layer in a Transformer block reads its input from the stream via linear transformations. Similarly, it writes its output back to the stream, typically after another linear transformation.
Remark (Linear Transformations in Practice)
- Attention Layer: To process an input vector from the residual stream, the layer projects it into query, key, and value vectors using weight matrices . These are the “read” transformations. After computing the attention head’s output, this result is projected back into the residual stream’s dimension via the output matrix . This is the “write” transformation.
- MLP Layer: A standard Transformer MLP consists of two linear transformations with a non-linearity between them. The first linear layer, , reads from the residual stream. The second, , projects the result of the activated hidden layer back into the stream. The full operation is .
One of the most profound consequences of the stream’s linear, additive nature is that it lacks a privileged basis 1. This implies that we can rotate the vector space of the residual stream—and correspondingly rotate the matrices that interact with it—without changing the model’s computational behavior.
1. Understanding the Privileged Basis
Definition (Basis)
For an -dimensional vector space , a basis is a set of linearly independent vectors, , such that any vector in can be uniquely expressed as a linear combination of these basis vectors. In the context of neural networks, the hidden state space is a vector space of dimension , and the most common basis is the standard basis , where each is a vector of zeros with a 1 in the -th position. Each standard basis vector corresponds to the activation of a single neuron.
The concept of a “privileged basis” is defined in 1 as follows:
“A privileged basis occurs when some aspect of a model’s architecture encourages neural network features to align with basis dimensions, for example because of a sparse activation function such as ReLU.”
Let’s dissect this with an example.
Definition (Feature)
Following the work of Olah et al.
The critical distinction is how these features are represented:
- In a privileged basis, features tend to align with the basis vectors themselves. This means a single feature might be represented by the activation of a single neuron (or a very small, sparse set of neurons).
- In a non-privileged basis, a feature is typically represented by a dense linear combination of many neurons. The feature exists as a direction in activation space that is not aligned with any of the standard basis vectors.
Consider a simple MLP trained to classify shapes. Let’s focus on a hidden layer with neurons, whose activation space is . The standard basis vectors are .
-
Case 1: No Sparse Activation (e.g., Linear Layer)
- The feature for “square” might be represented by the dense vector
[1.2, -0.9, 0.8, 1.1]. - The feature for “triangle” might be represented by
[-0.8, -1.1, 1.3, -0.9]. - Here, each feature is a complex combination of all four neurons. No single neuron is the “square detector.” The features are not aligned with the basis vectors. This is a non-privileged basis.
- The feature for “square” might be represented by the dense vector
-
Case 2: With a Sparse Activation (e.g., ReLU)
- The pre-activation vector for “square,”
[1.2, -0.9, 0.8, 1.1], becomes[1.2, 0, 0.8, 1.1]after passing through ReLU. If the network further learns to isolate features, this might evolve into a sparser representation like[1.5, 0, 0, 0]. Now, this vector is perfectly aligned with the first basis vector, . We can confidently say that Neuron 1 has learned to detect squares. - Similarly, “triangle” might activate Neuron 3, becoming
[0, 0, 1.3, 0]. - It can be observed that, due to the presence of the ReLU activation (or more generally, as originally defined, “occurs when some aspect of a model’s architecture encourages neural network features to align with basis dimensions”), features tend to align with individual neurons. Consequently, one can “confidently” ascribe responsibility to specific neurons — for example, designating a given neuron as detecting the presence of a square. This alignment makes the representation more interpretable, and such a coordinate system is referred to as a privileged basis.
- In addition, ReLU contributes both non-linearity and sparsity to the activation vectors (since negative activations are zeroed out), which further reinforces this interpretability structure.
- The pre-activation vector for “square,”
The residual stream, being purely linear, does not have this architectural pressure towards sparsity. Features can exist in any arbitrary direction.
2. Rotational Invariance
Intuition (Why Rotate the Basis?)
- A model with a non-privileged basis is like an alien that speaks an unintelligible language. It computes the correct answers, but its internal representations—the features it uses—are encoded along arbitrary, dense directions in its high-dimensional state space.
- The standard basis (neuron activations) is the language we humans can directly read. But inspecting individual neuron activations is meaningless if features aren’t aligned with them.
The goal of “rotating the basis” is to find a new coordinate system whose axes align with the true features the model has learned. This search for an interpretable basis is mathematically equivalent to applying a rotation. Once we find this basis, we can point to a new “neuron” (a direction in the rotated space) and say, “This direction detects circles.” We rotate the basis to make the model understandable to us.
Because the residual stream is basis-free, we can apply such rotations without changing the model’s output. Let’s see why.
Let be an arbitrary orthogonal rotation matrix, meaning and . Suppose we rotate a vector on the residual stream to get . For the model’s behavior to remain unchanged, every component that interacts with the stream must adapt.
Consider an attention component:
- Read Operation: The component reads from the stream using matrices . To preserve the computation, we need new matrices such that:
Substituting , we get . For this to hold for all , we must have , which implies . Thus, the new weight matrices simply “un-rotate” the input before applying the original transformation: . The underlying logic is unchanged.
- Write Operation: The layer writes its output back to the rotated stream:
Since all vectors on the stream must be consistently rotated, and . Substituting these into the original update rule gives:
This requires . The output projection is simply rotated along with the rest of the space.
Since the internal calculations of each component remain invariant after applying these compensatory rotations to the weight matrices, we say the residual stream is rotationally invariant or basis-free.
III. Virtual Weights
The linearity of the residual stream has another powerful implication: we can analyze the interaction between any two layers by composing their weight matrices into a single “virtual weight.”
Note (Virtual Weights Induced by the Residual Stream)
Owing to the linearity of the residual stream, one can view it as implicitly defining a set of virtual weights that connect any arbitrary pair of layers, regardless of how far apart they are in depth. Concretely, such a virtual weight matrix is given by the product of the output projection matrix of one layer and the input projection matrix of the other layer.
Let be the computation of component (e.g., an attention head or MLP), with input weights and output weights . The update rule at step is:
Now, consider how the next component, , reads from the stream:
The term is a virtual weight matrix. It directly maps the output of component to the input of component . This shows that information written by layer is read by layer through this composite matrix.
We can extend this across multiple layers. The input to component is influenced by the output of component (where ) via the virtual weight . This allows us to think of the Transformer as a network where every layer directly communicates with every subsequent layer, mediated by these virtual weights.
IV. Subspaces and Bandwidth of the Residual Stream
The residual stream is a high-dimensional vector space (e.g., for BERT-base, for Gemma-2B). This high dimensionality allows different layers and attention heads to operate on distinct, often non-overlapping, subspaces.
Definition (Disjoint Subspaces)
We say that a collection of subspaces of a vector space are disjoint if the following conditions hold:
- , i.e.,
This condition is analogous to a “union” in set theory, except expressed in terms of linear subspaces via the direct-sum decomposition.
In a multi-head attention layer, each head has a relatively small output dimension (, often 64). When these outputs are projected back into the residual stream, they are likely to occupy different subspaces. It’s possible for these subspaces to be nearly orthogonal (disjoint), allowing heads to write information without interfering with each other 2.
Once information is added to the residual stream, it persists until it is explicitly modified or overwritten by a subsequent layer. From this perspective, the dimensionality of the residual stream, , acts as the model’s communication bandwidth or working memory. Increasing theoretically increases the capacity for components to store and share information.
Furthermore, studies suggest that the token embedding and unembedding matrices () often interact with only a small fraction of the available dimensions 3. This leaves a large number of “free” dimensions in the residual stream for intermediate layers to use for computation.
Definition (Computational Dimension)
Here, the term refers to the dimensionality of components that perform active computation, such as the MLP or the Attention Heads (in contrast, the residual stream primarily serves as an information carrier rather than a site of computation). For example, the output dimensionality of an Attention layer can match (after concatenating the multiple attention heads). In contrast, the hidden layer of the MLP typically has a dimensionality that is 4 times larger than .
However, this bandwidth is in very high demand. It is the sole channel for communication between all components. The computational dimensions of the components often far exceed the residual stream’s dimension. For instance, the MLP hidden layer dimension is typically . This mismatch creates computational bottlenecks.
Definition (Bottleneck Activations)
An activation vector is considered a bottleneck if its dimension is smaller than the layers preceding and succeeding it. This forces information to be compressed, potentially losing fidelity.
-
For example, the residual stream can be regarded as a form of bottleneck activation. MLP layers at different depths (which typically have activations of higher dimensionality than the residual stream) must communicate with one another through the residual stream. Consequently, the residual stream acts as an intermediary between two MLP layers whose hidden activations may have much larger dimensionality. Moreover, the residual stream is the only pathway through which any given MLP layer can communicate with subsequent layers. It must also carry forward information originating from other MLP layers along the path toward the extreme bottleneck.
-
Similarly, a value vector (in the decomposition of an attention head) also constitutes a bottleneck activation.
- By construction, each value vector has dimensionality , where denotes the number of attention heads. Thus, its dimensionality is much smaller than that of the residual stream.
- Let denote the residual stream at token position . The corresponding value vector is . This value vector is then used to update the residual stream at another position :
- In this way, the information in the residual stream is compressed into and subsequently transferred to the residual stream at position . Thus, between two residual streams, the value vector functions as a bottleneck activation. Importantly, the value vector is the only mechanism by which information can be transmitted from one token to another.
Because of the high bandwidth demand imposed on the residual stream, certain MLP neurons or attention heads can be interpreted as performing memory management operations. For instance, they may clear specific residual dimensions allocated by earlier layers by reading out information and then writing back the negative of that information. This resembles the behavior of a memory-cleaning process: the act of writing the negation cancels out the previous signal, thereby freeing up representational capacity in the residual stream..
V. Attention Heads Operate as an Ensemble of Independent Operations
A key design principle of multi-head attention is that the heads, , operate in parallel and independently. The output of the attention layer is the sum of the outputs of individual heads.
Recall Attention Mechanism:
Let be the result vector from head (with dimension ). In the original Transformer paper, these results are concatenated and then projected by a single output matrix (with dimension ). We can decompose this operation. Let be composed of sub-matrices (each of size ), one for each head. The concatenation and projection is equivalent to:
This decomposition shows that the total output is simply the sum of each head’s output projected independently into the residual stream. Each head can be thought of as contributing its own update vector, and these are all added together.
VI. Attention Heads as Information Movers
The fundamental operation of an attention head is to move information between token positions. It reads information from the residual stream at one set of positions and writes that information to the residual stream at another position.
To formalize this, let’s analyze the computation for a single head.
- Let be the matrix of input vectors from the residual stream (shape , where is sequence length).
- The head computes value vectors . This is a per-token operation.
- It computes an attention matrix (shape ), where is the softmax score from query to key .
- The result vectors are computed by mixing values: . This is an across-token operation, where the result for token , , is .
- Finally, the output written to the stream is .
This sequence of operations—per-token projection, across-token mixing, per-token projection—can be elegantly expressed using the Kronecker product ().
Definition (Bilinear Map)
A bilinear map is a function that combines elements from two vector spaces into an element of a third vector space. Moreover, a bilinear map is linear in each of its arguments when the other is fixed. Formally, a bilinear map satisfies:
Definition (Tensor Product)
A tensor product is a vector space together with a canonical bilinear map that is universal with respect to bilinear maps (i.e., for any bilinear map , there exists a unique linear map such that ).
If is an matrix and is a matrix, then their Kronecker product (denoted ) is the block matrix of size given by
- The Kronecker product is a concrete realization of the tensor product when and are regarded as linear maps between vector spaces.
- Define as the operation that vectorizes a matrix into a one-dimensional column vector. For example, if has shape , then has shape . With this definition, for arbitrary matrices and we have the useful identity:
Remark (Attention Head as a Tensor Product)
If we represent the input as a single vectorized column vector of size , the entire head’s operation can be written as a single linear transformation:
Combining these, the end-to-end transformation from input to output is:
This compact form reveals the head’s fundamental structure: the attention matrix dictates how information is moved between token positions, while the virtual weight matrix determines what information is read and written at each position. The two operations are separable.
VII. One-Layer Attention-Only Transformers
Note
Under construction 🚧
Citation
@misc{ln2025residual, author={Nguyen Le}, title={Residual Stream is Key to Transformer Interpretability}, year={2025}, url={https://lenguyen.vercel.app/note/math-transformers}}References
- A Mathematical Framework for Transformer Circuits, Elhage, Nelson and Nanda, Neel and Olsson, Catherine and Henighan, Tom and Joseph, Nicholas and Mann, Ben and Askell, Amanda and Bai, Yuntao and Chen, Anna and Conerly, Tom and DasSarma, Nova and Drain, Dawn and Ganguli, Deep and Hatfield-Dodds, Zac and Hernandez, Danny and Jones, Andy and Kernion, Jackson and Lovitt, Liane and Ndousse, Kamal and Amodei, Dario and Brown, Tom and Clark, Jack and Kaplan, Jared and McCandlish, Sam and Olah, ChrisTransformer Circuits Thread, 2021https://transformer-circuits.pub/2021/framework
- Zoom In: An Introduction to Circuits, Olah, Chris and Cammarata, Nick and Schubert, Ludwig and Goh, Gabriel and Petrov, Michael and Carter, ShanDistill, 2020https://distill.pub/2020/circuits/zoom-in
- Mechanistic Interpretability for AI Safety -- A Review, Leonard Bereska and Efstratios Gavves2024https://arxiv.org/abs/2404.14082
Footnotes
-
https://transformer-circuits.pub/2021/framework/index.html#def-privileged-basis ↩ ↩2
-
Put simply, each attention head can be thought of as locating its own “free space” within the residual stream — analogous to finding unused memory — in order to write its information. Since each head typically requires only about 64 dimensions, while the residual stream may have as many as ↩
-
https://transformer-circuits.pub/2021/framework/index.html#d-footnote-6 ↩