nguyen le

index

I’ve always wanted to understand how deep learning frameworks really work under the hood. Not just the API, but the graph execution, the kernel generation, the autograd. So I built one from scratch in ~2000 lines of Python. It’s called ‘banhxeo’—a lazy tensor library with automatic differentiation and Triton codegen.

1. The Philosophy: Lazy Evaluation

Modern frameworks are bloated. Reading PyTorch source code is like trying to understand a compiler by staring at assembly. banhxeo strips away the magic with one core idea: lazy evaluation.

When you write x + y, nothing is computed. Instead, a computation graph is built:

1
from banhxeo import Tensor
2

3
# 1. Define tensors (Lazy - no memory allocated for data)
4
x = Tensor.eye(3, requires_grad=True)
5
y = Tensor([[2.0, 0, -2.0]], requires_grad=True)
6

7
# 2. Build graph (nothing is computed yet!)
8
z = y.matmul(x).sum()
9

10
# 3. Backprop (Implicitly realizes the forward pass first)
11
z.backward()
12

13
# 4. Check gradients
14
print(x.grad.numpy())
15
print(y.grad.numpy())

The graph is only realized (executed) when you call .backward() or .realize(). This allows us to fuse operations and generate optimal GPU kernels.

2. The Core Abstractions: 4 Classes That Run Everything

The entire engine fits in your head. There are just four key classes:

Component	Purpose
`Tensor`	The user-facing API. Handles operator overloading and autograd state.
`LazyBuffer`	The node in the computation graph. Tracks the operation (`ADD`, `MUL`) and its parents.
`View`	Handles shapes and strides. Enables zero-copy reshapes, permutes, and slices.
`TritonCodegen`	Walks the `LazyBuffer` graph and emits a Triton kernel string.

Here’s the core Tensor class that orchestrates everything:

1
class Tensor:
2
    __slots__ = "lazydata", "requires_grad", "grad", "_ctx"
3

4
    def __init__(
5
        self,
6
        data: Optional[Union[LazyBuffer, List, np.ndarray, ...]] = None,
7
        ...
8
    ):
9
        if isinstance(data, LazyBuffer):
10
            self.lazydata = data
11
        elif isinstance(data, np.ndarray):
12
            self.lazydata = LazyBuffer(
13
                LoadOp.FROM_NUMPY,
14
                view=View.create(shape=data.shape),
15
                args=[data.flatten()],
16
                device=device,
17
            )
18
        # ... other constructors
19

20
        self.grad: Optional[Tensor] = None
21
        self.requires_grad = requires_grad
22
        self._ctx: Optional[Function] = None  # for autograd
23

24
    def add(self, other) -> "Tensor":
25
        from banhxeo.core.function import Add
26
        return Add.apply(self, other)
27

28
    def realize(self) -> "Tensor":
29
        Device.get_backend(self.lazydata.device)().exec(self.lazydata)
30
        return self

3. Zero-Copy Movement with View

One of the coolest parts is the View system. Operations like reshape, permute, and slice don’t copy any data. They just change how we interpret the underlying memory through strides.

1
@dataclass
2
class View:
3
    shape: Tuple[int, ...]
4
    strides: Tuple[int, ...]
5
    offset: int = 0
6

7
    def permute(self, new_axis: Tuple[int, ...]) -> "View":
8
        # Just shuffle the strides - no data movement!
9
        target_shape = [self.shape[ax] for ax in new_axis]
10
        target_stride = [self.strides[ax] for ax in new_axis]
11
        return View(tuple(target_shape), tuple(target_stride))
12

13
    def broadcast_to(self, target_shape: Tuple[int, ...]) -> "View":
14
        # Expanding a dimension? Just set its stride to 0!
15
        new_strides = []
16
        for dim_s, dim_t, stride_s in zip(padded_shape, target_shape, padded_strides):
17
            if dim_s == dim_t:
18
                new_strides.append(stride_s)
19
            elif dim_s == 1:
20
                new_strides.append(0)  # Zero stride = repeat this element
21
        return View(target_shape, tuple(new_strides))

When we finally need to read data, the codegen uses these strides to calculate the correct memory offset. This is how NumPy and PyTorch do it too.

4. Triton Codegen: From Graph to GPU Kernel

This is where the magic happens. The TritonCodegen class walks the LazyBuffer DAG and emits a single, fused Triton kernel string.

1
class TritonCodegen:
2
    def visit_BinaryOp(self, buf: LazyBuffer, name: str):
3
        src0 = self.get_var_name(buf.src[0])
4
        src1 = self.get_var_name(buf.src[1])
5

6
        op_map = {
7
            BinaryOp.ADD: "+",
8
            BinaryOp.MUL: "*",
9
            BinaryOp.DIV: "/",
10
        }
11
        self.code.append(f"    {name} = {src0} {op_map[buf.op]} {src1}")
12

13
    def generate(self):
14
        for buf in self.schedule:
15
            self.visit(buf, self.get_var_name(buf))
16

17
        # Emit the final kernel string
18
        return "\n".join([
19
            "@triton.jit",
20
            f"def generated_kernel({', '.join(args_sig)}, out_ptr, N, ...):",
21
            "    pid = tl.program_id(0)",
22
            "    offsets = pid * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE)",
23
            *self.code,
24
            f"    tl.store(out_ptr + offsets, {output_name}, mask=...)"
25
        ])

Want to see the generated kernel? Set DEBUG=1:

DEBUG=1 python script.py

1
# --- GENERATED TRITON KERNEL ---
2
@triton.jit
3
def generated_kernel(in_0_ptr, in_1_const, out_ptr, N, BLOCK_SIZE: tl.constexpr):
4
    pid = tl.program_id(0)
5
    offsets = pid * BLOCK_SIZE + tl.arange(0, BLOCK_SIZE)
6
    in_0 = tl.load(in_0_ptr + offsets, mask=offsets < N)
7
    temp_0 = in_0 * in_1_const
8
    temp_1 = tl.exp(temp_0)
9
    tl.store(out_ptr + offsets, temp_1, mask=offsets < N)

5. Autograd: Backprop from Scratch

The autograd engine is surprisingly simple. Each operation is a Function subclass with forward and backward methods:

1
class Mul(Function):
2
    def forward(self, x: LazyBuffer, y: LazyBuffer):
3
        self.x, self.y = x, y
4
        return x.compute_ops(BinaryOp.MUL, y)
5

6
    def backward(self, grad_output: LazyBuffer):
7
        # d(x*y)/dx = y, d(x*y)/dy = x
8
        return (
9
            self.y.compute_ops(BinaryOp.MUL, grad_output),
10
            self.x.compute_ops(BinaryOp.MUL, grad_output)
11
        )
12

13
class Matmul(Function):
14
    def forward(self, x: LazyBuffer, y: LazyBuffer):
15
        self.x, self.y = x, y
16
        return x.matmul(y)
17

18
    def backward(self, grad_output: LazyBuffer):
19
        # d(X@Y)/dX = grad @ Y.T, d(X@Y)/dY = X.T @ grad
20
        return (
21
            grad_output.matmul(self.y.t()),
22
            self.x.t().matmul(grad_output)
23
        )

The backward() method on Tensor does a topological sort of the graph and calls each function’s backward in reverse order:

1
def backward(self, retain_graph: bool = False):
2
    # Topological Sort
3
    topo, visited = [], set()
4
    def build_topo(t):
5
        if t not in visited:
6
            visited.add(t)
7
            if t._ctx:
8
                for parent in t._ctx.parents:
9
                    build_topo(parent)
10
            topo.append(t)
11
    build_topo(self)
12

13
    # Backward pass (reverse order)
14
    for t in reversed(topo):
15
        grads = t._ctx.backward(t.grad.lazydata)
16
        for parent, g in zip(t._ctx.parents, grads):
17
            if g is not None and parent.requires_grad:
18
                parent.grad = Tensor(g) if parent.grad is None else parent.grad + Tensor(g)

This project taught me more about deep learning internals than any course ever could. Building the View system was a “eureka” moment—suddenly, broadcasting and striding made complete sense. The Triton codegen is still rough, but seeing a fused kernel pop out of a simple Python expression is pure magic.

Banhxeo

1. The Philosophy: Lazy Evaluation

2. The Core Abstractions: 4 Classes That Run Everything

3. Zero-Copy Movement with View

4. Triton Codegen: From Graph to GPU Kernel

5. Autograd: Backprop from Scratch