Direct Logit Attribution

Technique to understand which part of a neural network (transformer) are responsible for predicting specific tokens in the output

In a transformer, each layer typically contains:

One multi-head attention block (which itself contains multiple attention heads)
One MLP (feedforward) block

So when decomposing with Direct Logit Attribution, you're tracking contributions from:

The token embeddings
Each attention head within each layer's multi-head attention block
Each MLP block (one per layer)
Optionally, layer norms and other components depending on the architecture

The Structure

If you have a 12-layer transformer with 12 attention heads per layer, you'd be decomposing contributions from:

12 layers × 12 heads = 144 attention heads total
12 MLP blocks (one per layer)
Plus the initial embeddings

Each of these components adds its own vector to the residual stream, and DLA lets you measure how much each one contributed to the final logit for any given token.

Basic Formula

For a token t, the logit is:

logit(t) = (x_final)ᵀ · W_U[t]

where:

x_final is the final residual stream state