Technique to understand which part of a neural network (transformer) are responsible for predicting specific tokens in the output

In a transformer, each layer typically contains:

So when decomposing with Direct Logit Attribution, you're tracking contributions from:

The Structure

If you have a 12-layer transformer with 12 attention heads per layer, you'd be decomposing contributions from:

Each of these components adds its own vector to the residual stream, and DLA lets you measure how much each one contributed to the final logit for any given token.

Basic Formula

For a token t, the logit is:

logit(t) = (x_final)ᵀ · W_U[t]

where: