Technique to understand which part of a neural network (transformer) are responsible for predicting specific tokens in the output
In a transformer, each layer typically contains:
So when decomposing with Direct Logit Attribution, you're tracking contributions from:
If you have a 12-layer transformer with 12 attention heads per layer, you'd be decomposing contributions from:
Each of these components adds its own vector to the residual stream, and DLA lets you measure how much each one contributed to the final logit for any given token.
For a token t, the logit is:
logit(t) = (x_final)ᵀ · W_U[t]
where: