attention sinks may be a bias in causal transformers.
as some of you know, i've been writing a long blogpost on attention and its properties as a message-passing operation on graphs. while doing so, i figured i might have found an explanation for which attention sinks may be an