@adamzweiger : Here are all the architecture tricks used by gpt-oss: - Attention sinks - for each attention head, have a learned scalar such that softmax(qk) becomes softmax over [a_1, a_2, ..., a_T, sink]. Tokens don't have to attend to anything if all the attention scores are low!

Adam Zweiger

@adamzweiger

+ Follow

ID: 1571391036416724992

calendar_today18-09-2022 06:50:11

0 Tweet

14 Takipçi

197 Takip Edilen

Adam Zweiger

@adamzweiger

4 months ago

Here are all the architecture tricks used by gpt-oss: - Attention sinks - for each attention head, have a learned scalar such that softmax(qk) becomes softmax over [a_1, a_2, ..., a_T, sink]. Tokens don't have to attend to anything if all the attention scores are low! -

thumb_up_off_alt754

chat_bubble_outline10

repeat64

shareShare