Adam Zweiger (@adamzweiger) 's Twitter Profile
Adam Zweiger

@adamzweiger

ID: 1571391036416724992

calendar_today18-09-2022 06:50:11

0 Tweet

14 Followers

197 Following

Adam Zweiger (@adamzweiger) 's Twitter Profile Photo

Here are all the architecture tricks used by gpt-oss: - Attention sinks - for each attention head, have a learned scalar such that softmax(qk) becomes softmax over [a_1, a_2, ..., a_T, sink]. Tokens don't have to attend to anything if all the attention scores are low! -