@cloneofsimo : Very interesting, standard attention causes vanishing gradient due to most prob being very small after some training. LASER tackles this by pushing the attention operation on exponential space. i.e., exp_output = sm(QK^T) exp(V) They dont seem to exaggerate on the performance • TwiCopy

Simo Ryu

@cloneofsimo

+ Follow

I like cats, math and codes
[email protected]

ID: 1529156127006392320

linkhttps://github.com/cloneofsimo calendar_today24-05-2022 17:43:33

3,3K Tweet

12,12K Takipçi

680 Takip Edilen

Simo Ryu

@cloneofsimo

9 months ago

Very interesting, standard attention causes vanishing gradient due to most prob being very small after some training. LASER tackles this by pushing the attention operation on exponential space. i.e., exp_output = sm(QK^T) exp(V) They dont seem to exaggerate on the performance

thumb_up_off_alt299

chat_bubble_outline6

repeat45

shareShare