Simo Ryu (@cloneofsimo) 's Twitter Profile
Simo Ryu

@cloneofsimo

I like cats, math and codes
[email protected]

ID: 1529156127006392320

linkhttps://github.com/cloneofsimo calendar_today24-05-2022 17:43:33

3,3K Tweet

12,12K Takipçi

680 Takip Edilen

Simo Ryu (@cloneofsimo) 's Twitter Profile Photo

Very interesting, standard attention causes vanishing gradient due to most prob being very small after some training. LASER tackles this by pushing the attention operation on exponential space. i.e., exp_output = sm(QK^T) exp(V) They dont seem to exaggerate on the performance

Very interesting, standard attention causes vanishing gradient due to most prob being very small after some training. LASER tackles this by pushing the attention operation on exponential space.

i.e., exp_output = sm(QK^T) exp(V)

They dont seem to exaggerate on the performance