
Nafise Sadat Moosavi
@nafisesadat
Lecturer (~Assistant Prof.) in NLP @SheffieldNLP @shefcompsci, Muslim Iranian woman
إنا على العهد
ID: 1492212827192532993
https://ns-moosavi.github.io/ 11-02-2022 19:04:16
313 Tweet
454 Takipçi
350 Takip Edilen

Activation functions reduce the topological complexity of data. Best AF may be diff for diff models and diff layers, but most Transformer models use GELU. What if the model learns optimized activation functions during training? led by Haishuo with Ji Ung Lee and Iryna Gurevych