@nafisesadat : Activation functions reduce the topological complexity of data. Best AF may be diff for diff models and diff layers, but most Transformer models use GELU. What if the model learns optimized activation functions during training? led by @haishuofang with @JiUngLee1 and @IGurevych • TwiCopy

Nafise Sadat Moosavi

@nafisesadat

+ Follow

Lecturer (~Assistant Prof.) in NLP @SheffieldNLP @shefcompsci, Muslim Iranian woman
إنا على العهد

ID: 1492212827192532993

linkhttps://ns-moosavi.github.io/ calendar_today11-02-2022 19:04:16

313 Tweet

454 Takipçi

350 Takip Edilen

Nafise Sadat Moosavi

@nafisesadat

3 years ago

Activation functions reduce the topological complexity of data. Best AF may be diff for diff models and diff layers, but most Transformer models use GELU. What if the model learns optimized activation functions during training? led by Haishuo with Ji Ung Lee and Iryna Gurevych

thumb_up_off_alt22

chat_bubble_outline1

repeat6

shareShare