Peter Humphreys (@p_humphreys) 's Twitter Profile
Peter Humphreys

@p_humphreys

AI, quantum and neuroscience. Scientist @ DeepMind

ID: 1244596843

calendar_today05-03-2013 20:45:57

10 Tweet

63 Followers

33 Following

Adam Santoro (@santoroai) 's Twitter Profile Photo

Transformers can be made sparse across their depth. When trained isoFLOP, we can match or exceed the performance of vanilla models, while saving inference FLOPs arxiv.org/abs/2404.02258

Alex Hägele (@haeggee) 's Twitter Profile Photo

... and with one experiment, I was able to roughly reproduce their results for a ~220M GPT-2. It gives a speedup of ~20min (80min dense vs 60min MoD, 4 A100s) while keeping the pplx close. This roughly matches Fig. 3 or 4 in the paper arxiv.org/pdf/2404.02258…

... and with one experiment, I was able to roughly reproduce their results for a ~220M GPT-2. It gives a speedup of ~20min (80min dense vs 60min MoD, 4 A100s) while keeping the pplx close. This roughly matches Fig. 3 or 4 in the paper arxiv.org/pdf/2404.02258…
George Grigorev (@iamgrigorev) 's Twitter Profile Photo

I have implemented Mixture-of-Depths and it shows significant memory reduction during training and 10% speed increase. I will verify if it achieves the same quality with 12.5% active tokens. github.com/thepowerfuldee… thanks Alex Hägele for initial code

I have implemented Mixture-of-Depths and it shows significant memory reduction during training and 10% speed increase. I will verify if it achieves the same quality with 12.5% active tokens.
github.com/thepowerfuldee…
thanks <a href="/haeggee/">Alex Hägele</a> for initial code