 
                                Mike Lewis
@ml_perception
Llama3 pre-training lead. Partially to blame for things like the Cicero Diplomacy bot, BART, RoBERTa, kNN-LM, top-k sampling & Deal Or No Deal.
ID: 1170214705056452609
07-09-2019 05:58:31
272 Tweet
7,7K Takipçi
233 Takip Edilen
 
         
         
         
         
         
         
         
         
         
         
         
         
        How can we reduce pretraining costs for multi-modal models without sacrificing quality? We study this Q in our new work: arxiv.org/abs/2411.04996 At AI at Meta, We introduce Mixture-of-Transformers (MoT), a sparse architecture with modality-aware sparsity for every non-embedding
 
                        
                    
                    
                    
                 
         
         
         
        Don’t miss this - I’ve worked with Mike (Mike Lewis) very closely at Meta and his talks are super informative and fun.
 
         
         
         
                         
                         
                         
                         
                         
                        ![Nicholas Roberts (@nick11roberts) on Twitter photo 📉📉NEW SCALING LAW PHENOMENON 📉📉 
We find that knowledge and reasoning exhibit different scaling behaviors! 
Super excited to finally tell you all about our paper on the compute optimal scaling of skills: 
arxiv.org/pdf/2503.10061
[1/n] 📉📉NEW SCALING LAW PHENOMENON 📉📉 
We find that knowledge and reasoning exhibit different scaling behaviors! 
Super excited to finally tell you all about our paper on the compute optimal scaling of skills: 
arxiv.org/pdf/2503.10061
[1/n]](https://pbs.twimg.com/media/GmhN1K0aUAA_yut.jpg) 
                        