Mike Lewis (@ml_perception) Twitter Tweets • TwiCopy

Vedanuj Goswami

2 years ago

Happy to be part of this incredible journey of Llama3 and to share the best open weight 8B and 70B models! Our largest 400B+ model is still cooking but we are providing a sneak peek into how it is trending! Check more details here ai.meta.com/blog/meta-llam…

thumb_up_off_alt44

chat_bubble_outline0

repeat10

shareShare

Sharan Narang

@sharan0909

2 years ago

Excited to share the Llama 3 models with everyone. This has been an INCREDIBLE team effort. The 8b and 70b models are available now. These are the best open source models.

thumb_up_off_alt58

chat_bubble_outline6

repeat5

shareShare

Mike Lewis

@ml_perception

2 years ago

Yes, both the 8B and 70B are trained way more than is Chinchilla optimal - but we can eat the training cost to save you inference cost! One of the most interesting things to me was how quickly the 8B was improving even at 15T tokens.

thumb_up_off_alt490

chat_bubble_outline14

repeat37

shareShare

Mike Lewis

@ml_perception

2 years ago

I'm seeing a lot of questions about the limit of how good you can make a small LLM. tldr; benchmarks saturate, models don't. LLMs will improve logarithmically forever with enough good data.

thumb_up_off_alt173

chat_bubble_outline6

repeat14

shareShare

lmarena.ai (formerly lmsys.org)

@lmarena_ai

2 years ago

Moreover, we observe even stronger performance in English category, where Llama 3 ranking jumps to ~1st place with GPT-4-Turbo! It consistently performs strong against top models (see win-rate matrix) by human preference. It's been optimized for dialogue scenario with large

thumb_up_off_alt381

chat_bubble_outline11

repeat41

shareShare

Mike Lewis

@ml_perception

2 years ago

Heading to ICLR! I’m writing fewer papers now to train more Llamas, but proud of our work here: Instruction Backtranslation (arxiv.org/abs/2308.06259), Attention Sinks, (arxiv.org/abs/2309.17453) In Context Pretraining (arxiv.org/abs/2310.10638) and RA-DIT (arxiv.org/abs/2310.01352).

thumb_up_off_alt124

chat_bubble_outline5

repeat11

shareShare

Zexuan Zhong

@zexuanzhong

2 years ago

Introducing Lory, a fully-differentiable MoE arch for decoder LM pre-training! Lory merges expert FFNs by computing a weighted average in the parameter space, and computes the output through the merged FFNs. But training naively is infeasible, how to make it work? Details in🧵

thumb_up_off_alt229

chat_bubble_outline4

repeat43

shareShare

Ruoxi Jia

@ruoxijia

2 years ago

Thrilled to be in Vienna for our ICLR workshop, Navigating and Addressing Data Problems for Foundation Models. Starting Saturday at 8:50 AM, our program features keynote talks, best paper presentations, a poster session, and a panel discussion. Explore the full schedule here!

thumb_up_off_alt57

chat_bubble_outline2

repeat15

shareShare

Mike Lewis

@ml_perception

a year ago

Excited to see the open source release of FAIR's early fusion multimodal LLMs!

thumb_up_off_alt46

chat_bubble_outline0

repeat5

shareShare

Mike Lewis

@ml_perception

a year ago

So excited for the open release of Llama 3.1 405B - with MMLU > 87, it's a really strong model and I can't wait to see what you all build with it! llama.meta.com Also check out the paper here, with lots of details on how this was made: tinyurl.com/2z2cpj8m

thumb_up_off_alt181

chat_bubble_outline3

repeat19

shareShare

Mike Lewis

@ml_perception

a year ago

tldr; you can go a long way in pre-training by (1) curating amazing data, (2) using a lot of FLOPs, and (3) otherwise not screwing up. All three are harder than they sound, so read the paper... That said, I'm amazed by our progress since Llama 3 - expect big things from Llama 4!

thumb_up_off_alt168

chat_bubble_outline4

repeat15

shareShare

Victoria X Lin

@victorialinml

a year ago

1/n Introducing MoMa 🖼, our new sparse early-fusion architecture for mixed-modal language modeling that significantly boosts pre-training efficiency 🚀 (arxiv.org/pdf/2407.21770). MoMa employs a mixture-of-expert (MoE) framework with modality-specific expert groups. Given any

thumb_up_off_alt306

chat_bubble_outline7

repeat51

shareShare

Weixin Liang

@liang_weixin

a year ago

How can we reduce pretraining costs for multi-modal models without sacrificing quality? We study this Q in our new work: arxiv.org/abs/2411.04996 At AI at Meta, We introduce Mixture-of-Transformers (MoT), a sparse architecture with modality-aware sparsity for every non-embedding

thumb_up_off_alt220

chat_bubble_outline5

repeat37

shareShare

Artidoro Pagnoni

@artidoropagnoni

a year ago

🚀 Introducing the Byte Latent Transformer (BLT) – An LLM architecture that scales better than Llama 3 using byte-patches instead of tokens 🤯 Paper 📄 dl.fbaipublicfiles.com/blt/BLT__Patch… Code 🛠️ github.com/facebookresear…

thumb_up_off_alt702

chat_bubble_outline16

repeat140

shareShare

Qizhen (Irene) Zhang

@irenezhang30

10 months ago

✨New Preprint✨We introduce 𝐁𝐫𝐚𝐧𝐜𝐡-𝐓𝐫𝐚𝐢𝐧-𝐒𝐭𝐢𝐭𝐜𝐡 (𝐁𝐓𝐒), an efficient & flexible method for stitching together independently pretrained LLM experts (i.e. code, math) into a single, capable generalist model. Key Takeaways: ✅BTS achieves the best average

thumb_up_off_alt78

chat_bubble_outline1

repeat9

shareShare

Nicholas Roberts

@nick11roberts

9 months ago

📉📉NEW SCALING LAW PHENOMENON 📉📉 We find that knowledge and reasoning exhibit different scaling behaviors! Super excited to finally tell you all about our paper on the compute optimal scaling of skills: arxiv.org/pdf/2503.10061 [1/n]

thumb_up_off_alt1,1K

chat_bubble_outline13

repeat173

shareShare

Sharan Narang

@sharan0909

8 months ago

Don’t miss this - I’ve worked with Mike (Mike Lewis) very closely at Meta and his talks are super informative and fun.

thumb_up_off_alt24

chat_bubble_outline0

repeat1

shareShare

Guangxuan Xiao

@guangxuan_xiao

4 months ago

I've written the full story of Attention Sinks — a technical deep-dive into how the mechanism was developed and how our research ended up being used in OpenAI's new OSS models. For those interested in the details: hanlab.mit.edu/blog/streaming…

thumb_up_off_alt895

chat_bubble_outline17

repeat114

shareShare

Mike Lewis

@ml_perception

4 months ago

Love seeing these incredibly creative new evaluations! Optimizing benchmarks is easy, the real challenge is in generalizing to the tasks that don't exist yet

thumb_up_off_alt47

chat_bubble_outline3

repeat0

shareShare

Yen-Ju Lu

@yen_ju_lu

2 months ago

🚀 Introducing the Latent Speech-Text Transformer (LST) — a speech-text model that organizes speech tokens into latent patches for better text→speech transfer, enabling steeper scaling laws and more efficient multimodal training ⚡️ Paper 📄 arxiv.org/pdf/2510.06195

thumb_up_off_alt28

chat_bubble_outline7

repeat15

shareShare