Mike Lewis (@ml_perception) 's Twitter Profile
Mike Lewis

@ml_perception

Llama3 pre-training lead. Partially to blame for things like the Cicero Diplomacy bot, BART, RoBERTa, kNN-LM, top-k sampling & Deal Or No Deal.

ID: 1170214705056452609

calendar_today07-09-2019 05:58:31

272 Tweet

7,7K Takipçi

233 Takip Edilen

Vedanuj Goswami (@vedanujg) 's Twitter Profile Photo

Happy to be part of this incredible journey of Llama3 and to share the best open weight 8B and 70B models! Our largest 400B+ model is still cooking but we are providing a sneak peek into how it is trending! Check more details here ai.meta.com/blog/meta-llam…

Sharan Narang (@sharan0909) 's Twitter Profile Photo

Excited to share the Llama 3 models with everyone. This has been an INCREDIBLE team effort. The 8b and 70b models are available now. These are the best open source models.

Excited to share the Llama 3 models with everyone. This has been an INCREDIBLE team effort. The 8b and 70b models are available now. These are the best open source models.
Mike Lewis (@ml_perception) 's Twitter Profile Photo

Yes, both the 8B and 70B are trained way more than is Chinchilla optimal - but we can eat the training cost to save you inference cost! One of the most interesting things to me was how quickly the 8B was improving even at 15T tokens.

Mike Lewis (@ml_perception) 's Twitter Profile Photo

I'm seeing a lot of questions about the limit of how good you can make a small LLM. tldr; benchmarks saturate, models don't. LLMs will improve logarithmically forever with enough good data.

lmarena.ai (formerly lmsys.org) (@lmarena_ai) 's Twitter Profile Photo

Moreover, we observe even stronger performance in English category, where Llama 3 ranking jumps to ~1st place with GPT-4-Turbo! It consistently performs strong against top models (see win-rate matrix) by human preference. It's been optimized for dialogue scenario with large

Moreover, we observe even stronger performance in English category, where Llama 3 ranking jumps to ~1st place with GPT-4-Turbo!

It consistently performs strong against top models (see win-rate matrix) by human preference. It's been optimized for dialogue scenario with large
Mike Lewis (@ml_perception) 's Twitter Profile Photo

Heading to ICLR! I’m writing fewer papers now to train more Llamas, but proud of our work here: Instruction Backtranslation (arxiv.org/abs/2308.06259), Attention Sinks, (arxiv.org/abs/2309.17453) In Context Pretraining (arxiv.org/abs/2310.10638) and RA-DIT (arxiv.org/abs/2310.01352).

Zexuan Zhong (@zexuanzhong) 's Twitter Profile Photo

Introducing Lory, a fully-differentiable MoE arch for decoder LM pre-training! Lory merges expert FFNs by computing a weighted average in the parameter space, and computes the output through the merged FFNs. But training naively is infeasible, how to make it work? Details in🧵

Introducing Lory, a fully-differentiable MoE arch for decoder LM pre-training!

Lory merges expert FFNs by computing a weighted average in the parameter space, and computes the output through the merged FFNs.

But training naively is infeasible, how to make it work?
Details in🧵
Ruoxi Jia (@ruoxijia) 's Twitter Profile Photo

Thrilled to be in Vienna for our ICLR workshop, Navigating and Addressing Data Problems for Foundation Models. Starting Saturday at 8:50 AM, our program features keynote talks, best paper presentations, a poster session, and a panel discussion. Explore the full schedule here!

Thrilled to be in Vienna for our ICLR workshop, Navigating and Addressing Data Problems for Foundation Models. Starting Saturday at 8:50 AM, our program features keynote talks, best paper presentations, a poster session, and a panel discussion. Explore the full schedule here!
Mike Lewis (@ml_perception) 's Twitter Profile Photo

So excited for the open release of Llama 3.1 405B - with MMLU > 87, it's a really strong model and I can't wait to see what you all build with it! llama.meta.com Also check out the paper here, with lots of details on how this was made: tinyurl.com/2z2cpj8m

Mike Lewis (@ml_perception) 's Twitter Profile Photo

tldr; you can go a long way in pre-training by (1) curating amazing data, (2) using a lot of FLOPs, and (3) otherwise not screwing up. All three are harder than they sound, so read the paper... That said, I'm amazed by our progress since Llama 3 - expect big things from Llama 4!

Victoria X Lin (@victorialinml) 's Twitter Profile Photo

1/n Introducing MoMa 🖼, our new sparse early-fusion architecture for mixed-modal language modeling that significantly boosts pre-training efficiency 🚀 (arxiv.org/pdf/2407.21770). MoMa employs a mixture-of-expert (MoE) framework with modality-specific expert groups. Given any

1/n Introducing MoMa 🖼, our new sparse early-fusion architecture for mixed-modal language modeling that significantly boosts pre-training efficiency 🚀 (arxiv.org/pdf/2407.21770).
MoMa employs a mixture-of-expert (MoE) framework with modality-specific expert groups. Given any
Weixin Liang (@liang_weixin) 's Twitter Profile Photo

How can we reduce pretraining costs for multi-modal models without sacrificing quality? We study this Q in our new work: arxiv.org/abs/2411.04996 At AI at Meta, We introduce Mixture-of-Transformers (MoT), a sparse architecture with modality-aware sparsity for every non-embedding

How can we reduce pretraining costs for multi-modal models without sacrificing quality? We study this Q in our new work: arxiv.org/abs/2411.04996

At <a href="/AIatMeta/">AI at Meta</a>, We introduce Mixture-of-Transformers (MoT), a sparse architecture with modality-aware sparsity for every non-embedding
Artidoro Pagnoni (@artidoropagnoni) 's Twitter Profile Photo

🚀 Introducing the Byte Latent Transformer (BLT) – An LLM architecture that scales better than Llama 3 using byte-patches instead of tokens 🤯 Paper 📄 dl.fbaipublicfiles.com/blt/BLT__Patch… Code 🛠️ github.com/facebookresear…

🚀 Introducing the Byte Latent Transformer (BLT) – An LLM architecture that scales better than Llama 3 using byte-patches instead of tokens 🤯 

Paper 📄 dl.fbaipublicfiles.com/blt/BLT__Patch…
Code 🛠️ github.com/facebookresear…
Qizhen (Irene) Zhang (@irenezhang30) 's Twitter Profile Photo

✨New Preprint✨We introduce 𝐁𝐫𝐚𝐧𝐜𝐡-𝐓𝐫𝐚𝐢𝐧-𝐒𝐭𝐢𝐭𝐜𝐡 (𝐁𝐓𝐒), an efficient & flexible method for stitching together independently pretrained LLM experts (i.e. code, math) into a single, capable generalist model. Key Takeaways: ✅BTS achieves the best average

Nicholas Roberts (@nick11roberts) 's Twitter Profile Photo

📉📉NEW SCALING LAW PHENOMENON 📉📉 We find that knowledge and reasoning exhibit different scaling behaviors! Super excited to finally tell you all about our paper on the compute optimal scaling of skills: arxiv.org/pdf/2503.10061 [1/n]

📉📉NEW SCALING LAW PHENOMENON 📉📉 

We find that knowledge and reasoning exhibit different scaling behaviors! 

Super excited to finally tell you all about our paper on the compute optimal scaling of skills: 
arxiv.org/pdf/2503.10061

[1/n]
Guangxuan Xiao (@guangxuan_xiao) 's Twitter Profile Photo

I've written the full story of Attention Sinks — a technical deep-dive into how the mechanism was developed and how our research ended up being used in OpenAI's new OSS models. For those interested in the details: hanlab.mit.edu/blog/streaming…

I've written the full story of Attention Sinks — a technical deep-dive into how the mechanism was developed and how our research ended up being used in OpenAI's new OSS models.

For those interested in the details:
hanlab.mit.edu/blog/streaming…
Mike Lewis (@ml_perception) 's Twitter Profile Photo

Love seeing these incredibly creative new evaluations! Optimizing benchmarks is easy, the real challenge is in generalizing to the tasks that don't exist yet