Davis Blalock (@davisblalock) 's Twitter Profile
Davis Blalock

@davisblalock

Research scientist + first hire @MosaicML, now @Databricks. @MIT PhD. I post about AI technical progress + sometimes the business side.

ID: 805547773944889344

linkhttp://bit.ly/3OXJbDs calendar_today04-12-2016 23:02:10

1,1K Tweet

12,12K Followers

170 Following

Sean Welleck (@wellecks) 's Twitter Profile Photo

What do nucleus sampling, tree-of-thought, and PagedAttention have in common? They're all part of our new survey: "From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models" arxiv.org/abs/2406.16838

What do nucleus sampling, tree-of-thought, and PagedAttention have in common?

They're all part of our new survey: "From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models"

arxiv.org/abs/2406.16838
Mihir Patel (@mvpatel2000) 's Twitter Profile Photo

Gemma 2 is out with 9b and 27b! A few things I really liked in tech report: On pretraining: - GQA (finally lol) - interleaving global attn w local (but 4k vs. 8k? it feels like it should support longer and was kneecapped...) - 16x expansion ratio (very wide!) (1/n)

Gemma 2 is out with 9b and 27b! A few things I really liked in tech report:

On pretraining:
- GQA (finally lol)
- interleaving global attn w local (but 4k vs. 8k? it feels like it should support longer and was kneecapped...)
- 16x expansion ratio (very wide!)

(1/n)
Dimitris Papailiopoulos (@dimitrispapail) 's Twitter Profile Photo

Thread on our newest paper: 1/n The initial motivation of our project was the "lost in the middle" phenomenon observed by Nelson Liu et al. arxiv.org/pdf/2307.03172 what they observed was models like gpt & claude were bad at retrieving from the middle/end of the input context

Sara Hooker (@sarahookr) 's Twitter Profile Photo

Does more compute equate with greater risk? What is our track record at predicting what risks emerge with scale? I don't have much time anymore to write, but this felt important enough to write down some thoughts on. arxiv.org/pdf/2407.05694…

Does more compute equate with greater risk? 

What is our track record at predicting what risks emerge with scale?

I don't have much time anymore to write, but this felt important enough to write down some thoughts on. 

arxiv.org/pdf/2407.05694…
Naomi Saphra hiring a lab 🧈🪰 (@nsaphra) 's Twitter Profile Photo

Chatbots have biases in what they say—but what about biases in what they WON'T say? Our new paper (w/Victoria Li & Yida Chen) shows that personal info like a user's race, age, or love for the Los Angeles Chargers decides if ChatGPT refuses a request. arxiv.org/abs/2407.06866

Chatbots have biases in what they say—but what about biases in what they WON'T say? Our new paper (w/<a href="/victoria_r_li/">Victoria Li</a> &amp; <a href="/YidaEdward/">Yida Chen</a>) shows that personal info like a user's race, age, or love for the Los Angeles Chargers decides if ChatGPT refuses a request. arxiv.org/abs/2407.06866
Max Zimmer @ ICLR25 (@maxzimmerberlin) 's Twitter Profile Photo

A good time to share our #ICLR2023 paper: How I Learned to Stop Worrying and Love Retraining We explore sparsity-adaptive LR schedules and show that with proper LR care, simple pruning can outperform complex methods that 'learn' the sparsity. 📜 arxiv.org/abs/2111.00843 🧵1/n

Jonathan Frankle (@jefrankle) 's Twitter Profile Photo

The new Llama-3.1 8B and 70B base models are likely derived from/very similar to the Llama-3 base models rather than created whole-cloth from scratch. Not shocking, but great to put hard data behind suspicions.

Azalia Mirhoseini (@azaliamirh) 's Twitter Profile Photo

Is inference compute a new dimension for scaling LLMs? In our latest paper, we explore scaling inference compute by increasing the number of samples per input. Across several models and tasks, we observe that coverage – the fraction of problems solved by at least one attempt –

Is inference compute a new dimension for scaling LLMs?

In our latest paper, we explore scaling inference compute by increasing the number of samples per input. Across several models and tasks, we observe that coverage – the fraction of problems solved by at least one attempt –
Thomas Wolf (@thom_wolf) 's Twitter Profile Photo

It’s Sunday morning we have some time with the coffee so let me tell you about some of our recent surprising journey in synthetic data and small language models. This post is prompted by the coming release of an instant, in-browser model called SmolLM360 (link at the end) The

It’s Sunday morning we have some time with the coffee so let me tell you about some of our recent surprising journey in synthetic data and small language models.

This post is prompted by the coming release of an instant, in-browser model called SmolLM360 (link at the end)

The
Zack Ankner (@zackankner) 's Twitter Profile Photo

Excited to announce our new work: Critique-out-Loud (CLoud) reward models. CLoud reward models first produce a chain of thought critique of the input before predicting a scalar reward, allowing reward models to reason explicitly instead of implicitly! arxiv.org/abs/2408.11791

xr-5 🐀 (@xariusrke) 's Twitter Profile Photo

1/n FP8 training is hard - loss divergence and instability often lead to the conclusion that it’s not possible. But we’ve found a recipe to train a 1B LLaMA model to match the convergence of bfloat16 while performing both the forward pass and backward pass in FP8 and using an FP8

1/n FP8 training is hard - loss divergence and instability often lead to the conclusion that it’s not possible. But we’ve found a recipe to train a 1B LLaMA model to match the convergence of bfloat16 while performing both the forward pass and backward pass in FP8 and using an FP8
Davis Blalock (@davisblalock) 's Twitter Profile Photo

An interesting history-of-science datapoint + yet another win in Noam Shazeer's track record. I can attest that this happens a lot at industry labs—we invent way more stuff than we have time to publish.

Hritik Bansal (@hbxnov) 's Twitter Profile Photo

New paper📢 LLM folks have been supervised finetuning their models with data from large and expensive models (e.g., Gemini Pro). However, we achieve better perf. by finetuning on the samples from the smaller and weaker LLMs (e.g., Flash)! w/Mehran Kazemi Arian Hosseini Rishabh Agarwal vinh q. tran

New paper📢 LLM folks have been supervised finetuning their models with data from large and expensive models (e.g., Gemini Pro).
However, we achieve better perf. by finetuning on the samples from the smaller and weaker LLMs (e.g., Flash)!
w/<a href="/kazemi_sm/">Mehran Kazemi</a> <a href="/arianTBD/">Arian Hosseini</a> <a href="/agarwl_/">Rishabh Agarwal</a> <a href="/vqctran/">vinh q. tran</a>
Harris Chan (@sirrahchan) 's Twitter Profile Photo

Here's my attempt at visualizing the training pipeline for DeepSeek-R1(-Zero) and the distillation to smaller models. Note they retrain DeepSeek-V3-Base with the new 800k curated data instead of continuing to finetune the checkpoint from the first round of cold-start SFT + RL

Here's my attempt at visualizing the training pipeline for DeepSeek-R1(-Zero) and the distillation to smaller models. 

Note they retrain DeepSeek-V3-Base with the new 800k curated data instead of continuing to finetune the checkpoint from the first round of cold-start SFT + RL
Ziming Liu (@zimingliu11) 's Twitter Profile Photo

New paper🚨: Physics of Skill Learning Training dynamics is complicated, but are there simple "physical laws" behind it? We take physicists' approach of simplification and abstraction: Simple models like "spherical cows" are surprisingly effective! arxiv.org/pdf/2501.12391 🧵

New paper🚨: Physics of Skill Learning

Training dynamics is complicated, but are there simple "physical laws" behind it? 

We take physicists' approach of simplification and abstraction: Simple models like "spherical cows" are surprisingly effective!

arxiv.org/pdf/2501.12391

🧵
Marianne Arriola @ ICLR’25 (@mariannearr) 's Twitter Profile Photo

🚨Announcing our #ICLR2025 Oral! 🔥Diffusion LMs are on the rise for parallel text generation! But unlike autoregressive LMs, they struggle with quality, fixed-length constraints & lack of KV caching. 🚀Introducing Block Diffusion—combining autoregressive and diffusion models

Jacob Springer (@jacspringer) 's Twitter Profile Photo

Training with more data = better LLMs, right? 🚨 False! Scaling language models by adding more pre-training data can decrease your performance after post-training! Introducing "catastrophic overtraining." 🥁🧵+arXiv 👇 1/9

Training with more data = better LLMs, right? 🚨

False! Scaling language models by adding more pre-training data can decrease your performance after post-training!

Introducing "catastrophic overtraining." 🥁🧵+arXiv 👇

1/9
Davis Blalock (@davisblalock) 's Twitter Profile Photo

So uh...we're getting huge accuracy lifts by generating synthetic data from in-distribution prompts. You just generate lots of responses, upweight the good ones, and do RL on those. No labels. Works shockingly well.

Davis Blalock (@davisblalock) 's Twitter Profile Photo

Phonic, another Mosaic Mafia company, just came out of stealth! Listened to the demos and was like "oh my gosh, why are customer help lines not like this?"