Davis Blalock (@davisblalock) Twitter Tweets • TwiCopy

Sean Welleck

a year ago

What do nucleus sampling, tree-of-thought, and PagedAttention have in common? They're all part of our new survey: "From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models" arxiv.org/abs/2406.16838

thumb_up_off_alt538

chat_bubble_outline10

repeat114

shareShare

Mihir Patel

@mvpatel2000

a year ago

Gemma 2 is out with 9b and 27b! A few things I really liked in tech report: On pretraining: - GQA (finally lol) - interleaving global attn w local (but 4k vs. 8k? it feels like it should support longer and was kneecapped...) - 16x expansion ratio (very wide!) (1/n)

thumb_up_off_alt139

chat_bubble_outline4

repeat11

shareShare

Dimitris Papailiopoulos

@dimitrispapail

a year ago

Thread on our newest paper: 1/n The initial motivation of our project was the "lost in the middle" phenomenon observed by Nelson Liu et al. arxiv.org/pdf/2307.03172 what they observed was models like gpt & claude were bad at retrieving from the middle/end of the input context

thumb_up_off_alt103

chat_bubble_outline4

repeat16

shareShare

Sara Hooker

@sarahookr

a year ago

Does more compute equate with greater risk? What is our track record at predicting what risks emerge with scale? I don't have much time anymore to write, but this felt important enough to write down some thoughts on. arxiv.org/pdf/2407.05694…

thumb_up_off_alt333

chat_bubble_outline9

repeat78

shareShare

Naomi Saphra hiring a lab 🧈🪰

@nsaphra

a year ago

Chatbots have biases in what they say—but what about biases in what they WON'T say? Our new paper (w/Victoria Li & Yida Chen) shows that personal info like a user's race, age, or love for the Los Angeles Chargers decides if ChatGPT refuses a request. arxiv.org/abs/2407.06866

Chatbots have biases in what they say—but what about biases in what they WON'T say? Our new paper (w/<a href="/victoria_r_li/">Victoria Li</a> & <a href="/YidaEdward/">Yida Chen</a>) shows that personal info like a user's race, age, or love for the Los Angeles Chargers decides if ChatGPT refuses a request. arxiv.org/abs/2407.06866

thumb_up_off_alt115

chat_bubble_outline4

repeat27

shareShare

Max Zimmer @ ICLR25

@maxzimmerberlin

a year ago

A good time to share our #ICLR2023 paper: How I Learned to Stop Worrying and Love Retraining We explore sparsity-adaptive LR schedules and show that with proper LR care, simple pruning can outperform complex methods that 'learn' the sparsity. 📜 arxiv.org/abs/2111.00843 🧵1/n

thumb_up_off_alt39

chat_bubble_outline1

repeat9

shareShare

Jonathan Frankle

@jefrankle

a year ago

The new Llama-3.1 8B and 70B base models are likely derived from/very similar to the Llama-3 base models rather than created whole-cloth from scratch. Not shocking, but great to put hard data behind suspicions.

thumb_up_off_alt50

chat_bubble_outline1

repeat8

shareShare

Azalia Mirhoseini

@azaliamirh

a year ago

Is inference compute a new dimension for scaling LLMs? In our latest paper, we explore scaling inference compute by increasing the number of samples per input. Across several models and tasks, we observe that coverage – the fraction of problems solved by at least one attempt –

$Is inference compute a new dimension for scaling LLMs? In our latest paper, we explore scaling inference compute by increasing the number of samples per input. Across several models and tasks, we observe that coverage – the fraction of problems solved by at least one attempt –$

thumb_up_off_alt405

chat_bubble_outline7

repeat67

shareShare

Andreas Kirsch 🇺🇦

@blackhc

a year ago

This is one of the best papers I have read in a while. It contains a crazy amount of insights and ideas 🤯

thumb_up_off_alt246

chat_bubble_outline1

repeat35

shareShare

Thomas Wolf

@thom_wolf

a year ago

It’s Sunday morning we have some time with the coffee so let me tell you about some of our recent surprising journey in synthetic data and small language models. This post is prompted by the coming release of an instant, in-browser model called SmolLM360 (link at the end) The

thumb_up_off_alt510

chat_bubble_outline14

repeat113

shareShare

Zack Ankner

@zackankner

a year ago

Excited to announce our new work: Critique-out-Loud (CLoud) reward models. CLoud reward models first produce a chain of thought critique of the input before predicting a scalar reward, allowing reward models to reason explicitly instead of implicitly! arxiv.org/abs/2408.11791

thumb_up_off_alt252

chat_bubble_outline13

repeat59

shareShare

xr-5 🐀

@xariusrke

a year ago

1/n FP8 training is hard - loss divergence and instability often lead to the conclusion that it’s not possible. But we’ve found a recipe to train a 1B LLaMA model to match the convergence of bfloat16 while performing both the forward pass and backward pass in FP8 and using an FP8

thumb_up_off_alt490

chat_bubble_outline5

repeat85

shareShare

Davis Blalock

@davisblalock

a year ago

An interesting history-of-science datapoint + yet another win in Noam Shazeer's track record. I can attest that this happens a lot at industry labs—we invent way more stuff than we have time to publish.

thumb_up_off_alt26

chat_bubble_outline1

repeat2

shareShare

Hritik Bansal

@hbxnov

a year ago

New paper📢 LLM folks have been supervised finetuning their models with data from large and expensive models (e.g., Gemini Pro). However, we achieve better perf. by finetuning on the samples from the smaller and weaker LLMs (e.g., Flash)! w/Mehran Kazemi Arian Hosseini Rishabh Agarwal vinh q. tran

thumb_up_off_alt837

chat_bubble_outline23

repeat144

shareShare

Harris Chan

@sirrahchan

a year ago

Here's my attempt at visualizing the training pipeline for DeepSeek-R1(-Zero) and the distillation to smaller models. Note they retrain DeepSeek-V3-Base with the new 800k curated data instead of continuing to finetune the checkpoint from the first round of cold-start SFT + RL

thumb_up_off_alt1,1K

chat_bubble_outline23

repeat242

shareShare

Ziming Liu

@zimingliu11

a year ago

New paper🚨: Physics of Skill Learning Training dynamics is complicated, but are there simple "physical laws" behind it? We take physicists' approach of simplification and abstraction: Simple models like "spherical cows" are surprisingly effective! arxiv.org/pdf/2501.12391 🧵

thumb_up_off_alt599

chat_bubble_outline9

repeat97

shareShare

Marianne Arriola @ ICLR’25

@mariannearr

9 months ago

🚨Announcing our #ICLR2025 Oral! 🔥Diffusion LMs are on the rise for parallel text generation! But unlike autoregressive LMs, they struggle with quality, fixed-length constraints & lack of KV caching. 🚀Introducing Block Diffusion—combining autoregressive and diffusion models

thumb_up_off_alt880

chat_bubble_outline16

repeat133

shareShare

Jacob Springer

@jacspringer

8 months ago

Training with more data = better LLMs, right? 🚨 False! Scaling language models by adding more pre-training data can decrease your performance after post-training! Introducing "catastrophic overtraining." 🥁🧵+arXiv 👇 1/9

thumb_up_off_alt790

chat_bubble_outline16

repeat173

shareShare

Davis Blalock

@davisblalock

8 months ago

So uh...we're getting huge accuracy lifts by generating synthetic data from in-distribution prompts. You just generate lots of responses, upweight the good ones, and do RL on those. No labels. Works shockingly well.

thumb_up_off_alt18

chat_bubble_outline3

repeat1

shareShare

Davis Blalock

@davisblalock

8 months ago

Phonic, another Mosaic Mafia company, just came out of stealth! Listened to the demos and was like "oh my gosh, why are customer help lines not like this?"

thumb_up_off_alt9

chat_bubble_outline0

repeat0

shareShare