Alex Wettig (@_awettig) Twitter Tweets • TwiCopy

David Pfau

6 months ago

FWIW, my take was never "the scaling laws will break down" but "the scaling laws holding means you'd hit a point of diminishing returns pretty quickly" and I stand by that.

thumb_up_off_alt55

chat_bubble_outline1

repeat3

shareShare

I shared a controversial take the other day at an event and I decided to write it down in a longer format: I’m afraid AI won't give us a "compressed 21st century". The "compressed 21st century" comes from Dario's "Machine of Loving Grace" and if you haven’t read it, you probably

thumb_up_off_alt2,2K

chat_bubble_outline277

repeat501

shareShare

Jeremy Bernstein

@jxbz

6 months ago

I just wrote my first blog post in four years! It is called "Deriving Muon". It covers the theory that led to Muon and how, for me, Muon is a meaningful example of theory leading practice in deep learning (1/11)

thumb_up_off_alt885

chat_bubble_outline10

repeat128

shareShare

Zhiyuan Zeng

@zhiyuanzeng_

6 months ago

Is a single accuracy number all we can get from model evals?🤔 🚨Does NOT tell where the model fails 🚨Does NOT tell how to improve it Introducing EvalTree🌳 🔍identifying LM weaknesses in natural language 🚀weaknesses serve as actionable guidance (paper&demo 🔗in🧵) [1/n]

thumb_up_off_alt240

chat_bubble_outline4

repeat89

shareShare

Alisa Liu

@alisawuffles

5 months ago

We created SuperBPE🚀, a *superword* tokenizer that includes tokens spanning multiple words. When pretraining at 8B scale, SuperBPE models consistently outperform the BPE baseline on 30 downstream tasks (+8% MMLU), while also being 27% more efficient at inference time.🧵

thumb_up_off_alt2,2K

chat_bubble_outline96

repeat322

shareShare

Logan Engstrom

@logan_engstrom

5 months ago

Want state-of-the-art data curation, data poisoning & more? Just do gradient descent! w/ Andrew Ilyas Ben Chen Axel Feldmann Billy Moses Aleksander Madry: we show how to optimize final model loss wrt any continuous variable. Key idea: Metagradients (grads through model training)

Want state-of-the-art data curation, data poisoning & more? Just do gradient descent!

w/ <a href="/andrew_ilyas/">Andrew Ilyas</a> Ben Chen <a href="/axel_s_feldmann/">Axel Feldmann</a> <a href="/wsmoses/">Billy Moses</a> <a href="/aleks_madry/">Aleksander Madry</a>: we show how to optimize final model loss wrt any continuous variable.

Key idea: Metagradients (grads through model training)

thumb_up_off_alt162

chat_bubble_outline9

repeat29

shareShare

Jacob Springer

@jacspringer

5 months ago

Training with more data = better LLMs, right? 🚨 False! Scaling language models by adding more pre-training data can decrease your performance after post-training! Introducing "catastrophic overtraining." 🥁🧵+arXiv 👇 1/9

thumb_up_off_alt790

chat_bubble_outline16

repeat173

shareShare

Xindi Wu

@cindy_x_wu

4 months ago

Introducing COMPACT: COMPositional Atomic-to-complex Visual Capability Tuning, a data-efficient approach to improve multimodal models on complex visual tasks without scaling data volume. 📦 arxiv.org/abs/2504.21850 1/10

thumb_up_off_alt149

chat_bubble_outline6

repeat42

shareShare

John Yang

@jyangballin

4 months ago

@ weekend warriors - DM me a GitHub repo that you like / maintain, and I'll train you a 7B coding agent that's an expert for that repo. Main constraints - it's predominantly Python, and has a testing suite w/ good coverage. (example of good repo = sympy, pandas, sqlfluff)

thumb_up_off_alt116

chat_bubble_outline19

repeat8

shareShare

Cursor

@cursor_ai

4 months ago

Cursor is now free for students. Enjoy!

thumb_up_off_alt42,42K

chat_bubble_outline1,1K

repeat3,3K

shareShare

Alex Wettig

@_awettig

4 months ago

Big arrow time! We can make huge progress on open-source SWE agents by scaling up the creation of virtual coding environments 🚀

thumb_up_off_alt15

chat_bubble_outline0

repeat3

shareShare

Ofir Press

@ofirpress

3 months ago

Great results from the Claude team- the 80% result is pass@1!! They ran the model in parallel multiple times and had an LM judge pick the best patch to submit.

thumb_up_off_alt121

chat_bubble_outline5

repeat7

shareShare

Aman Sanger

@amanrsanger

3 months ago

Claude Sonnet 4 is much better at codebase understanding. Paired with recent improvements in Cursor, it's SOTA on large codebases

thumb_up_off_alt835

chat_bubble_outline32

repeat41

shareShare

Kilian Lieret @ICLR

@klieret

3 months ago

Massive gains with Sonnet 4 on SWE-agent: Single-attempt pass@1 rises to 69% on SWE-bench Verified! Sonnet 4 iterates longer (making it slightly more expensive) but almost never gets stuck. Localization ability appears unchanged, but quality of edits improves.

thumb_up_off_alt84

chat_bubble_outline4

repeat13

shareShare

Alex Zhang

@a1zhang

3 months ago

Can GPT, Claude, and Gemini play video games like Zelda, Civ, and Doom II? 𝗩𝗶𝗱𝗲𝗼𝗚𝗮𝗺𝗲𝗕𝗲𝗻𝗰𝗵 evaluates VLMs on Game Boy & MS-DOS games given only raw screen input, just like how a human would play. The best model (Gemini) completes just 0.48% of the benchmark! 🧵👇

thumb_up_off_alt518

chat_bubble_outline23

repeat71

shareShare

Alex Wettig

@_awettig

2 months ago

New paper cutting through the thicket of KV cache eviction methods!

thumb_up_off_alt17

chat_bubble_outline0

repeat1

shareShare

Anthropic

@anthropicai

2 months ago

Anthropic staff realized they could ask Claude to buy things that weren’t just food & drink. After someone randomly decided to ask it to order a tungsten cube, Claude ended up with an inventory full of (as it put it) “specialty metal items” that it ended up selling at a loss.

thumb_up_off_alt4,4K

chat_bubble_outline64

repeat210

shareShare

Albert Gu

@_albertgu

2 months ago

Tokenization is just a special case of "chunking" - building low-level data into high-level abstractions - which is in turn fundamental to intelligence. Our new architecture, which enables hierarchical *dynamic chunking*, is not only tokenizer-free, but simply scales better.

thumb_up_off_alt1,1K

chat_bubble_outline58

repeat177

shareShare

Alex Wettig

@_awettig

2 months ago

Presenting two posters at ICML over the next two days: - Both at 11am - 1:30pm - Both about how to improve pre-training with domains - Both at stall # E-2600 in East Exhibition Hall A-B (!) Tomorrow: WebOrganizer w/ Luca Soldaini 🚗 ICML 2025 & Kyle Lo @ ICML2025 Thursday: MeCo by Tianyu Gao

thumb_up_off_alt49

chat_bubble_outline1

repeat7

shareShare

Stuart Sul

@stuart_sul

15 days ago

MoE layers can be really slow. When training our coding models Cursor, they ate up 27–53% of training time. So we completely rebuilt it at the kernel level and transitioned to MXFP8. The result: 3.5x faster MoE layer and 1.5x end-to-end training speedup. We believe our

MoE layers can be really slow. When training our coding models <a href="/cursor_ai/">Cursor</a>, they ate up 27–53% of training time.

So we completely rebuilt it at the kernel level and transitioned to MXFP8. The result: 3.5x faster MoE layer and 1.5x end-to-end training speedup.

We believe our

thumb_up_off_alt381

chat_bubble_outline14

repeat53

shareShare

Alex Wettig

David Pfau

Thomas Wolf

Jeremy Bernstein

Zhiyuan Zeng

Alisa Liu

Logan Engstrom

Jacob Springer

Xindi Wu

John Yang

Cursor

Alex Wettig

Ofir Press

Aman Sanger

Kilian Lieret @ICLR

Alex Zhang

Alex Wettig

Anthropic

Albert Gu

Alex Wettig

Stuart Sul