Alex Wettig (@_awettig) 's Twitter Profile
Alex Wettig

@_awettig

PhD@princeton trying to make sense of language models and their training data

ID: 1549104683955785728

linkhttps://www.cs.princeton.edu/~awettig/ calendar_today18-07-2022 18:52:37

167 Tweet

796 Takipçi

492 Takip Edilen

David Pfau (@pfau) 's Twitter Profile Photo

FWIW, my take was never "the scaling laws will break down" but "the scaling laws holding means you'd hit a point of diminishing returns pretty quickly" and I stand by that.

Thomas Wolf (@thom_wolf) 's Twitter Profile Photo

I shared a controversial take the other day at an event and I decided to write it down in a longer format: I’m afraid AI won't give us a "compressed 21st century". The "compressed 21st century" comes from Dario's "Machine of Loving Grace" and if you haven’t read it, you probably

Jeremy Bernstein (@jxbz) 's Twitter Profile Photo

I just wrote my first blog post in four years! It is called "Deriving Muon". It covers the theory that led to Muon and how, for me, Muon is a meaningful example of theory leading practice in deep learning (1/11)

I just wrote my first blog post in four years! It is called "Deriving Muon". It covers the theory that led to Muon and how, for me, Muon is a meaningful example of theory leading practice in deep learning

(1/11)
Zhiyuan Zeng (@zhiyuanzeng_) 's Twitter Profile Photo

Is a single accuracy number all we can get from model evals?🤔 🚨Does NOT tell where the model fails 🚨Does NOT tell how to improve it Introducing EvalTree🌳 🔍identifying LM weaknesses in natural language 🚀weaknesses serve as actionable guidance (paper&demo 🔗in🧵) [1/n]

Alisa Liu (@alisawuffles) 's Twitter Profile Photo

We created SuperBPE🚀, a *superword* tokenizer that includes tokens spanning multiple words. When pretraining at 8B scale, SuperBPE models consistently outperform the BPE baseline on 30 downstream tasks (+8% MMLU), while also being 27% more efficient at inference time.🧵

We created SuperBPE🚀, a *superword* tokenizer that includes tokens spanning multiple words.

When pretraining at 8B scale, SuperBPE models consistently outperform the BPE baseline on 30 downstream tasks (+8% MMLU), while also being 27% more efficient at inference time.🧵
Logan Engstrom (@logan_engstrom) 's Twitter Profile Photo

Want state-of-the-art data curation, data poisoning & more? Just do gradient descent! w/ Andrew Ilyas Ben Chen Axel Feldmann Billy Moses Aleksander Madry: we show how to optimize final model loss wrt any continuous variable. Key idea: Metagradients (grads through model training)

Want state-of-the-art data curation, data poisoning & more? Just do gradient descent!

w/ <a href="/andrew_ilyas/">Andrew Ilyas</a> Ben Chen <a href="/axel_s_feldmann/">Axel Feldmann</a>  <a href="/wsmoses/">Billy Moses</a> <a href="/aleks_madry/">Aleksander Madry</a>: we show how to optimize final model loss wrt any continuous variable.

Key idea: Metagradients (grads through model training)
Jacob Springer (@jacspringer) 's Twitter Profile Photo

Training with more data = better LLMs, right? 🚨 False! Scaling language models by adding more pre-training data can decrease your performance after post-training! Introducing "catastrophic overtraining." 🥁🧵+arXiv 👇 1/9

Training with more data = better LLMs, right? 🚨

False! Scaling language models by adding more pre-training data can decrease your performance after post-training!

Introducing "catastrophic overtraining." 🥁🧵+arXiv 👇

1/9
Xindi Wu (@cindy_x_wu) 's Twitter Profile Photo

Introducing COMPACT: COMPositional Atomic-to-complex Visual Capability Tuning, a data-efficient approach to improve multimodal models on complex visual tasks without scaling data volume. 📦 arxiv.org/abs/2504.21850 1/10

Introducing COMPACT: COMPositional Atomic-to-complex Visual Capability Tuning, a data-efficient approach to improve multimodal models on complex visual tasks without scaling data volume. 📦

arxiv.org/abs/2504.21850

1/10
John Yang (@jyangballin) 's Twitter Profile Photo

@ weekend warriors - DM me a GitHub repo that you like / maintain, and I'll train you a 7B coding agent that's an expert for that repo. Main constraints - it's predominantly Python, and has a testing suite w/ good coverage. (example of good repo = sympy, pandas, sqlfluff)

Alex Wettig (@_awettig) 's Twitter Profile Photo

Big arrow time! We can make huge progress on open-source SWE agents by scaling up the creation of virtual coding environments 🚀

Ofir Press (@ofirpress) 's Twitter Profile Photo

Great results from the Claude team- the 80% result is pass@1!! They ran the model in parallel multiple times and had an LM judge pick the best patch to submit.

Great results from the Claude team- the 80% result is pass@1!! They ran the model in parallel multiple times and had an LM judge pick the best patch to submit.
Aman Sanger (@amanrsanger) 's Twitter Profile Photo

Claude Sonnet 4 is much better at codebase understanding. Paired with recent improvements in Cursor, it's SOTA on large codebases

Claude Sonnet 4 is much better at codebase understanding.

Paired with recent improvements in Cursor, it's SOTA on large codebases
Kilian Lieret @ICLR (@klieret) 's Twitter Profile Photo

Massive gains with Sonnet 4 on SWE-agent: Single-attempt pass@1 rises to 69% on SWE-bench Verified! Sonnet 4 iterates longer (making it slightly more expensive) but almost never gets stuck. Localization ability appears unchanged, but quality of edits improves.

Massive gains with Sonnet 4 on SWE-agent: Single-attempt pass@1 rises to 69% on SWE-bench Verified! Sonnet 4 iterates longer (making it slightly more expensive) but almost never gets stuck. Localization ability appears unchanged, but quality of edits improves.
Alex Zhang (@a1zhang) 's Twitter Profile Photo

Can GPT, Claude, and Gemini play video games like Zelda, Civ, and Doom II? 𝗩𝗶𝗱𝗲𝗼𝗚𝗮𝗺𝗲𝗕𝗲𝗻𝗰𝗵 evaluates VLMs on Game Boy & MS-DOS games given only raw screen input, just like how a human would play. The best model (Gemini) completes just 0.48% of the benchmark! 🧵👇

Anthropic (@anthropicai) 's Twitter Profile Photo

Anthropic staff realized they could ask Claude to buy things that weren’t just food & drink. After someone randomly decided to ask it to order a tungsten cube, Claude ended up with an inventory full of (as it put it) “specialty metal items” that it ended up selling at a loss.

Anthropic staff realized they could ask Claude to buy things that weren’t just food &amp; drink. 

After someone randomly decided to ask it to order a tungsten cube, Claude ended up with an inventory full of (as it put it) “specialty metal items” that it ended up selling at a loss.
Albert Gu (@_albertgu) 's Twitter Profile Photo

Tokenization is just a special case of "chunking" - building low-level data into high-level abstractions - which is in turn fundamental to intelligence. Our new architecture, which enables hierarchical *dynamic chunking*, is not only tokenizer-free, but simply scales better.

Tokenization is just a special case of "chunking" - building low-level data into high-level abstractions - which is in turn fundamental to intelligence.

Our new architecture, which enables hierarchical *dynamic chunking*, is not only tokenizer-free, but simply scales better.
Alex Wettig (@_awettig) 's Twitter Profile Photo

Presenting two posters at ICML over the next two days: - Both at 11am - 1:30pm - Both about how to improve pre-training with domains - Both at stall # E-2600 in East Exhibition Hall A-B (!) Tomorrow: WebOrganizer w/ Luca Soldaini 🚗 ICML 2025 & Kyle Lo @ ICML2025 Thursday: MeCo by Tianyu Gao

Presenting two posters at ICML over the next two days:
- Both at 11am - 1:30pm
- Both about how to improve pre-training with domains
- Both at stall # E-2600 in East Exhibition Hall A-B (!)

Tomorrow: WebOrganizer w/ <a href="/soldni/">Luca Soldaini 🚗 ICML 2025</a> &amp; <a href="/kylelostat/">Kyle Lo @ ICML2025</a>
Thursday: MeCo by <a href="/gaotianyu1350/">Tianyu Gao</a>
Stuart Sul (@stuart_sul) 's Twitter Profile Photo

MoE layers can be really slow. When training our coding models Cursor, they ate up 27–53% of training time. So we completely rebuilt it at the kernel level and transitioned to MXFP8. The result: 3.5x faster MoE layer and 1.5x end-to-end training speedup. We believe our

MoE layers can be really slow. When training our coding models <a href="/cursor_ai/">Cursor</a>, they ate up 27–53% of training time.

So we completely rebuilt it at the kernel level and transitioned to MXFP8. The result: 3.5x faster MoE layer and 1.5x end-to-end training speedup.

We believe our