Jonny Cook (@jonnycoook) 's Twitter Profile
Jonny Cook

@jonnycoook

DPhil Student in AI @FLAIR_Ox
Prev. RS Intern @cohere, @DeepMind Scholar

ID: 1452746225074200582

calendar_today25-10-2021 21:17:53

48 Tweet

305 Followers

514 Following

Anastasios Gerontopoulos (@nasosger) 's Twitter Profile Photo

1/n Multi-token prediction boosts LLMs (DeepSeek-V3), tackling key limitations of the next-token setup: • Short-term focus • Struggles with long-range decisions • Weaker supervision Prior methods add complexity (extra layers) 🔑 Our fix? Register tokens—elegant and powerful

1/n Multi-token prediction boosts LLMs (DeepSeek-V3), tackling key limitations of the next-token setup:
• Short-term focus
• Struggles with long-range decisions
• Weaker supervision

Prior methods add complexity (extra layers)
🔑 Our fix? Register tokens—elegant and powerful
Clarisse Wibault (@clarissewibault) 's Twitter Profile Photo

How can we bypass the need for online hyper-parameter tuning in offline RL? Foerster Lab for AI Research is introducing two fully offline algorithms: SOReL, for accurate offline regret approximation, and TOReL, for offline hyper-parameter tuning! arxiv.org/html/2505.2244…

Amrith Setlur (@setlur_amrith) 's Twitter Profile Photo

Since R1 there has been a lot of chatter 💬 on post-training LLMs with RL. Is RL only sharpening the distribution over correct responses sampled by the pretrained LLM OR is it exploring and discovering new strategies 🤔? Find answers in our latest post ⬇️ tinyurl.com/rlshadis

Silvia Sapora (@silviasapora) 's Twitter Profile Photo

🧵 Check out our latest preprint: "Programming by Backprop". What if LLMs could internalize algorithms just by reading code, with no input-output examples? This could reshape how we train models to reason algorithmically. Let's dive into our findings 👇

Laura Ruis (@lauraruis) 's Twitter Profile Photo

LLMs can be programmed by backprop 🔎 In our new preprint, we show they can act as fuzzy program interpreters and databases. After being ‘programmed’ with next-token prediction, they can retrieve, evaluate, and even *compose* programs at test time, without seeing I/O examples.

LLMs can be programmed by backprop 🔎

In our new preprint, we show they can act as fuzzy program interpreters and databases. After being ‘programmed’ with next-token prediction, they can retrieve, evaluate, and even *compose* programs at test time, without seeing I/O examples.
Ulyana Piterbarg (@ulyanapiterbarg) 's Twitter Profile Photo

The way programs are represented in training data can have strong effects on generalization & reasoning -- really great (and truly scientific) work

Andrei Lupu (@_andreilupu) 's Twitter Profile Photo

Theory of Mind (ToM) is crucial for next gen LLM Agents, yet current benchmarks suffer from multiple shortcomings. Enter 💽 Decrypto, an interactive benchmark for multi-agent reasoning and ToM in LLMs! Work done with Timon Willi & Jakob Foerster at AI at Meta & Foerster Lab for AI Research 🧵👇

Ola Kalisz (@olakalisz8) 's Twitter Profile Photo

Antiviral therapy design is myopic 🦠🙈 optimised only for the current strain. That's why you need a different Flu vaccine every year! Our #ICML2025 paper ADIOS proposes "shaper therapies" that steer viral evolution in our favour & remain effective. Work done Foerster Lab for AI Research 🧵👇

Alexi Gladstone (@alexiglad) 's Twitter Profile Photo

How can we unlock generalized reasoning? ⚡️Introducing Energy-Based Transformers (EBTs), an approach that out-scales (feed-forward) transformers and unlocks generalized reasoning/thinking on any modality/problem without rewards. TLDR: - EBTs are the first model to outscale the

How can we unlock generalized reasoning?

⚡️Introducing Energy-Based Transformers (EBTs), an approach that out-scales (feed-forward) transformers and unlocks generalized reasoning/thinking on any modality/problem without rewards.
TLDR:
- EBTs are the first model to outscale the
Micah Goldblum (@micahgoldblum) 's Twitter Profile Photo

🚨 Did you know that small-batch vanilla SGD without momentum (i.e. the first optimizer you learn about in intro ML) is virtually as fast as AdamW for LLM pretraining on a per-FLOP basis? 📜 1/n

🚨 Did you know that small-batch vanilla SGD without momentum (i.e. the first optimizer you learn about in intro ML) is virtually as fast as AdamW for LLM pretraining on a per-FLOP basis? 📜 1/n