Timon Willi (@timonwilli) 's Twitter Profile
Timon Willi

@timonwilli

RS @AIatMeta, DPhil w/ @j_foerst, @UniofOxford; Formerly: Research Intern @GoogleDeepMind / PhD @VectorInst / RS at @nnaisense / MSc w/ @SchmidhuberAI

ID: 1525795330754850816

linkhttp://timonwilli.com calendar_today15-05-2022 11:08:58

151 Tweet

286 Takipçi

67 Takip Edilen

Robert Lange (@roberttlange) 's Twitter Profile Photo

🎉 Stoked to share The AI-Scientist 🧑‍🔬 - our end-to-end approach for conducting research with LLMs including ideation, coding, experiment execution, paper write-up & reviewing. Blog 📰: sakana.ai/ai-scientist/ Paper 📜: arxiv.org/abs/2408.06292 Code 💻: github.com/SakanaAI/AI-Sc…

Jonny Cook (@jonnycoook) 's Twitter Profile Photo

What if improving LLM evaluation and generation was as simple as using a checklist? Introducing TICK ✅ (Targeted Instruct-evaluation with ChecKlists) and STICK 🏒 (Self-TICK) Work done cohere with supervision from Tim Rocktäschel, Jakob Foerster, Dennis Aumiller at #ACL2025 & Alex Wang. 1/n

What if improving LLM evaluation and generation was as simple as using a checklist?

Introducing TICK ✅ (Targeted Instruct-evaluation with ChecKlists) and STICK 🏒 (Self-TICK)

Work done <a href="/cohere/">cohere</a> with supervision from <a href="/_rockt/">Tim Rocktäschel</a>, <a href="/j_foerst/">Jakob Foerster</a>, <a href="/d_aumiller/">Dennis Aumiller at #ACL2025</a> &amp; <a href="/W4ngatang/">Alex Wang</a>.

1/n
Michael Matthews @ ICLR 2025 (@mitrma) 's Twitter Profile Photo

We are very excited to announce Kinetix: an open-ended universe of physics-based tasks for RL! We use Kinetix to train a general agent on millions of randomly generated physics problems and show that this agent generalises to unseen handmade environments. 1/🧵

Davide Paglieri (@paglieridavide) 's Twitter Profile Photo

Tired of saturated benchmarks? Want scope for a significant leap in capabilities? 🔥 Introducing BALROG: a Benchmark for Agentic LLM and VLM Reasoning On Games! BALROG is a challenging benchmark for LLM agentic capabilities, designed to stay relevant for years to come. 1/🧵

Tired of saturated benchmarks? Want scope for a significant leap in capabilities?

🔥 Introducing BALROG: a Benchmark for Agentic LLM and VLM Reasoning On Games!

BALROG is a challenging benchmark for LLM agentic capabilities, designed to stay relevant for years to come.

1/🧵
Foerster Lab for AI Research (@flair_ox) 's Twitter Profile Photo

🔬 FLAIR has a bunch of great papers being presented today at NeurIPS! Come along to learn more about the work! 👀 Keep your eyes peeled for more work being presented over the week!

Branton DeMoss (@brantondemoss) 's Twitter Profile Photo

I’m pleased to announce our work which studies complexity phase transitions in neural networks! We track the Kolmogorov complexity of networks as they “grok”, and find a characteristic rise and fall of complexity, corresponding to memorization followed by generalization. 🧵

I’m pleased to announce our work which studies complexity phase transitions in neural networks! We track the Kolmogorov complexity of networks as they “grok”, and find a characteristic rise and fall of complexity, corresponding to memorization followed by generalization.

🧵
akbir. (@akbirkhan) 's Twitter Profile Photo

In the spirit of making more real world evals, here is the Factorio Learning Environment (FLE). Spurred by wanting to eval if models are good paperclip maximisers, we check how well agents build factories for other things 🏗️🏭🛠️

Ola Kalisz (@olakalisz8) 's Twitter Profile Photo

Antiviral therapy design is myopic 🦠🙈 optimised only for the current strain. That's why you need a different Flu vaccine every year! Our #ICML2025 paper ADIOS proposes "shaper therapies" that steer viral evolution in our favour & remain effective. Work done Foerster Lab for AI Research 🧵👇

Jürgen Schmidhuber (@schmidhuberai) 's Twitter Profile Photo

Since 1990, we have worked on artificial curiosity & measuring „interestingness.“ Our new ICML paper uses "Prediction of Hidden Units" loss to quantify in-context computational complexity in sequence models. It can tell boring from interesting tasks and predict correct reasoning.

Johan S. Obando 👍🏽 (@johanobandoc) 's Twitter Profile Photo

🚨 Excited to share our #ICML2025 paper: "The Courage to Stop: Overcoming Sunk Cost Fallacy in Deep RL" We train RL agents to know when to quit, cutting wasted effort and improving efficiency with our method LEAST. 📄Paper: arxiv.org/pdf/2506.13672 🧵Check the thread below👇🏾

🚨 Excited to share our #ICML2025 paper: "The Courage to Stop: Overcoming Sunk Cost Fallacy in Deep RL"  

We train RL agents to know when to quit, cutting wasted effort and improving efficiency with our method LEAST.

📄Paper: arxiv.org/pdf/2506.13672
🧵Check the thread below👇🏾
Uljad Berdica (@uljadb99) 's Twitter Profile Photo

Unlock real diversity in your LLM! 🚀 LLM outputs can be boring and repetitive. Today, we release Intent Factored Generation (IFG) to: - Sample conceptually diverse outputs💡 - Improve performance on math and code reasoning tasks🤔 - Get more engaging conversational agents 🤖

Jakob Foerster (@j_foerst) 's Twitter Profile Photo

I recently had a lunch time conversation with a very senior AI researcher about how are multi-agent problems differ from single agent (their starting point was they do not). One point that made them think: As computers scale, the rest of the world (i.e. no agentic parts) is not