Timon Willi (@timonwilli) Twitter Tweets • TwiCopy

Robert Lange

a year ago

🎉 Stoked to share The AI-Scientist 🧑‍🔬 - our end-to-end approach for conducting research with LLMs including ideation, coding, experiment execution, paper write-up & reviewing. Blog 📰: sakana.ai/ai-scientist/ Paper 📜: arxiv.org/abs/2408.06292 Code 💻: github.com/SakanaAI/AI-Sc…

thumb_up_off_alt363

chat_bubble_outline13

repeat66

shareShare

Timon Willi

@timonwilli

a year ago

More amazing work by Chris Lu Robert Lange Cong Lu . Wouldn't want to get automated by anyone else.

More amazing work by <a href="/_chris_lu_/">Chris Lu</a> <a href="/RobertTLange/">Robert Lange</a> <a href="/cong_ml/">Cong Lu</a> . Wouldn't want to get automated by anyone else.

thumb_up_off_alt22

chat_bubble_outline2

repeat4

shareShare

Timon Willi

@timonwilli

a year ago

That's my credo. No ragrets.

thumb_up_off_alt5

chat_bubble_outline0

repeat0

shareShare

Jonny Cook

@jonnycoook

a year ago

What if improving LLM evaluation and generation was as simple as using a checklist? Introducing TICK ✅ (Targeted Instruct-evaluation with ChecKlists) and STICK 🏒 (Self-TICK) Work done cohere with supervision from Tim Rocktäschel, Jakob Foerster, Dennis Aumiller at #ACL2025 & Alex Wang. 1/n

thumb_up_off_alt55

chat_bubble_outline4

repeat12

shareShare

Michael Matthews @ ICLR 2025

@mitrma

a year ago

We are very excited to announce Kinetix: an open-ended universe of physics-based tasks for RL! We use Kinetix to train a general agent on millions of randomly generated physics problems and show that this agent generalises to unseen handmade environments. 1/🧵

thumb_up_off_alt1,1K

chat_bubble_outline14

repeat216

shareShare

Davide Paglieri

@paglieridavide

a year ago

Tired of saturated benchmarks? Want scope for a significant leap in capabilities? 🔥 Introducing BALROG: a Benchmark for Agentic LLM and VLM Reasoning On Games! BALROG is a challenging benchmark for LLM agentic capabilities, designed to stay relevant for years to come. 1/🧵

thumb_up_off_alt220

chat_bubble_outline8

repeat43

shareShare

Timon Willi

@timonwilli

a year ago

There's light at the end of the tunnel of LLM evals: The light at the end of the tunnel:

thumb_up_off_alt11

chat_bubble_outline0

repeat0

shareShare

Foerster Lab for AI Research

@flair_ox

a year ago

🔬 FLAIR has a bunch of great papers being presented today at NeurIPS! Come along to learn more about the work! 👀 Keep your eyes peeled for more work being presented over the week!

thumb_up_off_alt13

chat_bubble_outline1

repeat5

shareShare

Foerster Lab for AI Research

@flair_ox

a year ago

🧑‍🔬 FLAIR is presenting three more great papers today at #NeurIPS2024! Come talk to us and find out what we've been doing!

thumb_up_off_alt20

chat_bubble_outline1

repeat5

shareShare

Branton DeMoss

@brantondemoss

a year ago

I’m pleased to announce our work which studies complexity phase transitions in neural networks! We track the Kolmogorov complexity of networks as they “grok”, and find a characteristic rise and fall of complexity, corresponding to memorization followed by generalization. 🧵

thumb_up_off_alt1,1K

chat_bubble_outline26

repeat153

shareShare

akbir.

@akbirkhan

8 months ago

In the spirit of making more real world evals, here is the Factorio Learning Environment (FLE). Spurred by wanting to eval if models are good paperclip maximisers, we check how well agents build factories for other things 🏗️🏭🛠️

thumb_up_off_alt1,1K

chat_bubble_outline31

repeat101

shareShare

Timon Willi

@timonwilli

8 months ago

congrats to the team!

thumb_up_off_alt1

chat_bubble_outline0

repeat0

shareShare

Timon Willi

@timonwilli

8 months ago

Tried to solve science, but solved humor instead. That’s why greatness cannot be planned.

thumb_up_off_alt28

chat_bubble_outline1

repeat0

shareShare

Ola Kalisz

@olakalisz8

5 months ago

Antiviral therapy design is myopic 🦠🙈 optimised only for the current strain. That's why you need a different Flu vaccine every year! Our #ICML2025 paper ADIOS proposes "shaper therapies" that steer viral evolution in our favour & remain effective. Work done Foerster Lab for AI Research 🧵👇

thumb_up_off_alt50

chat_bubble_outline1

repeat18

shareShare

Timon Willi

@timonwilli

5 months ago

finally, an Opponent Shaping application I don’t have to make up for the intro section. dreams do come true.

thumb_up_off_alt10

chat_bubble_outline0

repeat1

shareShare

Jürgen Schmidhuber

@schmidhuberai

4 months ago

Since 1990, we have worked on artificial curiosity & measuring „interestingness.“ Our new ICML paper uses "Prediction of Hidden Units" loss to quantify in-context computational complexity in sequence models. It can tell boring from interesting tasks and predict correct reasoning.

thumb_up_off_alt362

chat_bubble_outline11

repeat58

shareShare

Johan S. Obando 👍🏽

@johanobandoc

4 months ago

🚨 Excited to share our #ICML2025 paper: "The Courage to Stop: Overcoming Sunk Cost Fallacy in Deep RL" We train RL agents to know when to quit, cutting wasted effort and improving efficiency with our method LEAST. 📄Paper: arxiv.org/pdf/2506.13672 🧵Check the thread below👇🏾

thumb_up_off_alt125

chat_bubble_outline3

repeat17

shareShare

Uljad Berdica

@uljadb99

4 months ago

Unlock real diversity in your LLM! 🚀 LLM outputs can be boring and repetitive. Today, we release Intent Factored Generation (IFG) to: - Sample conceptually diverse outputs💡 - Improve performance on math and code reasoning tasks🤔 - Get more engaging conversational agents 🤖

thumb_up_off_alt34

chat_bubble_outline1

repeat9

shareShare

Jakob Foerster

@j_foerst

4 months ago

I recently had a lunch time conversation with a very senior AI researcher about how are multi-agent problems differ from single agent (their starting point was they do not). One point that made them think: As computers scale, the rest of the world (i.e. no agentic parts) is not

thumb_up_off_alt229

chat_bubble_outline24

repeat17

shareShare

Chris Lu

@_chris_lu_

3 months ago

akbir. Timon Willi Timon Willi deserves far more crap for staying in Europe

thumb_up_off_alt5

chat_bubble_outline2

repeat1

shareShare