Davide Paglieri (@paglieridavide) 's Twitter Profile
Davide Paglieri

@paglieridavide

PhD Student @UCL_DARK
Previously Research Engineer at @bendingspoons

ID: 921103420517437440

calendar_today19-10-2017 19:59:03

210 Tweet

374 Takipçi

293 Takip Edilen

Aran Komatsuzaki (@arankomatsuzaki) 's Twitter Profile Photo

Learning When to Plan LLM agents trained with dynamic planning learn when to spend test-time compute, balancing cost & performance. This is the first work to explore training LLM agents for dynamic test-time compute allocation in sequential decision-making tasks.

Learning When to Plan

LLM agents trained with dynamic planning learn when to spend test-time compute, balancing cost & performance.

This is the first work to explore training LLM agents for dynamic test-time compute allocation in sequential decision-making tasks.
Bartłomiej Cupiał (@cupiabart) 's Twitter Profile Photo

Almost all agentic pipelines prompt LLMs to explicitly plan before every action (ReAct), but turns out this isn't optimal for Multi-Step RL 🤔 Why? In our new work we highlight a crucial issue with ReAct and show that we should make and follow plans instead🧵

Almost all agentic pipelines prompt LLMs to explicitly plan before every action (ReAct), but turns out this isn't optimal for Multi-Step RL 🤔 Why?
In our new work we highlight a crucial issue with ReAct and show that we should make and follow plans instead🧵
François Chollet (@fchollet) 's Twitter Profile Photo

I like the analogy of the "bicycle for the mind", because riding a bike requires effort from you, and the bike multiplies the effect of that effort. I don't think the end goal of technology should be to let you sit around and twiddle your thumbs.

Machine Learning Street Talk (@mlstreettalk) 's Twitter Profile Photo

This paper is a banger, got all the top MLST guest talent on the roster too!! 👌 -- Edward Grefenstette, Jakob Foerster, Jack Parker-Holder, Tim Rocktäschel

Ethan Mollick (@emollick) 's Twitter Profile Photo

I have been wondering if there is an underlying capability factor that all the many benchmarks for AI are measuring. It seems like the answer is yes. Overall correlation is good (median r ≈ 0.51) and there are distinct clusters (eg reasoning, code) with VERY high correlation.

I have been wondering if there is an underlying capability factor that all the many benchmarks for AI are measuring.

It seems like the answer is yes. Overall correlation is good (median r ≈ 0.51) and there are distinct clusters (eg reasoning, code) with VERY high correlation.
Tim Rocktäschel (@_rockt) 's Twitter Profile Photo

Proud to announce that Dr Laura Ruis defended her PhD thesis titled "Understanding and Evaluating Reasoning in Large Language Models" last week 🥳. Massive thanks to Noah Goodman and Emine Yilmaz for examining! As is customary, Laura received a personal mortarboard from

Proud to announce that Dr <a href="/LauraRuis/">Laura Ruis</a> defended her PhD thesis titled "Understanding and Evaluating Reasoning in Large Language Models" last week 🥳. Massive thanks to Noah Goodman and Emine Yilmaz for examining! As is customary, Laura received a personal mortarboard from
Andrej Karpathy (@karpathy) 's Twitter Profile Photo

Finally had a chance to listen through this pod with Sutton, which was interesting and amusing. As background, Sutton's "The Bitter Lesson" has become a bit of biblical text in frontier LLM circles. Researchers routinely talk about and ask whether this or that approach or idea

Jubayer Ibn Hamid (@jubayer_hamid) 's Twitter Profile Photo

Exploration is fundamental to RL. Yet policy gradient methods often collapse: during training they fail to explore broadly, and converge into narrow, easily exploitable behaviors. The result is poor generalization, limited gains from test-time scaling, and brittleness on tasks

Exploration is fundamental to RL. Yet policy gradient methods often collapse: during training they fail to explore broadly, and converge into narrow, easily exploitable behaviors. The result is poor generalization, limited gains from test-time scaling, and brittleness on tasks
A. H. Guzel (@ahguzeluk) 's Twitter Profile Photo

🎮 How can agents learn to generalize from limited offline data? We introduce iMac (Imagined Autocurricula) - training agents entirely in world models with emergent curricula!

Denis Tarasov (@ml_is_overhyped) 's Twitter Profile Photo

I’m asking for help. I was meant to start my PhD with Tim Rocktäschel and Roberta Raileanu at UCL, but my UK background check was refused. My appeal seems unlikely to succeed, so I’m urgently searching for any PhD or research positions in academia or industry. Any help is appreciated.

Jay A (@jay_azhang) 's Twitter Profile Photo

Our new benchmark has the top 6 AI models trading real capital Grok4 is winning so far. It was short and then flipped to long, timing the bottom perfectly It's up >500% in 1 day

Our new benchmark has the top 6 AI models trading  real capital 

Grok4 is winning so far. It was short and then flipped to long, timing the bottom perfectly

It's up &gt;500% in 1 day
Jay A (@jay_azhang) 's Twitter Profile Photo

Grok up 600% now Watching some of these initial runs, anything can happen. Qwen went from 1st to last in a few hours Official launch next week with 50x more capital. Its about to get very real

Jay A (@jay_azhang) 's Twitter Profile Photo

Alpha Arena is LIVE 6 AI models trading $10K each, fully autonomously Real money. Real markets. Real benchmark. Who's your money on? Link below

Alpha Arena is LIVE

6 AI models trading $10K each, fully autonomously

Real money. Real markets. Real benchmark.

Who's your money on? Link below
Jay A (@jay_azhang) 's Twitter Profile Photo

DeepSeek and Grok seem to have better contextual awareness of market microstructure Grok in particular has made money in 100% of the past 5 rounds. More coming in technical writeup

DeepSeek and Grok seem to have better contextual awareness of market microstructure

Grok in particular has made money in 100% of the past 5 rounds. More coming in technical writeup
Andrej Karpathy (@karpathy) 's Twitter Profile Photo

Nice, short post illustrating how simple text (discrete) diffusion can be. Diffusion (i.e. parallel, iterated denoising, top) is the pervasive generative paradigm in image/video, but autoregression (i.e. go left to right bottom) is the dominant paradigm in text. For audio I've

Samuel Schmidgall (@srschmidgall) 's Twitter Profile Photo

Want to join Google DeepMind as a Student Researcher for 6 months starting in January (PhD students)? 🧬 The project will be focused on AI for Science and AI for Cancer! Send me a DM and also apply below👇

Want to join <a href="/GoogleDeepMind/">Google DeepMind</a> as a Student Researcher for 6 months starting in January (PhD students)?

🧬 The project will be focused on AI for Science and AI for Cancer!

Send me a DM and also apply below👇
Weco AI (@wecoai) 's Twitter Profile Photo

Hard work scales linearly. Automation scales exponentially. Over 17 days, our autonomous ML agent trained 120 models and beat 90% of teams in a live $100k ML competition, with zero human intervention. Weco, now in public beta: