Daniel Fried (@dan_fried) Twitter Tweets • TwiCopy

Yu Su @#ICLR2025

8 months ago

🔥2025 is the year of agents, but are we there yet?🤔 🤯 "An Illusion of Progress? Assessing the Current State of Web Agents" –– our new study shows that frontier web agents may be far less competent (up to 59%) than previously reported! Why were benchmark numbers inflated? -

thumb_up_off_alt230

chat_bubble_outline10

repeat66

shareShare

Jacob Springer

@jacspringer

8 months ago

Training with more data = better LLMs, right? 🚨 False! Scaling language models by adding more pre-training data can decrease your performance after post-training! Introducing "catastrophic overtraining." 🥁🧵+arXiv 👇 1/9

thumb_up_off_alt790

chat_bubble_outline16

repeat173

shareShare

Graham Neubig

@gneubig

8 months ago

Today's a big day! Months of work went into both of these releases, so we hope people enjoy them. OpenHands is now a great coding agent that you can run entirely locally (w/ OpenHands LM), and a great coding agent that you can run anywhere (w/ OpenHands Cloud).

thumb_up_off_alt162

chat_bubble_outline2

repeat24

shareShare

Bowen Wang

@bowenwangnlp

8 months ago

🎮 Computer Use Agent Arena is LIVE! 🚀 🔥 Easiest way to test computer-use agents in the wild without any setup 🌟 Compare top VLMs: OpenAI Operator, Claude 3.7, Gemini 2.5 Pro, Qwen 2.5 vl and more 🕹️ Test agents on 100+ real apps & webs with one-click config 🔒 Safe & free

thumb_up_off_alt333

chat_bubble_outline14

repeat104

shareShare

ML@CMU

@mlcmublog

8 months ago

blog.ml.cmu.edu/2025/04/09/cop… How do real-world developer preferences compare to existing evaluations? A CMU and UC Berkeley team led by Wayne Chi and Valerie Chen created Copilot Arena to collect user preferences on in-the-wild workflows. This blogpost overviews the design and

blog.ml.cmu.edu/2025/04/09/cop…

How do real-world developer preferences compare to existing evaluations? A CMU and UC Berkeley team led by <a href="/iamwaynechi/">Wayne Chi</a> and <a href="/valeriechen_/">Valerie Chen</a> created <a href="/CopilotArena/">Copilot Arena</a> to collect user preferences on in-the-wild workflows. This blogpost overviews the design and

thumb_up_off_alt18

chat_bubble_outline0

repeat7

shareShare

Sean Welleck

@wellecks

8 months ago

Had a fun time giving the tutorial at Simons Institute for the Theory of Computing! Here are the materials: Transformers for Mathematics Tutorial - Slides: wellecks.com/transformers4m… - Code/exercises: github.com/wellecks/trans…

Had a fun time giving the tutorial at <a href="/SimonsInstitute/">Simons Institute for the Theory of Computing</a>! Here are the materials:

Transformers for Mathematics Tutorial

- Slides: wellecks.com/transformers4m…
- Code/exercises: github.com/wellecks/trans…

thumb_up_off_alt382

chat_bubble_outline3

repeat55

shareShare

Graham Neubig

@gneubig

8 months ago

A big two days of agents starting tomorrow at CMU (and then two days of agent hackathon after that!) Registration is still open so if you're in or around Pittsburgh come one come all: cmu-agent-workshop.github.io We also plan to livestream for participants who can't make it in person

thumb_up_off_alt188

chat_bubble_outline5

repeat40

shareShare

Xing Han Lu

@xhluca

7 months ago

AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories We are releasing the first benchmark to evaluate how well automatic evaluators, such as LLM judges, can evaluate web agent trajectories. We find that rule-based evals underreport success rates, and

thumb_up_off_alt230

chat_bubble_outline4

repeat100

shareShare

Christina Baek

@_christinabaek

7 months ago

Are current reasoning models optimal for test-time scaling? 🌠 No! Models make the same incorrect guess over and over again. We show that you can fix this problem w/o any crazy tricks 💫 – just do weight ensembling (WiSE-FT) for big gains on math! 1/N

thumb_up_off_alt478

chat_bubble_outline6

repeat103

shareShare

Daniel Fried

@dan_fried

7 months ago

Zora's latest work shows that program induction / tool learning benefits web agents: large improvements in success & efficiency, when agents create their own tools to make tasks easier. I'm excited about programs for more controllable & verifiable agents in settings like these!

thumb_up_off_alt39

chat_bubble_outline1

repeat65

shareShare

Daniel Fried

@dan_fried

7 months ago

Nick is awesome and I've learned a lot from him!

thumb_up_off_alt1

chat_bubble_outline0

repeat0

shareShare

Deep Learning For Code @ ICLR'25

@dl4code

7 months ago

🚀 ICLR week is upon us! Join us at the #DL4C Workshop to hear Xingyao Wang (Xingyao Wang) discuss LLMs evolving into SE agents, covering the CodeAct framework (code exec as action), the OpenHands platform (dev-like generalist agents), & SWE-Gym (real-world task training).

thumb_up_off_alt13

chat_bubble_outline0

repeat2

shareShare

Zora Wang

@zhiruow

7 months ago

Couldn't agree more on agent "continually adapt" from "streamed experiences"! This is exactly what we've envisioned in building online adaptive agents with self-induced evolving memory & skills in AWM (arxiv.org/abs/2409.07429) and ASI (arxiv.org/abs/2504.06821)! Yet still some

thumb_up_off_alt68

chat_bubble_outline1

repeat61

shareShare

Deep Learning For Code @ ICLR'25

@dl4code

7 months ago

Just 6 days until #DL4C! 🗓️ Daniel Fried (CMU / Meta AI) Daniel Fried AI at Meta will be sharing insights on how inducing functions from code makes LLM agents smarter and more efficient. Don't miss it! See you Sunday! #ICLR2025 #iclr

thumb_up_off_alt11

chat_bubble_outline0

repeat4

shareShare

Prithviraj (Raj) Ammanabrolu

@rajammanabrolu

7 months ago

The future of embodied AI revolves around *collaborative* multi agent scenarios that need natural language communication, task delegation, resource sharing, and more ⛏️ Here are MINDcraft and MineCollab, a simulator and benchmark purpose built to enable research in this area!

thumb_up_off_alt207

chat_bubble_outline5

repeat40

shareShare

Elias Stengel-Eskin (on the faculty job market)

@eliaseskin

7 months ago

Extremely excited to announce that I will be joining UT Austin Computer Science at UT Austin in August 2025 as an Assistant Professor! 🎉 I’m looking forward to continuing to develop AI agents that interact/communicate with people, each other, and the multimodal world. I’ll be recruiting PhD

Extremely excited to announce that I will be joining <a href="/UTAustin/">UT Austin</a> <a href="/UTCompSci/">Computer Science at UT Austin</a> in August 2025 as an Assistant Professor! 🎉

I’m looking forward to continuing to develop AI agents that interact/communicate with people, each other, and the multimodal world. I’ll be recruiting PhD

thumb_up_off_alt441

chat_bubble_outline91

repeat67

shareShare

Philippe Laban

@philippelaban

7 months ago

🆕paper: LLMs Get Lost in Multi-Turn Conversation In real life, people don’t speak in perfect prompts. So we simulate multi-turn conversations — less lab-like, more like real use. We find that LLMs get lost in conversation. 👀What does that mean? 🧵1/N 📄arxiv.org/abs/2505.06120

thumb_up_off_alt126

chat_bubble_outline5

repeat30

shareShare

Wenting Zhao

@wzhao_nlp

6 months ago

Some personal news: I'll join UMass Amherst CS as an assistant professor in fall 2026. Until then, I'll postdoc at Meta nyc. Reasoning will continue to be my main interest, with a focus on data-centric approaches🤩 If you're also interested, apply to me (phds & a postdoc)!

thumb_up_off_alt833

chat_bubble_outline95

repeat31

shareShare

Jaemin Cho (on faculty job market)

@jmin__cho

6 months ago

Sharing some personal updates 🥳: - I've completed my PhD at UNC Computer Science! 🎓 - Starting Fall 2026, I'll be joining the Computer Science dept. at Johns Hopkins University (JHU Computer Science) as an Assistant Professor 💙 - Currently exploring options + finalizing the plan for my gap year (Aug

Sharing some personal updates 🥳:
- I've completed my PhD at <a href="/unccs/">UNC Computer Science</a>! 🎓
- Starting Fall 2026, I'll be joining the Computer Science dept. at Johns Hopkins University (<a href="/JHUCompSci/">JHU Computer Science</a>) as an Assistant Professor 💙
- Currently exploring options + finalizing the plan for my gap year (Aug

thumb_up_off_alt395

chat_bubble_outline65

repeat45

shareShare

Kayo Yin

@kayo_yin

6 months ago

Happy to announce the first workshop on Pragmatic Reasoning in Language Models — PragLM @ COLM 2025! 🧠🎉 How do LLMs engage in pragmatic reasoning, and what core pragmatic capacities remain beyond their reach? 🌐 sites.google.com/berkeley.edu/p… 📅 Submit by June 23rd

thumb_up_off_alt78

chat_bubble_outline4

repeat17

shareShare