Nicholas Meade (@ncmeade) Twitter Tweets • TwiCopy

Arkil Patel

7 months ago

Super timely work led by Xing Han Lu with extensive human evaluation of agent trajectories across multiple benchmarks and LLMs!

thumb_up_off_alt12

chat_bubble_outline0

repeat2

shareShare

A key reason RL for web agents hasn’t fully taken off is the lack of robust reward models. No matter the algorithm (PPO, GRPO), we can’t reliably do RL without a reward signal. With AgentRewardBench, we introduce the first benchmark aiming to kickstart progress in this space.

thumb_up_off_alt96

chat_bubble_outline2

repeat22

shareShare

Mila - Institut québécois d'IA

@mila_quebec

7 months ago

Congratulations to Mila members Ada Tur, Gaurav Kamath and Siva Reddy for their SAC award at #NAACL2025! Check out Ada's talk in Session I: Oral/Poster 6. Paper: arxiv.org/abs/2502.05670

thumb_up_off_alt24

chat_bubble_outline2

repeat10

shareShare

Ziling Cheng

@ziling_cheng

6 months ago

Do LLMs hallucinate randomly? Not quite. Our #ACL2025 (Main) paper shows that hallucinations under irrelevant contexts follow a systematic failure mode — revealing how LLMs generalize using abstract classes + context cues, albeit unreliably. 📎 Paper: arxiv.org/abs/2505.22630 1/n

thumb_up_off_alt34

chat_bubble_outline1

repeat20

shareShare

Benno Krojer

@benno_krojer

5 months ago

Excited to share the results of my internship research with AI at Meta, as part of a larger world modeling release! What subtle shortcuts are VideoLLMs taking on spatio-temporal questions? And how can we instead curate shortcut-robust examples at a large-scale? Details 👇🔬

Excited to share the results of my internship research with <a href="/AIatMeta/">AI at Meta</a>, as part of a larger world modeling release!

What subtle shortcuts are VideoLLMs taking on spatio-temporal questions?

And how can we instead curate shortcut-robust examples at a large-scale?

Details 👇🔬

thumb_up_off_alt59

chat_bubble_outline3

repeat22

shareShare

Xing Han Lu

@xhluca

5 months ago

"Build the web for agents, not agents for the web" This position paper argues that rather than forcing web agents to adapt to UIs designed for humans, we should develop a new interface optimized for web agents, which we call Agentic Web Interface (AWI).

thumb_up_off_alt184

chat_bubble_outline7

repeat52

shareShare

Maksym Andriushchenko @ ICLR

@maksym_andr

5 months ago

🚨Excited to release OS-Harm! 🚨 The safety of computer use agents has been largely overlooked. We created a new safety benchmark based on OSWorld for measuring 3 broad categories of harm: 1. deliberate user misuse, 2. prompt injections, 3. model misbehavior.

thumb_up_off_alt94

chat_bubble_outline3

repeat26

shareShare

Cesare Spinoso-Di Piano

@cesare_spinoso

5 months ago

A blizzard is raging in Montreal when your friend says “Wow, the weather is amazing!” Humans easily interpret irony, while LLMs struggle at it. We propose a 𝘳𝘩𝘦𝘵𝘰𝘳𝘪𝘤𝘢𝘭-𝘴𝘵𝘳𝘢𝘵𝘦𝘨𝘺-𝘢𝘸𝘢𝘳𝘦 probabilistic framework as a solution. arxiv.org/abs/2506.09301 @ #acl2025

thumb_up_off_alt11

chat_bubble_outline1

repeat11

shareShare

Verna Dankers

@vernadankers

5 months ago

I miss Edinburgh and its wonderful people already!! Thanks to Tal Linzen and Edoardo Ponti for inspiring discussions during the viva! I'm now exchanging Arthur's Seat for Mont Royal to join Siva Reddy's wonderful lab Mila - Institut québécois d'IA 🤩

thumb_up_off_alt88

chat_bubble_outline10

repeat8

shareShare

Shruti Joshi

@_shruti_joshi_

4 months ago

I will be at the Actionable Interpretability Workshop (Actionable Interpretability Workshop ICML2025, #ICML) presenting *SSAEs* in the East Ballroom A from 1-2pm. Drop by (or send a DM) to chat about (actionable) interpretability, (actionable) identifiability, and everything in between!

thumb_up_off_alt24

chat_bubble_outline1

repeat6

shareShare

Nicholas Meade

@ncmeade

4 months ago

Come by our #ACL2025 poster tomorrow to discuss the safety risks surrounding increasingly capable instruction-following retrievers (or anything safety related)! 16:00-17:30 on Tuesday in Hall 4/5

thumb_up_off_alt16

chat_bubble_outline0

repeat4

shareShare

Siva Reddy

@sivareddyg

4 months ago

What's the path to scalable and safe web agents? Is web agents the new semantic parsing? I will be giving a talk at the ACL REALM workshop today at 9:30 am. Come check out if you are interested in the history and contemporary work in this area. Lot of other exciting speakers.

thumb_up_off_alt38

chat_bubble_outline1

repeat9

shareShare

Maksym Andriushchenko @ ICLR

@maksym_andr

4 months ago

🚨 Incredibly excited to share that I'm starting my research group focusing on AI safety and alignment at the ELLIS Institute Tübingen and Max Planck Institute for Intelligent Systems in September 2025! 🚨 Hiring. I'm looking for multiple PhD students: both those able to start

thumb_up_off_alt778

chat_bubble_outline70

repeat86

shareShare

Nicholas Meade

@ncmeade

3 months ago

If you're interested in working on agent safety (and are a student in Canada) you should apply to this! Spandana Gella is extremely smart and one of the kindest people I've gotten to work with

thumb_up_off_alt6

chat_bubble_outline0

repeat0

shareShare

Alexander Panfilov

@kotekjedi_ml

2 months ago

🚨 New paper! LLMs, when asked harmful questions, sometimes produce outputs that look helpful (and harmful) — but are actually 𝗱𝗲𝗹𝗶𝗯𝗲𝗿𝗮𝘁𝗲𝗹𝘆 𝘄𝗿𝗼𝗻𝗴 What’s bad - current LLM-based jailbreak scorers can’t tell the difference (me neither) More in 🧵👇

thumb_up_off_alt97

chat_bubble_outline4

repeat16

shareShare

Xing Han Lu

@xhluca

2 months ago

i will be presenting AgentRewardBench at #COLM2025 next week! session: #3 date: wednesday 11am to 1pm poster: #545 come learn more about the paper, my recent works or just chat about anything (montreal, mila, etc.) here's a teaser of my poster :)

thumb_up_off_alt34

chat_bubble_outline1

repeat6

shareShare

Milad Aghajohari

@maghajohari

a month ago

Introducing linear scaling of reasoning: 𝐓𝐡𝐞 𝐌𝐚𝐫𝐤𝐨𝐯𝐢𝐚𝐧 𝐓𝐡𝐢𝐧𝐤𝐞𝐫 Reformulate RL so thinking scales 𝐎(𝐧) 𝐜𝐨𝐦𝐩𝐮𝐭𝐞, not O(n^2), with O(1) 𝐦𝐞𝐦𝐨𝐫𝐲, architecture-agnostic. Train R1-1.5B into a markovian thinker with 96K thought budget, ~2X accuracy 🧵

thumb_up_off_alt919

chat_bubble_outline14

repeat200

shareShare

Amirhossein Kazemnejad

@a_kazemnejad

a month ago

It’s clear next-gen reasoning LLMs will run for millions of tokens. RL at 1M needs ~100× compute than 128K. Our Markovian Thinking keeps compute scaling linear instead. Check out Milad’s thread; some of my perspectives below:

thumb_up_off_alt893

chat_bubble_outline18

repeat94

shareShare