Parishad BehnamGhader (@parishadbehnam) Twitter Tweets • TwiCopy

Parishad BehnamGhader

@parishadbehnam

+ Follow

NLP PhD student at @Mila_Quebec and @mcgillu

ID: 828506588902256640

linkhttp://parishadbehnam.github.io calendar_today06-02-2017 07:32:18

56 Tweet

145 Takipçi

99 Takip Edilen

Xing Han Lu

@xhluca

6 months ago

Agents like OpenAI Operator can solve complex computer tasks, but what happens when users use them to cause harm, e.g. automate hate speech and spread misinformation? To find out, we introduce SafeArena (safearena.github.io), a benchmark to assess the capabilities of web

thumb_up_off_alt78

chat_bubble_outline0

repeat34

shareShare

Orion Weller @ ICLR 2025

@orionweller

6 months ago

me when I see Promptriever has the highest score in some columns

thumb_up_off_alt16

chat_bubble_outline1

repeat4

shareShare

Sara Vera Marjanović

@saraveramarjano

5 months ago

Models like DeepSeek-R1 🐋 mark a fundamental shift in how LLMs approach complex problems. In our preprint on R1 Thoughtology, we study R1’s reasoning chains across a variety of tasks; investigating its capabilities, limitations, and behaviour. 🔗: mcgill-nlp.github.io/thoughtology/

thumb_up_off_alt227

chat_bubble_outline3

repeat62

shareShare

Siva Reddy

@sivareddyg

5 months ago

Talking about "DeepSeek-R1 Thoughtology: Let’s <think> about LLM reasoning" Going live at 11am PDT (i.e., 20 mins). Last minute change of plans. You might be able to see live here: youtube.com/watch?v=aO_cTI…

thumb_up_off_alt46

chat_bubble_outline1

repeat11

shareShare

Amirhossein Kazemnejad

@a_kazemnejad

5 months ago

Introducing nanoAhaMoment: Karpathy-style, single file RL for LLM library (<700 lines) - super hackable - no TRL / Verl, no abstraction💆‍♂️ - Single GPU, full param tuning, 3B LLM - Efficient (R1-zero countdown < 10h) comes with a from-scratch, fully spelled out YT video [1/n]

thumb_up_off_alt1,1K

chat_bubble_outline15

repeat164

shareShare

Xing Han Lu

@xhluca

5 months ago

AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories We are releasing the first benchmark to evaluate how well automatic evaluators, such as LLM judges, can evaluate web agent trajectories. We find that rule-based evals underreport success rates, and

thumb_up_off_alt230

chat_bubble_outline4

repeat100

shareShare

Reyhane Askari

@reyhaneaskari

4 months ago

Excited to be part of this panel today at the WiML social, 12:30 PM - 2:00 PM, Hall 1 Apex

thumb_up_off_alt31

chat_bubble_outline0

repeat7

shareShare

Afra Amini

@afra_amini

4 months ago

Current KL estimation practices in RLHF can generate high variance and even negative values! We propose a provably better estimator that only takes a few lines of code to implement.🧵👇 w/ Tim Vieira and Ryan Cotterell code: arxiv.org/pdf/2504.10637 paper: github.com/rycolab/kl-rb

thumb_up_off_alt113

chat_bubble_outline4

repeat28

shareShare

Xing Han Lu

@xhluca

3 months ago

"Build the web for agents, not agents for the web" This position paper argues that rather than forcing web agents to adapt to UIs designed for humans, we should develop a new interface optimized for web agents, which we call Agentic Web Interface (AWI).

thumb_up_off_alt184

chat_bubble_outline7

repeat52

shareShare

Benno Krojer

@benno_krojer

3 months ago

The video is online now! 3min speed science talk on "From a soup of raw pixels to abstract meaning" youtu.be/AHsoMYG2Vqk?si…

thumb_up_off_alt39

chat_bubble_outline0

repeat6

shareShare

Akari Asai

@akariasai

a month ago

We’re hosting a NeurIPS competition on real-world Retrieval-Augmented Generation! In addition to automatic and LLM-as-a-judge eval, we’ll feature live user feedback via our interactive RAG Arena. Stay tuned for more details and don’t forget to sign up agi-lti.github.io/MMU-RAGent/

thumb_up_off_alt97

chat_bubble_outline0

repeat16

shareShare

Saba

@saba_a96

a month ago

We built a new 𝗮𝘂𝘁𝗼𝗿𝗲𝗴𝗿𝗲𝘀𝘀𝗶𝘃𝗲 + 𝗥𝗟 image editing model using a strong verifier — and it beats SOTA diffusion baselines using 5× less data. 🔥 𝗘𝗔𝗥𝗟: a simple, scalable RL pipeline for high-quality, controllable edits. 🧵1/

thumb_up_off_alt58

chat_bubble_outline2

repeat25

shareShare