Parishad BehnamGhader (@parishadbehnam) 's Twitter Profile
Parishad BehnamGhader

@parishadbehnam

NLP PhD student at @Mila_Quebec and @mcgillu

ID: 828506588902256640

linkhttp://parishadbehnam.github.io calendar_today06-02-2017 07:32:18

56 Tweet

145 Takipçi

99 Takip Edilen

Xing Han Lu (@xhluca) 's Twitter Profile Photo

Agents like OpenAI Operator can solve complex computer tasks, but what happens when users use them to cause harm, e.g. automate hate speech and spread misinformation? To find out, we introduce SafeArena (safearena.github.io), a benchmark to assess the capabilities of web

Agents like OpenAI Operator can solve complex computer tasks, but what happens when users use them to cause harm, e.g. automate hate speech and spread misinformation?

To find out, we introduce SafeArena (safearena.github.io), a benchmark to assess the capabilities of web
Sara Vera Marjanović (@saraveramarjano) 's Twitter Profile Photo

Models like DeepSeek-R1 🐋 mark a fundamental shift in how LLMs approach complex problems. In our preprint on R1 Thoughtology, we study R1’s reasoning chains across a variety of tasks; investigating its capabilities, limitations, and behaviour. 🔗: mcgill-nlp.github.io/thoughtology/

Models like DeepSeek-R1 🐋 mark a fundamental shift in how LLMs approach complex problems. In our preprint on R1 Thoughtology, we study R1’s reasoning chains across a variety of tasks; investigating its capabilities, limitations, and behaviour.
🔗: mcgill-nlp.github.io/thoughtology/
Siva Reddy (@sivareddyg) 's Twitter Profile Photo

Talking about "DeepSeek-R1 Thoughtology: Let’s <think> about LLM reasoning" Going live at 11am PDT (i.e., 20 mins). Last minute change of plans. You might be able to see live here: youtube.com/watch?v=aO_cTI…

Amirhossein Kazemnejad (@a_kazemnejad) 's Twitter Profile Photo

Introducing nanoAhaMoment: Karpathy-style, single file RL for LLM library (<700 lines) - super hackable - no TRL / Verl, no abstraction💆‍♂️ - Single GPU, full param tuning, 3B LLM - Efficient (R1-zero countdown < 10h) comes with a from-scratch, fully spelled out YT video [1/n]

Introducing nanoAhaMoment: Karpathy-style, single file RL for LLM library (&lt;700 lines)

- super hackable
- no TRL / Verl, no abstraction💆‍♂️
- Single GPU, full param tuning, 3B LLM
- Efficient (R1-zero countdown &lt; 10h)

comes with a from-scratch, fully spelled out YT video [1/n]
Xing Han Lu (@xhluca) 's Twitter Profile Photo

AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories We are releasing the first benchmark to evaluate how well automatic evaluators, such as LLM judges, can evaluate web agent trajectories. We find that rule-based evals underreport success rates, and

AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories  

We are releasing the first benchmark to evaluate how well automatic evaluators, such as LLM judges, can evaluate web agent trajectories.

We find that rule-based evals underreport success rates, and
Afra Amini (@afra_amini) 's Twitter Profile Photo

Current KL estimation practices in RLHF can generate high variance and even negative values! We propose a provably better estimator that only takes a few lines of code to implement.🧵👇 w/ Tim Vieira and Ryan Cotterell code: arxiv.org/pdf/2504.10637 paper: github.com/rycolab/kl-rb

Current KL estimation practices in RLHF can generate high variance and even negative values! We propose a provably better estimator that only takes a few lines of code to implement.🧵👇
w/ <a href="/xtimv/">Tim Vieira</a> and Ryan Cotterell
code: arxiv.org/pdf/2504.10637
paper: github.com/rycolab/kl-rb
Xing Han Lu (@xhluca) 's Twitter Profile Photo

"Build the web for agents, not agents for the web" This position paper argues that rather than forcing web agents to adapt to UIs designed for humans, we should develop a new interface optimized for web agents, which we call Agentic Web Interface (AWI).

"Build the web for agents, not agents for the web"

This position paper argues that rather than forcing web agents to adapt to UIs designed for humans, we should develop a new interface optimized for web agents, which we call Agentic Web Interface (AWI).
Benno Krojer (@benno_krojer) 's Twitter Profile Photo

The video is online now! 3min speed science talk on "From a soup of raw pixels to abstract meaning" youtu.be/AHsoMYG2Vqk?si…

The video is online now!

3min speed science talk on "From a soup of raw pixels to abstract meaning"

youtu.be/AHsoMYG2Vqk?si…
Akari Asai (@akariasai) 's Twitter Profile Photo

We’re hosting a NeurIPS competition on real-world Retrieval-Augmented Generation! In addition to automatic and LLM-as-a-judge eval, we’ll feature live user feedback via our interactive RAG Arena. Stay tuned for more details and don’t forget to sign up agi-lti.github.io/MMU-RAGent/

Saba (@saba_a96) 's Twitter Profile Photo

We built a new 𝗮𝘂𝘁𝗼𝗿𝗲𝗴𝗿𝗲𝘀𝘀𝗶𝘃𝗲 + 𝗥𝗟 image editing model using a strong verifier — and it beats SOTA diffusion baselines using 5× less data. 🔥 𝗘𝗔𝗥𝗟: a simple, scalable RL pipeline for high-quality, controllable edits. 🧵1/

We built a new 𝗮𝘂𝘁𝗼𝗿𝗲𝗴𝗿𝗲𝘀𝘀𝗶𝘃𝗲 + 𝗥𝗟 image editing model using a strong verifier — and it beats SOTA diffusion baselines using 5× less data.
🔥 𝗘𝗔𝗥𝗟: a simple, scalable RL pipeline for high-quality, controllable edits.
🧵1/