Zaid Khan (@codezakh) Twitter Tweets • TwiCopy

Mohit Bansal

2 months ago

Shiny new building and good views from the new VirginiaTech campus in DC 😉 -- it was a pleasure to meet everyone and engage in exciting discussions about trustworthy agents, collaborative reasoning/privacy, and controllable multimodal generation -- thanks again Naren Ramakrishnan,

thumb_up_off_alt83

chat_bubble_outline0

repeat3

shareShare

Elias Stengel-Eskin (on the faculty job market)

@eliaseskin

2 months ago

🚨 Announcing Generalized Correctness Models (GCMs) 🚨Finding that LLMs have little self knowledge about their own correctness, we train an 8B GCM to predict correctness of many models, which is more accurate than training model-specific CMs, and outperforms a larger

thumb_up_off_alt92

chat_bubble_outline3

repeat40

shareShare

Hanqi Xiao

@hanqi_xiao

2 months ago

🚨Excited to announce General Correctness Models (GCM): 🔎We find no special advantage using an LLM to predict its own correctness, instead finding that LLMs benefit from learning to predict the correctness of many other models – becoming a GCM. Huge thanks to Vaidehi Patil,

thumb_up_off_alt21

chat_bubble_outline1

repeat14

shareShare

Justin Chih-Yao Chen

@cyjustinchen

2 months ago

🚨 NuRL: Nudging the Boundaries of LLM Reasoning GRPO improves LLM reasoning, but often within the model's "comfort zone": hard samples (w/ 0% pass rate) remain unsolvable and contribute zero learning signals. In NuRL, we show that "nudging" the LLM with self-generated hints

thumb_up_off_alt320

chat_bubble_outline11

repeat70

shareShare

Mohit Bansal

@mohitban47

2 months ago

🚨 NuRL: Nudging the Boundaries of LLM Reasoning -- GRPO improves LLM reasoning, but stays within the model's "comfort zone" i.e., hard samples (0% pass rate) remain unsolvable and contribute no meaningful gradients. -- In NuRL, we show that "nudging" the LLM with

thumb_up_off_alt114

chat_bubble_outline2

repeat26

shareShare

Elias Stengel-Eskin (on the faculty job market)

@eliaseskin

2 months ago

🚨 Introducing DINCO, a zero-resource calibration method for verbalized LLM confidence. We normalize over self-generated distractors to enforce coherence ➡️ better-calibrated and less saturated (more usable) confidence! ⚠️ Problem: Standard verbalized confidence is overconfident

thumb_up_off_alt76

chat_bubble_outline2

repeat22

shareShare

Zaid Khan

@codezakh

2 months ago

Can attest that this is true 🙂 and now there are RS emulators (github.com/LostCityRS/Ser…) which could be turned into a pretty cool environment to eval agents on — there are 100+ quests, lots of bosses to defeat, a legible "tech tree" with smithing, crafting etc

thumb_up_off_alt7

chat_bubble_outline0

repeat0

shareShare

Joykirat

@joykiratsingh

2 months ago

🚨 Excited to announce TRAAC, an online difficulty-adaptive, attention-based method that handles the tradeoff of under & overthinking in reasoning models to improve both accuracy and efficiency. Underthinking ❌: Models terminate reasoning too early on harder problems, leading

thumb_up_off_alt95

chat_bubble_outline3

repeat46

shareShare

Justin Chih-Yao Chen

@cyjustinchen

2 months ago

Large reasoning models suffer from under-adaptiveness, which underthink on hard problems and overthink on easy ones. TRAAC addresses this by introducing ✨difficulty calibration and attention-based compression✨→ +8.4% accuracy & +36.8% efficiency! 1️⃣ TRAAC adaptively mitigates

thumb_up_off_alt38

chat_bubble_outline0

repeat15

shareShare

Mohit Bansal

@mohitban47

2 months ago

🚨 "Think the right amount" for improving both reasoning accuracy and efficiency! --> Large reasoning models under-adapt = underthink on hard problems and overthink on easy ones --> ✨TRAAC✨ is an online RL, difficulty-adaptive, attention-based compression method that prunes

thumb_up_off_alt79

chat_bubble_outline1

repeat15

shareShare

Mohit Bansal

@mohitban47

2 months ago

🚨 Check out our awesome students/postdocs' papers at #COLM2025 and say hi to them (several are on the job market or hiring) --> -- Archiki, David are on the post-PhD job market! -- Elias finished his postdoc & is now faculty at UT-Austin CS and looking to admit PhD students!

thumb_up_off_alt113

chat_bubble_outline3

repeat39

shareShare

Mohit Bansal

@mohitban47

2 months ago

Conference on Language Modeling Archiki Prasad ✈️ COLM 2025 David Wan Elias Stengel-Eskin Han Wang @ COLM 2025 Hanqi Xiao (detailed links + websites + summary 🧵's of these papers attached below FYI 👇) -- Learning to Generate Unit Tests for Automated Debugging. Archiki Prasad ✈️ COLM 2025 Elias Stengel-Eskin Justin Chih-Yao Chen Zaid Khan arxiv.org/abs/2502.01619 x.com/ArchikiPrasad/…

thumb_up_off_alt14

chat_bubble_outline1

repeat4

shareShare

Mohit Bansal

@mohitban47

2 months ago

Conference on Language Modeling Archiki Prasad ✈️ COLM 2025 David Wan Elias Stengel-Eskin Han Wang @ COLM 2025 Hanqi Xiao Justin Chih-Yao Chen Zaid Khan Yi Lin Sung Eran Hirsch BIU NLP Shiyue Zhang Arie Cattan Ayal Klein -- Postdoc openings info/details 👇 (flyer+links: cs.unc.edu/~mbansal/postd…) Also, PhD admissions/openings info: cs.unc.edu/~mbansal/prosp… x.com/mohitban47/sta…

thumb_up_off_alt13

chat_bubble_outline0

repeat5

shareShare

Huaxiu Yao✈️ICLR 2025🇸🇬

@huaxiuyaoml

2 months ago

❗️Self-evolution is quietly pushing LLM agents off the rails. ⚠️ Even perfect alignment at deployment can gradually forget human alignment and shift toward self-serving strategies. Over time, LLM agents stop following values, imitate bad strategies, and even spread misaligned

thumb_up_off_alt58

chat_bubble_outline1

repeat21

shareShare

Elias Stengel-Eskin (on the faculty job market)

@eliaseskin

2 months ago

✈️ Arrived at #COLM2025 where I'll be helping to present the following 4 papers. I'm also recruiting multiple PhD students for my new lab at UT Austin -- happy to chat about research, PhD applications, or postdoc openings in my former postdoc lab at UNC! -- Learning to Generate

thumb_up_off_alt43

chat_bubble_outline1

repeat20

shareShare

David Wan

@meetdavidwan

2 months ago

Thanks for the shoutout! 🇨🇦I’ll be at #COLM2025 presenting two papers: GenerationPrograms (Attribution): Poster Session 4, Oct 8th, 4:30 PM QAPyramid (Summarization Eval): Poster Session 5, Oct 9th, 11:00 AM I’m also on the industry job market for research scientist roles.

thumb_up_off_alt23

chat_bubble_outline0

repeat13

shareShare

Archiki Prasad

@archikiprasad

2 months ago

I am attending #COLM2025 🇨🇦 this week to present our work on: Unit Test Generation: 📅 Oct 8th (Wed), 4:30 PM, #79 RAG with conflicting evidence: 📅 Oct 9th (Thu), 11 AM, #71 PS: I'm on the industry job market for RS roles, so you can reach me via DM or in-person to chat! 😄

thumb_up_off_alt38

chat_bubble_outline0

repeat13

shareShare

Zun Wang

@zunwang919

2 months ago

🚨 Thrilled to introduce Self-Improving Demonstrations (SID) for Goal-Oriented Vision-and-Language Navigation — a scalable paradigm where navigation agents learn to explore by teaching themselves. ➡️ Agents iteratively generate and learn from their own successful trajectories ➡️

thumb_up_off_alt71

chat_bubble_outline2

repeat32

shareShare

CODS

@ikddcods

2 months ago

We welcome Prof. Mohit Bansal (UNC Chapel Hill) as a keynote speaker at #CODS2025! Director of UNC’s MURGe-Lab, he works in multimodal generative models, reasoning agents & faithful language generation. He is an AAAI Fellow, PECASE and multiple best paper awardee.

thumb_up_off_alt13

chat_bubble_outline0

repeat11

shareShare

Shoubin Yu✈️ICLR 2025🇸🇬

@shoubin621

2 months ago

🚨 New Paper Alert! Introducing SciVideoBench — a comprehensive benchmark for scientific video reasoning! 🔬SciVideoBench: 1. Spans Physics, Chemistry, Biology & Medicine with authentic experimental videos. 2. Features 1,000 challenging MCQs across three reasoning types:

thumb_up_off_alt36

chat_bubble_outline3

repeat21

shareShare