Zaid Khan (@codezakh) 's Twitter Profile
Zaid Khan

@codezakh

@uncnlp with @mohitban47 working on grounded reasoning + multimodal agents // currently @allen_ai formerly @neclabsamerica // bs+ms CompE @northeastern

ID: 1669925833891356673

linkhttp://zaidkhan.me calendar_today17-06-2023 04:32:28

373 Tweet

508 Followers

755 Following

Mohit Bansal (@mohitban47) 's Twitter Profile Photo

Shiny new building and good views from the new VirginiaTech campus in DC 😉 -- it was a pleasure to meet everyone and engage in exciting discussions about trustworthy agents, collaborative reasoning/privacy, and controllable multimodal generation -- thanks again Naren Ramakrishnan,

Shiny new building and good views from the new VirginiaTech campus in DC 😉 -- it was a pleasure to meet everyone and engage in exciting discussions about trustworthy agents, collaborative reasoning/privacy, and controllable multimodal generation -- thanks again <a href="/profnaren/">Naren Ramakrishnan</a>,
Elias Stengel-Eskin (on the faculty job market) (@eliaseskin) 's Twitter Profile Photo

🚨 Announcing Generalized Correctness Models (GCMs) 🚨Finding that LLMs have little self knowledge about their own correctness, we train an 8B GCM to predict correctness of many models, which is more accurate than training model-specific CMs, and outperforms a larger

🚨 Announcing Generalized Correctness Models (GCMs) 🚨Finding that LLMs have little self knowledge about their own correctness, we train an 8B GCM to predict correctness of many models, which is more accurate than training model-specific CMs, and outperforms a larger
Hanqi Xiao (@hanqi_xiao) 's Twitter Profile Photo

🚨Excited to announce General Correctness Models (GCM): 🔎We find no special advantage using an LLM to predict its own correctness, instead finding that LLMs benefit from learning to predict the correctness of many other models – becoming a GCM. Huge thanks to Vaidehi Patil,

Justin Chih-Yao Chen (@cyjustinchen) 's Twitter Profile Photo

🚨 NuRL: Nudging the Boundaries of LLM Reasoning GRPO improves LLM reasoning, but often within the model's "comfort zone": hard samples (w/ 0% pass rate) remain unsolvable and contribute zero learning signals. In NuRL, we show that "nudging" the LLM with self-generated hints

🚨 NuRL: Nudging the Boundaries of LLM Reasoning

GRPO improves LLM reasoning, but often within the model's "comfort zone": hard samples (w/ 0% pass rate) remain unsolvable and contribute zero learning signals. In NuRL, we show that "nudging" the LLM with self-generated hints
Mohit Bansal (@mohitban47) 's Twitter Profile Photo

🚨 NuRL: Nudging the Boundaries of LLM Reasoning -- GRPO improves LLM reasoning, but stays within the model's "comfort zone" i.e., hard samples (0% pass rate) remain unsolvable and contribute no meaningful gradients. -- In NuRL, we show that "nudging" the LLM with

Elias Stengel-Eskin (on the faculty job market) (@eliaseskin) 's Twitter Profile Photo

🚨 Introducing DINCO, a zero-resource calibration method for verbalized LLM confidence. We normalize over self-generated distractors to enforce coherence ➡️ better-calibrated and less saturated (more usable) confidence! ⚠️ Problem: Standard verbalized confidence is overconfident

🚨 Introducing DINCO, a zero-resource calibration method for verbalized LLM confidence. We normalize over self-generated distractors to enforce coherence ➡️ better-calibrated and less saturated (more usable) confidence!

⚠️ Problem: Standard verbalized confidence is overconfident
Zaid Khan (@codezakh) 's Twitter Profile Photo

Can attest that this is true 🙂 and now there are RS emulators (github.com/LostCityRS/Ser…) which could be turned into a pretty cool environment to eval agents on — there are 100+ quests, lots of bosses to defeat, a legible "tech tree" with smithing, crafting etc

Joykirat (@joykiratsingh) 's Twitter Profile Photo

🚨 Excited to announce TRAAC, an online difficulty-adaptive, attention-based method that handles the tradeoff of under & overthinking in reasoning models to improve both accuracy and efficiency. Underthinking ❌: Models terminate reasoning too early on harder problems, leading

🚨 Excited to announce TRAAC, an online difficulty-adaptive, attention-based method that handles the tradeoff of under &amp; overthinking in reasoning models to improve both accuracy and efficiency.

Underthinking ❌: Models terminate reasoning too early on harder problems, leading
Justin Chih-Yao Chen (@cyjustinchen) 's Twitter Profile Photo

Large reasoning models suffer from under-adaptiveness, which underthink on hard problems and overthink on easy ones. TRAAC addresses this by introducing ✨difficulty calibration and attention-based compression✨→ +8.4% accuracy & +36.8% efficiency! 1️⃣ TRAAC adaptively mitigates

Mohit Bansal (@mohitban47) 's Twitter Profile Photo

🚨 "Think the right amount" for improving both reasoning accuracy and efficiency! --> Large reasoning models under-adapt = underthink on hard problems and overthink on easy ones --> ✨TRAAC✨ is an online RL, difficulty-adaptive, attention-based compression method that prunes

Mohit Bansal (@mohitban47) 's Twitter Profile Photo

🚨 Check out our awesome students/postdocs' papers at #COLM2025 and say hi to them (several are on the job market or hiring) --> -- Archiki, David are on the post-PhD job market! -- Elias finished his postdoc & is now faculty at UT-Austin CS and looking to admit PhD students!

🚨 Check out our awesome students/postdocs' papers at #COLM2025 and say hi to them (several are on the job market or hiring) --&gt;

-- Archiki, David are on the post-PhD job market!
-- Elias finished his postdoc &amp; is now faculty at UT-Austin CS and looking to admit PhD students!
Huaxiu Yao✈️ICLR 2025🇸🇬 (@huaxiuyaoml) 's Twitter Profile Photo

❗️Self-evolution is quietly pushing LLM agents off the rails. ⚠️ Even perfect alignment at deployment can gradually forget human alignment and shift toward self-serving strategies. Over time, LLM agents stop following values, imitate bad strategies, and even spread misaligned

Elias Stengel-Eskin (on the faculty job market) (@eliaseskin) 's Twitter Profile Photo

✈️ Arrived at #COLM2025 where I'll be helping to present the following 4 papers. I'm also recruiting multiple PhD students for my new lab at UT Austin -- happy to chat about research, PhD applications, or postdoc openings in my former postdoc lab at UNC! -- Learning to Generate

David Wan (@meetdavidwan) 's Twitter Profile Photo

Thanks for the shoutout! 🇨🇦I’ll be at #COLM2025 presenting two papers: GenerationPrograms (Attribution): Poster Session 4, Oct 8th, 4:30 PM QAPyramid (Summarization Eval): Poster Session 5, Oct 9th, 11:00 AM I’m also on the industry job market for research scientist roles.

Archiki Prasad (@archikiprasad) 's Twitter Profile Photo

I am attending #COLM2025 🇨🇦 this week to present our work on: Unit Test Generation: 📅 Oct 8th (Wed), 4:30 PM, #79 RAG with conflicting evidence: 📅 Oct 9th (Thu), 11 AM, #71 PS: I'm on the industry job market for RS roles, so you can reach me via DM or in-person to chat! 😄

Zun Wang (@zunwang919) 's Twitter Profile Photo

🚨 Thrilled to introduce Self-Improving Demonstrations (SID) for Goal-Oriented Vision-and-Language Navigation — a scalable paradigm where navigation agents learn to explore by teaching themselves. ➡️ Agents iteratively generate and learn from their own successful trajectories ➡️

🚨 Thrilled to introduce Self-Improving Demonstrations (SID) for Goal-Oriented Vision-and-Language Navigation — a scalable paradigm where navigation agents learn to explore by teaching themselves.

➡️ Agents iteratively generate and learn from their own successful trajectories
➡️
CODS (@ikddcods) 's Twitter Profile Photo

We welcome Prof. Mohit Bansal (UNC Chapel Hill) as a keynote speaker at #CODS2025! Director of UNC’s MURGe-Lab, he works in multimodal generative models, reasoning agents & faithful language generation. He is an AAAI Fellow, PECASE and multiple best paper awardee.

We welcome Prof. Mohit Bansal (UNC Chapel Hill) as a keynote speaker at #CODS2025!

Director of UNC’s MURGe-Lab, he works in multimodal generative models, reasoning agents &amp; faithful language generation. He is an AAAI Fellow, PECASE and multiple best paper awardee.
Shoubin Yu✈️ICLR 2025🇸🇬 (@shoubin621) 's Twitter Profile Photo

🚨 New Paper Alert! Introducing SciVideoBench — a comprehensive benchmark for scientific video reasoning! 🔬SciVideoBench: 1. Spans Physics, Chemistry, Biology & Medicine with authentic experimental videos. 2. Features 1,000 challenging MCQs across three reasoning types:

🚨 New Paper Alert! Introducing SciVideoBench — a comprehensive benchmark for scientific video reasoning!

🔬SciVideoBench:

1. Spans Physics, Chemistry, Biology &amp; Medicine with authentic experimental videos.

2. Features 1,000 challenging MCQs across three reasoning types: