Jay Huang✈️ICLR2025🇸🇬 (@jentsehuang) 's Twitter Profile
Jay Huang✈️ICLR2025🇸🇬

@jentsehuang

#NLProc. Postdoc @JohnsHopkins. PhD @CUHKofficial. BS @PKU1898. Previous: @USC @TencentGlobal. LLM + Social Science, Multi-Agent, AI Fairness.

ID: 1664096530415124481

linkhttps://penguinnnnn.github.io/ calendar_today01-06-2023 02:29:18

237 Tweet

586 Followers

830 Following

Irene Li (@irenelizihui) 's Twitter Profile Photo

📢 Today, we release #MMLUProX, which upgrades MMLU-Pro to 29 languages across 14 disciplines—11,829 reasoning-heavy Qs per language (≈342 k total). The toughest multilingual stress test for today’s LLMs! 🌐🧠 Heartfelt thanks to everyone who contributed.🤝

📢 Today, we release #MMLUProX, which upgrades MMLU-Pro to 29 languages across 14 disciplines—11,829 reasoning-heavy Qs per language (≈342 k total). The toughest multilingual stress test for today’s LLMs! 🌐🧠 Heartfelt thanks to everyone who contributed.🤝
Jiahao Xu (@jiahaox82739261) 's Twitter Profile Photo

🚨 Announcing DeepTheorem: Revolutionizing LLM Mathematical Reasoning! 🚀 𝕋𝕃𝔻ℝ: - 🌟 Learning by exploration is the most important rationale that recent RL-zero training teaches us since self-exploration significantly boosts the utilization of LLM pre-training knowledge; -

🚨 Announcing DeepTheorem: Revolutionizing LLM Mathematical Reasoning! 🚀

𝕋𝕃𝔻ℝ:
- 🌟 Learning by exploration is the most important rationale that recent RL-zero training teaches us since self-exploration significantly boosts the utilization of LLM pre-training knowledge;

-
Murat Kocaoglu (@murat_kocaoglu_) 's Twitter Profile Photo

I am pleased to announce that I will be joining Johns Hopkins University's Computer Science Department JHU Computer Science as an Assistant Professor in Fall 2025. I am grateful to my mentors for their unwavering support and to my exceptional PhD students for advancing our lab's vision.

Zhaopeng Tu (@tuzhaopeng) 's Twitter Profile Photo

When eyes and memory clash, who wins? 👁️🧠 Introducing a comprehensive study on vision-knowledge conflicts in MLLMs, where visual input contradicts the model's internal commonsense knowledge—and the results might surprise you. #ACL2025NLP 📈 We developed an automated framework

When eyes and memory clash, who wins? 👁️🧠

Introducing a comprehensive study on vision-knowledge conflicts in MLLMs, where visual input contradicts the model's internal commonsense knowledge—and the results might surprise you. #ACL2025NLP 

📈 We developed an automated framework
Zhaopeng Tu (@tuzhaopeng) 's Twitter Profile Photo

Can MLLMs truly "see" safety risks in image-text combinations? 🌲🖼️ Introducing MMSafetyAwareness, the first comprehensive benchmark for multimodal safety awareness in MLLMs, featuring 1,500 image-prompt pairs across 29 safety scenarios to evaluate whether models correctly

Can MLLMs truly "see" safety risks in image-text combinations? 🌲🖼️

Introducing MMSafetyAwareness, the first comprehensive benchmark for multimodal safety awareness in MLLMs, featuring 1,500 image-prompt pairs across 29 safety scenarios to evaluate whether models correctly
Kai Chen (@kaichen23) 's Twitter Profile Photo

🤔How well do LLMs adapt to different norms? 🧵We introduce STEER-BENCH, a benchmark for assessing steerability in LLMs. 📉 Human: 81% | Top LLM: ~65% 🚨 Norm alignment ≠ solved. 📄 Paper: arxiv.org/abs/2505.20645 Zihao He Taiwei Shi 🇺🇦 Kristina Lerman 🇺🇦

🤔How well do LLMs adapt to different norms?
🧵We introduce STEER-BENCH, a benchmark for assessing steerability in LLMs.
📉 Human: 81% | Top LLM: ~65%
🚨 Norm alignment ≠ solved.
📄 Paper: arxiv.org/abs/2505.20645
<a href="/ZihaoHe95/">Zihao He</a> <a href="/taiwei_shi/">Taiwei Shi</a> <a href="/KristinaLerman/">🇺🇦 Kristina Lerman 🇺🇦</a>
Omar Shaikh (@oshaikh13) 's Twitter Profile Photo

What if LLMs could learn your habits and preferences well enough (across any context!) to anticipate your needs? In a new paper, we present the General User Model (GUM): a model of you built from just your everyday computer use. 🧵

Pan Lu (@lupantech) 's Twitter Profile Photo

Do LLMs truly understand math proofs, or just guess? 🤔Our new study on #IneqMath dives deep into Olympiad-level inequality proofs & reveals a critical gap: LLMs are often good at finding answers, but struggle with rigorous, sound proofs. ➡️ ineqmath.github.io To tackle

Do LLMs truly understand math proofs, or just guess? 🤔Our new study on #IneqMath dives deep into Olympiad-level inequality proofs &amp; reveals a critical gap: LLMs are often good at finding answers, but struggle with rigorous, sound proofs.

➡️ ineqmath.github.io

To tackle
Mark Dredze (@mdredze) 's Twitter Profile Photo

Our new paper explores knowledge conflict in LLMs. It also issues a word of warning to those using LLMs as a Judge: the model can't help but inject its own knowledge into its decisions.

Yueqi Song (@yueqi_song) 's Twitter Profile Photo

We have a long way to go on visual reasoning. Our VisualPuzzles benchmark🧩shows similar findings, where the best models still can’t beat the bottom 5% of humans. 👉Check out our threads: x.com/yueqi_song/sta…

We have a long way to go on visual reasoning.
Our VisualPuzzles benchmark🧩shows similar findings, where the best models still can’t beat the bottom 5% of humans.
👉Check out our threads:
x.com/yueqi_song/sta…