Ayoung Lee (@o_cube01) Twitter Tweets • TwiCopy

Xinliang (Frederick) Zhang

a year ago

Heard of the Alaska-Hawaii merger?🤔Wonder if LLMs know it’s pending government approval before it can happen? They stumble, but we’ve got a fix⚒️! Dive into my #EMNLP2024 work 𝐍𝐚𝐫𝐫𝐚𝐭𝐢𝐯𝐞-𝐨𝐟-𝐓𝐡𝐨𝐮𝐠𝐡𝐭—a special prompting technique to unlock LLMs’ temporal reasoning

thumb_up_off_alt34

chat_bubble_outline1

repeat18

shareShare

Ayoung Lee

@o_cube01

9 months ago

Honored to receive this award! Planning to archive my work before April. Stay tuned!

thumb_up_off_alt6

chat_bubble_outline0

repeat0

shareShare

Yunxiang Zhang

@yunxiangzhang4

8 months ago

🚨 New Benchmark Drop! Can LLMs actually do ML research? Not toy problems, not Kaggle tweaks—but real, unsolved ML conference research competitions? We built MLRC-BENCH to find out. Paper: arxiv.org/abs/2504.09702 Leaderboard: huggingface.co/spaces/launch/… Code: github.com/yunx-z/MLRC-Be…

thumb_up_off_alt102

chat_bubble_outline3

repeat35

shareShare

Muhammad Khalifa

@mkhalifaaaa

7 months ago

🚨Announcing SCALR @ COLM 2025 — Call for Papers!🚨 The 1st Workshop on Test-Time Scaling and Reasoning Models (SCALR) is coming to Conference on Language Modeling in Montreal this October! This is the first workshop dedicated to this growing research area. 🌐 scalr-workshop.github.io

🚨Announcing SCALR @ COLM 2025 — Call for Papers!🚨

The 1st Workshop on Test-Time Scaling and Reasoning Models (SCALR) is coming to <a href="/COLM_conf/">Conference on Language Modeling</a> in Montreal this October!

This is the first workshop dedicated to this growing research area.

🌐 scalr-workshop.github.io

thumb_up_off_alt44

chat_bubble_outline1

repeat17

shareShare

Yeda Song

@__runamu__

6 months ago

🔥 GUI agents struggle with real-world mobile tasks. We present MONDAY—a diverse, large-scale dataset built via an automatic pipeline that transforms internet videos into GUI agent data. ✅ VLMs trained on MONDAY show strong generalization ✅ Open data (313K steps) (1/7) 🧵 #CVPR

thumb_up_off_alt48

chat_bubble_outline2

repeat15

shareShare

Jie Ruan

@jieruan75

6 months ago

🔍LLMs now give medical diagnoses, legal advice, and even tackle scientific problems. ❓Your LLM sounds smart. But what if it’s just good at faking expertise? 🚀We built ExpertLongBench to find out. 📉And the results? They revealed several concerns.👇 🔗 huggingface.co/spaces/launch/…

thumb_up_off_alt19

chat_bubble_outline1

repeat11

shareShare

Kai Zou

@zkjzou

5 months ago

🔥 Excited to introduce ManyICLBench (ACL 2025) 🧐 Do many-shot ICL tasks evaluate LCLMs' ability to retrieve the most similar examples or learn from many examples? We carefully analyzed numerous tasks and categorized them. 📄 Paper: arxiv.org/abs/2411.07130 #ACL2025

thumb_up_off_alt26

chat_bubble_outline1

repeat16

shareShare

Xinliang (Frederick) Zhang

@frederickxzhang

2 months ago

How do LLMs really navigate the thinking space? Straight off to a final answer OR follow a wiggly path? Definitely commit OR get stuck to “infinite” self-doubting? In our latest study, we unravel (over-)thinking through the lens of sub-thoughts: rb.gy/viud7z more in 🧵

thumb_up_off_alt61

chat_bubble_outline2

repeat27

shareShare

Ayoung Lee

@o_cube01

8 days ago

I will be at NeurIPS from Dec 2nd to Dec 5th. I am interested in reasoning and alignment, and also looking for 2026 summer internships 👀 Feel free to DM me if you would like to chat or grab coffee ☕️! Excited to reconnect with old friends and make new ones😆

thumb_up_off_alt3

chat_bubble_outline0

repeat0

shareShare