Changho Shin @ ICLR 2025 (@changho_shin_) Twitter Tweets • TwiCopy

Yoonho Lee

2 months ago

The standard way to improve reasoning in LLMs is to train on long chains of thought. But these traces are often brute-force and shallow. Introducing RLAD, where models instead learn _reasoning abstractions_: concise textual strategies that guide structured exploration. 1/N🧵

thumb_up_off_alt386

chat_bubble_outline8

repeat37

shareShare

Fred Sala

@fredsala

2 months ago

Super excited to present our new work on hybrid architecture models—getting the best of Transformers and SSMs like Mamba—at #COLM2025! Come chat with Nicholas Roberts at poster session 2 on Tuesday. Thread below! (1)

thumb_up_off_alt71

chat_bubble_outline2

repeat25

shareShare

Albert Ge

@albert_ge_95

2 months ago

🔭 Towards Extending Open dLLMs to 131k Tokens dLLMs behave differently from AutoRegressive models—they lack attention sinks, making long-context extension tricky. A few simple tweaks go a long way!! ✍️blog albertge.notion.site/longdllm 💻code github.com/lbertge/longdl…

thumb_up_off_alt203

chat_bubble_outline5

repeat49

shareShare

Fred Sala

@fredsala

2 months ago

The coolest trend for AI is shifting from conversation to action—less talking and more doing. This is also a great opportunity for evals: we need benchmarks that measure utility, including in an economic sense. terminalbench is my favorite effort of this type!

thumb_up_off_alt33

chat_bubble_outline1

repeat18

shareShare

Sungmin Cha

@_sungmin_cha

2 months ago

How can we be sure a generative model (LLMs, Diffusion) has truly unlearned something? What if existing evaluation metrics are misleading us? In our new paper, we introduce FADE, a new metric that assesses genuine unlearning by measuring distributional alignment, moving beyond

thumb_up_off_alt39

chat_bubble_outline1

repeat10

shareShare

Jiayu (Mila) Wang

@jiayuwang111

2 months ago

Excited to share our work on deep research! In this work, we argue that four task design principles are essential for fair comparison in deep research: (1) user-centric, (2) dynamic, (3) unambiguous, and (4) multi-faceted & search-intensive, and LiveResearchBench is guided

thumb_up_off_alt12

chat_bubble_outline0

repeat9

shareShare

Albert Ge

@albert_ge_95

2 months ago

new state of the art UW School of Computer, Data & Information Sciences building fosters state of the art discussions 😃 excited to kickstart our new ml reading seminar! today we had Nicholas E. Corrado give a talk on his latest work on data mixing for llm alignment! our reading seminar sites.google.com/view/madml

new state of the art <a href="/uwcdis/">UW School of Computer, Data & Information Sciences</a> building fosters state of the art discussions 😃

excited to kickstart our new ml reading seminar! today we had <a href="/NicholasEC49673/">Nicholas E. Corrado</a> give a talk on his latest work on data mixing for llm alignment!

our reading seminar sites.google.com/view/madml

thumb_up_off_alt9

chat_bubble_outline1

repeat5

shareShare

Aniket Rege

@wregss

2 months ago

Lots of disagreement on the TL about the definition of AGI 🤔 Meanwhile on the way to #ICCV2025 , they’re selling AGI at O’Hare!

thumb_up_off_alt18

chat_bubble_outline0

repeat2

shareShare

Aniket Rege

@wregss

2 months ago

Also presenting today: 1. MMFM4 in Ex Hall 2 #190 from 930-1030 AM 2. CEGIS in Ex Hall 2 somewhere between #132 and #145 Drop by to chat about T2I model bias and how to approximate human judgments of cultural faithfulness!

thumb_up_off_alt11

chat_bubble_outline0

repeat3

shareShare

Lester Mackey

@lestermackey

2 months ago

If you're a PhD student interested in interning with me or one of my amazing colleagues at Microsoft Research New England (Microsoft Research New England, Microsoft Research) this summer, please apply here jobs.careers.microsoft.com/global/en/job/… (If you'd like to work with me, please include my name in your cover letter!)

thumb_up_off_alt404

chat_bubble_outline8

repeat67

shareShare

Brenden Lake

@lakebrenden

2 months ago

There are still open desks in our new Human & Machine Intelligence lab at Princeton. Express your interest in joining us: lake-lab.github.io/apply/

thumb_up_off_alt325

chat_bubble_outline7

repeat54

shareShare

fly51fly

@fly51fly

2 months ago

[LG] Imbalanced Gradients in RL Post-Training of Multi-Task LLMs R Wu, A Samanta, A Jain, S Fujimoto... [Meta AI] (2025) arxiv.org/abs/2510.19178

thumb_up_off_alt16

chat_bubble_outline0

repeat3

shareShare

Snorkel AI

@snorkelai

2 months ago

New benchmark drop 🚀 SnorkelSpatial tests how well LLMs can think in space, following text-based moves and rotations in a 2D world.

thumb_up_off_alt19

chat_bubble_outline2

repeat3

shareShare

Harit Vishwakarma

@harit_v

2 months ago

Introducing SnorkelSpatial: A New Benchmark for Evaluating Spatial Reasoning in LLMs Spatial reasoning is everywhere from navigating city maps to understanding molecular interactions. But how well do LLMs handle tasks that require tracking objects moving through space?

thumb_up_off_alt11

chat_bubble_outline1

repeat4

shareShare

Harit Vishwakarma

@harit_v

2 months ago

We built SnorkelSpatial to answer this question. It's a procedurally generated benchmark that tests LLMs on spatial reasoning through a 2D grid world where particles and boards move and rotate through sequences of actions.

thumb_up_off_alt8

chat_bubble_outline1

repeat1

shareShare

Snorkel AI

@snorkelai

a month ago

Evaluating how models reason about space and motion is key to building grounded, trustworthy AI. SnorkelSpatial offers a data-centric benchmark for measuring just that. Explore the research👇

thumb_up_off_alt4

chat_bubble_outline0

repeat1

shareShare

Alex Ratner

@ajratner

a month ago

Static benchmarks as the gold standard of measurement will increasingly be a thing of the past. The future is dynamic benchmarks - regularly updated in response to evolving failure modes, error analyses, and objectives. Excited to see Snorkel AI Research leading the way here!

thumb_up_off_alt20

chat_bubble_outline1

repeat5

shareShare

Snorkel AI

@snorkelai

a month ago

Excited for this release -- can't wait to see how agents handle the Snorkel-contributed tasks! We'll be at the event too -- see you there!

thumb_up_off_alt4

chat_bubble_outline0

repeat2

shareShare

Jaden Park

@_jadenpark

a month ago

Me: memorize past exams 📚💯 Also me: fail on a slight tweak 🤦‍♂️🤦‍♂️ Turns out, we can use the same method to 𝗱𝗲𝘁𝗲𝗰𝘁 𝗰𝗼𝗻𝘁𝗮𝗺𝗶𝗻𝗮𝘁𝗲𝗱 𝗩𝗟𝗠𝘀! 🧵(1/10) - Project Page: mm-semantic-perturbation.github.io

thumb_up_off_alt21

chat_bubble_outline1

repeat8

shareShare

Jon Saad-Falcon

@jonsaadfalcon

a month ago

Data centers dominate AI, but they're hitting physical limits. What if the future of AI isn't just bigger data centers, but local intelligence in our hands? The viability of local AI depends on intelligence efficiency. To measure this, we propose intelligence per watt (IPW):

thumb_up_off_alt342

chat_bubble_outline38

repeat109

shareShare