Liliang Ren (@liliang_ren) Twitter Tweets • TwiCopy

Jize Jiang

6 months ago

Excited to introduce VTool-R1! We’ve trained VLMs to “think visually” using RL, blending Python-based 🖼️visual edits with💡textual Chain-of-Thought reasoning. Our trained qwen2.5-VL-32B surpasses GPT-4o on ChartQA & TableVQA, and even the compact qwen2.5-VL-7B significantly

thumb_up_off_alt18

chat_bubble_outline1

repeat4

shareShare

AI21 Labs

@ai21labs

4 months ago

Attention was never enough. The hybrid LLM era is here—and it’s moving fast. From Mamba to Jamba to Bamba, we mapped every major model that’s challenged the Transformer default in the past 18 months. 🧵 A timeline of what’s changed and why it matters ↓ 🔗

thumb_up_off_alt472

chat_bubble_outline12

repeat99

shareShare

Feng Yao

@fengyao1909

4 months ago

Failing on 𝐥𝐚𝐫𝐠𝐞-𝐬𝐜𝐚𝐥𝐞 𝐑𝐋 with VeRL? ⚠️ Mixing inference backend (𝐯𝐋𝐋𝐌/𝐒𝐆𝐋𝐚𝐧𝐠) with training backends (𝐅𝐒𝐃𝐏/𝐌𝐞𝐠𝐚𝐭𝐫𝐨𝐧) 𝐬𝐞𝐜𝐫𝐞𝐭𝐥𝐲 𝐭𝐮𝐫𝐧𝐬 𝐲𝐨𝐮𝐫 𝐑𝐋 𝐢𝐧𝐭𝐨 𝐨𝐟𝐟-𝐩𝐨𝐥𝐢𝐜𝐲 — even if they share the same weights! 📉 Blog:

thumb_up_off_alt461

chat_bubble_outline5

repeat69

shareShare

Aon

@aonsayyed

4 months ago

kalomaze Its phi5-21B-A3.6B with phi5-117B-A5.1B

thumb_up_off_alt6

chat_bubble_outline0

repeat1

shareShare

Mingyuan Wu

@mingyuanwu4

3 months ago

Can VLMs learn to reason better by drawing on the brilliant thoughts of others. 🔥Our recent work on vision language model reasoning, through carefully designed multimodal memory and retrieval, has been accepted to Main Conference of #EMNLP2025. 💡Inspired by case-based

thumb_up_off_alt26

chat_bubble_outline0

repeat11

shareShare

Kaiyue Wen

@wen_kaiyue

3 months ago

(1/n) Check out our new paper: "Fantastic Pretraining Optimizers and Where to Find Them"! >4000 models to find the fastest optimizer! 2× speedups over AdamW? Unlikely. Beware under-tuned baseline or limited scale! E.g. Muon: ~40% speedups <0.5B & only 10% at 1.2B (8× Chinchilla)!

thumb_up_off_alt423

chat_bubble_outline12

repeat90

shareShare

Junyang Lin

@justinlin610

3 months ago

github.com/huggingface/tr…

thumb_up_off_alt586

chat_bubble_outline34

repeat49

shareShare

Thinking Machines

@thinkymachines

3 months ago

Today Thinking Machines Lab is launching our research blog, Connectionism. Our first blog post is “Defeating Nondeterminism in LLM Inference” We believe that science is better when shared. Connectionism will cover topics as varied as our research is: from kernel numerics to

thumb_up_off_alt6,6K

chat_bubble_outline205

repeat1,1K

shareShare

Songlin Yang

@songlinyang4

3 months ago

Excited to see Gated DeltaNet being adopted in the Qwen series ! It has also previously demonstrated strong effectiveness in NVIDIA's Jet-Nemotron

thumb_up_off_alt548

chat_bubble_outline9

repeat53

shareShare

Liliang Ren

@liliang_ren

3 months ago

SambaY is accepted to #NeurIPS2025 ! Looking forward to seeing everyone in San Diego!

thumb_up_off_alt39

chat_bubble_outline2

repeat4

shareShare

Eric Jang

@ericjang11

2 months ago

a well written pedagogical blog post. Some questions out of my curiosity: 1. for stably training large models, why is normalizing weights better than normalizing activations? 2. how much does does regularizing weight matrix to Stiefel manifold limit its expressivity +

thumb_up_off_alt231

chat_bubble_outline5

repeat14

shareShare

Thinking Machines

@thinkymachines

2 months ago

Introducing Tinker: a flexible API for fine-tuning language models. Write training loops in Python on your laptop; we'll run them on distributed GPUs. Private beta starts today. We can't wait to see what researchers and developers build with cutting-edge open models!

thumb_up_off_alt2,2K

chat_bubble_outline102

repeat288

shareShare

Liliang Ren

@liliang_ren

2 months ago

It is really amazing to see a 5-year-old project finally got wrapped up and still seems very relevant to today's agentic research topics such as multi-agent collaboration, environment simulators and instruction following!

thumb_up_off_alt2

chat_bubble_outline1

repeat0

shareShare

Thinking Machines

@thinkymachines

a month ago

Our latest post explores on-policy distillation, a training approach that unites the error-correcting relevance of RL with the reward density of SFT. When training it for math reasoning and as an internal chat assistant, we find that on-policy distillation can outperform other

thumb_up_off_alt2,2K

chat_bubble_outline60

repeat381

shareShare

Liliang Ren

@liliang_ren

a month ago

Wow, it is also nine days earlier than Agarwal et al.'s on-policy distillation work.

thumb_up_off_alt3

chat_bubble_outline0

repeat0

shareShare

Larry Dial

@classiclarryd

a month ago

NorMuon from Zichong Li et al. takes the crown as the leading NanoGPT speedrun optimizer! github.com/KellerJordan/m… NorMuon enhances Muon with a neuron normalization step after orthogonalization using second-order statistics. arxiv.org/abs/2510.05491

NorMuon from <a href="/li_zichong/">Zichong Li</a> et al. takes the crown as the leading NanoGPT speedrun optimizer! github.com/KellerJordan/m…
NorMuon enhances Muon with a neuron normalization step after orthogonalization using second-order statistics. arxiv.org/abs/2510.05491

thumb_up_off_alt14

chat_bubble_outline1

repeat4

shareShare

Kimi.ai

@kimi_moonshot

a month ago

Kimi Linear Tech Report is dropped! 🚀 huggingface.co/moonshotai/Kim… Kimi Linear: A novel architecture that outperforms full attention with faster speeds and better performance—ready to serve as a drop-in replacement for full attention, featuring our open-sourced KDA kernels! Kimi

thumb_up_off_alt1,1K

chat_bubble_outline27

repeat193

shareShare

Songlin Yang

@songlinyang4

a month ago

Many people are confused by Minimax’s recent return to full attention - especially since it was the first large-scale pivot toward hybrid linear attention - and by Kimi’s later adoption of hybrid linear variants (as well as earlier attempts by Qwen3-Next, or Qwen3.5). I actually

thumb_up_off_alt507

chat_bubble_outline12

repeat63

shareShare

Seunghyun Seo

@seunghyunseo7

a month ago

just noticed modded-nanogpt adopt 'NorMuon' as default (?). it looks like `AdaMuon`. i personally didnt buy this idea because i thought Muon is enough and dont want to introduce optim state for 2nd moment again like adam... hmm arxiv.org/abs/2510.05491 arxiv.org/abs/2507.11005…

thumb_up_off_alt65

chat_bubble_outline6

repeat7

shareShare

Songlin Yang

@songlinyang4

a month ago

Hi Jeff Dean, what’s the plan for releasing the code for this line of work? None of these papers so far seem to have released any code

Hi <a href="/JeffDean/">Jeff Dean</a>, what’s the plan for releasing the code for this line of work? None of these papers so far seem to have released any code

thumb_up_off_alt1,1K

chat_bubble_outline22

repeat39

shareShare