Liliang Ren (@liliang_ren) 's Twitter Profile
Liliang Ren

@liliang_ren

Senior Researcher at Microsoft GenAI | UIUC CS PhD graduate | Efficient LLM | NLP | Former Intern @MSFTResearch @Azure @AmazonScience

ID: 1106294591718715392

linkhttps://renll.github.io calendar_today14-03-2019 20:42:39

101 Tweet

2,2K Takipçi

455 Takip Edilen

Jize Jiang (@jizejiang) 's Twitter Profile Photo

Excited to introduce VTool-R1! We’ve trained VLMs to “think visually” using RL, blending Python-based 🖼️visual edits with💡textual Chain-of-Thought reasoning. Our trained qwen2.5-VL-32B surpasses GPT-4o on ChartQA & TableVQA, and even the compact qwen2.5-VL-7B significantly

Excited to introduce VTool-R1! 

We’ve trained VLMs to “think visually” using RL, blending Python-based 🖼️visual edits with💡textual Chain-of-Thought reasoning. 
Our trained qwen2.5-VL-32B surpasses GPT-4o on ChartQA & TableVQA, and even the compact qwen2.5-VL-7B significantly
AI21 Labs (@ai21labs) 's Twitter Profile Photo

Attention was never enough. The hybrid LLM era is here—and it’s moving fast. From Mamba to Jamba to Bamba, we mapped every major model that’s challenged the Transformer default in the past 18 months. 🧵 A timeline of what’s changed and why it matters ↓ 🔗

Attention was never enough.

The hybrid LLM era is here—and it’s moving fast.

From Mamba to Jamba to Bamba, we mapped every major model that’s challenged the Transformer default in the past 18 months.

🧵 A timeline of what’s changed and why it matters ↓

🔗
Feng Yao (@fengyao1909) 's Twitter Profile Photo

Failing on 𝐥𝐚𝐫𝐠𝐞-𝐬𝐜𝐚𝐥𝐞 𝐑𝐋 with VeRL? ⚠️ Mixing inference backend (𝐯𝐋𝐋𝐌/𝐒𝐆𝐋𝐚𝐧𝐠) with training backends (𝐅𝐒𝐃𝐏/𝐌𝐞𝐠𝐚𝐭𝐫𝐨𝐧) 𝐬𝐞𝐜𝐫𝐞𝐭𝐥𝐲 𝐭𝐮𝐫𝐧𝐬 𝐲𝐨𝐮𝐫 𝐑𝐋 𝐢𝐧𝐭𝐨 𝐨𝐟𝐟-𝐩𝐨𝐥𝐢𝐜𝐲 — even if they share the same weights! 📉 Blog:

Failing on 𝐥𝐚𝐫𝐠𝐞-𝐬𝐜𝐚𝐥𝐞 𝐑𝐋 with VeRL?

⚠️ Mixing inference backend (𝐯𝐋𝐋𝐌/𝐒𝐆𝐋𝐚𝐧𝐠) with training backends (𝐅𝐒𝐃𝐏/𝐌𝐞𝐠𝐚𝐭𝐫𝐨𝐧) 𝐬𝐞𝐜𝐫𝐞𝐭𝐥𝐲 𝐭𝐮𝐫𝐧𝐬 𝐲𝐨𝐮𝐫 𝐑𝐋 𝐢𝐧𝐭𝐨 𝐨𝐟𝐟-𝐩𝐨𝐥𝐢𝐜𝐲 — even if they share the same weights!

📉 Blog:
Mingyuan Wu (@mingyuanwu4) 's Twitter Profile Photo

Can VLMs learn to reason better by drawing on the brilliant thoughts of others. 🔥Our recent work on vision language model reasoning, through carefully designed multimodal memory and retrieval, has been accepted to Main Conference of #EMNLP2025. 💡Inspired by case-based

Can VLMs learn to reason better by drawing on the brilliant thoughts of others.

🔥Our recent work on vision language model reasoning, through carefully designed multimodal memory and retrieval, has been accepted to Main Conference of #EMNLP2025.

💡Inspired by case-based
Kaiyue Wen (@wen_kaiyue) 's Twitter Profile Photo

(1/n) Check out our new paper: "Fantastic Pretraining Optimizers and Where to Find Them"! >4000 models to find the fastest optimizer! 2× speedups over AdamW? Unlikely. Beware under-tuned baseline or limited scale! E.g. Muon: ~40% speedups <0.5B & only 10% at 1.2B (8× Chinchilla)!

(1/n) Check out our new paper: "Fantastic Pretraining Optimizers and Where to Find Them"! &gt;4000 models to find the fastest optimizer! 2× speedups over AdamW? Unlikely. Beware under-tuned baseline or limited scale! E.g. Muon: ~40% speedups &lt;0.5B &amp; only 10% at 1.2B (8× Chinchilla)!
Thinking Machines (@thinkymachines) 's Twitter Profile Photo

Today Thinking Machines Lab is launching our research blog, Connectionism. Our first blog post is “Defeating Nondeterminism in LLM Inference” We believe that science is better when shared. Connectionism will cover topics as varied as our research is: from kernel numerics to

Today Thinking Machines Lab is launching our research blog, Connectionism. Our first blog post is “Defeating Nondeterminism in LLM Inference”

We believe that science is better when shared. Connectionism will cover topics as varied as our research is: from kernel numerics to
Songlin Yang (@songlinyang4) 's Twitter Profile Photo

Excited to see Gated DeltaNet being adopted in the Qwen series ! It has also previously demonstrated strong effectiveness in NVIDIA's Jet-Nemotron

Eric Jang (@ericjang11) 's Twitter Profile Photo

a well written pedagogical blog post. Some questions out of my curiosity: 1. for stably training large models, why is normalizing weights better than normalizing activations? 2. how much does does regularizing weight matrix to Stiefel manifold limit its expressivity +

Thinking Machines (@thinkymachines) 's Twitter Profile Photo

Introducing Tinker: a flexible API for fine-tuning language models. Write training loops in Python on your laptop; we'll run them on distributed GPUs. Private beta starts today. We can't wait to see what researchers and developers build with cutting-edge open models!

Introducing Tinker: a flexible API for fine-tuning language models.

Write training loops in Python on your laptop; we'll run them on distributed GPUs.

Private beta starts today. We can't wait to see what researchers and developers build with cutting-edge open models!
Liliang Ren (@liliang_ren) 's Twitter Profile Photo

It is really amazing to see a 5-year-old project finally got wrapped up and still seems very relevant to today's agentic research topics such as multi-agent collaboration, environment simulators and instruction following!

Thinking Machines (@thinkymachines) 's Twitter Profile Photo

Our latest post explores on-policy distillation, a training approach that unites the error-correcting relevance of RL with the reward density of SFT. When training it for math reasoning and as an internal chat assistant, we find that on-policy distillation can outperform other

Our latest post explores on-policy distillation, a training approach that unites the error-correcting relevance of RL with the reward density of SFT. When training it for math reasoning and as an internal chat assistant, we find that on-policy distillation can outperform other
Larry Dial (@classiclarryd) 's Twitter Profile Photo

NorMuon from Zichong Li et al. takes the crown as the leading NanoGPT speedrun optimizer! github.com/KellerJordan/m… NorMuon enhances Muon with a neuron normalization step after orthogonalization using second-order statistics. arxiv.org/abs/2510.05491

NorMuon from <a href="/li_zichong/">Zichong Li</a> et al. takes the crown as the leading NanoGPT speedrun optimizer! github.com/KellerJordan/m…
NorMuon enhances Muon with a neuron normalization step after orthogonalization using second-order statistics. arxiv.org/abs/2510.05491
Kimi.ai (@kimi_moonshot) 's Twitter Profile Photo

Kimi Linear Tech Report is dropped! 🚀 huggingface.co/moonshotai/Kim… Kimi Linear: A novel architecture that outperforms full attention with faster speeds and better performance—ready to serve as a drop-in replacement for full attention, featuring our open-sourced KDA kernels! Kimi

Songlin Yang (@songlinyang4) 's Twitter Profile Photo

Many people are confused by Minimax’s recent return to full attention - especially since it was the first large-scale pivot toward hybrid linear attention - and by Kimi’s later adoption of hybrid linear variants (as well as earlier attempts by Qwen3-Next, or Qwen3.5). I actually

Seunghyun Seo (@seunghyunseo7) 's Twitter Profile Photo

just noticed modded-nanogpt adopt 'NorMuon' as default (?). it looks like `AdaMuon`. i personally didnt buy this idea because i thought Muon is enough and dont want to introduce optim state for 2nd moment again like adam... hmm arxiv.org/abs/2510.05491 arxiv.org/abs/2507.11005…

just noticed modded-nanogpt adopt 'NorMuon' as default (?).
it looks like `AdaMuon`. i personally didnt buy this idea because i thought Muon is enough and dont want to introduce optim state for 2nd moment again like adam... hmm
arxiv.org/abs/2510.05491 
arxiv.org/abs/2507.11005…