Yikang Shen (@yikang_shen) 's Twitter Profile
Yikang Shen

@yikang_shen

MTS @xAI. ex Staff RS @IBM. PhD @Mila. Granite LMs, Ordered Neurons, Mixture of Attention Heads, JetMoE, stick-breaking attention and Power LR.

ID: 804168614

calendar_today05-09-2012 08:43:26

220 Tweet

2,2K Followers

370 Following

Songlin Yang (@songlinyang4) 's Twitter Profile Photo

I've created slides for those curious about the recent rapid progress in linear attention: from linear attention to Lightning-Attention, Mamba2, DeltaNet, and TTT/Titans. Check it out here: sustcsonglin.github.io/assets/pdf/tal…

Yikang Shen (@yikang_shen) 's Twitter Profile Photo

It's good to see Deepseek v3 draw everyone's attention to reducing the training cost of LLM. Over the last two years, we found that you can drastically reduce the cost of LLM in every step of its training, including 1) hyper-parameter search/scaling law experiments, 2) model

It's good to see Deepseek v3 draw everyone's attention to reducing the training cost of LLM. 
Over the last two years, we found that you can drastically reduce the cost of LLM in every step of its training, including 1) hyper-parameter search/scaling law experiments, 2) model
Songlin Yang (@songlinyang4) 's Twitter Profile Photo

Is RoPE necessary? Can sigmoid attention outperform softmax? Can we design PE for seamless length extrapolation? Join ASAP seminar 03—Shawn Tan & Yikang Shen present Stick-Breaking Attention (openreview.net/forum?id=r8J3D…), a new RoPE-free sigmoid attention with strong extrapolation!

Is RoPE necessary? Can sigmoid attention outperform softmax? Can we design PE for seamless length extrapolation? Join ASAP seminar 03—<a href="/tanshawn/">Shawn Tan</a> &amp; <a href="/Yikang_Shen/">Yikang Shen</a> present Stick-Breaking Attention (openreview.net/forum?id=r8J3D…), a new RoPE-free sigmoid attention with strong extrapolation!
Guodong Zhang (@guodzh) 's Twitter Profile Photo

Exciting time ahead! In the pretraining team, we are now hiring on data and eval as well. check out here job-boards.greenhouse.io/xai/jobs/43783…

Songlin Yang (@songlinyang4) 's Twitter Profile Photo

📢 (1/16) Introducing PaTH 🛣️ — a RoPE-free contextualized position encoding scheme, built for stronger state tracking, better extrapolation, and hardware-efficient training. PaTH outperforms RoPE across short and long language modeling benchmarks arxiv.org/abs/2505.16381

Yikang Shen (@yikang_shen) 's Twitter Profile Photo

My friend Alex build NEXA AI has been incredible to witness 👏 Today, they launched OmniNeural - a strong multimodal AI that runs on your phone's NPU. Real-time, private, blazing fast. Huge congrats to them! 🎉 #AI #Innovation #Startup