Zheng Zhan (@zhengzhan13) Twitter Tweets • TwiCopy

Songlin Yang

9 months ago

Is RoPE necessary? Can sigmoid attention outperform softmax? Can we design PE for seamless length extrapolation? Join ASAP seminar 03—Shawn Tan & Yikang Shen present Stick-Breaking Attention (openreview.net/forum?id=r8J3D…), a new RoPE-free sigmoid attention with strong extrapolation!

Is RoPE necessary? Can sigmoid attention outperform softmax? Can we design PE for seamless length extrapolation? Join ASAP seminar 03—<a href="/tanshawn/">Shawn Tan</a> & <a href="/Yikang_Shen/">Yikang Shen</a> present Stick-Breaking Attention (openreview.net/forum?id=r8J3D…), a new RoPE-free sigmoid attention with strong extrapolation!

thumb_up_off_alt174

chat_bubble_outline2

repeat25

shareShare

Weizhu Chen

@weizhuchen

9 months ago

Check out our tech report of the phi4 mini and multimodality.

thumb_up_off_alt46

chat_bubble_outline2

repeat9

shareShare

Songlin Yang

@songlinyang4

9 months ago

There is a fundamental tradeoff in parallelism vs expressiveness. Linear RNNs (Mamba/GLA) lack "true" recurrence, limiting them to NC0 complexity. ASAP 04 (tomorrow 1:30pm EST) will explore *scalable* linear RNNs that enhance state tracking while maintaining parallelism!

thumb_up_off_alt89

chat_bubble_outline3

repeat13

shareShare

John Langford

@johnclangford

7 months ago

The Belief State Transformer edwardshu.com/bst-website/ is at ICLR this week. The BST objective efficiently creates compact belief states: summaries of the past sufficient for all future predictions. See the short talk: microsoft.com/en-us/research… and mgostIH for further discussion.

thumb_up_off_alt104

chat_bubble_outline5

repeat19

shareShare

Songlin Yang

@songlinyang4

7 months ago

Join the link on April 30 if you're interested in DeltaNet :)

thumb_up_off_alt62

chat_bubble_outline1

repeat6

shareShare

Andrej Karpathy

@karpathy

7 months ago

There's a new paper circulating looking in detail at LMArena leaderboard: "The Leaderboard Illusion" arxiv.org/abs/2504.20879 I first became a bit suspicious when at one point a while back, a Gemini model scored #1 way above the second best, but when I tried to switch for a few

thumb_up_off_alt4,4K

chat_bubble_outline192

repeat429

shareShare

Dimitris Papailiopoulos

@dimitrispapail

7 months ago

We’ve been cooking... a new open weights 14B Phi-4 reasoning model, SFT’d on ~1.4M carefully curated reasoning demonstrations from o3-mini and RL’d for a tiny bit. This model is a little beast.

thumb_up_off_alt1,1K

chat_bubble_outline37

repeat237

shareShare

Songlin Yang

@songlinyang4

6 months ago

📢 (1/16) Introducing PaTH 🛣️ — a RoPE-free contextualized position encoding scheme, built for stronger state tracking, better extrapolation, and hardware-efficient training. PaTH outperforms RoPE across short and long language modeling benchmarks arxiv.org/abs/2505.16381

thumb_up_off_alt424

chat_bubble_outline9

repeat79

shareShare

Infini-AI-Lab

@infiniailab

5 months ago

🔥 We introduce Multiverse, a new generative modeling framework for adaptive and lossless parallel generation. 🚀 Multiverse is the first open-source non-AR model to achieve AIME24 and AIME25 scores of 54% and 46% 🌐 Website: multiverse4fm.github.io 🧵 1/n

thumb_up_off_alt207

chat_bubble_outline2

repeat76

shareShare

Songlin Yang

@songlinyang4

5 months ago

Xinyu Yang will be presenting this amazing work at ASAP seminar tomorrow! Do not miss his talk

<a href="/Xinyu2ML/">Xinyu Yang</a> will be presenting this amazing work at ASAP seminar tomorrow! Do not miss his talk

thumb_up_off_alt15

chat_bubble_outline0

repeat4

shareShare

Microsoft Azure

@azure

5 months ago

Meet Phi-4-mini-flash-reasoning: a fast, low-latency SLM built for scale with its novel SambaY architecture. Available on Azure AI Foundry and Hugging Face. Experience advanced reasoning capabilities here: msft.it/6018SAmHn

thumb_up_off_alt160

chat_bubble_outline8

repeat32

shareShare

Liliang Ren

@liliang_ren

5 months ago

Reasoning can be made much, much faster—with fundamental changes in neural architecture. 😮 Introducing Phi4-mini-Flash-Reasoning: a 3.8B model that surpasses Phi4-mini-Reasoning on major reasoning tasks (AIME24/25, MATH500, GPQA-D), while delivering up-to 10× higher throughput

thumb_up_off_alt359

chat_bubble_outline2

repeat71

shareShare

Rohan Paul

@rohanpaul_ai

5 months ago

Microsoft just dropped Phi-4-mini-flash-reasoning. - built on a new hybrid architecture, - 10X higher throughput and a 2 to 3X reduction in latency - significantly faster inference without sacrificing reasoning performance. Microsoft swaps most of that heavy work for a lean

thumb_up_off_alt156

chat_bubble_outline8

repeat25

shareShare

Liliang Ren

@liliang_ren

4 months ago

We’re open-sourcing the pre-training code for Phi4-mini-Flash, our SoTA hybrid model that delivers 10× faster reasoning than Transformers — along with μP++, a suite of simple yet powerful scaling laws for stable large-scale training. 🔗 github.com/microsoft/Arch… (1/4)

thumb_up_off_alt1,1K

chat_bubble_outline13

repeat216

shareShare

John Langford

@johnclangford

4 months ago

Apparently Dion is now being worked on for Torch Titan: github.com/pytorch/torcht… :-)

thumb_up_off_alt104

chat_bubble_outline0

repeat8

shareShare

Dinghuai Zhang 张鼎怀

@zdhnarsil

4 months ago

Your verl & vllm is secrectly giving your off-policy (when using quantized rollout) and you should treat it as an off-policy problem! How? As a probabilistic guy we shud say (truncated) importance sampling 😀 Check Feng Yao 's tweet here！

thumb_up_off_alt38

chat_bubble_outline0

repeat5

shareShare

ARC Prize

@arcprize

3 months ago

Analyzing the Hierarchical Reasoning Model by Guan Wang We verified scores on hidden tasks, ran ablations, and found that performance comes from an unexpected source ARC-AGI Semi Private Scores: * ARC-AGI-1: 32% * ARC-AGI-2: 2% Our 4 findings:

Analyzing the Hierarchical Reasoning Model by <a href="/makingAGI/">Guan Wang</a>

We verified scores on hidden tasks, ran ablations, and found that performance comes from an unexpected source

ARC-AGI Semi Private Scores:
* ARC-AGI-1: 32%
* ARC-AGI-2: 2%

Our 4 findings:

thumb_up_off_alt1,1K

chat_bubble_outline36

repeat171

shareShare

Songlin Yang

@songlinyang4

3 months ago

hybrid is the future:)

thumb_up_off_alt520

chat_bubble_outline4

repeat50

shareShare

John Langford

@johnclangford

2 months ago

A new Dion draft arxiv.org/pdf/2504.05295 with a more comprehensive study of use and variations. (Code github.com/microsoft/dion/ ) A new Belief State Transformer draft arxiv.org/pdf/2410.23506 with variations for tractability at somewhat larger scale. (Code github.com/microsoft/BST)

thumb_up_off_alt21

chat_bubble_outline1

repeat3

shareShare

Zheng Zhan

@zhengzhan13

a month ago

Starting now!

thumb_up_off_alt13

chat_bubble_outline0

repeat2

shareShare