Zheng Zhan (@zhengzhan13) 's Twitter Profile
Zheng Zhan

@zhengzhan13

Researcher @MSFTResearch

ID: 1498189775005749248

linkhttps://zhanzheng8585.github.io/ calendar_today28-02-2022 06:54:28

28 Tweet

30 Followers

62 Following

Songlin Yang (@songlinyang4) 's Twitter Profile Photo

Is RoPE necessary? Can sigmoid attention outperform softmax? Can we design PE for seamless length extrapolation? Join ASAP seminar 03—Shawn Tan & Yikang Shen present Stick-Breaking Attention (openreview.net/forum?id=r8J3D…), a new RoPE-free sigmoid attention with strong extrapolation!

Is RoPE necessary? Can sigmoid attention outperform softmax? Can we design PE for seamless length extrapolation? Join ASAP seminar 03—<a href="/tanshawn/">Shawn Tan</a> &amp; <a href="/Yikang_Shen/">Yikang Shen</a> present Stick-Breaking Attention (openreview.net/forum?id=r8J3D…), a new RoPE-free sigmoid attention with strong extrapolation!
Songlin Yang (@songlinyang4) 's Twitter Profile Photo

There is a fundamental tradeoff in parallelism vs expressiveness. Linear RNNs (Mamba/GLA) lack "true" recurrence, limiting them to NC0 complexity. ASAP 04 (tomorrow 1:30pm EST) will explore *scalable* linear RNNs that enhance state tracking while maintaining parallelism!

There is a fundamental tradeoff in parallelism vs expressiveness. Linear RNNs (Mamba/GLA) lack "true" recurrence, limiting them to NC0 complexity. ASAP 04 (tomorrow 1:30pm EST) will explore *scalable* linear RNNs that enhance state tracking while maintaining parallelism!
John Langford (@johnclangford) 's Twitter Profile Photo

The Belief State Transformer edwardshu.com/bst-website/ is at ICLR this week. The BST objective efficiently creates compact belief states: summaries of the past sufficient for all future predictions. See the short talk: microsoft.com/en-us/research… and mgostIH for further discussion.

Andrej Karpathy (@karpathy) 's Twitter Profile Photo

There's a new paper circulating looking in detail at LMArena leaderboard: "The Leaderboard Illusion" arxiv.org/abs/2504.20879 I first became a bit suspicious when at one point a while back, a Gemini model scored #1 way above the second best, but when I tried to switch for a few

Dimitris Papailiopoulos (@dimitrispapail) 's Twitter Profile Photo

We’ve been cooking... a new open weights 14B Phi-4 reasoning model, SFT’d on ~1.4M carefully curated reasoning demonstrations from o3-mini and RL’d for a tiny bit. This model is a little beast.

We’ve been cooking... a new open weights 14B Phi-4 reasoning model, SFT’d on ~1.4M carefully curated reasoning demonstrations from o3-mini and RL’d for a tiny bit. This model is a little beast.
Songlin Yang (@songlinyang4) 's Twitter Profile Photo

📢 (1/16) Introducing PaTH 🛣️ — a RoPE-free contextualized position encoding scheme, built for stronger state tracking, better extrapolation, and hardware-efficient training. PaTH outperforms RoPE across short and long language modeling benchmarks arxiv.org/abs/2505.16381

Infini-AI-Lab (@infiniailab) 's Twitter Profile Photo

🔥 We introduce Multiverse, a new generative modeling framework for adaptive and lossless parallel generation. 🚀 Multiverse is the first open-source non-AR model to achieve AIME24 and AIME25 scores of 54% and 46% 🌐 Website: multiverse4fm.github.io 🧵 1/n

Microsoft Azure (@azure) 's Twitter Profile Photo

Meet Phi-4-mini-flash-reasoning: a fast, low-latency SLM built for scale with its novel SambaY architecture. Available on Azure AI Foundry and Hugging Face. Experience advanced reasoning capabilities here: msft.it/6018SAmHn

Liliang Ren (@liliang_ren) 's Twitter Profile Photo

Reasoning can be made much, much faster—with fundamental changes in neural architecture. 😮 Introducing Phi4-mini-Flash-Reasoning: a 3.8B model that surpasses Phi4-mini-Reasoning on major reasoning tasks (AIME24/25, MATH500, GPQA-D), while delivering up-to 10× higher throughput

Reasoning can be made much, much faster—with fundamental changes in neural architecture. 😮
Introducing Phi4-mini-Flash-Reasoning: a 3.8B model that surpasses Phi4-mini-Reasoning on major reasoning tasks (AIME24/25, MATH500, GPQA-D), while delivering up-to 10× higher throughput
Rohan Paul (@rohanpaul_ai) 's Twitter Profile Photo

Microsoft just dropped Phi-4-mini-flash-reasoning. - built on a new hybrid architecture, - 10X higher throughput and a 2 to 3X reduction in latency - significantly faster inference without sacrificing reasoning performance. Microsoft swaps most of that heavy work for a lean

Microsoft just dropped Phi-4-mini-flash-reasoning.

- built on a new hybrid architecture, 
- 10X higher throughput and a 2 to 3X reduction in latency
- significantly faster inference without sacrificing reasoning performance.

Microsoft swaps most of that heavy work for a lean
Liliang Ren (@liliang_ren) 's Twitter Profile Photo

We’re open-sourcing the pre-training code for Phi4-mini-Flash, our SoTA hybrid model that delivers 10× faster reasoning than Transformers — along with μP++, a suite of simple yet powerful scaling laws for stable large-scale training. 🔗 github.com/microsoft/Arch… (1/4)

Dinghuai Zhang 张鼎怀 (@zdhnarsil) 's Twitter Profile Photo

Your verl & vllm is secrectly giving your off-policy (when using quantized rollout) and you should treat it as an off-policy problem! How? As a probabilistic guy we shud say (truncated) importance sampling 😀 Check Feng Yao 's tweet here!

ARC Prize (@arcprize) 's Twitter Profile Photo

Analyzing the Hierarchical Reasoning Model by Guan Wang We verified scores on hidden tasks, ran ablations, and found that performance comes from an unexpected source ARC-AGI Semi Private Scores: * ARC-AGI-1: 32% * ARC-AGI-2: 2% Our 4 findings:

Analyzing the Hierarchical Reasoning Model by <a href="/makingAGI/">Guan Wang</a>

We verified scores on hidden tasks, ran ablations, and found that performance comes from an unexpected source

ARC-AGI Semi Private Scores:
* ARC-AGI-1: 32%
* ARC-AGI-2: 2%

Our 4 findings:
John Langford (@johnclangford) 's Twitter Profile Photo

A new Dion draft arxiv.org/pdf/2504.05295 with a more comprehensive study of use and variations. (Code github.com/microsoft/dion/ ) A new Belief State Transformer draft arxiv.org/pdf/2410.23506 with variations for tractability at somewhat larger scale. (Code github.com/microsoft/BST)