Li Dong (@donglixp) 's Twitter Profile
Li Dong

@donglixp

NLP Researcher at Microsoft Research

ID: 269696571

linkhttp://dong.li calendar_today21-03-2011 08:49:12

241 Tweet

4,4K Followers

3,3K Following

Microsoft Research (@msftresearch) 's Twitter Profile Photo

Differential Transformer is a new foundation architecture for LLMs that enhances focus on relevant information while canceling attention noise. It generally outperforms Transformer models w/advantages in long-sequence modeling & hallucination mitigation. msft.it/6010mhpr6

Differential Transformer is a new foundation architecture for LLMs that enhances focus on relevant information while canceling attention noise. It generally outperforms Transformer models w/advantages in long-sequence modeling & hallucination mitigation. msft.it/6010mhpr6
Microsoft Research (@msftresearch) 's Twitter Profile Photo

Microsoft researchers release bitnet.cpp, the official inference framework for 1-bit LLMs like BitNet b1.58. It has optimized kernels for fast, lossless inference on CPUs, achieving impressive speedups on ARM and x86 CPUs and significant energy reductions. msft.it/6016WGy8o

Rohan Paul (@rohanpaul_ai) 's Twitter Profile Photo

Cache smarter, not harder: one-time storage trick makes LLMs fly with less memory. 💡 "You Only Cache Once (YOCO) from Microsoft : Decoder-Decoder Architectures for Language Models" ✨ One cache to rule them all: YOCO keeps LLM memory lean by storing key-value pairs just once.

Cache smarter, not harder: one-time storage trick makes LLMs fly with less memory. 💡

"You Only Cache Once (YOCO) from <a href="/Microsoft/">Microsoft</a> : Decoder-Decoder Architectures for Language Models" ✨

One cache to rule them all: YOCO keeps LLM memory lean by storing key-value pairs just once.
PapersAnon (@papers_anon) 's Twitter Profile Photo

Multimodal Latent Language Modeling with Next-Token Diffusion From MS Research. Employs a variational autoencoder to represent continuous data as latent vectors and introduces next-token diffusion for autoregressive generation of these vectors. Effective across various

Multimodal Latent Language Modeling with Next-Token Diffusion

From MS Research. Employs a variational autoencoder to represent continuous data as latent vectors and introduces next-token diffusion for autoregressive generation of these vectors. Effective across various
Rohan Paul (@rohanpaul_ai) 's Twitter Profile Photo

LatentLM makes multimodal AI speak one language by treating all data types as continuous tokens. Next-token diffusion: The universal translator between discrete and continuous data in AI models. LatentLM unifies continuous and discrete data processing in LLMs using causal

LatentLM makes multimodal AI speak one language by treating all data types as continuous tokens.

Next-token diffusion: The universal translator between discrete and continuous data in AI models.

LatentLM unifies continuous and discrete data processing in LLMs using causal
Lucas Beyer (bl16) (@giffmana) 's Twitter Profile Photo

Long paper thread! I've held a skeptical opinion of DiffTransformer after skimming it, but after some recent interaction thought that's unfair and I'd give it a proper read. Did so on a recent flight. Paper TL;DR: pair two attention heads, and do: (sm(Q1K1) - λ sm(Q2K2)) V

Long paper thread!

I've held a skeptical opinion of DiffTransformer after skimming it, but after some recent interaction thought that's unfair and I'd give it a proper read. Did so on a recent flight.

Paper TL;DR: pair two attention heads, and do:

 (sm(Q1K1) - λ sm(Q2K2)) V
Chengzu Li (@li_chengzu) 's Twitter Profile Photo

Forget just thinking in words. 🚀 New Era of Multimodal Reasoning🚨 🔍 Imagine While Reasoning in Space with MVoT Multimodal Visualization-of-Thought (MVoT) revolutionizes reasoning by generating visual "thoughts" that transform how AI thinks, reasons, and explains itself.

Forget just thinking in words.

🚀 New Era of Multimodal Reasoning🚨
🔍 Imagine While Reasoning in Space with MVoT

Multimodal Visualization-of-Thought (MVoT) revolutionizes reasoning by generating visual "thoughts" that transform how AI thinks, reasons, and explains itself.
Abhishek mishra (@abhishek_mishr2) 's Twitter Profile Photo

GPT-4o's image gen quality, especially with text ✍️, is impressive! But how does it work? Beyond standard diffusion (like Stable Diffusion/DiT), could other architectures be key? Let's explore potential inspiration from recent research, starting with LatentLM. 🧐 #AI #GPT4o (1/6)

Tianzhu Ye ✈️ ICLR Singapore (@ytz2024) 's Twitter Profile Photo

🚨 We have updated our ICLR 2025 Oral paper Differential Transformer with additional results on mathematical reasoning. DIFF consistently outperforms Transformer on o1-style reasoning evaluation, achieving an average accuracy improvement of 7.5% (see Fig. 2). (1/3)

🚨 We have updated our ICLR 2025 Oral paper Differential Transformer with additional results on mathematical reasoning. DIFF consistently outperforms Transformer on o1-style reasoning evaluation, achieving an average accuracy improvement of 7.5% (see Fig. 2). (1/3)
Qingxiu Dong (@qx_dong) 's Twitter Profile Photo

⏰ We introduce Reinforcement Pre-Training (RPT🍒) — reframing next-token prediction as a reasoning task using RLVR ✅ General-purpose reasoning 📑 Scalable RL on web corpus 📈 Stronger pre-training + RLVR results 🚀 Allow allocate more compute on specific tokens

⏰ We introduce Reinforcement Pre-Training (RPT🍒)  

 — reframing next-token prediction as a reasoning task using RLVR  

✅ General-purpose reasoning 
📑 Scalable RL on web corpus
📈 Stronger pre-training + RLVR results
🚀 Allow allocate more compute on specific tokens
alphaXiv (@askalphaxiv) 's Twitter Profile Photo

New training paradigm: instead of just predicting tokens, models reason about each prediction using RL The model thinks through context, considers alternatives, then makes a prediction. 14B model matches 32B baseline, though training costs are significantly higher.

New training paradigm: instead of just predicting tokens, models reason about each prediction using RL

The model thinks through context, considers alternatives, then makes a prediction.

14B model matches 32B baseline, though training costs are significantly higher.
elvis (@omarsar0) 's Twitter Profile Photo

Reinforcement Pre-Training New pre-training paradigm for LLMs just landed on arXiv! It incentivises effective next-token reasoning with RL. This unlocks richer reasoning capabilities using only raw text and intrinsic RL signals. A must-read! Bookmark it! Here are my notes:

Reinforcement Pre-Training

New pre-training paradigm for LLMs just landed on arXiv!

It incentivises effective next-token reasoning with RL.

This unlocks richer reasoning capabilities using only raw text and intrinsic RL signals.

A must-read! Bookmark it!

Here are my notes:
DailyPapers (@huggingpapers) 's Twitter Profile Photo

Microsoft just dropped VibeVoice on Hugging Face A novel framework generating expressive, long-form, multi-speaker conversational audio like podcasts from text. Synthesizes up to 90 minutes of speech with up to 4 distinct speakers! huggingface.co/microsoft/Vibe…

Vaibhav (VB) Srivastav (@reach_vb) 's Twitter Profile Photo

Microsoft just released VibeVoice - 1.5B SoTA Text to Speech model - MIT Licensed 🔥 > It can generate up 90 minutes of audio > Supports simultaneous generation of > 4 speakers > Streaming and larger 7B model in-coming > Capable of cross-lingual and singing synthesis Love the