Li Dong (@donglixp) Twitter Tweets • TwiCopy

Microsoft Research

a year ago

Differential Transformer is a new foundation architecture for LLMs that enhances focus on relevant information while canceling attention noise. It generally outperforms Transformer models w/advantages in long-sequence modeling & hallucination mitigation. msft.it/6010mhpr6

thumb_up_off_alt662

chat_bubble_outline13

repeat129

shareShare

Microsoft Research

@msftresearch

a year ago

Microsoft researchers release bitnet.cpp, the official inference framework for 1-bit LLMs like BitNet b1.58. It has optimized kernels for fast, lossless inference on CPUs, achieving impressive speedups on ARM and x86 CPUs and significant energy reductions. msft.it/6016WGy8o

thumb_up_off_alt1,1K

chat_bubble_outline17

repeat210

shareShare

hicksford

@citizenhicks

a year ago

this Google DeepMind paper explores the limitations of the softmax function in artificial intelligence systems, particularly its inability to maintain sharpness in decision-making as the number of inputs increases. key findings the softmax function cannot robustly approximate

this <a href="/GoogleDeepMind/">Google DeepMind</a> paper explores the limitations of the softmax function in artificial intelligence systems, particularly its inability to maintain sharpness in decision-making as the number of inputs increases.

key findings
the softmax function cannot robustly approximate

thumb_up_off_alt441

chat_bubble_outline6

repeat56

shareShare

Rohan Paul

@rohanpaul_ai

a year ago

Cache smarter, not harder: one-time storage trick makes LLMs fly with less memory. 💡 "You Only Cache Once (YOCO) from Microsoft : Decoder-Decoder Architectures for Language Models" ✨ One cache to rule them all: YOCO keeps LLM memory lean by storing key-value pairs just once.

Cache smarter, not harder: one-time storage trick makes LLMs fly with less memory. 💡

"You Only Cache Once (YOCO) from <a href="/Microsoft/">Microsoft</a> : Decoder-Decoder Architectures for Language Models" ✨

One cache to rule them all: YOCO keeps LLM memory lean by storing key-value pairs just once.

thumb_up_off_alt511

chat_bubble_outline9

repeat90

shareShare

PapersAnon

@papers_anon

a year ago

Multimodal Latent Language Modeling with Next-Token Diffusion From MS Research. Employs a variational autoencoder to represent continuous data as latent vectors and introduces next-token diffusion for autoregressive generation of these vectors. Effective across various

thumb_up_off_alt61

chat_bubble_outline1

repeat9

shareShare

Rohan Paul

@rohanpaul_ai

a year ago

LatentLM makes multimodal AI speak one language by treating all data types as continuous tokens. Next-token diffusion: The universal translator between discrete and continuous data in AI models. LatentLM unifies continuous and discrete data processing in LLMs using causal

thumb_up_off_alt14

chat_bubble_outline2

repeat3

shareShare

Lucas Beyer (bl16)

@giffmana

10 months ago

Long paper thread! I've held a skeptical opinion of DiffTransformer after skimming it, but after some recent interaction thought that's unfair and I'd give it a proper read. Did so on a recent flight. Paper TL;DR: pair two attention heads, and do: (sm(Q1K1) - λ sm(Q2K2)) V

thumb_up_off_alt748

chat_bubble_outline18

repeat87

shareShare

Chengzu Li

@li_chengzu

10 months ago

Forget just thinking in words. 🚀 New Era of Multimodal Reasoning🚨 🔍 Imagine While Reasoning in Space with MVoT Multimodal Visualization-of-Thought (MVoT) revolutionizes reasoning by generating visual "thoughts" that transform how AI thinks, reasons, and explains itself.

thumb_up_off_alt739

chat_bubble_outline14

repeat165

shareShare

Abhishek mishra

@abhishek_mishr2

7 months ago

GPT-4o's image gen quality, especially with text ✍️, is impressive! But how does it work? Beyond standard diffusion (like Stable Diffusion/DiT), could other architectures be key? Let's explore potential inspiration from recent research, starting with LatentLM. 🧐 #AI #GPT4o (1/6)

thumb_up_off_alt3

chat_bubble_outline1

repeat1

shareShare

Tianzhu Ye ✈️ ICLR Singapore

@ytz2024

7 months ago

🚨 We have updated our ICLR 2025 Oral paper Differential Transformer with additional results on mathematical reasoning. DIFF consistently outperforms Transformer on o1-style reasoning evaluation, achieving an average accuracy improvement of 7.5% (see Fig. 2). (1/3)

thumb_up_off_alt20

chat_bubble_outline3

repeat1

shareShare

AK

@_akhaliq

5 months ago

Microsoft presents Reward Reasoning Model

thumb_up_off_alt296

chat_bubble_outline1

repeat60

shareShare

AK

@_akhaliq

5 months ago

Think Only When You Need with Large Hybrid-Reasoning Models

thumb_up_off_alt178

chat_bubble_outline2

repeat34

shareShare

Qingxiu Dong

@qx_dong

5 months ago

⏰ We introduce Reinforcement Pre-Training (RPT🍒) — reframing next-token prediction as a reasoning task using RLVR ✅ General-purpose reasoning 📑 Scalable RL on web corpus 📈 Stronger pre-training + RLVR results 🚀 Allow allocate more compute on specific tokens

thumb_up_off_alt920

chat_bubble_outline28

repeat147

shareShare

Andrew Carr (e/🤸)

@andrew_n_carr

5 months ago

what a figure

thumb_up_off_alt235

chat_bubble_outline10

repeat11

shareShare

alphaXiv

@askalphaxiv

5 months ago

New training paradigm: instead of just predicting tokens, models reason about each prediction using RL The model thinks through context, considers alternatives, then makes a prediction. 14B model matches 32B baseline, though training costs are significantly higher.

thumb_up_off_alt168

chat_bubble_outline3

repeat39

shareShare

elvis

@omarsar0

5 months ago

Reinforcement Pre-Training New pre-training paradigm for LLMs just landed on arXiv! It incentivises effective next-token reasoning with RL. This unlocks richer reasoning capabilities using only raw text and intrinsic RL signals. A must-read! Bookmark it! Here are my notes:

thumb_up_off_alt513

chat_bubble_outline15

repeat89

shareShare

DailyPapers

@huggingpapers

2 months ago

Microsoft just dropped VibeVoice on Hugging Face A novel framework generating expressive, long-form, multi-speaker conversational audio like podcasts from text. Synthesizes up to 90 minutes of speech with up to 4 distinct speakers! huggingface.co/microsoft/Vibe…

thumb_up_off_alt749

chat_bubble_outline13

repeat121

shareShare

Vaibhav (VB) Srivastav

@reach_vb

2 months ago

Microsoft just released VibeVoice - 1.5B SoTA Text to Speech model - MIT Licensed 🔥 > It can generate up 90 minutes of audio > Supports simultaneous generation of > 4 speakers > Streaming and larger 7B model in-coming > Capable of cross-lingual and singing synthesis Love the