Sangmin Bae (@raymin0223) 's Twitter Profile
Sangmin Bae

@raymin0223

PhD Student @kaist_ai | Prev-Intern @GoogleDeepMind, Kakao | LLM Inference Acceleration, Foundation Model Training, Multimodal Learning

ID: 1454885436896133125

linkhttp://www.raymin0223.com calendar_today31-10-2021 18:58:28

310 Tweet

1,1K Followers

966 Following

Pavlo Molchanov (@pavlomolchanov) 's Twitter Profile Photo

New efficient Hybrid LLMs from @NVIDIA: Nemotron-H! Introducing a family of models combining Mamba-2, Self-Attention & FFNs for 8B, 47B and 56B sizes. • 3x faster and 1.5x smaller 47B model is on par with Qwen-72B and Llama-70B • 1.8x faster Hybrid 8B than transformers

New efficient Hybrid LLMs from @NVIDIA: Nemotron-H! Introducing a family of models combining Mamba-2, Self-Attention & FFNs for 8B, 47B and 56B sizes.

• 3x faster and 1.5x smaller 47B model is on par with Qwen-72B and Llama-70B
• 1.8x faster Hybrid 8B than transformers
Weijia Shi (@weijiashi2) 's Twitter Profile Photo

Our previous work showed that 𝐜𝐫𝐞𝐚𝐭𝐢𝐧𝐠 𝐯𝐢𝐬𝐮𝐚𝐥 𝐜𝐡𝐚𝐢𝐧‑𝐨𝐟‑𝐭𝐡𝐨𝐮𝐠𝐡𝐭𝐬 𝐯𝐢𝐚 𝐭𝐨𝐨𝐥 𝐮𝐬𝐞 significantly boosts GPT‑4o’s visual reasoning performance. Excited to see this idea incorporated into OpenAI’s o3 and o4‑mini models (openai.com/index/thinking…).

Google DeepMind (@googledeepmind) 's Twitter Profile Photo

Gemini 2.5 Flash just dropped. ⚡ As a hybrid reasoning model, you can control how much it ‘thinks’ depending on your 💰 - making it ideal for tasks like building chat apps, extracting data and more. Try an early version in Google AI Studio → ai.dev

The AI Timeline (@theaitimeline) 's Twitter Profile Photo

🚨This week's top AI/ML research papers: - BitNet b1.58 2B4T Technical Report - Reasoning Models Can Be Effective Without Thinking - ReTool - Sleep-time Compute - Nemotron-H - Kimina-Prover Preview - CLIMB - Dynamic Cheatsheet - How new data permeates LLM knowledge and how to

🚨This week's top AI/ML research papers:

- BitNet b1.58 2B4T Technical Report
- Reasoning Models Can Be Effective Without Thinking
- ReTool
- Sleep-time Compute
- Nemotron-H
- Kimina-Prover Preview
- CLIMB
- Dynamic Cheatsheet
- How new data permeates LLM knowledge and how to
Aran Komatsuzaki (@arankomatsuzaki) 's Twitter Profile Photo

Dynamic Early Exit in Reasoning Models - Allows LLMs to self-truncate CoT sequences by dynamic early exit - Reduces the CoT length by ~35% while improving accuracy by 1% - 10%.

Dynamic Early Exit in Reasoning Models

- Allows LLMs to self-truncate CoT sequences by dynamic early exit
- Reduces the CoT length by ~35% while improving accuracy by 1% - 10%.
Tanishq Mathew Abraham, Ph.D. (@iscienceluvr) 's Twitter Profile Photo

Tina: Tiny Reasoning Models via LoRA "the best Tina model achieves a >20% reasoning performance increase and 43.33% Pass@1 accuracy on AIME24, at only $9 USD post-training and evaluation cost (i.e., an estimated 260x cost reduction). Our work reveals the surprising effectiveness

Tina: Tiny Reasoning Models via LoRA

"the best Tina model achieves a >20% reasoning performance increase and 43.33% Pass@1 accuracy on AIME24, at only $9 USD post-training and evaluation cost (i.e., an estimated 260x cost reduction). Our work reveals the surprising effectiveness
Sangmin Bae (@raymin0223) 's Twitter Profile Photo

Sharing our poster ICLR 2026! Please stop by poster if you're interested 😊 Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA 📍 Hall 3 + Hall 2B, Poster #262 🗓️ Saturday, April 26, 10:00 a.m. — 12:30 p.m.

Reza Bayat (@reza_byt) 's Twitter Profile Photo

Mohammad, my amazing co-author, will present our work today at ICLR (#304 at 3 p.m.). Don’t miss it, you can learn far more about fundamental research from him than from any paper. I was really fortunate to work with him and to experience a glimpse of the fundamental research

Google DeepMind (@googledeepmind) 's Twitter Profile Photo

We’ve developed Gemini Diffusion: our state-of-the-art text diffusion model. Instead of predicting text directly, it learns to generate outputs by refining noise, step-by-step. This helps it excel at coding and math, where it can iterate over solutions quickly. #GoogleIO

Aran Komatsuzaki (@arankomatsuzaki) 's Twitter Profile Photo

dKV-Cache: The Cache for Diffusion Language Models - Achieves from 2-10× speedup in inference, largely narrowing the gap between ARs and DLMs - Can be used for diffusion LMs as well

dKV-Cache: The Cache for Diffusion Language Models

- Achieves from 2-10× speedup in inference, largely narrowing the gap between ARs and DLMs
- Can be used for diffusion LMs as well
机器之心 JIQIZHIXIN (@synced_global) 's Twitter Profile Photo

🔥 State-Space Meets Diffusion: A New Era for Video World Models This paper tackles a core bottleneck in video-based world modeling: long-term memory. While video diffusion models are great at short-term frame prediction, their memory fades fast—especially when modeling long

🔥 State-Space Meets Diffusion: A New Era for Video World Models 

This paper tackles a core bottleneck in video-based world modeling: long-term memory. While video diffusion models are great at short-term frame prediction, their memory fades fast—especially when modeling long
Ali Behrouz (@behrouz_ali) 's Twitter Profile Photo

What makes attention the critical component for most advances in LLMs and what holds back long-term memory modules (RNNs)? Can we strictly generalize Transformers? Presenting Atlas (A powerful Titan): a new architecture with long-term in-context memory that learns how to

What makes attention the critical component for most advances in LLMs and what holds back long-term memory modules (RNNs)? Can we strictly generalize Transformers?

Presenting Atlas (A powerful Titan): a new architecture with long-term in-context memory that learns how to
Zhihan Yang (@zhihanyangzy) 's Twitter Profile Photo

📢Thrilled to share our new paper: Esoteric Language Models (Eso-LMs) > 🔀Fuses autoregressive (AR) and masked diffusion (MDM) paradigms > 🚀First to unlock KV caching for MDMs (65x speedup!) > 🥇Sets new SOTA on generation speed-vs-quality Pareto frontier How? Dive in👇

📢Thrilled to share our new paper: Esoteric Language Models (Eso-LMs)

> 🔀Fuses autoregressive (AR) and masked diffusion (MDM) paradigms
> 🚀First to unlock KV caching for MDMs (65x speedup!)
> 🥇Sets new SOTA on generation speed-vs-quality Pareto frontier

How? Dive in👇
Han Guo (@hanguo97) 's Twitter Profile Photo

We know Attention and its linear-time variants, such as linear attention and State Space Models. But what lies in between? Introducing Log-Linear Attention with: - Log-linear time training - Log-time inference (in both time and memory) - Hardware-efficient Triton kernels

We know Attention and its linear-time variants, such as linear attention and State Space Models. But what lies in between?

Introducing Log-Linear Attention with:

- Log-linear time training
- Log-time inference (in both time and memory)
- Hardware-efficient Triton kernels
Dongmin Park @ iclr25 (@dongmin_park11) 's Twitter Profile Photo

🚨New Paper Alert As a game company, KRAFTON AI is actively exploring how to apply LLM agents to video games. We present Orak—a foundational video gaming benchmark for LLM agents! Includes Pokémon, StarCraft II, Slay the Spire, Darkest Dungeon, Ace Attorney, and more in🧵

🚨New Paper Alert

As a game company, <a href="/Krafton_AI/">KRAFTON AI</a> is actively exploring how to apply LLM agents to video games.

We present Orak—a foundational video gaming benchmark for LLM agents!

Includes Pokémon, StarCraft II, Slay the Spire, Darkest Dungeon, Ace Attorney, and more in🧵
Tri Dao (@tri_dao) 's Twitter Profile Photo

State space models and RNNs compress history into a constant size state, while attn has KV cache scaling linearly in seqlen. We can instead start from RNNs and let the state size grow logarithmically with seqlen. Feels like a sweet spot. Also beautiful connection to classical

Avi Chawla (@_avichawla) 's Twitter Profile Photo

ML researchers just built a new ensemble technique. It even outperforms XGBoost, CatBoost, and LightGBM. Here's a complete breakdown (explained visually):

Sundar Pichai (@sundarpichai) 's Twitter Profile Photo

Gemini 2.5 Pro + 2.5 Flash are now stable and generally available. Plus, get a preview of Gemini 2.5 Flash-Lite, our fastest + most cost-efficient 2.5 model yet. 🔦 Exciting steps as we expand our 2.5 series of hybrid reasoning models that deliver amazing performance at the

Gemini 2.5 Pro + 2.5 Flash are now stable and generally available. Plus, get a preview of Gemini 2.5 Flash-Lite, our fastest + most cost-efficient 2.5 model yet. 🔦

Exciting steps as we expand our 2.5 series of hybrid reasoning models that deliver amazing performance at the
DailyPapers (@huggingpapers) 's Twitter Profile Photo

Discrete Diffusion in Large Language and Multimodal Models: A Survey just released on Hugging Face Get an overview of research in discrete diffusion LLMs and MLLMs, which achieve performance comparable to autoregressive models with up to 10x faster inference!

Discrete Diffusion in Large Language and Multimodal Models: A Survey just released on Hugging Face

Get an overview of research in discrete diffusion LLMs and MLLMs, which achieve performance comparable to autoregressive models with up to 10x faster inference!