Albert Gu (@_albertgu) 's Twitter Profile
Albert Gu

@_albertgu

assistant prof @mldcmu. chief scientist @cartesia_ai. leading the ssm revolution.

ID: 1076265378118959104

calendar_today21-12-2018 23:57:16

362 Tweet

14,14K Takipçi

88 Takip Edilen

Vaibhav (VB) Srivastav (@reach_vb) 's Twitter Profile Photo

Falcon 3 is out! 1B, 3B, 7B, 10B (Base + Instruct) & 7B Mamba, trained on 14 Trillion tokens and apache 2.0 licensed! 🔥 > 1B-Base surpasses SmolLM2-1.7B and matches gemma-2-2b > 3B-Base outperforms larger models like Llama-3.1-8B and Minitron-4B-Base > 7B-Base is on par with

Falcon 3 is out! 1B, 3B, 7B, 10B (Base + Instruct) & 7B Mamba, trained on 14 Trillion tokens and apache 2.0 licensed! 🔥

> 1B-Base surpasses SmolLM2-1.7B and matches gemma-2-2b
> 3B-Base outperforms larger models like Llama-3.1-8B and Minitron-4B-Base
> 7B-Base is on par with
Albert Gu (@_albertgu) 's Twitter Profile Photo

new Mamba-2 model release! strong performance, fast inference, and most importantly strong commitment to transparency and reproducibility to benefit the community 🚀 lots of interesting findings in the blog post :)

Beidi Chen (@beidichen) 's Twitter Profile Photo

🐷 MagicPig was developed during our efforts to create challenging reasoning tasks that showcase the true potential of long-context models—tasks that cannot be solved through simple retrieval. In addition to tackling long-context closed/open LLMs (🔥 more on this coming soon), we

Karan Goel (@krandiash) 's Twitter Profile Photo

A few interesting challenges in extending context windows. A model with a big prompt =/= "infinite context" in my mind. 10M tokens of context is not exactly on the path to infinite context. Instead, it requires a streaming model that has - an efficient state with fast

Alex Wang (@heyyalexwang) 's Twitter Profile Photo

did you know you've been doing test-time learning this whole time? transformers, SSMs, RNNs, are all test-time regressors but with different design choices we present a unifying framework that derives sequence layers (and higher-order attention👀) from a *single* equation 🧵

did you know you've been doing test-time learning this whole time?

transformers, SSMs, RNNs, are all test-time regressors but with different design choices

we present a unifying framework that derives sequence layers (and higher-order attention👀) from a *single* equation

🧵
Tri Dao (@tri_dao) 's Twitter Profile Photo

I've been excited about this for a while: a simple architectural change to the residual connection that allows arbitrary overlapping of computation of one layer and the communication of another layer, leading to ~30% speedup in TP! More on MoE and expert parallel to come soon!

Tanishq Mathew Abraham, Ph.D. (@iscienceluvr) 's Twitter Profile Photo

Thinking Slow, Fast: Scaling Inference Compute with Distilled Reasoners Distilling Llama-1B and -3B models with only 8 billion tokens into subquadratic models like Mamba to achieve better and faster scaling of inference-time compute with minimal performance loss.

Thinking Slow, Fast: Scaling Inference Compute with Distilled Reasoners

Distilling Llama-1B and -3B models with only 8 billion tokens into subquadratic models like Mamba to achieve better and faster scaling of inference-time compute with minimal performance loss.
Albert Gu (@_albertgu) 's Twitter Profile Photo

Isaac has been interested in a general compression-based theory of intelligence. He explored this on ARC-AGI to get interesting results with a very different approach!

Hunyuan (@tencenthunyuan) 's Twitter Profile Photo

🚀 Introducing Hunyuan-TurboS – the first ultra-large Hybrid-Transformer-Mamba MoE model! Traditional pure Transformer models struggle with long-text training and inference due to O(N²) complexity and KV-Cache issues. Hunyuan-TurboS combines: ✅ Mamba's efficient long-sequence

🚀 Introducing Hunyuan-TurboS – the first ultra-large Hybrid-Transformer-Mamba MoE model!
Traditional pure Transformer models struggle with long-text training and inference due to O(N²) complexity and KV-Cache issues. Hunyuan-TurboS combines:
✅ Mamba's efficient long-sequence
Albert Gu (@_albertgu) 's Twitter Profile Photo

Announcing Cartesia’s Series A towards our mission of building real-time intelligence I’m cooking up some new models in the back - looking for researchers who want to develop the next generation of architectures 👀

Albert Gu (@_albertgu) 's Twitter Profile Photo

Lyra shows that biology rewards the right inductive biases! Careful architectural design can significantly improve performance and efficiency for modeling biological sequences.

Albert Gu (@_albertgu) 's Twitter Profile Photo

We started off investigating applications of SSMs to PDEs but evolved to a broader question of understanding memory in modeling PDEs, finding when combining a sequence model (e.g. S4) with a Markovian neural operator (e.g. FNO) has advantages. Led by CMU students Ricardo and

Yutong (Kelly) He (@electronickale) 's Twitter Profile Photo

✨ Love 4o-style image generation but prefer to use Midjourney? Tired of manual prompt crafting from inspo images? PRISM to the rescue! 🖼️→📝→🖼️ We automate black-box prompt engineering—no training, no embeddings, just accurate, readable prompts from your inspo images! 1/🧵

Raghu Ganti (@raghuganti) 's Twitter Profile Photo

🚀 Bamba v2 (9B) is here: faster, stronger, and smarter! A leaderboard model in just 3T tokens!! Bamba v1 +1T tokens of training Outperforms Llama 3.1 8B on L1 & L2 benchmark scores 📈 2–2.5× faster inference with vLLM than standard transformer based models 🏎️ Open weights +

Albert Gu (@_albertgu) 's Twitter Profile Photo

We dug into in-depth mechanistic differences between Transformers and SSMs: 1.SSMs are very strong at sequence modeling, but worse at certain algorithmic “skills” such as retrieval 2. The gap appears only in a few heads 3. This provides insight and improved designs for hybrid