Albert Gu (@_albertgu) Twitter Tweets • TwiCopy

Oliver Wang

@oliver_wang2

a year ago

Penguins deserve to experience flight too! #veo2

thumb_up_off_alt1,1K

chat_bubble_outline45

repeat130

shareShare

Falcon 3 is out! 1B, 3B, 7B, 10B (Base + Instruct) & 7B Mamba, trained on 14 Trillion tokens and apache 2.0 licensed! 🔥 > 1B-Base surpasses SmolLM2-1.7B and matches gemma-2-2b > 3B-Base outperforms larger models like Llama-3.1-8B and Minitron-4B-Base > 7B-Base is on par with

thumb_up_off_alt540

chat_bubble_outline17

repeat85

shareShare

Albert Gu

@_albertgu

a year ago

new Mamba-2 model release! strong performance, fast inference, and most importantly strong commitment to transparency and reproducibility to benefit the community 🚀 lots of interesting findings in the blog post :)

thumb_up_off_alt102

chat_bubble_outline4

repeat12

shareShare

Beidi Chen

@beidichen

a year ago

🐷 MagicPig was developed during our efforts to create challenging reasoning tasks that showcase the true potential of long-context models—tasks that cannot be solved through simple retrieval. In addition to tackling long-context closed/open LLMs (🔥 more on this coming soon), we

thumb_up_off_alt152

chat_bubble_outline3

repeat19

shareShare

Karan Goel

@krandiash

a year ago

A few interesting challenges in extending context windows. A model with a big prompt =/= "infinite context" in my mind. 10M tokens of context is not exactly on the path to infinite context. Instead, it requires a streaming model that has - an efficient state with fast

thumb_up_off_alt85

chat_bubble_outline2

repeat10

shareShare

Albert Gu

@_albertgu

a year ago

2025 is the year of fat albert

thumb_up_off_alt96

chat_bubble_outline2

repeat2

shareShare

Albert Gu

@_albertgu

a year ago

a great survey of modern recurrent models after Mamba

thumb_up_off_alt79

chat_bubble_outline1

repeat4

shareShare

Alex Wang

@heyyalexwang

a year ago

did you know you've been doing test-time learning this whole time? transformers, SSMs, RNNs, are all test-time regressors but with different design choices we present a unifying framework that derives sequence layers (and higher-order attention👀) from a *single* equation 🧵

thumb_up_off_alt490

chat_bubble_outline5

repeat98

shareShare

Tri Dao

@tri_dao

10 months ago

I've been excited about this for a while: a simple architectural change to the residual connection that allows arbitrary overlapping of computation of one layer and the communication of another layer, leading to ~30% speedup in TP! More on MoE and expert parallel to come soon!

thumb_up_off_alt507

chat_bubble_outline4

repeat66

shareShare

Tanishq Mathew Abraham, Ph.D.

@iscienceluvr

9 months ago

Thinking Slow, Fast: Scaling Inference Compute with Distilled Reasoners Distilling Llama-1B and -3B models with only 8 billion tokens into subquadratic models like Mamba to achieve better and faster scaling of inference-time compute with minimal performance loss.

thumb_up_off_alt398

chat_bubble_outline3

repeat94

shareShare

Albert Gu

@_albertgu

9 months ago

Isaac has been interested in a general compression-based theory of intelligence. He explored this on ARC-AGI to get interesting results with a very different approach!

thumb_up_off_alt168

chat_bubble_outline2

repeat13

shareShare

Hunyuan

@tencenthunyuan

9 months ago

🚀 Introducing Hunyuan-TurboS – the first ultra-large Hybrid-Transformer-Mamba MoE model! Traditional pure Transformer models struggle with long-text training and inference due to O(N²) complexity and KV-Cache issues. Hunyuan-TurboS combines: ✅ Mamba's efficient long-sequence

thumb_up_off_alt1,1K

chat_bubble_outline68

repeat224

shareShare

Albert Gu

@_albertgu

9 months ago

Announcing Cartesia’s Series A towards our mission of building real-time intelligence I’m cooking up some new models in the back - looking for researchers who want to develop the next generation of architectures 👀

thumb_up_off_alt125

chat_bubble_outline6

repeat10

shareShare

Albert Gu

@_albertgu

9 months ago

Lyra shows that biology rewards the right inductive biases! Careful architectural design can significantly improve performance and efficiency for modeling biological sequences.

thumb_up_off_alt34

chat_bubble_outline1

repeat1

shareShare

Albert Gu

@_albertgu

7 months ago

We started off investigating applications of SSMs to PDEs but evolved to a broader question of understanding memory in modeling PDEs, finding when combining a sequence model (e.g. S4) with a Markovian neural operator (e.g. FNO) has advantages. Led by CMU students Ricardo and

thumb_up_off_alt55

chat_bubble_outline1

repeat5

shareShare

Yutong (Kelly) He

@electronickale

7 months ago

✨ Love 4o-style image generation but prefer to use Midjourney? Tired of manual prompt crafting from inspo images? PRISM to the rescue! 🖼️→📝→🖼️ We automate black-box prompt engineering—no training, no embeddings, just accurate, readable prompts from your inspo images! 1/🧵

thumb_up_off_alt83

chat_bubble_outline2

repeat31

shareShare

Raghu Ganti

@raghuganti

7 months ago

🚀 Bamba v2 (9B) is here: faster, stronger, and smarter! A leaderboard model in just 3T tokens!! Bamba v1 +1T tokens of training Outperforms Llama 3.1 8B on L1 & L2 benchmark scores 📈 2–2.5× faster inference with vLLM than standard transformer based models 🏎️ Open weights +

thumb_up_off_alt71

chat_bubble_outline2

repeat19

shareShare

Albert Gu

@_albertgu

7 months ago

We dug into in-depth mechanistic differences between Transformers and SSMs: 1.SSMs are very strong at sequence modeling, but worse at certain algorithmic “skills” such as retrieval 2. The gap appears only in a few heads 3. This provides insight and improved designs for hybrid

thumb_up_off_alt158

chat_bubble_outline4

repeat19

shareShare

Albert Gu

Oliver Wang

Vaibhav (VB) Srivastav

Albert Gu

Beidi Chen

Karan Goel

Albert Gu

Albert Gu

Alex Wang

Tri Dao

Tanishq Mathew Abraham, Ph.D.

Albert Gu

Hunyuan

Albert Gu

Albert Gu

Albert Gu

Yutong (Kelly) He

Raghu Ganti

Albert Gu