Daniele Paliotta (@danielepaliotta) Twitter Tweets • TwiCopy

With the awesome Ramón Calvo, Daniele Paliotta, Matteo Pagliardini, and Martin Jaggi. Faculty of Science | UNIGE EPFL Computer and Communication Sciences TL;DR: you can shuffle the middle layers of a transformer without retraining it. We take advantage of that to compute layers in parallel. arxiv.org/abs/2502.02790

thumb_up_off_alt361

chat_bubble_outline7

repeat48

shareShare

Daniele Paliotta

@danielepaliotta

9 months ago

Check out our new work!

thumb_up_off_alt3

chat_bubble_outline0

repeat0

shareShare

Thinking Slow, Fast: Scaling Inference Compute with Distilled Reasoners Distilling Llama-1B and -3B models with only 8 billion tokens into subquadratic models like Mamba to achieve better and faster scaling of inference-time compute with minimal performance loss.

thumb_up_off_alt398

chat_bubble_outline3

repeat94

shareShare

Tri Dao

@tri_dao

9 months ago

Mamba's higher inference throughput means you can sample more CoTs for reasoning and get a better model for the same budget. I've been thinking about the ideal architecture for test-time compute, more on this soon!

thumb_up_off_alt360

chat_bubble_outline8

repeat44

shareShare

Simone Scardapane

@s_scardapane

8 months ago

*Leveraging the true depth of LLMs* by Ramón Calvo Daniele Paliotta Matteo Pagliardini François Fleuret Based on the observation that layer shuffling has little impact on the middle layers of LLMs, they propose to parallelize their execution in pairs. arxiv.org/abs/2502.02790

*Leveraging the true depth of LLMs*
by <a href="/noctrog/">Ramón Calvo</a> <a href="/DanielePaliotta/">Daniele Paliotta</a> <a href="/MatPagliardini/">Matteo Pagliardini</a> <a href="/francoisfleuret/">François Fleuret</a>

Based on the observation that layer shuffling has little impact on the middle layers of LLMs, they propose to parallelize their execution in pairs.

arxiv.org/abs/2502.02790

thumb_up_off_alt153

chat_bubble_outline2

repeat25

shareShare

Daniele Paliotta

@danielepaliotta

7 months ago

sliding widow idk

thumb_up_off_alt5

chat_bubble_outline0

repeat0

shareShare

Aran Komatsuzaki

@arankomatsuzaki

7 months ago

M1: Towards Scalable Test-Time Compute with Mamba Reasoning Models - >3x faster inference over Transformer - Outperforms distilled R1 under a fixed generation time budget

thumb_up_off_alt251

chat_bubble_outline4

repeat69

shareShare

𝚐𝔪𝟾𝚡𝚡𝟾

@gm8xx8

7 months ago

M1 is a hybrid Mamba-based reasoning model designed to scale test-time generation more efficiently than Transformers. - built via distillation from LLaMA3.2-3B and enhanced with SFT + RL (GRPO) - trained on <50B tokens vs. DeepSeek-R1’s 1T+ MATH tokens - 3× faster inference vs.

thumb_up_off_alt102

chat_bubble_outline2

repeat43

shareShare

AK

@_akhaliq

7 months ago

M1 Towards Scalable Test-Time Compute with Mamba Reasoning Models

thumb_up_off_alt211

chat_bubble_outline3

repeat30

shareShare

Kevin Li

@kevinyli_

7 months ago

⏳Transformer-based LLM reasoning is effective but slow. Distilled Mamba reasoners have ~4x throughput vs teachers, enabling better coverage and accuracy at many time budgets! 📍Reasoning and Planning for LLMs Workshop Paper: arxiv.org/abs/2502.20339 x.com/DanielePaliott… 3/3

thumb_up_off_alt5

chat_bubble_outline0

repeat1

shareShare

Daniele Paliotta

@danielepaliotta

6 months ago

Gemini knows, gamble responsibly #RolandGarros

thumb_up_off_alt2

chat_bubble_outline0

repeat0

shareShare

Daniele Paliotta

Daniele Paliotta

Daniele Paliotta

François Fleuret

Daniele Paliotta

Tanishq Mathew Abraham, Ph.D.

Tri Dao

Simone Scardapane

Daniele Paliotta

Aran Komatsuzaki

𝚐𝔪𝟾𝚡𝚡𝟾

AK

Kevin Li

Daniele Paliotta