Daniele Paliotta (@danielepaliotta) 's Twitter Profile
Daniele Paliotta

@danielepaliotta

ML PhD @Unige_en, and other things.

ID: 1189887448618328065

linkhttps://danielep.xyz calendar_today31-10-2019 12:51:01

415 Tweet

560 Followers

1,1K Following

François Fleuret (@francoisfleuret) 's Twitter Profile Photo

With the awesome Ramón Calvo, Daniele Paliotta, Matteo Pagliardini, and Martin Jaggi. Faculty of Science | UNIGE EPFL Computer and Communication Sciences TL;DR: you can shuffle the middle layers of a transformer without retraining it. We take advantage of that to compute layers in parallel. arxiv.org/abs/2502.02790

Tanishq Mathew Abraham, Ph.D. (@iscienceluvr) 's Twitter Profile Photo

Thinking Slow, Fast: Scaling Inference Compute with Distilled Reasoners Distilling Llama-1B and -3B models with only 8 billion tokens into subquadratic models like Mamba to achieve better and faster scaling of inference-time compute with minimal performance loss.

Thinking Slow, Fast: Scaling Inference Compute with Distilled Reasoners

Distilling Llama-1B and -3B models with only 8 billion tokens into subquadratic models like Mamba to achieve better and faster scaling of inference-time compute with minimal performance loss.
Tri Dao (@tri_dao) 's Twitter Profile Photo

Mamba's higher inference throughput means you can sample more CoTs for reasoning and get a better model for the same budget. I've been thinking about the ideal architecture for test-time compute, more on this soon!

Simone Scardapane (@s_scardapane) 's Twitter Profile Photo

*Leveraging the true depth of LLMs* by Ramón Calvo Daniele Paliotta Matteo Pagliardini François Fleuret Based on the observation that layer shuffling has little impact on the middle layers of LLMs, they propose to parallelize their execution in pairs. arxiv.org/abs/2502.02790

*Leveraging the true depth of LLMs*
by <a href="/noctrog/">Ramón Calvo</a> <a href="/DanielePaliotta/">Daniele Paliotta</a> <a href="/MatPagliardini/">Matteo Pagliardini</a> <a href="/francoisfleuret/">François Fleuret</a> 

Based on the observation that layer shuffling has little impact on the middle layers of LLMs, they propose to parallelize their execution in pairs.

arxiv.org/abs/2502.02790
Aran Komatsuzaki (@arankomatsuzaki) 's Twitter Profile Photo

M1: Towards Scalable Test-Time Compute with Mamba Reasoning Models - >3x faster inference over Transformer - Outperforms distilled R1 under a fixed generation time budget

M1: Towards Scalable Test-Time Compute with Mamba Reasoning Models

- &gt;3x faster inference over Transformer
- Outperforms distilled R1 under a fixed generation time budget
𝚐𝔪𝟾𝚡𝚡𝟾 (@gm8xx8) 's Twitter Profile Photo

M1 is a hybrid Mamba-based reasoning model designed to scale test-time generation more efficiently than Transformers. - built via distillation from LLaMA3.2-3B and enhanced with SFT + RL (GRPO) - trained on <50B tokens vs. DeepSeek-R1’s 1T+ MATH tokens - 3× faster inference vs.

M1 is a hybrid Mamba-based reasoning model designed to scale test-time generation more efficiently than Transformers.

- built via distillation from LLaMA3.2-3B and enhanced with SFT + RL (GRPO)
- trained on &lt;50B tokens vs. DeepSeek-R1’s 1T+ MATH tokens
- 3× faster inference vs.
Kevin Li (@kevinyli_) 's Twitter Profile Photo

⏳Transformer-based LLM reasoning is effective but slow. Distilled Mamba reasoners have ~4x throughput vs teachers, enabling better coverage and accuracy at many time budgets! 📍Reasoning and Planning for LLMs Workshop Paper: arxiv.org/abs/2502.20339 x.com/DanielePaliott… 3/3