Benjamin Lefaudeux (@bentheegg) 's Twitter Profile
Benjamin Lefaudeux

@bentheegg

Crafting pixels w PhotoRoom after some time in sunny California and happy Copenhagen. Meta (xformers, FairScale), EyeTribe (acq) Not using this account anymore

ID: 65453004

linkhttps://bentheegg.bsky.social calendar_today13-08-2009 20:02:16

7,7K Tweet

1,1K Followers

1,1K Following

Andrew White 🐦‍⬛ (@andrewwhite01) 's Twitter Profile Photo

This paper is amazing. Their architecture comparison with AlphaFold2 really tells the whole story. Just flow matching with a transformer...

This paper is amazing. Their architecture comparison with AlphaFold2 really tells the whole story. Just flow matching with a transformer...
vik (@vikhyatk) 's Twitter Profile Photo

first saw this in paligemma, adopted it in moondream and saw sizeable perf wins the downside is you can't use vLLM, SGLang etc, have to roll your own inference engine

Awni Hannun (@awnihannun) 's Twitter Profile Photo

Maybe top-k is all you need. First it came for the MLP - switch-style MoEs Now it's coming for attention - DSV3.2 sparse attention

wh (@nrehiew_) 's Twitter Profile Photo

Some notes on DeepSeek's v3.2's sparse attention mechanism DSA can be thought of as a noncontiguous sliding window where each token only attends to 2048 other tokens. This means that both memory to load and FLOPs is constant at O(2048) (for decode at least).

Some notes on DeepSeek's v3.2's sparse attention mechanism 

DSA can be thought of as a noncontiguous sliding window where each token only attends to 2048 other tokens. This means that both memory to load and FLOPs is constant at O(2048) (for decode at least).
wh (@nrehiew_) 's Twitter Profile Photo

1 possible reason is that DSA is inherently a better long context algorithm than other quadratic attention variants because of Softmax. Since softmax is only done over 2048 tokens, QK attention weight magnitudes are preserved and no "attention budget" is given to useless tokens

1 possible reason is that DSA is inherently a better long context algorithm than other quadratic attention variants because of Softmax.

Since softmax is only done over 2048 tokens, QK attention weight magnitudes are preserved and no "attention budget" is given to useless tokens
Benjamin Lefaudeux (@bentheegg) 's Twitter Profile Photo

Marketing budget (see thread, this was a great question). There's nothing inherent to stablecoins which yields money that I know of, yet very few people question the value structure and a lot believe in a money tree

Lucas Beyer (bl16) (@giffmana) 's Twitter Profile Photo

Quite the contrary: We're using the language that was designed as a glue language for gluing pieces together that are written in the language(s) that were designed for peak performance. Everything working exactly as designed.

Alexia Jolicoeur-Martineau (@jm_alexia) 's Twitter Profile Photo

New paper 📜: Tiny Recursion Model (TRM) is a recursive reasoning approach with a tiny 7M parameters neural network that obtains 45% on ARC-AGI-1 and 8% on ARC-AGI-2, beating most LLMs. Blog: alexiajm.github.io/2025/09/29/tin… Code: github.com/SamsungSAILMon… Paper: arxiv.org/abs/2510.04871

Benjamin Lefaudeux (@bentheegg) 's Twitter Profile Photo

This is a dog shit bad take on 3 grounds - ad hominem is bad - Sam Gross was already GOAT, core Pytorch and he did this by the python governance book - Python is a great language which falls appart with concurrency, modern CPUs have 300 cores

Saining Xie (@sainingxie) 's Twitter Profile Photo

three years ago, DiT replaced the legacy unet with a transformer-based denoising backbone. we knew the bulky VAEs would be the next to go -- we just waited until we could do it right. today, we introduce Representation Autoencoders (RAE). >> Retire VAEs. Use RAEs. 👇(1/n)

three years ago, DiT replaced the legacy unet with a transformer-based denoising backbone. we knew the bulky VAEs would be the next to go -- we just waited until we could do it right.

today, we introduce Representation Autoencoders (RAE).

>> Retire VAEs. Use RAEs. 👇(1/n)
SemiAnalysis (@semianalysis_) 's Twitter Profile Photo

One of the interesting Blackwell bugs that we troubleshooted was that the Blackwell vLLM image we started using back in July 2025 would lead to the instance stalling for up to 30 minutes on our bare metal B200 machines. This was especially challenging to replicate and debug as

One of the interesting Blackwell bugs that we troubleshooted was that the Blackwell vLLM image we started using back in July 2025 would lead to the instance stalling for up to 30 minutes on our bare metal B200 machines. This was especially challenging to replicate and debug as
Keller Jordan (@kellerjordan0) 's Twitter Profile Photo

New CIFAR-10 training speed record: 94% in 1.99 seconds on one A100 Previous record: 2.59 seconds (Nov. 10th 2024) New record-holder: Algorithmic discovery engine developed by Hiverge Changelog: - Muon: Vectorize NS iter and reduce frequency of 'normalize weights' step 1/3

New CIFAR-10 training speed record: 94% in 1.99 seconds on one A100

Previous record: 2.59 seconds (Nov. 10th 2024)
New record-holder: Algorithmic discovery engine developed by <a href="/hivergeai/">Hiverge</a>

Changelog:
- Muon: Vectorize NS iter and reduce frequency of 'normalize weights' step
1/3
Q (@qtnx_) 's Twitter Profile Photo

we at the codegen team at mistral are looking for cracked interns in Paris & Palo Alto. phd students & master’s graduating in 2026-early 27 preferred. join us to make the best code models and agents link below

Dominic Gannaway (@trueadm) 's Twitter Profile Photo

Stacked diffs would be absolutely epic. If you have experienced using them, then you’ll understand their importance and impact when working in teams/monorepos