Benjamin Lefaudeux (@bentheegg) Twitter Tweets • TwiCopy

Benjamin Lefaudeux

@bentheegg

+ Follow

Crafting pixels w PhotoRoom after some time in sunny California and happy Copenhagen. Meta (xformers, FairScale), EyeTribe (acq) Not using this account anymore

ID: 65453004

linkhttps://bentheegg.bsky.social calendar_today13-08-2009 20:02:16

7,7K Tweet

1,1K Followers

1,1K Following

Andrew White 🐦‍⬛

@andrewwhite01

24 days ago

This paper is amazing. Their architecture comparison with AlphaFold2 really tells the whole story. Just flow matching with a transformer...

thumb_up_off_alt492

chat_bubble_outline12

repeat57

shareShare

vik

@vikhyatk

22 days ago

first saw this in paligemma, adopted it in moondream and saw sizeable perf wins the downside is you can't use vLLM, SGLang etc, have to roll your own inference engine

thumb_up_off_alt377

chat_bubble_outline14

repeat19

shareShare

Awni Hannun

@awnihannun

20 days ago

Maybe top-k is all you need. First it came for the MLP - switch-style MoEs Now it's coming for attention - DSV3.2 sparse attention

thumb_up_off_alt202

chat_bubble_outline5

repeat14

shareShare

Benjamin Lefaudeux

@bentheegg

19 days ago

better take than mine from yesterday, pretty aligned, feels like a surprisingly powerful solution to long context explosion

thumb_up_off_alt2

chat_bubble_outline0

repeat0

shareShare

Some notes on DeepSeek's v3.2's sparse attention mechanism DSA can be thought of as a noncontiguous sliding window where each token only attends to 2048 other tokens. This means that both memory to load and FLOPs is constant at O(2048) (for decode at least).

thumb_up_off_alt282

chat_bubble_outline12

repeat26

shareShare

wh

@nrehiew_

18 days ago

1 possible reason is that DSA is inherently a better long context algorithm than other quadratic attention variants because of Softmax. Since softmax is only done over 2048 tokens, QK attention weight magnitudes are preserved and no "attention budget" is given to useless tokens

thumb_up_off_alt154

chat_bubble_outline4

repeat4

shareShare

Benjamin Lefaudeux

@bentheegg

17 days ago

Marketing budget (see thread, this was a great question). There's nothing inherent to stablecoins which yields money that I know of, yet very few people question the value structure and a lot believe in a money tree

thumb_up_off_alt1

chat_bubble_outline1

repeat0

shareShare

Lucas Beyer (bl16)

@giffmana

12 days ago

Quite the contrary: We're using the language that was designed as a glue language for gluing pieces together that are written in the language(s) that were designed for peak performance. Everything working exactly as designed.

thumb_up_off_alt746

chat_bubble_outline26

repeat34

shareShare

Alexia Jolicoeur-Martineau

@jm_alexia

12 days ago

New paper 📜: Tiny Recursion Model (TRM) is a recursive reasoning approach with a tiny 7M parameters neural network that obtains 45% on ARC-AGI-1 and 8% on ARC-AGI-2, beating most LLMs. Blog: alexiajm.github.io/2025/09/29/tin… Code: github.com/SamsungSAILMon… Paper: arxiv.org/abs/2510.04871

thumb_up_off_alt1,1K

chat_bubble_outline49

repeat220

shareShare

Charlie Marsh

@charliermarsh

11 days ago

As of Python 3.14, the free-threaded (or no-GIL) version of the Python interpreter is no longer considered experimental.

thumb_up_off_alt3,3K

chat_bubble_outline87

repeat366

shareShare

Benjamin Lefaudeux

@bentheegg

8 days ago

This is a dog shit bad take on 3 grounds - ad hominem is bad - Sam Gross was already GOAT, core Pytorch and he did this by the python governance book - Python is a great language which falls appart with concurrency, modern CPUs have 300 cores

thumb_up_off_alt185

chat_bubble_outline6

repeat4

shareShare

אגי-e/acc

@murage_kibicho

6 days ago

Karpathy is taylor swift for people who know who godbolt is

thumb_up_off_alt213

chat_bubble_outline3

repeat8

shareShare

Benjamin Lefaudeux

@bentheegg

6 days ago

Great writeup on the current state of nanogpt speedruns lesswrong.com/posts/j3gp8teb…

thumb_up_off_alt6

chat_bubble_outline0

repeat0

shareShare

Saining Xie

@sainingxie

6 days ago

three years ago, DiT replaced the legacy unet with a transformer-based denoising backbone. we knew the bulky VAEs would be the next to go -- we just waited until we could do it right. today, we introduce Representation Autoencoders (RAE). >> Retire VAEs. Use RAEs. 👇(1/n)

thumb_up_off_alt1,1K

chat_bubble_outline55

repeat321

shareShare

SemiAnalysis

@semianalysis_

6 days ago

One of the interesting Blackwell bugs that we troubleshooted was that the Blackwell vLLM image we started using back in July 2025 would lead to the instance stalling for up to 30 minutes on our bare metal B200 machines. This was especially challenging to replicate and debug as

thumb_up_off_alt148

chat_bubble_outline1

repeat10

shareShare

Keller Jordan

@kellerjordan0

4 days ago

New CIFAR-10 training speed record: 94% in 1.99 seconds on one A100 Previous record: 2.59 seconds (Nov. 10th 2024) New record-holder: Algorithmic discovery engine developed by Hiverge Changelog: - Muon: Vectorize NS iter and reduce frequency of 'normalize weights' step 1/3

thumb_up_off_alt384

chat_bubble_outline5

repeat38

shareShare

Q

@qtnx_

4 days ago

we at the codegen team at mistral are looking for cracked interns in Paris & Palo Alto. phd students & master’s graduating in 2026-early 27 preferred. join us to make the best code models and agents link below

thumb_up_off_alt159

chat_bubble_outline6

repeat8

shareShare

Dominic Gannaway

@trueadm

2 days ago

Stacked diffs would be absolutely epic. If you have experienced using them, then you’ll understand their importance and impact when working in teams/monorepos

thumb_up_off_alt404

chat_bubble_outline14

repeat8

shareShare