Lucas Beyer (bl16) (@giffmana) 's Twitter Profile
Lucas Beyer (bl16)

@giffmana

Researcher (now: OpenAI, ex: DeepMind, Brain, RWTH Aachen), Gamer, Hacker, Belgian. Anon feedback: admonymous.co/giffmana
✗DMs → email

ID: 2236047510

linkhttp://lucasb.eyer.be calendar_today08-12-2013 13:31:09

18,18K Tweet

88,88K Takipçi

593 Takip Edilen

Lucas Beyer (bl16) (@giffmana) 's Twitter Profile Photo

nsys looks pretty cool actually, but information overload for a first-time user. Took me a bit to get good at Google's XProf too, so let's get started! QQ to my nsys expert followers: any specific pro-tips? Biggest bang-for-buck things/views to look at? Any good pytorch

nsys looks pretty cool actually, but information overload for a first-time user. Took me a bit to get good at Google's XProf too, so let's get started!

QQ to my nsys expert followers: any specific pro-tips? Biggest bang-for-buck things/views to look at? Any good pytorch
Lucas Beyer (bl16) (@giffmana) 's Twitter Profile Photo

torch.profiler.profiler.py Last commit: 1mo ago "add memory blablabla" torch.autograd.profiler.py Last commit: 2mo ago "Induce inductor blablabla" One of the two is legacy/deprecated, but you only learn that by looking at the docs of the other one, so if you land on the old one by

torch.profiler.profiler.py
Last commit: 1mo ago "add memory blablabla"
torch.autograd.profiler.py
Last commit: 2mo ago "Induce inductor blablabla"

One of the two is legacy/deprecated, but you only learn that by looking at the docs of the other one, so if you land on the old one by
Lucas Beyer (bl16) (@giffmana) 's Twitter Profile Photo

I like the Encoder-only Mask Transformer (EoMT): basically removing all the bells and whistles, and doing panoptic segmentation with an almost vanilla ViT. You're sliiiiightly worse for the same encoder size, but it's a lot simpler/faster and (likely) more scalable. I wish they

I like the Encoder-only Mask Transformer (EoMT): basically removing all the bells and whistles, and doing panoptic segmentation with an almost vanilla ViT.

You're sliiiiightly worse for the same encoder size, but it's a lot simpler/faster and (likely) more scalable. I wish they
Lucas Beyer (bl16) (@giffmana) 's Twitter Profile Photo

Oh wow, did you guys know that torch.compile can compile numpy code? And even run it on GPU? This is pretty neat for all kinds of "surrounding" code besides the model (like evals and fancy metrics) that I used to do with numba/numexpr (cuz CPU-XLA was pretty meh). Poll below

Oh wow, did you guys know that torch.compile can compile numpy code? And even run it on GPU?

This is pretty neat for all kinds of "surrounding" code besides the model (like evals and fancy metrics) that I used to do with numba/numexpr (cuz CPU-XLA was pretty meh).

Poll below
Lucas Beyer (bl16) (@giffmana) 's Twitter Profile Photo

The answer is that the name is weird, it's simply (almost) the whole flex_attention bwd compute, not just "zeros" as I thought the name implies. The way to find out would be looking at the call-stack, opening that generated file with the long name, and then go look at the actual

The answer is that the name is weird, it's simply (almost) the whole flex_attention bwd compute, not just "zeros" as I thought the name implies.

The way to find out would be looking at the call-stack, opening that generated file with the long name, and then go look at the actual
Lucas Beyer (bl16) (@giffmana) 's Twitter Profile Photo

Cool work uses "visual anagrams": two images of different objects made out of the same image patches. Model must classify both correctly to score. Hence, higher scoring models use global geometry, lower use textures. SigLIP is GOAT of course, or I wouldn't repost this (jk)

Cool work uses "visual anagrams": two images of different objects made out of the same image patches.

Model must classify both correctly to score. Hence, higher scoring models use global geometry, lower use textures.

SigLIP is GOAT of course, or I wouldn't repost this (jk)
Lucas Beyer (bl16) (@giffmana) 's Twitter Profile Photo

Interesting alternative to multi-token prediction, though the figure is a bit unintuitive. Instead of attaching a head for each +d'th prediction, pass a dummy input token for each extra prediction through the model. This is A LOT more expensive, e.g. doing 2-step prediction

Lucas Beyer (bl16) (@giffmana) 's Twitter Profile Photo

Or, in other words, Gemini2.5 Pro succeeds at 30% of real world office tasks. That's pretty good, considering this is the worst Gemini Pro will ever be.

Or, in other words, Gemini2.5 Pro succeeds at 30% of real world office tasks.

That's pretty good, considering this is the worst Gemini Pro will ever be.