Arnaud Autef (@arnaud_autef) 's Twitter Profile
Arnaud Autef

@arnaud_autef

Research Engineer @GoogleDeepMind working on Gemini Flash⚡

ID: 1169004884982730753

calendar_today03-09-2019 21:51:09

748 Tweet

240 Followers

346 Following

Andrew Gordon Wilson (@andrewgwils) 's Twitter Profile Photo

My new paper "Deep Learning is Not So Mysterious or Different": arxiv.org/abs/2503.02113. Generalization behaviours in deep learning can be intuitively understood through a notion of soft inductive biases, and formally characterized with countable hypothesis bounds! 1/12

My new paper "Deep Learning is Not So Mysterious or Different": arxiv.org/abs/2503.02113. Generalization behaviours in deep learning can be intuitively understood through a notion of soft inductive biases, and formally characterized with countable hypothesis bounds! 1/12
Jyo Pari (@jyo_pari) 's Twitter Profile Photo

What if an LLM could update its own weights? Meet SEAL🦭: a framework where LLMs generate their own training data (self-edits) to update their weights in response to new inputs. Self-editing is learned via RL, using the updated model’s downstream performance as reward.

What if an LLM could update its own weights?

Meet SEAL🦭: a framework where LLMs generate their own training data (self-edits) to update their weights in response to new inputs.

Self-editing is learned via RL, using the updated model’s downstream performance as reward.
Aurko Roy (@happylemon56775) 's Twitter Profile Photo

Excited to share what I worked on during my time at Meta. - We introduce a Triton-accelerated Transformer with *2-simplicial attention*—a tri-linear generalization of dot-product attention - We show how to adapt RoPE to tri-linear forms - We show 2-simplicial attention scales

Excited to share what I worked on during my time at Meta.

- We introduce a Triton-accelerated Transformer with *2-simplicial attention*—a tri-linear generalization of dot-product attention

- We show how to adapt RoPE to tri-linear forms

- We show 2-simplicial attention scales
Stas Bekman (@stasbekman) 's Twitter Profile Photo

Got a chance to measure Maximum Achievable Matmul TFLOPS on NVIDIA B200. With each new NVIDIA generation the efficiency keeps on dropping: A100: 86.9% H100: 80.3% B200: 77.6% The updated table is here: github.com/stas00/ml-engi…

Got a chance to measure Maximum Achievable Matmul TFLOPS on NVIDIA B200.

With each new NVIDIA generation the efficiency keeps on dropping:

A100: 86.9%
H100: 80.3%
B200: 77.6%

The updated table is here: github.com/stas00/ml-engi…
Deedy (@deedydas) 's Twitter Profile Photo

🚨 BREAKING: Detailed list of all 44 people in Meta's Superintelligence team. — 50% from China — 75% have PhDs, 70% Researchers — 40% from OpenAI, 20% DeepMind, 15% Scale — 20% L8+ level — 75% 1st gen immigrants Each of these people are likely getting paid $10-$100M/yr.

🚨 BREAKING: Detailed list of all 44 people in Meta's Superintelligence team.

— 50% from China
— 75% have PhDs, 70% Researchers
— 40% from OpenAI, 20% DeepMind, 15% Scale
— 20% L8+ level
— 75% 1st gen immigrants

Each of these people are likely getting paid $10-$100M/yr.
Rohan Paul (@rohanpaul_ai) 's Twitter Profile Photo

Beautiful Google Research paper. LLMs can learn in context from examples in the prompt, can pick up new patterns while answering, yet their stored weights never change. That behavior looks impossible if learning always means gradient descent. The mechanisms through which this

Beautiful <a href="/GoogleResearch/">Google Research</a> paper.

LLMs can learn in context from examples in the prompt,  can pick up new patterns while answering, yet their stored weights never change.

That behavior looks impossible if learning always means gradient descent.

The mechanisms through which this
mgostIH (@mgostih) 's Twitter Profile Photo

This was a very good read yesterday, they fix not only GRPO, but most RL on sequences. They found a bug in a common definition of importance sampling for sequence models and fix just that in a very simple way. No hack required, just correct math!

This was a very good read yesterday, they fix not only GRPO, but most RL on sequences. They found a bug in a common definition of importance sampling for sequence models and fix just that in a very simple way. No hack required, just correct math!
David McAllister (@davidrmcall) 's Twitter Profile Photo

Excited to share Flow Matching Policy Gradients: expressive RL policies trained from rewards using flow matching. It’s an easy, drop-in replacement for Gaussian PPO on control tasks.

Jacob Austin (@jacobaustin132) 's Twitter Profile Photo

Today we're putting out an update to the JAX TPU book, this time on GPUs. How do GPUs work, especially compared to TPUs? How are they networked? And how does this affect LLM training? 1/n

Today we're putting out an update to the JAX TPU book, this time on GPUs. How do GPUs work, especially compared to TPUs? How are they networked? And how does this affect LLM training? 1/n
@levelsio (@levelsio) 's Twitter Profile Photo

The older I get the more I realize that in most countries these days governments are more like a mafia that just extorts more and more money from the people Before I saw the government as a benevolent group of people that represented the people, and in return for tax payments

tensorqt (@tensorqt) 's Twitter Profile Photo

attention sinks may be a bias in causal transformers. as some of you know, i've been writing a long blogpost on attention and its properties as a message-passing operation on graphs. while doing so, i figured i might have found an explanation for which attention sinks may be an

attention sinks may be a bias in causal transformers. 

as some of you know, i've been writing a long blogpost on attention and its properties as a message-passing operation on graphs. while doing so, i figured i might have found an explanation for which attention sinks may be an
Alexia Jolicoeur-Martineau (@jm_alexia) 's Twitter Profile Photo

New paper 📜: Tiny Recursion Model (TRM) is a recursive reasoning approach with a tiny 7M parameters neural network that obtains 45% on ARC-AGI-1 and 8% on ARC-AGI-2, beating most LLMs. Blog: alexiajm.github.io/2025/09/29/tin… Code: github.com/SamsungSAILMon… Paper: arxiv.org/abs/2510.04871