Ekaterina Lobacheva (@katelobacheva) 's Twitter Profile
Ekaterina Lobacheva

@katelobacheva

Postdoc @Mila_Quebec @UMontreal
Like to explain unexpected behavior of neural nets 🤯

ID: 1069184356533583872

linkhttps://tipt0p.github.io/ calendar_today02-12-2018 10:59:49

80 Tweet

513 Followers

356 Following

Sara Hooker (@sarahookr) 's Twitter Profile Photo

I'm starting a new project. Working on what I consider to be the most important problem: building thinking machines that adapt and continuously learn. We have incredibly talent dense founding team + are hiring for engineering, ops, design. Join us: adaptionlabs.ai

Mila - Institut québécois d'IA (@mila_quebec) 's Twitter Profile Photo

Mila's annual supervision request process is now open to receive MSc and PhD applications for Fall 2026 admission! For more information, visit mila.quebec/en/prospective…

Mila's annual supervision request process is now open to receive MSc and PhD applications for Fall 2026 admission! For more information, visit mila.quebec/en/prospective…
Atli Kosson (@atlikosson) 's Twitter Profile Photo

The Maximal Update Parameterization (µP) allows LR transfer from small to large models, saving costly tuning. But why is independent weight decay (IWD) essential for it to work? We find µP stabilizes early training (like an LR warmup), but IWD takes over in the long term! 🧵

The Maximal Update Parameterization (µP) allows LR transfer from small to large models, saving costly tuning. But why is independent weight decay (IWD) essential for it to work?

We find µP stabilizes early training (like an LR warmup), but IWD takes over in the long term! 🧵
Ekaterina Lobacheva (@katelobacheva) 's Twitter Profile Photo

A talk on our recent paper on zero-sum learning, now with new links to generalization, circuit emergence, and optimizer dynamics! 🚀

Goodfire (@goodfireai) 's Twitter Profile Photo

LLMs memorize a lot of training data, but memorization is poorly understood. Where does it live inside models? How is it stored? How much is it involved in different tasks? Jack Merullo & Srihita Vatsavaya's new paper examines all of these questions using loss curvature! (1/7)

LLMs memorize a lot of training data, but memorization is poorly understood.

Where does it live inside models? How is it stored? How much is it involved in different tasks?

<a href="/jack_merullo_/">Jack Merullo</a> &amp; <a href="/srihita_raju/">Srihita Vatsavaya</a>'s new paper examines all of these questions using loss curvature! (1/7)
Andrew Lampinen (@andrewlampinen) 's Twitter Profile Photo

Really cool finding (and paper), and makes a ton of sense! Way back in the day when we/others were trying to make sense of memorization vs. generalization in simpler models, we made an argument that generalizing signals will be top eigenvalues due to shared structure, while 1/2

Verna Dankers (@vernadankers) 's Twitter Profile Photo

Ready for day 3 of #EMNLP2025 🎉🎉 I've been on the lookout for memorization, unlearning, interp, memory module papers & more, chat w me if these topics fascinate you too😻 Looking forward to more of Suzhou, the conf & my BlackboxNLP keynote Sunday 1.45PM! blackboxnlp.github.io/2025/

Ready for day 3 of #EMNLP2025 🎉🎉 I've been on the lookout for memorization, unlearning, interp, memory module papers &amp; more, chat w me if these topics fascinate you too😻 Looking forward to more of Suzhou, the conf &amp; my BlackboxNLP keynote Sunday 1.45PM! blackboxnlp.github.io/2025/
Goodfire (@goodfireai) 's Twitter Profile Photo

New research: are prompting and activation steering just two sides of the same coin? Eric Bigelow Daniel Wurgaft Ekdeep Singh and coauthors argue they are: ICL and steering have formally equivalent effects. (1/4)

New research: are prompting and activation steering just two sides of the same coin?

<a href="/EricBigelow/">Eric Bigelow</a> <a href="/danielwurgaft/">Daniel Wurgaft</a> <a href="/EkdeepL/">Ekdeep Singh</a> and coauthors argue they are: ICL and steering have formally equivalent effects. (1/4)
Irina Saparina (@irisaparina) 's Twitter Profile Photo

Reasoning models are powerful, but they burn thousands of tokens on potentially wrong interpretations for ambiguous requests! 👉 We teach models to think about intent first and provide all interpretations and answers in a single response via RL with dual reward. 🧵1/6

Reasoning models are powerful, but they burn thousands of tokens on potentially wrong interpretations for ambiguous requests!

👉 We teach models to think about intent first and provide all interpretations and answers in a single response via RL with dual reward.

🧵1/6
Vaishnavh Nagarajan (@_vaishnavh) 's Twitter Profile Photo

1/ We found that deep sequence models memorize atomic facts "geometrically" -- not as an associative lookup table as often imagined. This opens up practical questions on reasoning/memory/discovery, and also poses a theoretical "memorization puzzle."

Chandar Lab (@chandarlab) 's Twitter Profile Photo

Excited to share that we have 3 papers accepted at #ICLR2026! 🇧🇷 Our work this year focuses on efficiency and expressivity: deriving theoretical limits for SSMs, achieving linear scaling for reasoning, and modernizing encoder architectures. A summary of our work 👇 🧵

Yizhou Liu (@yizhouliu0) 's Twitter Profile Photo

🚨 New Paper Alert: Why LLM training follows a slow power law? ⁉️We find the neural scaling law with time arises intrinsically from softmax and cross-entropy! (1/6) When learning peaked (or low-temperature or low-entropy) distributions like next-token distributions, these

🚨 New Paper Alert: Why LLM training follows a slow power law?

⁉️We find the neural scaling law with time arises intrinsically from softmax and cross-entropy!

(1/6) When learning peaked (or low-temperature or low-entropy) distributions like next-token distributions, these