L (@codetitanium) Twitter Tweets • TwiCopy

Xeophon

@thexeophon

5 months ago

From the R1-0528 model card A proper frontier model 🐋

thumb_up_off_alt154

chat_bubble_outline3

repeat8

shareShare

Adam is similar to many algorithms, but cannot be effectively replaced by any simpler variant in LMs. The community is starting to get the recipe right, but what is the secret sauce? Robert M. Gower 🇺🇦 and I found that it has to do with the beta parameters and variational inference.

thumb_up_off_alt259

chat_bubble_outline10

repeat37

shareShare

Ali Behrouz

@behrouz_ali

5 months ago

What makes attention the critical component for most advances in LLMs and what holds back long-term memory modules (RNNs)? Can we strictly generalize Transformers? Presenting Atlas (A powerful Titan): a new architecture with long-term in-context memory that learns how to

thumb_up_off_alt897

chat_bubble_outline23

repeat133

shareShare

Ali Behrouz

@behrouz_ali

5 months ago

Vahab Mirrokni Meisam Razaviyayn Even with a powerful surprise metric and enhanced memory capacity, the memory needs to properly be updated and optimized. In fact, a bad update rule can cause the memory to be stuck in local optima and so does not properly memorize the context. While almost all models are based

<a href="/mirrokni/">Vahab Mirrokni</a> <a href="/meisamrr/">Meisam Razaviyayn</a> Even with a powerful surprise metric and enhanced memory capacity, the memory needs to properly be updated and optimized. In fact, a bad update rule can cause the memory to be stuck in local optima and so does not properly memorize the context. While almost all models are based

thumb_up_off_alt40

chat_bubble_outline1

repeat2

shareShare

Mengyue Yang ✈️ ICLR 2025

@mengyue_yang_

5 months ago

Curious how training data order impacts LLMs without retraining? Introducing FUT: 🔍Estimate the effects of any sample order on model performance 🎯 Design optimal curricula & analyze memorization/generalization ⚡️ Up to 130x faster than retraining, <0.02 error Read the paper

thumb_up_off_alt276

chat_bubble_outline8

repeat43

shareShare

Grad

@grad62304977

5 months ago

Finally some real exciting architecture work which focuses on actual training speed efficiency and performance. Really feel like this is the path towards continual learning for llms. Congrats! (and obv Songlin Yang is on it bruh)

thumb_up_off_alt82

chat_bubble_outline3

repeat5

shareShare

Han

@hanchunglee

5 months ago

simple and elegant idea

thumb_up_off_alt18

chat_bubble_outline1

repeat3

shareShare

Eric Jiang

@veggie_eric

5 months ago

*takes one look at X feed*

thumb_up_off_alt344

chat_bubble_outline33

repeat17

shareShare

jianlin.su

@jianlin_s

5 months ago

kexue.fm/archives/11006 introduces the idea of using matrices and their msign to perform general operations on the singular values, including singular value clipping, step functions, and arbitrary polynomials (not just odd polynomials). leloy! You Jiacheng rohan anil

thumb_up_off_alt47

chat_bubble_outline0

repeat6

shareShare

Epoch AI

@epochairesearch

5 months ago

How do reasoning models solve hard math problems? We asked 14 mathematicians to review o3-mini-high’s raw, unsummarized reasoning traces on 29 FrontierMath problems. Here’s what they found:

thumb_up_off_alt554

chat_bubble_outline17

repeat93

shareShare

Varun Mayya

@waitin4agi_

5 months ago

😂

thumb_up_off_alt95

chat_bubble_outline11

repeat4

shareShare

Aran Komatsuzaki

@arankomatsuzaki

5 months ago

Truncated Proximal Policy Optimization - Improves the training efficiency of reasoning LLMs by up to 2.5x and outperforms its existing competitors. - (1) Enhances GPU throughput, (2) complete elimination of persistent bias in value function estimation, and (3) substantially

thumb_up_off_alt178

chat_bubble_outline5

repeat31

shareShare

Kimi.ai

@kimi_moonshot

5 months ago

Meet Kimi-Researcher - an autonomous agent that excels at multi-turn search and reasoning. Powered by k 1.5 and trained with end-to-end agentic RL. Achieved 26.9% pass@1 on Humanity's Last Exam, 69% pass@1 on xbench. 🔗 Tech blog：moonshotai.github.io/Kimi-Researche…

thumb_up_off_alt1,1K

chat_bubble_outline38

repeat219

shareShare

Andrew Zhao

@andrewz45732491

5 months ago

Moonshot does it again, nice deep research + RL work moonshotai.github.io/Kimi-Researche…

thumb_up_off_alt435

chat_bubble_outline3

repeat71

shareShare

leloy!

@leloykun

5 months ago

Fast, Numerically Stable, and Auto-Differentiable Spectral Clipping via Newton-Schulz Iteration Hi all, I'm bacc. I have a lot to talk about, but let's start with this fun side-project. Here I'll talk about novel (?) ways to compute: 1. Spectral Clipping (discussed in Rohan's

thumb_up_off_alt264

chat_bubble_outline7

repeat34

shareShare

Nouha Dziri

@nouhadziri

4 months ago

📢 Can LLMs really reason outside the box in math? Or are they just remixing familiar strategies? Remember DeepSeek R1, o1 have impressed us on Olympiad-level math but also they were failing at simple arithmetic 😬 We built a benchmark to find out → OMEGA Ω 📐 💥 We found

thumb_up_off_alt714

chat_bubble_outline22

repeat157

shareShare

Sauers

@sauers_

4 months ago

Anthropic prepretraining pipeline: "As a preliminary step towards training, engineers browsed books and bibliographic metadata to learn what languages the books were written in, what subjects they concerned, whether they were by famous authors or not, and so on — sometimes by

thumb_up_off_alt187

chat_bubble_outline5

repeat10

shareShare

wh

@nrehiew_

4 months ago

It turns out the main way Anthropic gave Claude its soul is data. It's always data.

thumb_up_off_alt562

chat_bubble_outline11

repeat30

shareShare

Tilde

@tilderesearch

4 months ago

Sparse attention (MoBA/NSA) trains faster & beats full attention in key tasks. But we’ve had no idea how they truly work…until now. 🔍 We reverse-engineered them to uncover: - Novel attention patterns - Hidden "attention sinks" - Better performance - And more A 🧵… ~1/8~

thumb_up_off_alt405

chat_bubble_outline5

repeat80

shareShare

Simo Ryu

@cloneofsimo

4 months ago

Wait what this paper has 4 citations and these guys decided to scale this to billion parameter scale with efficient triton implementation? Incredible. Huge respect...

thumb_up_off_alt479

chat_bubble_outline8

repeat22

shareShare

L

Xeophon

Antonio Orvieto

Ali Behrouz

Ali Behrouz

Mengyue Yang ✈️ ICLR 2025

Grad

Han

Eric Jiang

jianlin.su

Epoch AI

Varun Mayya

Aran Komatsuzaki

Kimi.ai

Andrew Zhao

leloy!

Nouha Dziri

Sauers

wh

Tilde

Simo Ryu