L (@codetitanium) 's Twitter Profile
L

@codetitanium

ID: 1685205841417560064

calendar_today29-07-2023 08:32:01

1,1K Tweet

89 Followers

3,3K Following

Antonio Orvieto (@orvieto_antonio) 's Twitter Profile Photo

Adam is similar to many algorithms, but cannot be effectively replaced by any simpler variant in LMs. The community is starting to get the recipe right, but what is the secret sauce? Robert M. Gower 🇺🇦 and I found that it has to do with the beta parameters and variational inference.

Adam is similar to many algorithms, but cannot be effectively replaced by any simpler variant in LMs.
The community is starting to get the recipe right, but what is the secret sauce?

<a href="/gowerrobert/">Robert M. Gower 🇺🇦</a> and I found that it has to do with the beta parameters and variational inference.
Ali Behrouz (@behrouz_ali) 's Twitter Profile Photo

What makes attention the critical component for most advances in LLMs and what holds back long-term memory modules (RNNs)? Can we strictly generalize Transformers? Presenting Atlas (A powerful Titan): a new architecture with long-term in-context memory that learns how to

What makes attention the critical component for most advances in LLMs and what holds back long-term memory modules (RNNs)? Can we strictly generalize Transformers?

Presenting Atlas (A powerful Titan): a new architecture with long-term in-context memory that learns how to
Ali Behrouz (@behrouz_ali) 's Twitter Profile Photo

Vahab Mirrokni Meisam Razaviyayn Even with a powerful surprise metric and enhanced memory capacity, the memory needs to properly be updated and optimized. In fact, a bad update rule can cause the memory to be stuck in local optima and so does not properly memorize the context. While almost all models are based

<a href="/mirrokni/">Vahab Mirrokni</a> <a href="/meisamrr/">Meisam Razaviyayn</a> Even with a powerful surprise metric and enhanced memory capacity, the memory needs to properly be updated and optimized. In fact, a bad update rule can cause the memory to be stuck in local optima and so does not properly memorize the context. While almost all models are based
Mengyue Yang ✈️ ICLR 2025 (@mengyue_yang_) 's Twitter Profile Photo

Curious how training data order impacts LLMs without retraining? Introducing FUT: 🔍Estimate the effects of any sample order on model performance 🎯 Design optimal curricula & analyze memorization/generalization ⚡️ Up to 130x faster than retraining, <0.02 error Read the paper

Curious how training data order impacts LLMs without retraining?

Introducing FUT:

🔍Estimate the effects of any sample order on model performance
🎯 Design optimal curricula &amp; analyze memorization/generalization
⚡️ Up to 130x faster than retraining, &lt;0.02 error
Read the paper
Grad (@grad62304977) 's Twitter Profile Photo

Finally some real exciting architecture work which focuses on actual training speed efficiency and performance. Really feel like this is the path towards continual learning for llms. Congrats! (and obv Songlin Yang is on it bruh)

jianlin.su (@jianlin_s) 's Twitter Profile Photo

kexue.fm/archives/11006 introduces the idea of using matrices and their msign to perform general operations on the singular values, including singular value clipping, step functions, and arbitrary polynomials (not just odd polynomials). leloy! You Jiacheng rohan anil

Epoch AI (@epochairesearch) 's Twitter Profile Photo

How do reasoning models solve hard math problems? We asked 14 mathematicians to review o3-mini-high’s raw, unsummarized reasoning traces on 29 FrontierMath problems. Here’s what they found:

How do reasoning models solve hard math problems? 

We asked 14 mathematicians to review o3-mini-high’s raw, unsummarized reasoning traces on 29 FrontierMath problems. Here’s what they found:
Aran Komatsuzaki (@arankomatsuzaki) 's Twitter Profile Photo

Truncated Proximal Policy Optimization - Improves the training efficiency of reasoning LLMs by up to 2.5x and outperforms its existing competitors. - (1) Enhances GPU throughput, (2) complete elimination of persistent bias in value function estimation, and (3) substantially

Truncated Proximal Policy Optimization

- Improves the training efficiency of reasoning LLMs by up to 2.5x and outperforms its existing competitors.

- (1) Enhances GPU throughput, (2) complete elimination of persistent bias in value function estimation, and (3) substantially
Kimi.ai (@kimi_moonshot) 's Twitter Profile Photo

Meet Kimi-Researcher - an autonomous agent that excels at multi-turn search and reasoning. Powered by k 1.5 and trained with end-to-end agentic RL. Achieved 26.9% pass@1 on Humanity's Last Exam, 69% pass@1 on xbench. 🔗 Tech blog:moonshotai.github.io/Kimi-Researche…

Meet Kimi-Researcher - an autonomous agent that excels at multi-turn search and reasoning. Powered by k 1.5 and trained with end-to-end agentic RL. 

Achieved 26.9% pass@1 on Humanity's Last Exam, 69% pass@1 on xbench.

🔗 Tech blog:moonshotai.github.io/Kimi-Researche…
leloy! (@leloykun) 's Twitter Profile Photo

Fast, Numerically Stable, and Auto-Differentiable Spectral Clipping via Newton-Schulz Iteration Hi all, I'm bacc. I have a lot to talk about, but let's start with this fun side-project. Here I'll talk about novel (?) ways to compute: 1. Spectral Clipping (discussed in Rohan's

Fast, Numerically Stable, and Auto-Differentiable Spectral Clipping via Newton-Schulz Iteration

Hi all, I'm bacc. I have a lot to talk about, but let's start with this fun side-project.

Here I'll talk about novel (?) ways to compute:
1. Spectral Clipping (discussed in Rohan's
Nouha Dziri (@nouhadziri) 's Twitter Profile Photo

📢 Can LLMs really reason outside the box in math? Or are they just remixing familiar strategies? Remember DeepSeek R1, o1 have impressed us on Olympiad-level math but also they were failing at simple arithmetic 😬 We built a benchmark to find out → OMEGA Ω 📐 💥 We found

📢 Can LLMs really reason outside the box in math? Or are they just remixing familiar strategies? 

Remember DeepSeek R1, o1 have impressed us on Olympiad-level math but also they were failing at simple arithmetic 😬

 We built a benchmark to find out → OMEGA Ω 📐

💥 We found
Sauers (@sauers_) 's Twitter Profile Photo

Anthropic prepretraining pipeline: "As a preliminary step towards training, engineers browsed books and bibliographic metadata to learn what languages the books were written in, what subjects they concerned, whether they were by famous authors or not, and so on — sometimes by

Tilde (@tilderesearch) 's Twitter Profile Photo

Sparse attention (MoBA/NSA) trains faster & beats full attention in key tasks. But we’ve had no idea how they truly work…until now. 🔍 We reverse-engineered them to uncover: - Novel attention patterns - Hidden "attention sinks" - Better performance - And more A 🧵… ~1/8~

Simo Ryu (@cloneofsimo) 's Twitter Profile Photo

Wait what this paper has 4 citations and these guys decided to scale this to billion parameter scale with efficient triton implementation? Incredible. Huge respect...

Wait what this paper has 4 citations and these guys decided to scale this to billion parameter scale with efficient triton implementation? Incredible. Huge respect...