Radek Bartyzal (@radekbartyzal) Twitter Tweets • TwiCopy

Jason Weston

a year ago

🚨 Distilling System 2 into System 1🚨 - System 2 LLMs spend compute to improve responses (CoT, BSM, RaR, Sys 2 Attention, ..) - *System 2 distillation* keeps this improvement but distills it back into the base LLM (System 1) outputs arxiv.org/abs/2407.06023 🧵(1/5)

thumb_up_off_alt468

chat_bubble_outline1

repeat91

shareShare

Daniel Litt

@littmath

7 months ago

In this thread I'll record some brief impressions from trying to use o3/o4-mini (the new OpenAI models) for mathematical tasks.

thumb_up_off_alt773

chat_bubble_outline23

repeat72

shareShare

Deedy

@deedydas

7 months ago

Rich Sutton just published his most important essay on AI since The Bitter Lesson: "Welcome to the Era of Experience" Sutton and his advisee Silver argue that the “era of human data,” dominated by supervised pre‑training and RL‑from‑human‑feedback, has hit diminishing returns;

thumb_up_off_alt1,1K

chat_bubble_outline40

repeat196

shareShare

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)

@teortaxestex

7 months ago

> Learnable 3d RoPE > Block-Causal Attention, Parallel Attention Block how is this real. A completely unknown company comes out and BTFOs all current videogen with a heavily modified DiT. DeepSeek or Kimi levels of alpha in the report, go read it

thumb_up_off_alt268

chat_bubble_outline8

repeat22

shareShare

Ravid Shwartz Ziv

@ziv_ravid

7 months ago

Unfortunately I will not be at ICLR (20h flight, the kids would kill me 😅), but our paper on min-p sampling will be presented as an oral!

thumb_up_off_alt211

chat_bubble_outline2

repeat11

shareShare

Xeophon

@thexeophon

7 months ago

Starting a series about popular Benchmarks / Evals as a lot of people don't really know what the bars are even showing. Starting with GPQA, one of the most popular benches out there. Its well done imo, but you need to know that its only about bio, physics and chemistry :)

thumb_up_off_alt414

chat_bubble_outline14

repeat58

shareShare

Andrej Karpathy

@karpathy

7 months ago

There's a new paper circulating looking in detail at LMArena leaderboard: "The Leaderboard Illusion" arxiv.org/abs/2504.20879 I first became a bit suspicious when at one point a while back, a Gemini model scored #1 way above the second best, but when I tried to switch for a few

thumb_up_off_alt4,4K

chat_bubble_outline192

repeat429

shareShare

Sara Hooker

@sarahookr

7 months ago

It is critical for scientific integrity that we trust our measure of progress. The lmarena.ai has become the go-to evaluation for AI progress. Our release today demonstrates the difficulty in maintaining fair evaluations on lmarena.ai, despite best intentions.

It is critical for scientific integrity that we trust our measure of progress.

The <a href="/lmarena_ai/">lmarena.ai</a> has become the go-to evaluation for AI progress.

Our release today demonstrates the difficulty in maintaining fair evaluations on <a href="/lmarena_ai/">lmarena.ai</a>, despite best intentions.

thumb_up_off_alt712

chat_bubble_outline21

repeat132

shareShare

Lisan al Gaib

@scaling01

7 months ago

I'm back and Gemini 2.5 Pro is still the king (no glaze) I did some more manual data cleaning and scrapped the shitty "average scaled score" and replaced it with Glicko-2 rating system with params: INITIAL_RATING = 1500 INITIAL_RD = 350 INITIAL_VOL = 0.06 TAU (τ) =

thumb_up_off_alt398

chat_bubble_outline29

repeat35

shareShare

Nolan Dey

@deynolan

7 months ago

(1/7) Cerebras Paper drop: arxiv.org/abs/2505.01618 TLDR: We introduce CompleteP, which offers depth-wise hyperparameter (HP) transfer (Left), FLOP savings when training deep models (Middle), and a larger range of compute-efficient width/depth ratios (Right). 🧵 👇

(1/7) <a href="/CerebrasSystems/">Cerebras</a> Paper drop: arxiv.org/abs/2505.01618

TLDR: We introduce CompleteP, which offers depth-wise hyperparameter (HP) transfer (Left), FLOP savings when training deep models (Middle), and a larger range of compute-efficient width/depth ratios (Right). 🧵 👇

thumb_up_off_alt407

chat_bubble_outline12

repeat68

shareShare

Essential AI

@essential_ai

7 months ago

[1/5] Muon has recently emerged as a promising second-order optimizer for LLMs. Prior work (e.g. Moonshot) showed that Muon scales. Our in-depth study addresses the practicality of Muon vs AdamW and demonstrates that Muon expands the Pareto frontier over AdamW on the

thumb_up_off_alt142

chat_bubble_outline1

repeat21

shareShare

Omar Khattab

@lateinteraction

6 months ago

DSPy's biggest strength is also the reason it can admittedly be hard to wrap your head around it. It's basically say: LLMs & their methods will continue to improve but not equally in every axis, so: - What's the smallest set of fundamental abstractions that allow you to build

thumb_up_off_alt869

chat_bubble_outline46

repeat125

shareShare

AI at Meta

@aiatmeta

6 months ago

We’re releasing model weights for our 8B- parameter Dynamic Byte Latent Transformer, an alternative to traditional tokenization methods with the potential to redefine the standards for language model efficiency and reliability. Learn more about how Dynamic Byte Latent

thumb_up_off_alt1,1K

chat_bubble_outline43

repeat251

shareShare

Lisan al Gaib

@scaling01

6 months ago

OPUS 4 NEW SOTA ON ARC-AGI-2 IT'S HAPPENING - I WAS RIGHT Claude 4 models are the first models that effectively use test-time-compute for ARC-AGI-2

thumb_up_off_alt1,1K

chat_bubble_outline60

repeat79

shareShare