Radek Bartyzal (@radekbartyzal) 's Twitter Profile
Radek Bartyzal

@radekbartyzal

Recommendation Team Lead at GLAMI. Building production ML systems for millions of users.

ID: 1908291367

linkhttps://github.com/BartyzalRadek calendar_today26-09-2013 15:26:52

348 Tweet

179 Followers

170 Following

Jason Weston (@jaseweston) 's Twitter Profile Photo

🚨 Distilling System 2 into System 1🚨 - System 2 LLMs spend compute to improve responses (CoT, BSM, RaR, Sys 2 Attention, ..) - *System 2 distillation* keeps this improvement but distills it back into the base LLM (System 1) outputs arxiv.org/abs/2407.06023 🧵(1/5)

🚨 Distilling System 2 into System 1🚨
- System 2 LLMs spend compute to improve responses (CoT, BSM, RaR, Sys 2 Attention, ..)
- *System 2 distillation* keeps this improvement but distills it back into the base LLM (System 1) outputs
arxiv.org/abs/2407.06023
🧵(1/5)
Daniel Litt (@littmath) 's Twitter Profile Photo

In this thread I'll record some brief impressions from trying to use o3/o4-mini (the new OpenAI models) for mathematical tasks.

Deedy (@deedydas) 's Twitter Profile Photo

Rich Sutton just published his most important essay on AI since The Bitter Lesson: "Welcome to the Era of Experience" Sutton and his advisee Silver argue that the “era of human data,” dominated by supervised pre‑training and RL‑from‑human‑feedback, has hit diminishing returns;

Rich Sutton just published his most important essay on AI since The Bitter Lesson: "Welcome to the Era of Experience"

Sutton and his advisee Silver argue that the “era of human data,” dominated by supervised pre‑training and RL‑from‑human‑feedback, has hit diminishing returns;
Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) (@teortaxestex) 's Twitter Profile Photo

> Learnable 3d RoPE > Block-Causal Attention, Parallel Attention Block how is this real. A completely unknown company comes out and BTFOs all current videogen with a heavily modified DiT. DeepSeek or Kimi levels of alpha in the report, go read it

> Learnable 3d RoPE
> Block-Causal Attention, Parallel Attention Block
how is this real. A completely unknown company comes out and BTFOs all current videogen with a heavily modified DiT.
DeepSeek or Kimi levels of alpha in the report, go read it
Ravid Shwartz Ziv (@ziv_ravid) 's Twitter Profile Photo

Unfortunately I will not be at ICLR (20h flight, the kids would kill me 😅), but our paper on min-p sampling will be presented as an oral!

Unfortunately I will not be at ICLR (20h flight, the kids would kill me 😅), but our paper on min-p sampling will be presented as an oral!
Xeophon (@thexeophon) 's Twitter Profile Photo

Starting a series about popular Benchmarks / Evals as a lot of people don't really know what the bars are even showing. Starting with GPQA, one of the most popular benches out there. Its well done imo, but you need to know that its only about bio, physics and chemistry :)

Starting a series about popular Benchmarks / Evals as a lot of people don't really know what the bars are even showing. 

Starting with GPQA, one of the most popular benches out there. Its well done imo, but you need to know that its only about bio, physics and chemistry :)
Andrej Karpathy (@karpathy) 's Twitter Profile Photo

There's a new paper circulating looking in detail at LMArena leaderboard: "The Leaderboard Illusion" arxiv.org/abs/2504.20879 I first became a bit suspicious when at one point a while back, a Gemini model scored #1 way above the second best, but when I tried to switch for a few

Sara Hooker (@sarahookr) 's Twitter Profile Photo

It is critical for scientific integrity that we trust our measure of progress. The lmarena.ai has become the go-to evaluation for AI progress. Our release today demonstrates the difficulty in maintaining fair evaluations on lmarena.ai, despite best intentions.

It is critical for scientific integrity that we trust our measure of progress. 

The <a href="/lmarena_ai/">lmarena.ai</a> has become the go-to evaluation for AI progress.

Our release today demonstrates the difficulty in maintaining fair evaluations on <a href="/lmarena_ai/">lmarena.ai</a>, despite best intentions.
Lisan al Gaib (@scaling01) 's Twitter Profile Photo

I'm back and Gemini 2.5 Pro is still the king (no glaze) I did some more manual data cleaning and scrapped the shitty "average scaled score" and replaced it with Glicko-2 rating system with params: INITIAL_RATING = 1500 INITIAL_RD = 350 INITIAL_VOL = 0.06 TAU (τ) =

I'm back and Gemini 2.5 Pro is still the king (no glaze)

I did some more manual data cleaning and scrapped the shitty "average scaled score" and replaced it with Glicko-2 rating system with params:
INITIAL_RATING = 1500
INITIAL_RD     = 350
INITIAL_VOL    = 0.06
TAU (τ)        =
Nolan Dey (@deynolan) 's Twitter Profile Photo

(1/7) Cerebras Paper drop: arxiv.org/abs/2505.01618 TLDR: We introduce CompleteP, which offers depth-wise hyperparameter (HP) transfer (Left), FLOP savings when training deep models (Middle), and a larger range of compute-efficient width/depth ratios (Right).  🧵 👇

(1/7) <a href="/CerebrasSystems/">Cerebras</a> Paper drop: arxiv.org/abs/2505.01618

TLDR: We introduce CompleteP, which offers depth-wise hyperparameter (HP) transfer (Left), FLOP savings when training deep models (Middle), and a larger range of compute-efficient width/depth ratios (Right).  🧵 👇
Essential AI (@essential_ai) 's Twitter Profile Photo

[1/5] Muon has recently emerged as a promising second-order optimizer for LLMs. Prior work (e.g. Moonshot) showed that Muon scales. Our in-depth study addresses the practicality of Muon vs AdamW and demonstrates that Muon expands the Pareto frontier over AdamW on the

[1/5]

Muon has recently emerged as a promising second-order optimizer for LLMs. Prior work (e.g. Moonshot) showed that Muon scales. Our in-depth study  addresses the practicality of Muon vs AdamW and demonstrates that Muon expands the Pareto frontier over AdamW on the
Omar Khattab (@lateinteraction) 's Twitter Profile Photo

DSPy's biggest strength is also the reason it can admittedly be hard to wrap your head around it. It's basically say: LLMs & their methods will continue to improve but not equally in every axis, so: - What's the smallest set of fundamental abstractions that allow you to build

AI at Meta (@aiatmeta) 's Twitter Profile Photo

We’re releasing model weights for our 8B- parameter Dynamic Byte Latent Transformer, an alternative to traditional tokenization methods with the potential to redefine the standards for language model efficiency and reliability. Learn more about how Dynamic Byte Latent

Lisan al Gaib (@scaling01) 's Twitter Profile Photo

OPUS 4 NEW SOTA ON ARC-AGI-2 IT'S HAPPENING - I WAS RIGHT Claude 4 models are the first models that effectively use test-time-compute for ARC-AGI-2

OPUS 4 NEW SOTA ON ARC-AGI-2

IT'S HAPPENING - I WAS RIGHT

Claude 4 models are the first models that effectively use test-time-compute for ARC-AGI-2