Kazz (@kazzorr_) 's Twitter Profile
Kazz

@kazzorr_

math, physics, AI/ML

ID: 1970150615881048064

calendar_today22-09-2025 15:38:39

1,1K Tweet

49 Takipçi

103 Takip Edilen

Ian Osband (@ianosband) 's Twitter Profile Photo

Something is rotten with policy gradient. PG has become *the* RL loss for LLMs. But it’s not even good at basic RL. Even on MNIST with bandit feedback, vanilla PG performs far worse than cross-entropy because it wastes gradient budget. Delightful Policy Gradient:

Something is rotten with policy gradient.

PG has become *the* RL loss for LLMs. But it’s not even good at basic RL.

Even on MNIST with bandit feedback, vanilla PG performs far worse than cross-entropy because it wastes gradient budget.

Delightful Policy Gradient:
Thariq (@trq212) 's Twitter Profile Photo

I put a lot of heart into my technical writing, I hope it's useful to you all. 📌 Here's a pinned thread of everything I've written. (much of this will be posted on the Claude blog soon as well)

Simplifying Complexity (@simplifyinai) 's Twitter Profile Photo

🚨 BREAKING: Tencent has killed the “next-token” paradigm. Tencent and Tsinghua has released CALM (Continuous Autoregressive Language Models), and it completely disrupts the next-token paradigm. LLMs currently waste massive amounts of compute predicting discrete, single tokens

🚨 BREAKING: Tencent has killed the “next-token” paradigm.

Tencent and Tsinghua has released CALM (Continuous Autoregressive Language Models), and it completely disrupts the next-token paradigm.

LLMs currently waste massive amounts of compute predicting discrete, single tokens
alphaXiv (@askalphaxiv) 's Twitter Profile Photo

"Foundations of Schrödinger Bridges for Generative Modeling" This paper shows that diffusion models, score-based models, and flow matching are really just different views of the same core idea: a Schrödinger bridge that moves noise into data along the most efficient stochastic

"Foundations of Schrödinger Bridges for Generative Modeling"

This paper shows that diffusion models, score-based models, and flow matching are really just different views of the same core idea: a Schrödinger bridge that moves noise into data along the most efficient stochastic
Emiel Hoogeboom (@emiel_hoogeboom) 's Twitter Profile Photo

You may think discrete distillation is fundamentally flawed, you are (surprisingly) wrong. 🤯 Meet Discrete Moment Distillation (D-MMD). It is a new method that brings fast, few-step sampling to discrete diffusion models! 🧵👇

You may think discrete distillation is fundamentally flawed, you are (surprisingly) wrong. 🤯

Meet Discrete Moment Distillation (D-MMD). It is a new method that brings fast, few-step sampling to discrete diffusion models! 🧵👇
Unsloth AI (@unslothai) 's Twitter Profile Photo

You can now train Qwen3.5 with RL in our free notebook! You just need 8GB VRAM to RL Qwen3.5-2B locally! Qwen3.5 will learn to solve math problems autonomously via vision GRPO. RL Guide: unsloth.ai/docs/get-start… GitHub: github.com/unslothai/unsl… Qwen3-4B: colab.research.google.com/github/unsloth…

You can now train Qwen3.5 with RL in our free notebook!

You just need 8GB VRAM to RL Qwen3.5-2B locally!

Qwen3.5 will learn to solve math problems autonomously via vision GRPO.

RL Guide: unsloth.ai/docs/get-start…
GitHub: github.com/unslothai/unsl…

Qwen3-4B: colab.research.google.com/github/unsloth…
Lucas Maes (@lucasmaes_) 's Twitter Profile Photo

JEPA are finally easy to train end-to-end without any tricks! Excited to introduce LeWorldModel: a stable, end-to-end JEPA that learns world models directly from pixels, no heuristics. 15M params, 1 GPU, and full planning <1 second. 📑: le-wm.github.io

Antonio Orvieto (@orvieto_antonio) 's Twitter Profile Photo

Optimization theory for adaptive methods actually predicts most of what we know about hyperparameter scaling in LLM pretraining, and suggests new strategies as well. We did a deep dive here.

Optimization theory for adaptive methods actually predicts most of what we know about hyperparameter scaling in LLM pretraining, and suggests new strategies as well. We did a deep dive here.
alphaXiv (@askalphaxiv) 's Twitter Profile Photo

Yann LeCun and his team can't stop cooking "LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels" One of the biggest bottlenecks of JEPA is they are hard to train, and this new research changes that. They propose LeWorldModel, which shows that a

Yann LeCun and his team can't stop cooking

"LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels"

One of the biggest bottlenecks of JEPA is they are hard to train, and this new research changes that.

They propose LeWorldModel, which shows that a
Nav Singh (@heynavsingh) 's Twitter Profile Photo

🚨 Electrical engineers are going to hate this. Someone just turned React into a circuit board factory. Write code. Get a real PCB manufactured and delivered to your door. It's called tscircuit. React for Electronics. No Altium. No $10,000/year licenses. No 6-month learning

Snyk (@snyksec) 's Twitter Profile Photo

Andrej Karpathy The LiteLLM dependency incident didn't "just happen" though. This is part of a larger campaign LiteLLM already extends to supply chain security fallout for other projects: snyk.io/articles/poiso…

Wildminder (@wildmindai) 's Twitter Profile Photo

NVIDIA says: no more "brute force every pixel" of video understanding. AutoGaze- identifies and removes redundant video patches before they enter a Vision Transformer. Now we can processes 4K long-video in real-time. Works with SigLIP2 and NVILA. autogaze.github.io

Brian Roemmele (@brianroemmele) 's Twitter Profile Photo

LeWorldModel: Yann LeCuns Radical Simplification of World Models Just Made Physics-Aware AI Practical In the race for artificial general intelligence, two paths have emerged. One is the familiar scale everything route: bigger LLMs trained on ever-larger text corpora. The other,

LeWorldModel: Yann LeCuns Radical Simplification of World Models Just Made Physics-Aware AI Practical

In the race for artificial general intelligence, two paths have emerged. One is the familiar scale everything route: bigger LLMs trained on ever-larger text corpora. The other,
Sawyer Hood (@sawyerhood) 's Twitter Profile Photo

Introducing the new dev-browser cli. The fastest way for an agent to use a browser is to let it write code. Just `npm i -g dev-browser` and tell your agent to "use dev-browser"

Om Patel (@om_patel5) 's Twitter Profile Photo

THIS GUY MADE A CLAUDE CODE SKILL THAT CLONES ANY WEBSITE IN ONE PROMPT everyone tries to clone websites by taking screenshots and hoping for the best. that gets you maybe halfway there. there's a better way. Claude Code has a built-in Chrome MCP that goes straight to the

himanshu dubey (@himanshustwts) 's Twitter Profile Photo

nanoGPT by Andrej Karpathy is still the most relevant reference to hack and learn if someone is starting out in ai research. i tried to look (been a longtime!) what all work has been done to beat the baseline: > Architectural modernization (RoPE, QK-norm, ReLU, RMSnorm etc) >

nanoGPT by <a href="/karpathy/">Andrej Karpathy</a> is still the most relevant reference to hack and learn if someone is starting out in ai research.

i tried to look (been a longtime!) what all work has been done to beat the baseline:

&gt; Architectural modernization (RoPE, QK-norm, ReLU, RMSnorm etc)
&gt;
Yesterday Work (@yesterday_work_) 's Twitter Profile Photo

🚨 BREAKING: HuggingFace just dropped their complete AI engineering playbook to the public. They released 12 courses that were internal-only until this week. This covers LLMs, Robotics, and MCP, which is the exact tech stack behind Llama, Mistral, and every major open model.

🚨 BREAKING: HuggingFace just dropped their complete AI engineering playbook to the public.

They released 12 courses that were internal-only until this week.

This covers LLMs, Robotics, and MCP, which is the exact tech stack behind Llama, Mistral, and every major open model.
Martin (@mjbukow) 's Twitter Profile Photo

Andy It's more complex than that. Because the residual stream is purely additive, low-level gradient noise and intralayer communication signals accumulate across layers. The norm of the hidden states steadily increases with depth. In the last few layers, the model turns up the volume

Boris Cherny (@bcherny) 's Twitter Profile Photo

I wanted to share a bunch of my favorite hidden and under-utilized features in Claude Code. I'll focus on the ones I use the most. Here goes.