Owen Dugan (@owendugan) Twitter Tweets • TwiCopy

good girl

@goodgirlxsz

5 hours ago

🔥Telegram İfşa

thumb_up_off_alt34

chat_bubble_outline39

repeat6

shareShare

(1/6) Joyously announcing ThunderKittens with real support on NVIDIA Blackwell! We've released BF16/FP8 GEMM and attention fwd+bwd kernels, up to 2x faster than cuBLAS GEMMs on H100. Blog: bit.ly/41tuT4Q With Dan Fu, Aaryan Singhal, and @hazyresearch!

thumb_up_off_alt190

chat_bubble_outline4

repeat29

shareShare

Simran Arora

@simran_s_arora

7 months ago

BASED ✌️ turns 1! One year since its launch at NeurIPS 2023 — and it's helped shape the new wave of efficient LMs. ⚡️ Fastest linear attention kernels 🧠 405B models trained on 16 GPUs 💥 Inspired Mamba-v2, RWKVs, MiniMax Checkout our retrospective below!

thumb_up_off_alt110

chat_bubble_outline3

repeat66

shareShare

hazyresearch

@hazyresearch

7 months ago

The Great American AI Race. I wrote something about how we need a holistic AI effort from academia, industry, and the US government to have the best shot at a freer, better educated, and healthier world in AI. I’m a mega bull on the US and open source AI. Maybe we’re cooking

thumb_up_off_alt85

chat_bubble_outline1

repeat93

shareShare

Jerry Liu

@jerrywliu

7 months ago

I'm at #ICLR2025 this week to present our work on🔬high-precision algorithm learning🔬with Transformers! Stop by our poster session Thursday afternoon! 🔗arxiv.org/abs/2503.12295 With Jess Grogan, Owen Dugan , Ashish Rao, Simran Arora , Atri Rudra, and hazyresearch!

thumb_up_off_alt43

chat_bubble_outline2

repeat12

shareShare

Roberto Garcia

@garctrob

6 months ago

I'm at #ICLR2025 presenting RaNA 🐸, an adaptive compression method to speed up Transformers, with Jerry Liu , Sabri Eyuboglu , and others! 🔗arxiv.org/abs/2503.18216 RaNA is based on the Adaptive Rank Allocation framework, which generalizes prior neuron-adaptive methods,

I'm at #ICLR2025 presenting RaNA 🐸, an adaptive compression method to speed up Transformers, with <a href="/jerrywliu/">Jerry Liu</a> , <a href="/EyubogluSabri/">Sabri Eyuboglu</a> , and others! 🔗arxiv.org/abs/2503.18216

RaNA is based on the Adaptive Rank Allocation framework, which generalizes prior neuron-adaptive methods,

thumb_up_off_alt13

chat_bubble_outline0

repeat3

shareShare

Avanika Narayan

@avanika15

6 months ago

can you chat privately with a cloud llm—*without* sacrificing speed? excited to release minions secure chat: an open-source protocol for end-to-end encrypted llm chat with <1% latency overhead (even @ 30B+ params!). cloud providers can’t peek—messages decrypt only inside a

thumb_up_off_alt245

chat_bubble_outline13

repeat57

shareShare

Dan Biderman

@dan_biderman

6 months ago

We secure all communications with a cloud-hosted LLM, running on an H100 in confidential mode. Latency overhead goes away once you cross the 10B model size. This is our first foray into applied cryptography -- help us refine our ideas.

thumb_up_off_alt38

chat_bubble_outline3

repeat8

shareShare

Benjamin F Spector

@bfspector

5 months ago

(1/5) We’ve never enjoyed watching people chop Llamas into tiny pieces. So, we’re excited to be releasing our Low-Latency-Llama Megakernel! We run the whole forward pass in single kernel. Megakernels are faster & more humane. Here’s how to treat your Llamas ethically: (Joint

thumb_up_off_alt863

chat_bubble_outline32

repeat142

shareShare

Avanika Narayan

@avanika15

5 months ago

minions ❤️ ollama. e2e security for local-cloud lm collaboration. check it out 👇

thumb_up_off_alt18

chat_bubble_outline2

repeat6

shareShare

Dan Biderman

@dan_biderman

5 months ago

Local LLMs *privately* collaborating with smarter cloud LLMs, as if you never left your laptop. Pure joy to work with ollama.

thumb_up_off_alt33

chat_bubble_outline1

repeat8

shareShare

Jordan Juravsky

@jordanjuravsky

5 months ago

Happy Throughput Thursday! We’re excited to release Tokasaurus: an LLM inference engine designed from the ground up for high-throughput workloads with large and small models. (Joint work with Ayush Chakravarthy, Ryan Ehrlich, Sabri Eyuboglu, Bradley Brown, Joseph Shetaye,

thumb_up_off_alt168

chat_bubble_outline3

repeat38

shareShare

Sabri Eyuboglu

@eyuboglusabri

5 months ago

When we put lots of text (eg a code repo) into LLM context, cost soars b/c of the KV cache’s size. What if we trained a smaller KV cache for our documents offline? Using a test-time training recipe we call self-study, we find that this can reduce cache memory on avg 39x

thumb_up_off_alt287

chat_bubble_outline12

repeat66

shareShare

Jon Saad-Falcon

@jonsaadfalcon

4 months ago

How can we close the generation-verification gap when LLMs produce correct answers but fail to select them? 🧵 Introducing Weaver: a framework that combines multiple weak verifiers (reward models + LM judges) to achieve o3-mini-level accuracy with much cheaper non-reasoning

thumb_up_off_alt204

chat_bubble_outline11

repeat56

shareShare

Tilde

@tilderesearch

4 months ago

Sparse attention (MoBA/NSA) trains faster & beats full attention in key tasks. But we’ve had no idea how they truly work…until now. 🔍 We reverse-engineered them to uncover: - Novel attention patterns - Hidden "attention sinks" - Better performance - And more A 🧵… ~1/8~

thumb_up_off_alt405

chat_bubble_outline5

repeat80

shareShare

Jerry Liu

@jerrywliu

4 months ago

1/10 ML can solve PDEs – but precision🔬is still a challenge. Towards high-precision methods for scientific problems, we introduce BWLer 🎳, a new architecture for physics-informed learning achieving (near-)machine-precision (up to 10⁻¹² RMSE) on benchmark PDEs. 🧵How it works:

thumb_up_off_alt579

chat_bubble_outline12

repeat109

shareShare

Jordan Juravsky

@jordanjuravsky

4 months ago

Check out Tokasaurus on Modal to make Llama-1B brrr! This repeated sampling example shows off two engine features that are important for serving small models: very low CPU overhead and automatic shared prefix exploitation with Hydragen.

thumb_up_off_alt32

chat_bubble_outline1

repeat8

shareShare

Owen Dugan

@owendugan

3 months ago

Paradigm is awesome! Check it out!

thumb_up_off_alt10

chat_bubble_outline0

repeat0

shareShare

Michael Poli

@michaelpoli6

2 months ago

Life update: I started Radical Numerics with Stefano Massaroli, Armin Thomas, Eric Nguyen, and a fantastic team of engineers and researchers. We are building the engine for recursive self‑improvement (RSI): AI that designs and refines AI, accelerating discovery across science and

thumb_up_off_alt226

chat_bubble_outline8

repeat23

shareShare

Eric Nguyen

@exnx

2 months ago

✨ Excited to share a few life updates! 🎤 My TED Talk is now live! I shared the origin story of Evo, titled: "How AI could generate new life forms" TED talk: ted.com/talks/eric_ngu… ✍️ I wrote a blog post about what it’s *really* like to deliver a TED talk blog:

thumb_up_off_alt162

chat_bubble_outline15

repeat28

shareShare