Shizhe He (@shizhehe) 's Twitter Profile
Shizhe He

@shizhehe

rethinking computing @ Brains in Silicon; building excavators; prev. STAILab @Stanford

ID: 1524037330000138242

linkhttp://shizhehe.com calendar_today10-05-2022 14:43:52

23 Tweet

152 Takipçi

339 Takip Edilen

Avanika Narayan (@avanika15) 's Twitter Profile Photo

The U.S.–China AI race won’t be decided by who builds the most datacenters, but by who deploys the most intelligence. We call this Gross Domestic Intelligence (GDI): intelligence per watt × usable power. If the U.S. activates its dense installed base of local AI accelerators in

The U.S.–China AI race won’t be decided by who builds the most datacenters, but by who deploys the most intelligence.

We call this Gross Domestic Intelligence (GDI): intelligence per watt × usable power.

If the U.S. activates its dense installed base of local AI accelerators in
Radical Numerics (@radicalnumerics) 's Twitter Profile Photo

Scaling scientific world models requires co-designing architectures, training objectives, and numerics. Today, we share the first posts in our series on low-precision pretraining, starting with NVIDIA's NVFP4 recipe for stable 4-bit training. Part 1: radicalnumerics.ai/blog/nvfp4-par… Part

Scaling scientific world models requires co-designing architectures, training objectives, and numerics. Today, we share the first posts in our series on low-precision pretraining, starting with NVIDIA's NVFP4 recipe for stable 4-bit training.

Part 1: radicalnumerics.ai/blog/nvfp4-par…
Part
Flapping Airplanes (@flappyairplanes) 's Twitter Profile Photo

Announcing Flapping Airplanes! We’ve raised $180M from GV, Sequoia, and Index to assemble a new guard in AI: one that imagines a world where models can think at human level without ingesting half the internet.

jack morris (@jxmnop) 's Twitter Profile Photo

at long last, the final paper of my phd 🧮 Learning to Reason in 13 Parameters 🧮 we develop TinyLoRA, a new ft method. with TinyLoRA + RL, models learn well with dozens or hundreds of params example: we use only 13 parameters to train 7B Qwen model from 76 to 91% on GSM8K 🤯

at long last, the final paper of my phd

🧮 Learning to Reason in 13 Parameters 🧮

we develop TinyLoRA, a new ft method. with TinyLoRA + RL, models learn well with dozens or hundreds of params

example: we use only 13 parameters to train 7B Qwen model from 76 to 91% on GSM8K 🤯
Mayee Chen (@mayeechen) 's Twitter Profile Photo

Data mixing - determining ratios across your training datasets - matters a lot for model quality. While building Olmo 3, we learned it’s hard to set up a method that finds a strong mix, and hard to maintain that mix as datasets change throughout development. Introducing Olmix👇

Data mixing - determining ratios across your training datasets - matters a lot for model quality. While building Olmo 3, we learned it’s hard to set up a method that finds a strong mix, and hard to maintain that mix as datasets change throughout development.
Introducing Olmix👇
Stuart Sul (@stuart_sul) 's Twitter Profile Photo

(1/7) We're releasing ThunderKittens 2.0! Faster kernels, cleaner code, industry contributions, and new state-of-the-art BF16 / MXFP8 / NVFP4 GEMMs that match or surpass cuBLAS! Alongside this release, we’re equally excited to share some insights we learned while squeezing every

(1/7) We're releasing ThunderKittens 2.0! Faster kernels, cleaner code, industry contributions, and new state-of-the-art BF16 / MXFP8 / NVFP4 GEMMs that match or surpass cuBLAS!

Alongside this release, we’re equally excited to share some insights we learned while squeezing every