Yizhou Liu (@yizhouliu0) 's Twitter Profile
Yizhou Liu

@yizhouliu0

PhD student at @MITMechE | Physics of living systems, Complex systems, Statistical physics

ID: 1577675228469043201

linkhttps://liuyz0.github.io/ calendar_today05-10-2022 15:01:26

43 Tweet

168 Followers

150 Following

Yulu Gan (@yule_gan) 's Twitter Profile Photo

Simply adding Gaussian noise to LLMs (one step—no iterations, no learning rate, no gradients) and ensembling them can achieve performance comparable to or even better than standard GRPO/PPO on math reasoning, coding, writing, and chemistry tasks. We call this algorithm RandOpt.

Simply adding Gaussian noise to LLMs (one step—no iterations, no learning rate, no gradients) and ensembling them can achieve performance comparable to or even better than standard GRPO/PPO on math reasoning, coding, writing, and chemistry tasks. We call this algorithm RandOpt.
Yizhou Liu (@yizhouliu0) 's Twitter Profile Photo

Arrived at APS March meeting 🤩 Will talk about Neural Scaling Laws Trilogy (Mar 19, at 🚨Physics of Learning and Adaptation III)! Happy to schedule chats with physicists interested in LLMs‼️

Arrived at APS March meeting 🤩 Will talk about Neural Scaling Laws Trilogy (Mar 19, at 🚨Physics of Learning and Adaptation III)! Happy to schedule chats with physicists interested in LLMs‼️
Yizhou Liu (@yizhouliu0) 's Twitter Profile Photo

Similar to Fig. 20 and 24 in our arxiv.org/abs/2602.05970. When transition is smooth ODE, dense connection can be much better. When transition is SDE (or high-order derivatives don’t exist), there is no big change. Actually we found LLMs are more like SDE.

Yizhou Liu (@yizhouliu0) 's Twitter Profile Photo

Seems to be a systematic scaling study for diffusion language model 👍 Not surprised that the exponents are still similar to Chinchilla. But the origin of 21.8x speedup? So far I can imagine diffusion models enable better hyperparameter choices.

Yizhou Liu (@yizhouliu0) 's Twitter Profile Photo

Some works also reported that optimal batch size scales as dataset size^1/3. I wonder if these are related to my 1/3 scaling, arxiv.org/abs/2602.03685, by balancing that loss and an extra loss due to gradient noise. (My theory studied perfect gradients).