Yizhou Liu (@yizhouliu0) Twitter Tweets • TwiCopy

Yizhou Liu

@yizhouliu0

+ Follow

PhD student at @MITMechE | Physics of living systems, Complex systems, Statistical physics

ID: 1577675228469043201

linkhttps://liuyz0.github.io/ calendar_today05-10-2022 15:01:26

43 Tweet

168 Followers

150 Following

Yulu Gan

@yule_gan

a month ago

Simply adding Gaussian noise to LLMs (one step—no iterations, no learning rate, no gradients) and ensembling them can achieve performance comparable to or even better than standard GRPO/PPO on math reasoning, coding, writing, and chemistry tasks. We call this algorithm RandOpt.

thumb_up_off_alt1,1K

chat_bubble_outline48

repeat231

shareShare

Sophie Wang

@sophielwang

a month ago

I made an interactive blog post about how JPEG image compression works: sophielwang.com/blog/jpeg

thumb_up_off_alt1,1K

chat_bubble_outline26

repeat196

shareShare

Yizhou Liu

@yizhouliu0

a month ago

Counterintuitive? Longer sequences have more repeated signals and more similar to overfitting? Is that the picture?

thumb_up_off_alt23

chat_bubble_outline1

repeat5

shareShare

Yizhou Liu

@yizhouliu0

a month ago

Arrived at APS March meeting 🤩 Will talk about Neural Scaling Laws Trilogy (Mar 19, at 🚨Physics of Learning and Adaptation III)! Happy to schedule chats with physicists interested in LLMs‼️

thumb_up_off_alt22

chat_bubble_outline0

repeat1

shareShare

Yizhou Liu

@yizhouliu0

a month ago

Similar to Fig. 20 and 24 in our arxiv.org/abs/2602.05970. When transition is smooth ODE, dense connection can be much better. When transition is SDE (or high-order derivatives don’t exist), there is no big change. Actually we found LLMs are more like SDE.

thumb_up_off_alt36

chat_bubble_outline0

repeat7

shareShare

Yizhou Liu

@yizhouliu0

a month ago

0.148 agrees with what I said.

thumb_up_off_alt6

chat_bubble_outline0

repeat0

shareShare

Yizhou Liu

@yizhouliu0

a month ago

Seems to be a systematic scaling study for diffusion language model 👍 Not surprised that the exponents are still similar to Chinchilla. But the origin of 21.8x speedup? So far I can imagine diffusion models enable better hyperparameter choices.

thumb_up_off_alt26

chat_bubble_outline1

repeat6

shareShare

Yizhou Liu

@yizhouliu0

a month ago

One obvious reason may be for larger batch size, there are fewer steps.

thumb_up_off_alt22

chat_bubble_outline1

repeat1

shareShare

Yizhou Liu

@yizhouliu0

a month ago

How should I interpret the exponent 1/4? The best case of lower bound? max(min(alpha_D))?

thumb_up_off_alt18

chat_bubble_outline0

repeat1

shareShare

Yizhou Liu

@yizhouliu0

21 days ago

Some works also reported that optimal batch size scales as dataset size^1/3. I wonder if these are related to my 1/3 scaling, arxiv.org/abs/2602.03685, by balancing that loss and an extra loss due to gradient noise. (My theory studied perfect gradients).

thumb_up_off_alt48

chat_bubble_outline0

repeat3

shareShare

Yizhou Liu

@yizhouliu0

21 days ago

Can’t agree more 😂 or please report the fitting standard error…

thumb_up_off_alt0

chat_bubble_outline0

repeat0

shareShare