Larry Dial (@classiclarryd) 's Twitter Profile
Larry Dial

@classiclarryd

Data Engineer at AWS. & AI research.

ID: 1791834924317560832

linkhttp://larrydial.com calendar_today18-05-2024 14:15:22

6 Tweet

117 Followers

19 Following

Larry Dial (@classiclarryd) 's Twitter Profile Photo

Excited to share an (unofficial till merged) collaborative WR of 159s on modded-nanogpt! github.com/KellerJordan/m… 180s->159s includes align bos, triton, sparse attn gate, FA3, drop MLP, & dynamic YaRN. Keller Jordan Love how speedrun format enables immediate feedback on ideas!

Larry Dial (@classiclarryd) 's Twitter Profile Photo

Down to 146.8s on modded-nanogpt! github.com/KellerJordan/m… Surprising result: Different parameter groups have different sensitivity to batch size. Instead of picking a single batch size, grad accumulation can be managed on a param level to simulate different batch sizes.