にょせがわ (@nyosegawa) 's Twitter Profile
にょせがわ

@nyosegawa

泣きながら機械学習を勉強している @gyakuse

ID: 1620385553274540032

calendar_today31-01-2023 11:36:40

1,1K Tweet

184 Followers

80 Following

にょせがわ (@nyosegawa) 's Twitter Profile Photo

gpt-2記事書いていて7shiさんの記事ヒットしてウケてたし、人間やっぱりここに還ってくるんだよなぁ

Cameron R. Wolfe, Ph.D. (@cwolferesearch) 's Twitter Profile Photo

We use scaling laws for both pretraining and RL, but these domains handle scaling way differently. In reality, labeling both concepts as “scaling laws” is misleading… Power laws. Most scaling laws are based upon an (inverse) power law, which describes the relationship between

We use scaling laws for both pretraining and RL, but these domains handle scaling way differently. In reality, labeling both concepts as “scaling laws” is misleading…

Power laws. Most scaling laws are based upon an (inverse) power law, which describes the relationship between
にょせがわ (@nyosegawa) 's Twitter Profile Photo

sumo ai, 久々に腕力のある20歳に出会ってとても楽しかったので17歳JKとして頑張っていきたいですね

Kyle Chan (@kyleichan) 's Twitter Profile Photo

Must-listen interview by Chang Che with ex-ByteDance AI researcher: - Benchmaxxing - Distillation on US models - Poor data quality and infra - Compute constraints "I don’t even agree with the assumption that Chinese models are catching up — I believe we’re still far behind. I

Kouta Nakayama (@nlpingu) 's Twitter Profile Photo

昨日の #sumo_ai の発表資料です〜 LLMベースエージェントの設計論、AIエージェントの評価と学習、LLM-jpでの現在の取り組みとこれからをまとめています‼️ 「AIエージェント時代のLLM-jpモデルのあるべき姿」 speakerdeck.com/k141303/aiezie…

Susan Zhang (@suchenzang) 's Twitter Profile Photo

so that explains the delay... deepseek could not fix training instabilities, after doubling from ~15T tokens in v3 to ~33T tokens in v4 the 10+ mentions of "stability" tricks seem to be wildly lacking if these two were the main bandages (mismatched routing + clamping) but

so that explains the delay...

deepseek could not fix training instabilities, after doubling from ~15T tokens in v3 to ~33T tokens in v4

the 10+ mentions of "stability" tricks seem to be wildly lacking if these two were the main bandages (mismatched routing + clamping)

but
Google Research (@googleresearch) 's Twitter Profile Photo

Google presents a new Transformer alternative at #ICLR2026! Join Nino Scherrer & Yanick Schimpf at the Google booth (#411) at 10AM to learn about MesaNet, proposing a new linear sequence layer that optimally learns in-context given a fixed memory budget.

Google presents a new Transformer alternative at #ICLR2026! Join Nino Scherrer & Yanick Schimpf at the Google booth (#411) at 10AM to learn about MesaNet, proposing a new linear sequence layer that optimally learns in-context given a fixed memory budget.
Akashi (@akashi203) 's Twitter Profile Photo

big models reason because they're deep. a 70b model has 80 layers, each doing something different. if you want a small model to do the same, you can take one layer and just run it 80 times. universal transformer did this in 2019. huginn did it in 2025. problem is, when you run

big models reason because they're deep. a 70b model has 80 layers, each doing something different.

if you want a small model to do the same, you can take one layer and just run it 80 times. universal transformer did this in 2019. huginn did it in 2025.

problem is, when you run