にょせがわ (@nyosegawa) Twitter Tweets • TwiCopy

We use scaling laws for both pretraining and RL, but these domains handle scaling way differently. In reality, labeling both concepts as “scaling laws” is misleading… Power laws. Most scaling laws are based upon an (inverse) power law, which describes the relationship between

thumb_up_off_alt105

chat_bubble_outline2

repeat16

shareShare

Sundar Pichai

@sundarpichai

13 days ago

TPU 8t, optimized for training and TPU 8i, optimized for inference. Looking good!

thumb_up_off_alt12,12K

chat_bubble_outline371

repeat752

shareShare

にょせがわ

@nyosegawa

12 days ago

sumo ai, 久々に腕力のある20歳に出会ってとても楽しかったので17歳JKとして頑張っていきたいですね

thumb_up_off_alt1

chat_bubble_outline0

repeat0

shareShare

Kyle Chan

@kyleichan

11 days ago

Must-listen interview by Chang Che with ex-ByteDance AI researcher: - Benchmaxxing - Distillation on US models - Poor data quality and infra - Compute constraints "I don’t even agree with the assumption that Chinese models are catching up — I believe we’re still far behind. I

thumb_up_off_alt633

chat_bubble_outline20

repeat67

shareShare

Kouta Nakayama

@nlpingu

11 days ago

昨日の #sumo_ai の発表資料です〜 LLMベースエージェントの設計論、AIエージェントの評価と学習、LLM-jpでの現在の取り組みとこれからをまとめています‼️ 「AIエージェント時代のLLM-jpモデルのあるべき姿」 speakerdeck.com/k141303/aiezie…

thumb_up_off_alt57

chat_bubble_outline0

repeat16

shareShare

Susan Zhang

@suchenzang

11 days ago

so that explains the delay... deepseek could not fix training instabilities, after doubling from ~15T tokens in v3 to ~33T tokens in v4 the 10+ mentions of "stability" tricks seem to be wildly lacking if these two were the main bandages (mismatched routing + clamping) but

thumb_up_off_alt1,1K

chat_bubble_outline32

repeat84

shareShare

Google Research

@googleresearch

11 days ago

Google presents a new Transformer alternative at #ICLR2026! Join Nino Scherrer & Yanick Schimpf at the Google booth (#411) at 10AM to learn about MesaNet, proposing a new linear sequence layer that optimally learns in-context given a fixed memory budget.

thumb_up_off_alt958

chat_bubble_outline31

repeat130

shareShare

にょせがわ

@nyosegawa

10 days ago

とてもいいポッドキャスト（transcript読んだ）

thumb_up_off_alt0

chat_bubble_outline0

repeat0

shareShare

にょせがわ

@nyosegawa

10 days ago

Cursor、本当に料金プランが複雑

thumb_up_off_alt1

chat_bubble_outline0

repeat0

shareShare

Akashi

@akashi203

10 days ago

big models reason because they're deep. a 70b model has 80 layers, each doing something different. if you want a small model to do the same, you can take one layer and just run it 80 times. universal transformer did this in 2019. huginn did it in 2025. problem is, when you run

thumb_up_off_alt515

chat_bubble_outline20

repeat67

shareShare

にょせがわ

@nyosegawa

10 days ago

Nope、おいしすぎて毎日のんでる

thumb_up_off_alt1

chat_bubble_outline0

repeat0

shareShare

にょせがわ

しゅんけー「📕Pythonで学ぶ画像生成」発売中！

にょせがわ

にょせがわ

kakiraちゃん

にょせがわ

にょせがわ

Cameron R. Wolfe, Ph.D.

Sundar Pichai

にょせがわ

Kyle Chan

Kouta Nakayama

Susan Zhang

Google Research

にょせがわ

にょせがわ

Akashi

にょせがわ