Adam Zweiger (@adamzweiger) Twitter Tweets • TwiCopy

Adam Zweiger

5 months ago

re Scott Alexander on the human analogue of LLM hallucination: There is simply no human equivalent to what gpt4 does when being asked "what does GRPO stand for" or even "What is the capital of France" without anything else prompted in-context. The closest thing is someone

thumb_up_off_alt6

chat_bubble_outline0

repeat0

shareShare

Adam Zweiger

@adamzweiger

4 months ago

Very excited to see what they've cooked up now My out-there guess: Use SSMs to "tokenize" text into fewer but more semantic chunks, and do attention over that. State still grows linearly (and compute quadratically), but far fewer tokens and better expressivity for some domains

thumb_up_off_alt6

chat_bubble_outline0

repeat0

shareShare

Adam Zweiger

@adamzweiger

4 months ago

This will be the biggest architecture change since the transformer for language modeling.

thumb_up_off_alt10

chat_bubble_outline0

repeat0

shareShare

Jyo Pari

@jyo_pari

4 months ago

MoE Routers are trained a bit strangely but things seem to still work. minyoung huh (jacob) and I got curious about combining specialized experts at test time through routing… and ended up deep in the weeds of MoE optimization. Here's a blog post! jyopari.github.io/posts/peculiar…

MoE Routers are trained a bit strangely but things seem to still work.

<a href="/minyoung_huh/">minyoung huh (jacob)</a> and I got curious about combining specialized experts at test time through routing… and ended up deep in the weeds of MoE optimization. Here's a blog post!

jyopari.github.io/posts/peculiar…

thumb_up_off_alt138

chat_bubble_outline2

repeat20

shareShare

Adam Zweiger

@adamzweiger

4 months ago

Come check out our ICML poster on combining Test-Time Training and In-Context Learning for on-the-fly adaptation to novel tasks like ARC-AGI puzzles. I will be presenting with Jyo Pari at E-2702, Tuesday 11-1:30!

thumb_up_off_alt32

chat_bubble_outline1

repeat5

shareShare

Adam Zweiger

@adamzweiger

4 months ago

This is one of the highest-quality evals I've seen and it's nice to see it expanding! I love how you can view each model-problem-run datapoint

thumb_up_off_alt4

chat_bubble_outline0

repeat0

shareShare

Adam Zweiger

@adamzweiger

4 months ago

why are all the quant booths at ICML on absolute steroids?? OpenAI gotta build one of these things next time

why are all the quant booths at ICML on absolute steroids?? <a href="/OpenAI/">OpenAI</a> gotta build one of these things next time

thumb_up_off_alt20

chat_bubble_outline0

repeat0

shareShare

Adam Zweiger

@adamzweiger

4 months ago

A mathematician can think about a single problem for a full decade (perhaps 100M+ tokens of reading/writing/thinking) before solving it. When will we reach that point with LLMs?

thumb_up_off_alt15

chat_bubble_outline3

repeat1

shareShare

Adam Zweiger

@adamzweiger

4 months ago

interesting

thumb_up_off_alt3

chat_bubble_outline0

repeat0

shareShare

Adam Zweiger

@adamzweiger

4 months ago

the whole shrimp welfare thing is actually a great reductio ad absurdum for ea/rat. I suggest if you are interested in that stuff you focus your efforts on high-impact things that are more aligned with normal human intuition on morality.

thumb_up_off_alt7

chat_bubble_outline0

repeat1

shareShare

Han Guo

@hanguo97

4 months ago

Since our initial arXiv post, several concurrent papers have introduced new architectures with log-linear properties in various forms. Two personal favorites of mine (among others) are: - Transformer-PSM by Morris Yau et al., and - Radial Attention by Xingyang and Muyang Li et

thumb_up_off_alt278

chat_bubble_outline6

repeat40

shareShare

Adam Zweiger

@adamzweiger

4 months ago

This is crazy! It makes more sense once you hear it requires both models to have the same initialization. If you can get a method like this to work without that, it would have big implications for data poisoning. I think it's not possible, but someone should look into it more.

thumb_up_off_alt17

chat_bubble_outline2

repeat0

shareShare

Adam Zweiger

@adamzweiger

4 months ago

Models that know what they know are way more useful. So, jointly reward correctness and calibration!

thumb_up_off_alt4

chat_bubble_outline0

repeat0

shareShare

Jyo Pari

@jyo_pari

3 months ago

We have a fun collaboration of GPU MODE x Scale ML coming up! We’re hosting a week-long online bootcamp that explores the core components of GPT-OSS while also diving into cutting-edge research that pushes beyond what’s currently in GPT-OSS! For example, how can MoE's power

We have a fun collaboration of <a href="/GPU_MODE/">GPU MODE</a> x <a href="/scaleml/">Scale ML</a> coming up!

We’re hosting a week-long online bootcamp that explores the core components of GPT-OSS while also diving into cutting-edge research that pushes beyond what’s currently in GPT-OSS!

For example, how can MoE's power

thumb_up_off_alt71

chat_bubble_outline1

repeat20

shareShare

Adam Zweiger

@adamzweiger

3 months ago

this is a really fun puzzle :D many more optimizations to come!

thumb_up_off_alt15

chat_bubble_outline3

repeat0

shareShare

Adam Zweiger

@adamzweiger

3 months ago

Interesting work showing concretely why on-policy RL forgets less. It's not quite because of "sparse updates" -- only that RL maintains a smaller KL to the base model.

thumb_up_off_alt6

chat_bubble_outline0

repeat0

shareShare

Zitong Yang

@zitongyang0

2 months ago

📜 Paper on new pretraining paradigm: Synthetic Bootstrapped Pretraining SBP goes beyond next-token supervision in a single document by leveraging inter-document correlations to synthesize new data for training — no teacher needed. Validation: 1T data + 3B model from scratch.🧵

thumb_up_off_alt241

chat_bubble_outline9

repeat47

shareShare