Adam Zweiger (@adamzweiger) 's Twitter Profile
Adam Zweiger

@adamzweiger

ID: 1571391036416724992

calendar_today18-09-2022 06:50:11

0 Tweet

14 Takipçi

197 Takip Edilen

Adam Zweiger (@adamzweiger) 's Twitter Profile Photo

re Scott Alexander on the human analogue of LLM hallucination: There is simply no human equivalent to what gpt4 does when being asked "what does GRPO stand for" or even "What is the capital of France" without anything else prompted in-context. The closest thing is someone

Adam Zweiger (@adamzweiger) 's Twitter Profile Photo

Very excited to see what they've cooked up now My out-there guess: Use SSMs to "tokenize" text into fewer but more semantic chunks, and do attention over that. State still grows linearly (and compute quadratically), but far fewer tokens and better expressivity for some domains

Jyo Pari (@jyo_pari) 's Twitter Profile Photo

MoE Routers are trained a bit strangely but things seem to still work. minyoung huh (jacob) and I got curious about combining specialized experts at test time through routing… and ended up deep in the weeds of MoE optimization. Here's a blog post! jyopari.github.io/posts/peculiar…

MoE Routers are trained a bit strangely but things seem to still work. 

<a href="/minyoung_huh/">minyoung huh (jacob)</a> and I got curious about combining specialized experts at test time through routing… and ended up deep in the weeds of MoE optimization. Here's a blog post! 

jyopari.github.io/posts/peculiar…
Adam Zweiger (@adamzweiger) 's Twitter Profile Photo

Come check out our ICML poster on combining Test-Time Training and In-Context Learning for on-the-fly adaptation to novel tasks like ARC-AGI puzzles. I will be presenting with Jyo Pari at E-2702, Tuesday 11-1:30!

Come check out our ICML poster on combining Test-Time Training and In-Context Learning for on-the-fly adaptation to novel tasks like ARC-AGI puzzles.

I will be presenting with <a href="/jyo_pari/">Jyo Pari</a> at E-2702, Tuesday 11-1:30!
Adam Zweiger (@adamzweiger) 's Twitter Profile Photo

This is one of the highest-quality evals I've seen and it's nice to see it expanding! I love how you can view each model-problem-run datapoint

Adam Zweiger (@adamzweiger) 's Twitter Profile Photo

A mathematician can think about a single problem for a full decade (perhaps 100M+ tokens of reading/writing/thinking) before solving it. When will we reach that point with LLMs?

Adam Zweiger (@adamzweiger) 's Twitter Profile Photo

the whole shrimp welfare thing is actually a great reductio ad absurdum for ea/rat. I suggest if you are interested in that stuff you focus your efforts on high-impact things that are more aligned with normal human intuition on morality.

Han Guo (@hanguo97) 's Twitter Profile Photo

Since our initial arXiv post, several concurrent papers have introduced new architectures with log-linear properties in various forms. Two personal favorites of mine (among others) are: - Transformer-PSM by Morris Yau et al., and - Radial Attention by Xingyang and Muyang Li et

Adam Zweiger (@adamzweiger) 's Twitter Profile Photo

This is crazy! It makes more sense once you hear it requires both models to have the same initialization. If you can get a method like this to work without that, it would have big implications for data poisoning. I think it's not possible, but someone should look into it more.

Jyo Pari (@jyo_pari) 's Twitter Profile Photo

We have a fun collaboration of GPU MODE x Scale ML coming up! We’re hosting a week-long online bootcamp that explores the core components of GPT-OSS while also diving into cutting-edge research that pushes beyond what’s currently in GPT-OSS! For example, how can MoE's power

We have a fun collaboration of <a href="/GPU_MODE/">GPU MODE</a> x <a href="/scaleml/">Scale ML</a> coming up!

We’re hosting a week-long online bootcamp that explores the core components of GPT-OSS while also diving into cutting-edge research that pushes beyond what’s currently in GPT-OSS!

For example, how can MoE's power
Adam Zweiger (@adamzweiger) 's Twitter Profile Photo

Interesting work showing concretely why on-policy RL forgets less. It's not quite because of "sparse updates" -- only that RL maintains a smaller KL to the base model.

Zitong Yang (@zitongyang0) 's Twitter Profile Photo

📜 Paper on new pretraining paradigm: Synthetic Bootstrapped Pretraining SBP goes beyond next-token supervision in a single document by leveraging inter-document correlations to synthesize new data for training — no teacher needed. Validation: 1T data + 3B model from scratch.🧵

📜 Paper on new pretraining paradigm: Synthetic Bootstrapped Pretraining

SBP goes beyond next-token supervision in a single document by leveraging inter-document correlations to synthesize new data for training — no teacher needed. Validation: 1T data + 3B model from scratch.🧵