rishi (@overquantized) 's Twitter Profile
rishi

@overquantized

ml research, physics, control theory

ID: 816668188993867777

linkhttp://rishiiyer.com calendar_today04-01-2017 15:30:44

1,1K Tweet

376 Followers

577 Following

rishi (@overquantized) 's Twitter Profile Photo

reach out if you want to work with me and others on novel architectures for pretraining! dms are open jobs.ashbyhq.com/zyphra/e509d43…

jianlin.su (@jianlin_s) 's Twitter Profile Photo

A fun fact: Adam remains the dominant optimizer today, yet even it has had only scant opportunities to be verified on trillion-parameter models; Muon, proposed less than a year ago, has already trained at that scale.

Zyphra (@zyphraai) 's Twitter Profile Photo

Zyphra is excited to release Compressed Convolutional Attention (CCA), a novel attention mechanism that: - Beats MHA, GQA, MLA for dense and MoE models - Reduces training/prefill flops - 3x fewer parameters vs MHA - Matches GQA/MLA KV-cache sizes without quality penalty

<a href="/ZyphraAI/">Zyphra</a> is excited to release Compressed Convolutional Attention (CCA), a novel attention mechanism that:
- Beats MHA, GQA, MLA for dense and MoE models
- Reduces training/prefill flops
- 3x fewer parameters vs MHA
- Matches GQA/MLA KV-cache sizes without quality penalty
Beren Millidge (@berenmillidge) 's Twitter Profile Photo

This one was super fun! For the first time we managed to directly reduce training and prefill attention flops directly. CCA is best of both worlds: matching MLA decode speed while substantially outperforming everything in perplexity and prefill speed.