rishi (@overquantized) Twitter Tweets • TwiCopy

A fun fact: Adam remains the dominant optimizer today, yet even it has had only scant opportunities to be verified on trillion-parameter models; Muon, proposed less than a year ago, has already trained at that scale.

thumb_up_off_alt125

chat_bubble_outline4

repeat6

shareShare

rishi

@overquantized

9 months ago

the hills of sf are awesome

thumb_up_off_alt5

chat_bubble_outline1

repeat0

shareShare

rishi

@overquantized

9 months ago

exit the conversation once it feels too godel escher bach

thumb_up_off_alt4

chat_bubble_outline1

repeat0

shareShare

rishi

@overquantized

9 months ago

that tiramisu be sedating me no lie

thumb_up_off_alt4

chat_bubble_outline0

repeat0

shareShare

rishi

@overquantized

9 months ago

i set high aura hps for pretrain like a numerologist

thumb_up_off_alt3

chat_bubble_outline1

repeat0

shareShare

rishi

@overquantized

8 months ago

bro is living in kowloon walled city monsoon season

thumb_up_off_alt2

chat_bubble_outline0

repeat0

shareShare

rishi

@overquantized

8 months ago

I’m nice at ping pong

thumb_up_off_alt0

chat_bubble_outline1

repeat0

shareShare

rishi

@overquantized

7 months ago

one time I jumped in a pond exactly like this and stepped on broken glass

thumb_up_off_alt5

chat_bubble_outline1

repeat0

shareShare

rishi

@overquantized

7 months ago

guan yin

thumb_up_off_alt3

chat_bubble_outline0

repeat0

shareShare

rishi

@overquantized

7 months ago

inference flops don’t matter

thumb_up_off_alt2

chat_bubble_outline0

repeat0

shareShare

rishi

@overquantized

7 months ago

new paper arxiv.org/abs/2510.04476

thumb_up_off_alt282

chat_bubble_outline10

repeat33

shareShare

Vasu Shyam

@vasud3vshyam

7 months ago

New day, new efficient attention variant

thumb_up_off_alt14

chat_bubble_outline1

repeat2

shareShare

Zyphra

@zyphraai

7 months ago

Zyphra is excited to release Compressed Convolutional Attention (CCA), a novel attention mechanism that: - Beats MHA, GQA, MLA for dense and MoE models - Reduces training/prefill flops - 3x fewer parameters vs MHA - Matches GQA/MLA KV-cache sizes without quality penalty

<a href="/ZyphraAI/">Zyphra</a> is excited to release Compressed Convolutional Attention (CCA), a novel attention mechanism that:
- Beats MHA, GQA, MLA for dense and MoE models
- Reduces training/prefill flops
- 3x fewer parameters vs MHA
- Matches GQA/MLA KV-cache sizes without quality penalty

thumb_up_off_alt30

chat_bubble_outline1

repeat6

shareShare

Beren Millidge

@berenmillidge

7 months ago

This one was super fun! For the first time we managed to directly reduce training and prefill attention flops directly. CCA is best of both worlds: matching MLA decode speed while substantially outperforming everything in perplexity and prefill speed.

thumb_up_off_alt13

chat_bubble_outline1

repeat4

shareShare

rishi

@overquantized

7 months ago

thai iced tea provides me with the power of 10000 suns

thumb_up_off_alt2

chat_bubble_outline1

repeat0

shareShare

rishi

rishi

rishi

rishi

꧁꧂

jianlin.su

rishi

rishi

rishi

rishi

rishi

rishi

rishi

rishi

rishi

rishi

Vasu Shyam

Zyphra

Beren Millidge

rishi