Seba (@culstory) Twitter Tweets • TwiCopy

Seba

8 months ago

been playing with 2 amd mi50, okay performance with sglang. qwen3-8B ~60t/s gen with tp2 small ctx, (~900GB/s reached bw), 50t/s ctx 4k qwen3-4b single gpu gets ~80t/s (~600GB/s bw), individual bws: -up+g matmul kernels reach ~850GB/s -down mm 800GB/s -qkv 870GB/s -o 800GB/s

thumb_up_off_alt2

chat_bubble_outline0

repeat0

shareShare

Seba

@culstory

8 months ago

feel like clipping could be used everywhere to make the activations quantization friendly

thumb_up_off_alt0

chat_bubble_outline0

repeat0

shareShare

Seba

@culstory

7 months ago

twitter perception of openai

thumb_up_off_alt0

chat_bubble_outline0

repeat0

shareShare

Seba

@culstory

7 months ago

zoom out

thumb_up_off_alt0

chat_bubble_outline0

repeat0

shareShare

Seba

@culstory

7 months ago

the curse of causal masks

thumb_up_off_alt0

chat_bubble_outline0

repeat0

shareShare

Seba

@culstory

7 months ago

interesting pair of papers yday bytedance chads with ultramemv2 arxiv.org/abs/2508.18756 Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks arxiv.org/abs/2508.18672

thumb_up_off_alt2

chat_bubble_outline0

repeat1

shareShare

Seba

@culstory

6 months ago

lots of tts activity

thumb_up_off_alt0

chat_bubble_outline0

repeat0

shareShare

Seba

@culstory

6 months ago

i think i’ll make a sign

thumb_up_off_alt1

chat_bubble_outline1

repeat0

shareShare

Seba

@culstory

5 months ago

with language diffusion models research slowly catching, my biggest hunch is that param heavy encoder-small weights, heavy flops decoder would greatly fit current consumer hw. waiting eerily for what neurips.cc/virtual/2025/p… has to show us

thumb_up_off_alt0

chat_bubble_outline0

repeat0

shareShare