Aryan (@aryanvs_) Twitter Tweets • TwiCopy

Aryan

a month ago

No wizardry required here... Apply context parallel on compute bound models for easy speedups. Oh, and don't forget flash attention 3! Further, you can compile the model and set a few inductor flags for near perfect compute-communication overlap. Don't want to use compile? -

thumb_up_off_alt195

chat_bubble_outline6

repeat8

shareShare

Aryan

@aryanvs_

a month ago

sometimes i remember how trying to understand diffusion models breaks my brain. want to do fewer steps? just use a better solver, cowboy. RK4 turbo deluxe XL+++. same quality. yeehaw use a more compressed latent space? no worries mate, text still renders. we drawing thems

thumb_up_off_alt34

chat_bubble_outline2

repeat2

shareShare

Aryan

@aryanvs_

a month ago

Actually, that just comes from the warmup part of pytorch's cudagraph example. A more useful example can be found here: gist.github.com/a-r-r-o-w/d34c… Since Flux has a dual stream architecture for text and image tokens, the computation can be parallelized using streams. However,

thumb_up_off_alt19

chat_bubble_outline2

repeat0

shareShare

Aryan

@aryanvs_

a month ago

guys will literally learn to speak and write an alien language to whisper to gpus but won't say hi to that girl

thumb_up_off_alt12

chat_bubble_outline2

repeat0

shareShare

Aryan

@aryanvs_

a month ago

do yall not read literally every cool repository you come across??? (sorry for the rant that follows but using this post as an answer to DMs and some emails i got yesterday) firstly, im a complete noob at performance optimization. you might think optimizing Wan or Flux or

thumb_up_off_alt127

chat_bubble_outline3

repeat8

shareShare

Aryan

@aryanvs_

a month ago

a lot of pytorch training code i see in diffusion/llm/rl world makes use of compilation for speedup. but they all kinda just do it for forward pass. you can squeeze in some more speedups by compiling the backward as well: - loss.backward() + torch.compile(lambda:

thumb_up_off_alt308

chat_bubble_outline4

repeat18

shareShare

Adithya S K

@adithya_s_k

a month ago

I want to rephrase this a bit here You can do all these things, but not everything at 100%. It’s been especially hard to work on a product and do research at the same time Research is very demanding there are no fixed time slots where you can just sit down, finish, and move

thumb_up_off_alt37

chat_bubble_outline0

repeat2

shareShare

Aryan

@aryanvs_

a month ago

really gotta start learning mandarin. the chinese bros i've interacted with helped with research replication, unfiltered advice and opinions about doing a phd, connecting to other folks and professors, explaining their paper/work, and are also somehow really really humble and

thumb_up_off_alt21

chat_bubble_outline2

repeat0

shareShare

Aryan

@aryanvs_

a month ago

being quizzed and asked questions by professors and their phd students is a nice reminder of how out of touch one can be with theory behind their work. it's one thing to implement something and have a decent grasp of it. it's completely another thing to *actually* understand all

thumb_up_off_alt12

chat_bubble_outline0

repeat0

shareShare

Aryan

@aryanvs_

a month ago

inference profit margins are so damn insane for image/video models. with 1.5(0.8) second latency per image using Flux and 2.1(1.4) second using Qwen (outside parentheses is latency for lossless quality optimization and inside is lossy), you need 4x H100. each image roughly costs

thumb_up_off_alt27

chat_bubble_outline1

repeat0

shareShare

Tiezhen WANG

@xianbao_qian

a month ago

ByteDance Seed OSS is now available on Hugging Face!!! - 36B, Apache2 - Great for ablation research, including Base, base without synthetic instruction data and instruction variants - Great performance on reasoning & Agents - Native 512k long context - Flexible thinking budget

ByteDance Seed OSS is now available on <a href="/huggingface/">Hugging Face</a>!!!

- 36B, Apache2
- Great for ablation research, including Base, base without synthetic instruction data and instruction variants
- Great performance on reasoning & Agents
- Native 512k long context
- Flexible thinking budget

thumb_up_off_alt74

chat_bubble_outline2

repeat11

shareShare

Aryan

@aryanvs_

a month ago

looking at some world model inference code running on h100 - it seems to be idle a lot when not performing any actions. as the model only supports 4-10 actions, computing the future output of all possible actions (upto some depth) in idle cycles might be a feasible strategy for

thumb_up_off_alt13

chat_bubble_outline1

repeat0

shareShare

Aryan

@aryanvs_

a month ago

speedrunning getting banned on stack overflow 🫡

thumb_up_off_alt14

chat_bubble_outline1

repeat0

shareShare

Aryan

@aryanvs_

a month ago

performance engineers squeezing out 1% gains after 19 days of work, 2 sacrifices and 7 mental breakdowns

thumb_up_off_alt415

chat_bubble_outline16

repeat13

shareShare

Thien Tran

@gaunernst

a month ago

Wrote a blogpost. Hopefully it's the first of many to come. Feedback is welcome 🤗 gau-nernst.github.io/fa-5090/

thumb_up_off_alt357

chat_bubble_outline17

repeat44

shareShare

Aryan

@aryanvs_

a month ago

if together dot ai stopped publishing their latest advancements, the posts about "we made model [x] run [y] times faster" from a lot of inference companies (who don't really innovate anything) would cease to exist lol. easy to take pytorch and apply all existing ideas and

thumb_up_off_alt27

chat_bubble_outline1

repeat0

shareShare

Aryan

@aryanvs_

a month ago

Okay, compiled autograd is bonkers About 5.2 minutes to train 1000 step Flux* on 1x A100 (yes yes, the images have compilation overhead included atm because I'm too lazy to do more runs today). Resolution is 512px in order to keep comparisons equivalent to paid services (will

thumb_up_off_alt25

chat_bubble_outline0

repeat0

shareShare