Aryan (@aryanvs_) 's Twitter Profile
Aryan

@aryanvs_

latent explorer @huggingface

ID: 1722556084462714880

calendar_today09-11-2023 10:06:04

500 Tweet

758 Followers

1,1K Following

Aryan (@aryanvs_) 's Twitter Profile Photo

No wizardry required here... Apply context parallel on compute bound models for easy speedups. Oh, and don't forget flash attention 3! Further, you can compile the model and set a few inductor flags for near perfect compute-communication overlap. Don't want to use compile? -

No wizardry required here... Apply context parallel on compute bound models for easy speedups. Oh, and don't forget flash attention 3!

Further, you can compile the model and set a few inductor flags for near perfect compute-communication overlap.

Don't want to use compile?
-
Aryan (@aryanvs_) 's Twitter Profile Photo

sometimes i remember how trying to understand diffusion models breaks my brain. want to do fewer steps? just use a better solver, cowboy. RK4 turbo deluxe XL+++. same quality. yeehaw use a more compressed latent space? no worries mate, text still renders. we drawing thems

Aryan (@aryanvs_) 's Twitter Profile Photo

Actually, that just comes from the warmup part of pytorch's cudagraph example. A more useful example can be found here: gist.github.com/a-r-r-o-w/d34c… Since Flux has a dual stream architecture for text and image tokens, the computation can be parallelized using streams. However,

Actually, that just comes from the warmup part of pytorch's cudagraph example. A more useful example can be found here: gist.github.com/a-r-r-o-w/d34c…

Since Flux has a dual stream architecture for text and image tokens, the computation can be parallelized using streams. However,
Aryan (@aryanvs_) 's Twitter Profile Photo

do yall not read literally every cool repository you come across??? (sorry for the rant that follows but using this post as an answer to DMs and some emails i got yesterday) firstly, im a complete noob at performance optimization. you might think optimizing Wan or Flux or

do yall not read literally every cool repository you come across???

(sorry for the rant that follows but using this post as an answer to DMs and some emails i got yesterday)

firstly, im a complete noob at performance optimization. you might think optimizing Wan or Flux or
Aryan (@aryanvs_) 's Twitter Profile Photo

a lot of pytorch training code i see in diffusion/llm/rl world makes use of compilation for speedup. but they all kinda just do it for forward pass. you can squeeze in some more speedups by compiling the backward as well: - loss.backward() + torch.compile(lambda:

Adithya S K (@adithya_s_k) 's Twitter Profile Photo

I want to rephrase this a bit here You can do all these things, but not everything at 100%. It’s been especially hard to work on a product and do research at the same time Research is very demanding there are no fixed time slots where you can just sit down, finish, and move

Aryan (@aryanvs_) 's Twitter Profile Photo

really gotta start learning mandarin. the chinese bros i've interacted with helped with research replication, unfiltered advice and opinions about doing a phd, connecting to other folks and professors, explaining their paper/work, and are also somehow really really humble and

Aryan (@aryanvs_) 's Twitter Profile Photo

being quizzed and asked questions by professors and their phd students is a nice reminder of how out of touch one can be with theory behind their work. it's one thing to implement something and have a decent grasp of it. it's completely another thing to *actually* understand all

Aryan (@aryanvs_) 's Twitter Profile Photo

inference profit margins are so damn insane for image/video models. with 1.5(0.8) second latency per image using Flux and 2.1(1.4) second using Qwen (outside parentheses is latency for lossless quality optimization and inside is lossy), you need 4x H100. each image roughly costs

Tiezhen WANG (@xianbao_qian) 's Twitter Profile Photo

ByteDance Seed OSS is now available on Hugging Face!!! - 36B, Apache2 - Great for ablation research, including Base, base without synthetic instruction data and instruction variants - Great performance on reasoning & Agents - Native 512k long context - Flexible thinking budget

ByteDance Seed OSS is now available on <a href="/huggingface/">Hugging Face</a>!!!

- 36B, Apache2
- Great for ablation research, including Base, base without synthetic instruction data and instruction variants
- Great performance on reasoning &amp; Agents
- Native 512k long context
- Flexible thinking budget
Aryan (@aryanvs_) 's Twitter Profile Photo

looking at some world model inference code running on h100 - it seems to be idle a lot when not performing any actions. as the model only supports 4-10 actions, computing the future output of all possible actions (upto some depth) in idle cycles might be a feasible strategy for

Aryan (@aryanvs_) 's Twitter Profile Photo

if together dot ai stopped publishing their latest advancements, the posts about "we made model [x] run [y] times faster" from a lot of inference companies (who don't really innovate anything) would cease to exist lol. easy to take pytorch and apply all existing ideas and

Aryan (@aryanvs_) 's Twitter Profile Photo

Okay, compiled autograd is bonkers About 5.2 minutes to train 1000 step Flux* on 1x A100 (yes yes, the images have compilation overhead included atm because I'm too lazy to do more runs today). Resolution is 512px in order to keep comparisons equivalent to paid services (will

Okay, compiled autograd is bonkers 

About 5.2 minutes to train 1000 step Flux* on 1x A100 (yes yes, the images have compilation overhead included atm because I'm too lazy to do more runs today). Resolution is 512px in order to keep comparisons equivalent to paid services (will