Stas Bekman (@stasbekman) 's Twitter Profile
Stas Bekman

@stasbekman

Toolmaker. Software creator, optimizer and harmonizer.

Makes ML systems work and fly @ Snowflake.

ID: 1068360975898660864

linkhttps://stasosphere.com/machine-learning/ calendar_today30-11-2018 04:28:00

2,2K Tweet

8,8K Followers

282 Following

Stas Bekman (@stasbekman) 's Twitter Profile Photo

Some time back I asked the DeepSpeed team to add torch.compile support - they did and went beyond that and created DeepCompile which can now massively speed up your training workloads - check out those plots! Wow! From talking to developers this is just the beginning and more

Stas Bekman (@stasbekman) 's Twitter Profile Photo

Very long context models are coming, e.g. NVIDIA's UltraLong 4M token series: huggingface.co/nvidia/Llama-3… But how do you finetune for such long seqlen? Soon we will post a working code that can finetune with millions of token seqlen for HF Transformers.

Stas Bekman (@stasbekman) 's Twitter Profile Photo

Modern art. Artist: PyTorch memory profiler Model: Llama-8B The piece on the left is the Forward pass The piece on the right is Backward pass

Modern art. 

Artist: PyTorch memory profiler

Model: Llama-8B

The piece on the left is the Forward pass 

The piece on the right is Backward pass
Stas Bekman (@stasbekman) 's Twitter Profile Photo

Have you figured out how to estimate FLOPs for Flash Attention 2 w/ packed samples? The formula it gives in the paper leads to about 2-3x what it should be Sections 4.1 and 4.2 of the FA2 paper can't decide what the right formula should be :( it suggests 14x in 4.1 and 6x or

Have you figured out how to estimate FLOPs for Flash Attention 2 w/ packed samples?

The formula it gives in the paper leads to about 2-3x what it should be

Sections 4.1 and 4.2 of the FA2 paper can't decide what the right formula should be :( it suggests 14x in 4.1 and 6x or
Stas Bekman (@stasbekman) 's Twitter Profile Photo

I have just realized I have never mentioned github.com/stas00/make-to… When I need to release a new package I just type: make release and have it bump up the version, update CHANGES.md, tag the release, start a new dev branch, commit all that, build the pip/conda

Dwarak Rajagopal (@dwarak) 's Twitter Profile Photo

Snowflake AI Research team is on fire! 🔥 Thrilled for our breakthroughs across embeddings, inference, & SQL generation - pioneering practical research that directly tackles critical real-world challenges for enterprise users! #AI #SnowflakeAI

Stas Bekman (@stasbekman) 's Twitter Profile Photo

In inference one usually gets either high throughput or low latency, but not both - enter shift parallelism which automatically adapts for the best performance!