profile-img
Abhi Venigalla

@abhi_venigalla

Researcher @Databricks. Former @MosaicML, @CerebrasSystems. Addicted to all things compute.

calendar_today12-10-2018 04:27:26

609 Tweets

2,9K Followers

1,0K Following

Abhi Venigalla(@abhi_venigalla) 's Twitter Profile Photo

Ready for GPU independence weekend?

PyTorch 2.0 and LLM Foundry now work out of the box on ** AMD GPUs! ** We profiled MPT 1B-13B models on AMD MI250 and saw perf within 80% of A100-40GB, which could go up to 94% with better software.

It. Just. Works.

mosaicml.com/blog/amd-mi250

account_circle
Abhi Venigalla(@abhi_venigalla) 's Twitter Profile Photo

Here's MPT-1B training for 1B tokens on NVIDIA A100 (green) vs. AMD MI250 (red). Can you spot a difference?
Both runs use the exact same LLM Foundry code: github.com/mosaicml/llm-fโ€ฆ

Here's MPT-1B training for 1B tokens on NVIDIA A100 (green) vs. AMD MI250 (red). Can you spot a difference? Both runs use the exact same LLM Foundry code: github.com/mosaicml/llm-fโ€ฆ
account_circle
Abhi Venigalla(@abhi_venigalla) 's Twitter Profile Photo

If we zoom in on the first 100 batches, we get nearly overlapping loss curves. This is crazy given that the runs are on two totally different hardware stacks!

StreamingDataset and Composer do a lot heavy lifting for determinism in the dataloader and train loop.

If we zoom in on the first 100 batches, we get nearly overlapping loss curves. This is crazy given that the runs are on two totally different hardware stacks! StreamingDataset and Composer do a lot heavy lifting for determinism in the dataloader and train loop.
account_circle
Abhi Venigalla(@abhi_venigalla) 's Twitter Profile Photo

What about perf? We only had 1 node of 4xMI250, so to compare with our 8xA100 systems we measured per-GPU metrics.

With no code changes, perf on MI250 looks really strong! About 80% of A100-40GB. Better FlashAttention for AMD may close the gap (we predict ~94% of A100-40GB)

What about perf? We only had 1 node of 4xMI250, so to compare with our 8xA100 systems we measured per-GPU metrics. With no code changes, perf on MI250 looks really strong! About 80% of A100-40GB. Better FlashAttention for AMD may close the gap (we predict ~94% of A100-40GB)
account_circle
Abhi Venigalla(@abhi_venigalla) 's Twitter Profile Photo

This is all made possible thanks to a software and hardware stack that AMD has been building for years, and is now bearing fruit.

Seeing MI250 work so well today brings hope that the MI300x will too when it arrives!

This is all made possible thanks to a software and hardware stack that AMD has been building for years, and is now bearing fruit. Seeing MI250 work so well today brings hope that the MI300x will too when it arrives!
account_circle
Abhi Venigalla(@abhi_venigalla) 's Twitter Profile Photo

And yes, you can switch back and forth between NVIDIA and AMD, even within a single training run.

It's Christmas in July!๐ŸŽ„

And yes, you can switch back and forth between NVIDIA and AMD, even within a single training run. It's Christmas in July!๐ŸŽ„
account_circle