abhi_venigalla : Ready for GPU independence weekend? PyTorch 2.0 a • TwiCopy

Abhi Venigalla

@abhi_venigalla

+ Follow

Researcher @Databricks. Former @MosaicML, @CerebrasSystems. Addicted to all things compute.

calendar_today12-10-2018 04:27:26

609 Tweets

2,9K Followers

1,0K Following

Abhi Venigalla

@abhi_venigalla

10 months ago

Ready for GPU independence weekend?

PyTorch 2.0 and LLM Foundry now work out of the box on ** AMD GPUs! ** We profiled MPT 1B-13B models on AMD MI250 and saw perf within 80% of A100-40GB, which could go up to 94% with better software.

It. Just. Works.

mosaicml.com/blog/amd-mi250

account_circle

Abhi Venigalla

@abhi_venigalla

10 months ago

Here's MPT-1B training for 1B tokens on NVIDIA A100 (green) vs. AMD MI250 (red). Can you spot a difference?
Both runs use the exact same LLM Foundry code: github.com/mosaicml/llm-f…

thumb_up_off_alt57

chat_bubble_outline0

repeat3

shareShare

account_circle

Abhi Venigalla

@abhi_venigalla

10 months ago

If we zoom in on the first 100 batches, we get nearly overlapping loss curves. This is crazy given that the runs are on two totally different hardware stacks!

StreamingDataset and Composer do a lot heavy lifting for determinism in the dataloader and train loop.

thumb_up_off_alt48

chat_bubble_outline0

repeat2

shareShare

account_circle

Abhi Venigalla

@abhi_venigalla

10 months ago

What about perf? We only had 1 node of 4xMI250, so to compare with our 8xA100 systems we measured per-GPU metrics.

With no code changes, perf on MI250 looks really strong! About 80% of A100-40GB. Better FlashAttention for AMD may close the gap (we predict ~94% of A100-40GB)

account_circle

Abhi Venigalla

@abhi_venigalla

10 months ago

This is all made possible thanks to a software and hardware stack that AMD has been building for years, and is now bearing fruit.

Seeing MI250 work so well today brings hope that the MI300x will too when it arrives!