@karpathy : A few new CUDA hacker friends joined the effort and now llm.c is only 2X slower than PyTorch (fp32, forward pass) compared to 4 days ago, when it was at 4.2X slower 📈 The biggest improvements were: - turn on TF32 (NVIDIA TensorFLoat-32) instead of FP32 for matmuls. This is a • TwiCopy

Andrej Karpathy

@karpathy

+ Follow

Building @EurekaLabsAI. Previously Director of AI @ Tesla, founding team @ OpenAI, CS231n/PhD @ Stanford. I like to train large deep neural nets 🧠🤖💥

ID: 33836629

linkhttps://karpathy.ai calendar_today21-04-2009 06:49:15

9,9K Tweet

1,2M Takipçi

972 Takip Edilen

Andrej Karpathy

@karpathy

a year ago

A few new CUDA hacker friends joined the effort and now llm.c is only 2X slower than PyTorch (fp32, forward pass) compared to 4 days ago, when it was at 4.2X slower 📈 The biggest improvements were: - turn on TF32 (NVIDIA TensorFLoat-32) instead of FP32 for matmuls. This is a

thumb_up_off_alt4,4K

chat_bubble_outline111

repeat358

shareShare