Andrej Karpathy(@karpathy) 's Twitter Profileg
Andrej Karpathy

@karpathy

🧑‍🍳. Previously Director of AI @ Tesla, founding team @ OpenAI, CS231n/PhD @ Stanford. I like to train large deep neural nets 🧠🤖💥

ID:33836629

linkhttps://karpathy.ai calendar_today21-04-2009 06:49:15

8,6K Tweets

978,1K Followers

904 Following

Follow People
Andrej Karpathy(@karpathy) 's Twitter Profile Photo

🔥llm.c update: Our single file of 2,000 ~clean lines of C/CUDA code now trains GPT-2 (124M) on GPU at speeds ~matching PyTorch (fp32, no flash attention)
github.com/karpathy/llm.c…

On my A100 I'm seeing 78ms/iter for llm.c and 80ms/iter for PyTorch. Keeping in mind this is fp32,

🔥llm.c update: Our single file of 2,000 ~clean lines of C/CUDA code now trains GPT-2 (124M) on GPU at speeds ~matching PyTorch (fp32, no flash attention) github.com/karpathy/llm.c… On my A100 I'm seeing 78ms/iter for llm.c and 80ms/iter for PyTorch. Keeping in mind this is fp32,
account_circle
Andrej Karpathy(@karpathy) 's Twitter Profile Photo

The model card has some more interesting info too:
github.com/meta-llama/lla…

Note that Llama 3 8B is actually somewhere in the territory of Llama 2 70B, depending on where you look. This might seem confusing at first but note that the former was trained for 15T tokens, while the

account_circle
Andrej Karpathy(@karpathy) 's Twitter Profile Photo

Congrats to AI at Meta on Llama 3 release!! 🎉
ai.meta.com/blog/meta-llam…
Notes:

Releasing 8B and 70B (both base and finetuned) models, strong-performing in their model class (but we'll see when the rankings come in @ lmsys.org :))
400B is still training, but already encroaching

account_circle
Andrej Karpathy(@karpathy) 's Twitter Profile Photo

Consider being a labeler for an LLM. The prompt is “give me a random number between 1 and 10”. What SFT & RM labels do you contribute? What does this do the network when trained on?

In subtle way this problem is present in every prompt that does not have a single unique answer.

account_circle
Andrej Karpathy(@karpathy) 's Twitter Profile Photo

The history of computing is repeating in an echo, except replace computers that do precise arithmetic on bytes with computers that do statistical arithmetic on tokens.

account_circle
Andrej Karpathy(@karpathy) 's Twitter Profile Photo

# scheduling workloads to run on humans

Some computational workloads in human organizations are best 'run on a CPU': take one single, highly competent person and assign them a task to complete in a single-threaded fashion, without synchronization. Usually the best fit when

account_circle
Andrej Karpathy(@karpathy) 's Twitter Profile Photo

🧠: “Let’s but this (text)book! Nice and now… instead of reading it… let’s buy another one!” 💡

All of the dopamine is generated only at the point of resolving to read something. After that there is no juice left 😅

account_circle
Andrej Karpathy(@karpathy) 's Twitter Profile Photo

A few new CUDA hacker friends joined the effort and now llm.c is only 2X slower than PyTorch (fp32, forward pass) compared to 4 days ago, when it was at 4.2X slower 📈

The biggest improvements were:
- turn on TF32 (NVIDIA TensorFLoat-32) instead of FP32 for matmuls. This is a

A few new CUDA hacker friends joined the effort and now llm.c is only 2X slower than PyTorch (fp32, forward pass) compared to 4 days ago, when it was at 4.2X slower 📈 The biggest improvements were: - turn on TF32 (NVIDIA TensorFLoat-32) instead of FP32 for matmuls. This is a
account_circle
Andrej Karpathy(@karpathy) 's Twitter Profile Photo

torch.compile is cool but
LLM compile: takes your .py repo as string and outputs a brand new, custom, from scratch, minimal code repository directly running your network in highly optimized CUDA

account_circle
Andrej Karpathy(@karpathy) 's Twitter Profile Photo

Okay I did a first quick pass of naive CUDA kernels for the forward pass of GPT-2 and pushed everything to one file in llm.c, Still only ~1000 lines of code:
github.com/karpathy/llm.c…

Current per iteration timings on my Lambda box <3 A100 40GB PCIe, B=4, T=1024:
- llm.c: 111ms
-

Okay I did a first quick pass of naive CUDA kernels for the forward pass of GPT-2 and pushed everything to one file in llm.c, Still only ~1000 lines of code: github.com/karpathy/llm.c… Current per iteration timings on my Lambda box <3 A100 40GB PCIe, B=4, T=1024: - llm.c: 111ms -
account_circle
Andrej Karpathy(@karpathy) 's Twitter Profile Photo

Btw writing the llm.c training code would imo be a very interesting, impressive, self-contained and very meta challenge for LLM agents. The prompt is:

Take the PyTorch code train_gpt2.py
And write, compile and unit test a single .c file that reproduces the training: train_gpt2.c

account_circle
Andrej Karpathy(@karpathy) 's Twitter Profile Photo

I added a quick crappy tutorial on how PyTorch layers are moved to C, with a few possibly helpful pointers:
github.com/karpathy/llm.c…

account_circle