Andrej Karpathy (@karpathy) Twitter Tweets • TwiCopy

Andrej Karpathy

@karpathy

+ Follow

🧑‍🍳. Previously Director of AI @ Tesla, founding team @ OpenAI, CS231n/PhD @ Stanford. I like to train large deep neural nets 🧠🤖💥

ID:33836629

linkhttps://karpathy.ai calendar_today21-04-2009 06:49:15

8,6K Tweets

978,1K Followers

904 Following

Follow People

Yann LeCun

Professor at NYU. Chief AI Scientist at Meta. Researcher in AI, Machine Learning, Robotics, etc. ACM Turing Award Laureate.

+ Follow

François Chollet

Deep learning @google. Creator of Keras. Author of 'Deep Learning with Python'. Opinions are my own.

+ Follow

hardmaru

Building Collective Intelligence @SakanaAILabs 🧠

+ Follow

AI research paper tweets, ML @Gradio (acq. by @HuggingFace 🤗) dm for promo follow on Hugging Face: https://t.co/q2Qoey80Gx

+ Follow

Sebastian Raschka

Machine learning & AI researcher writing at https://t.co/A0tXWzG1p5. LLM research engineer @LightningAI. Previously stats professor at UW-Madison.

+ Follow

Andrej Karpathy

@karpathy

vor 3 tage

Money can't buy happiness.
Just like an H100.
H100 = happiness.

account_circle

🔥llm.c update: Our single file of 2,000 ~clean lines of C/CUDA code now trains GPT-2 (124M) on GPU at speeds ~matching PyTorch (fp32, no flash attention)
github.com/karpathy/llm.c…

On my A100 I'm seeing 78ms/iter for llm.c and 80ms/iter for PyTorch. Keeping in mind this is fp32,

account_circle

Andrej Karpathy

@karpathy

vor 1 woche

The model card has some more interesting info too:
github.com/meta-llama/lla…

Note that Llama 3 8B is actually somewhere in the territory of Llama 2 70B, depending on where you look. This might seem confusing at first but note that the former was trained for 15T tokens, while the

account_circle

Andrej Karpathy

@karpathy

vor 1 woche

Congrats to AI at Meta on Llama 3 release!! 🎉
ai.meta.com/blog/meta-llam…
Notes:

Releasing 8B and 70B (both base and finetuned) models, strong-performing in their model class (but we'll see when the rankings come in @ lmsys.org :))
400B is still training, but already encroaching

account_circle

Andrej Karpathy

@karpathy

vor 1 woche

Consider being a labeler for an LLM. The prompt is “give me a random number between 1 and 10”. What SFT & RM labels do you contribute? What does this do the network when trained on?

In subtle way this problem is present in every prompt that does not have a single unique answer.

account_circle

Andrej Karpathy

@karpathy

vor 1 woche

The history of computing is repeating in an echo, except replace computers that do precise arithmetic on bytes with computers that do statistical arithmetic on tokens.

account_circle

Andrej Karpathy

@karpathy

vor 1 woche

# scheduling workloads to run on humans

Some computational workloads in human organizations are best 'run on a CPU': take one single, highly competent person and assign them a task to complete in a single-threaded fashion, without synchronization. Usually the best fit when

account_circle

Andrej Karpathy

@karpathy

vor 1 woche

🧠: “Let’s but this (text)book! Nice and now… instead of reading it… let’s buy another one!” 💡

All of the dopamine is generated only at the point of resolving to read something. After that there is no juice left 😅

account_circle

Andrej Karpathy

@karpathy

vor 2 wochen

A few new CUDA hacker friends joined the effort and now llm.c is only 2X slower than PyTorch (fp32, forward pass) compared to 4 days ago, when it was at 4.2X slower 📈

The biggest improvements were:
- turn on TF32 (NVIDIA TensorFLoat-32) instead of FP32 for matmuls. This is a

account_circle

Andrej Karpathy

@karpathy

vor 2 wochen

torch.compile is cool but
LLM compile: takes your .py repo as string and outputs a brand new, custom, from scratch, minimal code repository directly running your network in highly optimized CUDA

account_circle

Andrej Karpathy

@karpathy

vor 2 wochen

Okay I did a first quick pass of naive CUDA kernels for the forward pass of GPT-2 and pushed everything to one file in llm.c, Still only ~1000 lines of code:
github.com/karpathy/llm.c…

Current per iteration timings on my Lambda box <3 A100 40GB PCIe, B=4, T=1024:
- llm.c: 111ms
-

account_circle

Andrej Karpathy

@karpathy

vor 2 wochen

Btw writing the llm.c training code would imo be a very interesting, impressive, self-contained and very meta challenge for LLM agents. The prompt is:

Take the PyTorch code train_gpt2.py
And write, compile and unit test a single .c file that reproduces the training: train_gpt2.c

account_circle

Andrej Karpathy

@karpathy

vor 2 wochen

I added a quick crappy tutorial on how PyTorch layers are moved to C, with a few possibly helpful pointers:
github.com/karpathy/llm.c…

account_circle