Stuart Sul (@stuart_sul) 's Twitter Profile
Stuart Sul

@stuart_sul

cs @ stanford

ID: 1811402960288751616

calendar_today11-07-2024 14:12:07

0 Tweet

8 Followers

52 Following

Stuart Sul (@stuart_sul) 's Twitter Profile Photo

GPU kernel launches are expensive--so we fused the entire Llama-1B into a single kernel. Very excited to kick off our megakernel framework series with Thunderkittens hazyresearch. More coming soon!

Andrej Karpathy (@karpathy) 's Twitter Profile Photo

So so so cool. Llama 1B batch one inference in one single CUDA kernel, deleting synchronization boundaries imposed by breaking the computation into a series of kernels called in sequence. The *optimal* orchestration of compute and memory is only achievable in this way.

Jordan Juravsky (@jordanjuravsky) 's Twitter Profile Photo

Happy Throughput Thursday! We’re excited to release Tokasaurus: an LLM inference engine designed from the ground up for high-throughput workloads with large and small models. (Joint work with Ayush Chakravarthy, Ryan Ehrlich, Sabri Eyuboglu, Bradley Brown, Joseph Shetaye,

Happy Throughput Thursday! We’re excited to release Tokasaurus: an LLM inference engine designed from the ground up for high-throughput workloads with large and small models.

(Joint work with <a href="/achakravarthy01/">Ayush Chakravarthy</a>, <a href="/ryansehrlich/">Ryan Ehrlich</a>, <a href="/EyubogluSabri/">Sabri Eyuboglu</a>, <a href="/brad19brown/">Bradley Brown</a>, <a href="/jshetaye/">Joseph Shetaye</a>,
Sabri Eyuboglu (@eyuboglusabri) 's Twitter Profile Photo

When we put lots of text (eg a code repo) into LLM context, cost soars b/c of the KV cache’s size. What if we trained a smaller KV cache for our documents offline? Using a test-time training recipe we call self-study, we find that this can reduce cache memory on avg 39x

When we put lots of text (eg a code repo) into LLM context, cost soars b/c of the KV cache’s size.

What if we trained a smaller KV cache for our documents offline? Using a test-time training recipe we call self-study, we find that this can reduce cache memory on avg 39x
Stuart Sul (@stuart_sul) 's Twitter Profile Photo

We worked closely with the OpenAI team to make sure GPT-5 is the best coding agent ever on Cursor. For me, it’s the first AI model that actually provides meaningful help with GPU kernels (especially on finding race conditions). Everyone should give it a try.