Nick Comly US (@ncomly_nvidia) Twitter Tweets • TwiCopy

NVIDIA AI Developer

5 months ago

👀 Accelerate performance of AI at Meta Llama 4 Maverick and Llama 4 Scout using our optimizations in #opensource TensorRT-LLM.⚡ ✅ NVIDIA Blackwell B200 delivers over 42,000 tokens per second on Llama 4 Scout, over 32,000 tokens per seconds on Llama 4 Maverick. ✅ 3.4X more

👀 Accelerate performance of <a href="/AIatMeta/">AI at Meta</a> Llama 4 Maverick and Llama 4 Scout using our optimizations in #opensource TensorRT-LLM.⚡

✅ NVIDIA Blackwell B200 delivers over 42,000 tokens per second on Llama 4 Scout, over 32,000 tokens per seconds on Llama 4 Maverick.

✅ 3.4X more

thumb_up_off_alt609

chat_bubble_outline64

repeat92

shareShare

NVIDIA AI Developer

@nvidiaaidev

5 months ago

🎉 A new generation of the AI at Meta Llama models is here with Llama 4 Scout and Llama 4 Maverick.🦙 ⚡ Accelerated for TensorRT-LLM, you can achieve over 40K output tokens per second on NVIDIA Blackwell B200 GPUs. Tech blog to learn more ➡️ developer.nvidia.com/blog/nvidia-ac…

🎉 A new generation of the <a href="/AIatMeta/">AI at Meta</a> Llama models is here with Llama 4 Scout and Llama 4 Maverick.🦙

⚡ Accelerated for TensorRT-LLM, you can achieve over 40K output tokens per second on NVIDIA Blackwell B200 GPUs.

Tech blog to learn more ➡️ developer.nvidia.com/blog/nvidia-ac…

thumb_up_off_alt190

chat_bubble_outline8

repeat37

shareShare

NVIDIA AI Developer

@nvidiaaidev

4 months ago

🔢 ✨ Bring your data and try out the new Llama 4 Maverick and Scout multimodal, multilingual MoE models from AI at Meta. 🎉 Available now on the free multimodal playground for Llama 4 using our NVIDIA NIM demo environment on the API catalog. ➡️ build.nvidia.com/meta

thumb_up_off_alt85

chat_bubble_outline9

repeat11

shareShare

Modal

@modal_labs

4 months ago

When you're serving tokens in a chatbot, low latency is everything. We just optimized our TensorRT-LLM example to achieve a 4x speedup -- and wrote up the key steps we took. Read on for the tl;dr. We promise it'll be fast 😉

thumb_up_off_alt46

chat_bubble_outline1

repeat3

shareShare

Baseten

@basetenco

4 months ago

🚀 You can now use NVIDIA B200s on Baseten and get higher throughput, lower latency, and better cost per token! 🚀 From benchmarks on models like DeepSeek R1, Llama 4, and Qwen, we’re already seeing: • 5x higher throughput • Over 2x better cost per token • 38% lower latency

thumb_up_off_alt17

chat_bubble_outline9

repeat6

shareShare

vLLM

@vllm_project

4 months ago

vLLM🤝🤗! You can now deploy any Hugging Face language model with vLLM's speed. This integration makes it possible for one consistent implementation of the model in HF for both training and inference. 🧵 blog.vllm.ai/2025/04/11/tra…

thumb_up_off_alt850

chat_bubble_outline12

repeat126

shareShare

dstack

@dstackai

4 months ago

TensorRT-LLM delivers fast, flexible LLM inference - but the full pipeline can be complex. This dstack example simplifies it end-to-end: build the container, convert the model, and deploy on any cloud or on-prem. Showcases both DeepSeek R1 and its distilled Llama version.

thumb_up_off_alt16

chat_bubble_outline2

repeat6

shareShare

Baseten

@basetenco

4 months ago

We’ve seen a lot of interest in B200s after our launch. Our lead DevRel, Philip Kiely, wrote a blog explaining some of their performance benefits and the components needed to build an inference platform on top of B200 GPUs. More details in 🧵

We’ve seen a lot of interest in B200s after our launch.

Our lead DevRel, <a href="/philip_kiely/">Philip Kiely</a>, wrote a blog explaining some of their performance benefits and the components needed to build an inference platform on top of B200 GPUs.

More details in 🧵

thumb_up_off_alt6

chat_bubble_outline1

repeat2

shareShare

LMSYS Org

@lmsysorg

4 months ago

Thank you NVIDIA AI Developer Nebius and DataCrunch_io for providing the development machines H100 and H200. Your support greatly contributed to the fast execution speed of SGLang's optimization!

thumb_up_off_alt18

chat_bubble_outline1

repeat6

shareShare

NVIDIA AI Developer

@nvidiaaidev

4 months ago

🎉 Huge congrats to LMSYS Org on 5x faster DeepSeek R1 performance on NVIDIA Hopper with disaggregated serving, large-scale expert parallelism, and more. Great to see collaboration across the industry to redefine what's possible on NVIDIA.

thumb_up_off_alt81

chat_bubble_outline1

repeat17

shareShare