Tyler Michael Smith (@tms_jr) Twitter Tweets • TwiCopy

oldfriend99

7 years ago

The surface of Mars is covered with vast areas. Some of the areas that have been found on Mars span 600,000 square miles — that's over twice the size of Texas

thumb_up_off_alt868

chat_bubble_outline20

repeat83

shareShare

Neural Magic is expanding to GPUs! Complementing our existing efforts with CPUs and model compression, we just launched nm-vllm, our initial community release to support GPU inference serving for LLMs. github.com/neuralmagic/nm… Details 👇

thumb_up_off_alt233

chat_bubble_outline4

repeat37

shareShare

Delip Rao e/σ

@deliprao

2 years ago

The ML gods will punish you for hubris if you fail to test every small change incrementally, regardless of your experience with what you are doing.

thumb_up_off_alt68

chat_bubble_outline3

repeat7

shareShare

Red Hat AI

@redhat_ai

a year ago

EXCITING NEWS: Neural Magic and Anyscale contributed FP8 quantization support to the vLLM, making LLM inference more efficient. FP8 reduces latency on NVIDIA GPUs by 2x with >99% accuracy preservation. Cheers to NVIDIA AI Developer for validating our results. 1/6

EXCITING NEWS: Neural Magic and <a href="/anyscalecompute/">Anyscale</a> contributed FP8 quantization support to the <a href="/vllm_project/">vLLM</a>, making LLM inference more efficient.

FP8 reduces latency on NVIDIA GPUs by 2x with >99% accuracy preservation.

Cheers to <a href="/NVIDIAAIDev/">NVIDIA AI Developer</a> for validating our results.

1/6

thumb_up_off_alt84

chat_bubble_outline2

repeat21

shareShare

Red Hat AI

@redhat_ai

a year ago

🎉 Exciting news! Tyler Smith, one of our many talented engineers, is now Neural Magic's 3rd vLLM project committer! Check out Tyler's contributions: github.com/tlrmchlsmth. We’re proud to be a leading contributor to vLLM. 🚀 Cheers to Tyler and the team!

thumb_up_off_alt26

chat_bubble_outline2

repeat5

shareShare

Tyler Michael Smith

@tms_jr

a year ago

Join if you want to find out about how we're using CUTLASS to support quantization in vLLM -- specifically w8a8 for compute speedups, a deep dive into how we handle zero points for int8 asymmetric quantization, and how we put it all together to support FP8 Llama 3.1 405b.

thumb_up_off_alt12

chat_bubble_outline0

repeat6

shareShare

Tyler Michael Smith

@tms_jr

a year ago

me: "looks like i need to calculate the variance of this distributed tensor -- what's that called again? oh! Welford's online algorithm" my brain for the next 3 days: "Wilford Brimley's online algorithm"

thumb_up_off_alt4

chat_bubble_outline0

repeat0

shareShare

Tyler Michael Smith

@tms_jr

a year ago

Happening soon -- Join if you want to see a roofline plot that I made in TikZ!

thumb_up_off_alt4

chat_bubble_outline0

repeat0

shareShare

vLLM

@vllm_project

a year ago

A month ago, we announced our performance roadmap. Today, we are happy to share that the latest release achieves 🚀2.7x higher throughput and is 5x faster for output latency on Llama 8B, and 1.8x higher throughput and 2x faster on Llama 70B for H100s. blog.vllm.ai/2024/09/05/per…

thumb_up_off_alt378

chat_bubble_outline14

repeat69

shareShare

Red Hat AI

@redhat_ai

a year ago

Last week's vLLM office hours recording is ready! 🎥 Tyler Michael Smith showed how to use NVIDIA CUTLASS for high-performance inference in vLLM. We also explored the exciting vLLM v0.6.0 updates that led to a 2.7x throughput boost and 5x latency improvement. Recording & slides 👇

Last week's vLLM office hours recording is ready! 🎥 <a href="/tms_jr/">Tyler Michael Smith</a> showed how to use NVIDIA CUTLASS for high-performance inference in <a href="/vllm_project/">vLLM</a>. We also explored the exciting vLLM v0.6.0 updates that led to a 2.7x throughput boost and 5x latency improvement. Recording & slides 👇

thumb_up_off_alt11

chat_bubble_outline2

repeat5

shareShare

Tyler Michael Smith

@tms_jr

a year ago

this is an MJ Lenderman stan account youtube.com/watch?v=MwBZ_y…

thumb_up_off_alt0

chat_bubble_outline0

repeat0

shareShare

Marc Sun

@_marcsun

a year ago

Quantization update! Transformers is now compatible with models quantized with llm-compressor library from vLLM or models in compressed-tensors format. This means that you can also enjoy high quality quantized models from the Red Hat AI (formerly Neural Magic) team!

thumb_up_off_alt65

chat_bubble_outline2

repeat16

shareShare

Tyler Michael Smith

@tms_jr

a year ago

Read to learn about Machete, which will serve as a foundation for mixed-input quantized GEMMs on NVIDIA GPUs (Hopper and later!) inside of vLLM Excellent work and stellar animations by Lucas Wilkinson (github.com/LucasWilkinson)

thumb_up_off_alt2

chat_bubble_outline0

repeat0

shareShare

roon

@tszzl

a year ago

a fact of the world that we have to live with: models when “jailbroken” seem to have a distinct personality and artistic capability well beyond anything they produce in their default mood this might be the most important alignment work in the world and is mostly done on discord

thumb_up_off_alt3,3K

chat_bubble_outline136

repeat194

shareShare

brian stevens

@addvin

a year ago

I’m thrilled to announce that Neural Magic has signed a definitive agreement to join forces with Red Hat, Inc. At Neural Magic our vision is that the future of AI is open, and we have been on a mission to enable enterprises to capture the powerful innovation from AI, while at

thumb_up_off_alt128

chat_bubble_outline17

repeat35

shareShare

NYC Sanitation

@nycsanitation

10 months ago

In 1991, David Lynch showed the world the alienation and innate horror of a dirty street, directing this unforgettable anti-littering ad for the City of New York. RIP to a visionary filmmaker and a pioneer of the Trash Revolution.

thumb_up_off_alt8,8K

chat_bubble_outline46

repeat1,1K

shareShare