Roberto López Castro (@robertol_castro) Twitter Tweets • TwiCopy

John Carmack

3 years ago

I read someone disdainfully talking about how optimizing CUDA kernels was a miserable job. Perhaps for many, but some people take joy in the clearly defined world of make-go-faster. Leaders don’t need the actual skills, but they should have a sense of what is possible with them.

thumb_up_off_alt647

chat_bubble_outline16

repeat28

shareShare

AI Breakfast

@aibreakfast

3 years ago

🤯 Full body tracking now possible using only WiFi signals A deep neural network maps the phase and amplitude of WiFi signals to UV coordinates within 24 human regions The model can estimate the dense pose of multiple subjects by utilizing WiFi signals as the only input 🧵

thumb_up_off_alt4,4K

chat_bubble_outline92

repeat1,1K

shareShare

Peyman Milanfar

@docmilanfar

2 years ago

Sparsity people trying to publish in ML conferences

thumb_up_off_alt169

chat_bubble_outline4

repeat17

shareShare

SPCL@ETH

@spcl_eth

2 years ago

Join Roberto at #SC23 on Nov 16 at 11:30AM as he unveils VENOM - a new sparse matrix format enabling arbitrary N:M patterns on Sparse Tensor Cores, natively restricted to 2:4. Plus, explore Spatha 🗡️, our sparse library for VENOM achieving up to 37x over SOTA dense methods.

thumb_up_off_alt24

chat_bubble_outline0

repeat9

shareShare

Roberto López Castro

@robertol_castro

2 years ago

Join us at #SC23 on Nov 16 at 11:30AM as we unveil VENOM - a new sparse matrix format enabling arbitrary N:M patterns on Sparse Tensor Cores, natively restricted to 2:4. Plus, explore Spatha 🗡️, our sparse library for VENOM achieving up to 37x over SOTA dense methods.

thumb_up_off_alt4

chat_bubble_outline0

repeat2

shareShare

SPCL@ETH

@spcl_eth

2 years ago

Don't miss Roberto's talk at #SC23 tomorrow at 11:30AM as he unveils VENOM - a new sparse matrix format enabling arbitrary N:M patterns on Sparse Tensor Cores, natively restricted to 2:4. 👉 sc23.supercomputing.org/attend/digital…

thumb_up_off_alt7

chat_bubble_outline0

repeat2

shareShare

Torsten Hoefler 🇨🇭

@thoefler

a year ago

Is NVIDIA's 2:4 sparsity not enough for your ambitions with #AI models? Then check out Roberto's structured sparse extension format to go to nearly arbitrary sparsities. Getting >10x inference speedups using sparse tensor cores! youtube.com/watch?v=hif1Eq… arxiv.org/abs/2310.02065

Is <a href="/nvidia/">NVIDIA</a>'s 2:4 sparsity not enough for your ambitions with #AI models? Then check out Roberto's structured sparse extension format to go to nearly arbitrary sparsities.

Getting >10x inference speedups using sparse tensor cores!

youtube.com/watch?v=hif1Eq…

arxiv.org/abs/2310.02065

thumb_up_off_alt81

chat_bubble_outline3

repeat19

shareShare

BiblioInformática UDC

@biblioinf_udc

a year ago

📰Xa dispoñible no #RUC e #Zenodo o traballo do grupo #GAC da Facultade de Informática da UDC: "STuning-DL: Model-Driven Autotuning of Sparse GPU Kernels for Deep Learning" (doi.org/10.1109/ACCESS…). hdl.handle.net/2183/36810 & zenodo.org/uploads/114893…. #cienciaaberta

📰Xa dispoñible no #RUC e #Zenodo o traballo do grupo #GAC da <a href="/FIC_UDC/">Facultade de Informática da UDC</a>: "STuning-DL: Model-Driven Autotuning of Sparse GPU Kernels for Deep Learning" (doi.org/10.1109/ACCESS…). hdl.handle.net/2183/36810 & zenodo.org/uploads/114893…. #cienciaaberta

thumb_up_off_alt2

chat_bubble_outline0

repeat2

shareShare

Dan Alistarh

@dalistarh

a year ago

Happy to release the write-up on the MARLIN kernel for fast LLM inference, now supporting 2:4 sparsity! Led by Elias Frantar & Roberto López Castro Paper: arxiv.org/abs/2408.11743 Code: github.com/IST-DASLab/Spa… MARLIN is integrated with vLLM thanks to @neuralmagic!

thumb_up_off_alt73

chat_bubble_outline3

repeat22

shareShare

Red Hat AI

@redhat_ai

a year ago

Sparse-Marlin is here and integrated into vLLM! This GPU-optimized kernel accelerates matrix multiplication with 4-bit quantized weights and 2:4 sparsity, achieving 5.3x speedups on NVIDIA GPUs (Ampere/Ada). Maintains efficiency with batch sizes up to 32. Links below.

Sparse-Marlin is here and integrated into <a href="/vllm_project/">vLLM</a>! This GPU-optimized kernel accelerates matrix multiplication with 4-bit quantized weights and 2:4 sparsity, achieving 5.3x speedups on NVIDIA GPUs (Ampere/Ada). Maintains efficiency with batch sizes up to 32. Links below.

thumb_up_off_alt96

chat_bubble_outline2

repeat20

shareShare

Red Hat AI

@redhat_ai

a year ago

Code: github.com/IST-DASLab/Spa… Paper: arxiv.org/abs/2408.11743 Made possible by: Dan Alistarh, Elias Frantar, Roberto López Castro. Shout out to Neural Magic engineers who swiftly integrated Sparse-Marlin into #vLLM for immediate use.

thumb_up_off_alt10

chat_bubble_outline0

repeat2

shareShare

Eldar Kurtic

@_eldarkurtic

9 months ago

2:4 Sparsity + AI at Meta Llama-3.1: At @neuralmagic, we've developed a recipe to produce very competitive sparse LLMs, and we are starting by open-sourcing the first one: Sparse-Llama-3.1-8B-2of4. We also show how to leverage it for blazingly fast inference in vLLM.

2:4 Sparsity + <a href="/AIatMeta/">AI at Meta</a> Llama-3.1: At @neuralmagic, we've developed a recipe to produce very competitive sparse LLMs, and we are starting by open-sourcing the first one: Sparse-Llama-3.1-8B-2of4. We also show how to leverage it for blazingly fast inference in <a href="/vllm_project/">vLLM</a>.

thumb_up_off_alt139

chat_bubble_outline4

repeat26

shareShare

Eldar Kurtic

@_eldarkurtic

9 months ago

4) The 2:4 models are compatible with quantization. We apply GPTQ to quantize weights to 4-bit integers, but modify it such that it preserves the 2:4 sparsity pattern. The resulting model has a minimal drop in acc, but runs super fast with Sparse-Marlin kernel in vLLM

thumb_up_off_alt4

chat_bubble_outline1

repeat1

shareShare

Red Hat

@redhat

7 months ago

Today, Red Hat completed the acquisition of Red Hat AI (formerly Neural Magic), a pioneer in software and algorithms that accelerate #GenAI inference workloads. Read how we are accelerating our vision for #AI’s future: red.ht/408kJ8K.

thumb_up_off_alt114

chat_bubble_outline2

repeat39

shareShare

Saleh Ashkboos

@ashkboossaleh

7 months ago

Happy to release #HALO, a Hadamard-Assisted Lower-Precision scheme that enables INT8/FP6 full #finetuning (FFT) of LLMs. with @mmnnn76 Rush Tabesh Roberto López Castro Torsten Hoefler 🇨🇭 Dan Alistarh Paper: arxiv.org/pdf/2501.02625 Code: github.com/IST-DASLab/HALO 1/4

Happy to release #HALO, a Hadamard-Assisted Lower-Precision scheme that enables INT8/FP6 full #finetuning (FFT) of LLMs.

with @mmnnn76 <a href="/rush_tabesh/">Rush Tabesh</a> <a href="/RobertoL_Castro/">Roberto López Castro</a> <a href="/thoefler/">Torsten Hoefler 🇨🇭</a> <a href="/DAlistarh/">Dan Alistarh</a>

Paper: arxiv.org/pdf/2501.02625
Code: github.com/IST-DASLab/HALO

1/4

thumb_up_off_alt25

chat_bubble_outline1

repeat8

shareShare

SPCL@ETH

@spcl_eth

6 months ago

Yesterday, Jiale and Roberto presented the paper MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models at PPoPP 2025 in Las Vegas! Want to know more? Check out the paper here👇dl.acm.org/doi/10.1145/37… #HPC Torsten Hoefler 🇨🇭 ETH CS Department

thumb_up_off_alt4

chat_bubble_outline0

repeat1

shareShare

Dan Alistarh

@dalistarh

5 months ago

Our QuEST paper was selected for Oral Presentation at ICLR Sparsity in LLMs Workshop at ICLR 2025 workshop! QuEST is the first algorithm with Pareto-optimal LLM training for 4bit weights/activations, and can even train accurate 1-bit LLMs. Paper: arxiv.org/abs/2502.05003 Code: github.com/IST-DASLab/QuE…

thumb_up_off_alt31

chat_bubble_outline3

repeat9

shareShare

Dan Alistarh

@dalistarh

5 months ago

Introducing MoE-Quant, a fast version of GPTQ for MoEs, with: * Optimized Triton kernels and expert&data parallelism * Quantizes the 671B DeepSeekV3/R1 models in 2 hours on 8xH100 * ~99% accuracy recovery for 4bit R1 on *reasoning* tasks, and 100% recovery on leaderboards [1/3]

thumb_up_off_alt162

chat_bubble_outline3

repeat32

shareShare

Dan Alistarh

@dalistarh

3 months ago

We are introducing Quartet, a fully FP4-native training method for Large Language Models, achieving optimal accuracy-efficiency trade-offs on NVIDIA Blackwell GPUs! Quartet can be used to train billion-scale models in FP4 faster than FP8 or FP16, at matching accuracy. [1/4]

thumb_up_off_alt393

chat_bubble_outline20

repeat77

shareShare