Roberto López Castro (@robertol_castro) 's Twitter Profile
Roberto López Castro

@robertol_castro

Costa da Morte - A Coruña - Wien | Postdoc Researcher @ ISTA

ID: 1225736906

calendar_today27-02-2013 18:12:39

49 Tweet

98 Followers

216 Following

John Carmack (@id_aa_carmack) 's Twitter Profile Photo

I read someone disdainfully talking about how optimizing CUDA kernels was a miserable job. Perhaps for many, but some people take joy in the clearly defined world of make-go-faster. Leaders don’t need the actual skills, but they should have a sense of what is possible with them.

AI Breakfast (@aibreakfast) 's Twitter Profile Photo

🤯 Full body tracking now possible using only WiFi signals A deep neural network maps the phase and amplitude of WiFi signals to UV coordinates within 24 human regions The model can estimate the dense pose of multiple subjects by utilizing WiFi signals as the only input 🧵

🤯 Full body tracking now possible using only WiFi signals

A deep neural network maps the phase and amplitude of WiFi signals to UV coordinates within 24 human regions

The model can estimate the dense pose of multiple subjects by utilizing WiFi signals as the only input

🧵
SPCL@ETH (@spcl_eth) 's Twitter Profile Photo

Join Roberto at #SC23 on Nov 16 at 11:30AM as he unveils VENOM - a new sparse matrix format enabling arbitrary N:M patterns on Sparse Tensor Cores, natively restricted to 2:4. Plus, explore Spatha 🗡️, our sparse library for VENOM achieving up to 37x over SOTA dense methods.

Join Roberto at #SC23 on Nov 16 at 11:30AM as he unveils VENOM - a new sparse matrix format enabling arbitrary N:M patterns on Sparse Tensor Cores, natively restricted to 2:4. Plus, explore Spatha 🗡️, our sparse library for VENOM achieving up to 37x over SOTA dense methods.
Roberto López Castro (@robertol_castro) 's Twitter Profile Photo

Join us at #SC23 on Nov 16 at 11:30AM as we unveil VENOM - a new sparse matrix format enabling arbitrary N:M patterns on Sparse Tensor Cores, natively restricted to 2:4. Plus, explore Spatha 🗡️, our sparse library for VENOM achieving up to 37x over SOTA dense methods.

Join us at #SC23 on  Nov 16 at 11:30AM as we unveil VENOM - a new sparse matrix format  enabling arbitrary N:M patterns on Sparse Tensor Cores, natively  restricted to 2:4. Plus, explore Spatha 🗡️, our sparse library for  VENOM achieving up to 37x over SOTA dense methods.
SPCL@ETH (@spcl_eth) 's Twitter Profile Photo

Don't miss Roberto's talk at #SC23 tomorrow at 11:30AM as he unveils VENOM - a new sparse matrix format enabling arbitrary N:M patterns on Sparse Tensor Cores, natively restricted to 2:4. 👉 sc23.supercomputing.org/attend/digital…

Torsten Hoefler 🇨🇭 (@thoefler) 's Twitter Profile Photo

Is NVIDIA's 2:4 sparsity not enough for your ambitions with #AI models? Then check out Roberto's structured sparse extension format to go to nearly arbitrary sparsities. Getting >10x inference speedups using sparse tensor cores! youtube.com/watch?v=hif1Eq… arxiv.org/abs/2310.02065

Is <a href="/nvidia/">NVIDIA</a>'s 2:4 sparsity not enough for your ambitions with #AI models? Then check out Roberto's structured sparse extension format to go to nearly arbitrary sparsities.

Getting &gt;10x inference speedups using sparse tensor cores!

youtube.com/watch?v=hif1Eq…

arxiv.org/abs/2310.02065
BiblioInformática UDC (@biblioinf_udc) 's Twitter Profile Photo

📰Xa dispoñible no #RUC e #Zenodo o traballo do grupo #GAC da Facultade de Informática da UDC: "STuning-DL: Model-Driven Autotuning of Sparse GPU Kernels for Deep Learning" (doi.org/10.1109/ACCESS…). hdl.handle.net/2183/36810 & zenodo.org/uploads/114893…. #cienciaaberta

📰Xa dispoñible no #RUC e #Zenodo o traballo do grupo #GAC da <a href="/FIC_UDC/">Facultade de Informática da UDC</a>: "STuning-DL: Model-Driven Autotuning of Sparse GPU Kernels for Deep Learning"   (doi.org/10.1109/ACCESS…). hdl.handle.net/2183/36810 &amp; zenodo.org/uploads/114893…. #cienciaaberta
Dan Alistarh (@dalistarh) 's Twitter Profile Photo

Happy to release the write-up on the MARLIN kernel for fast LLM inference, now supporting 2:4 sparsity! Led by Elias Frantar & Roberto López Castro Paper: arxiv.org/abs/2408.11743 Code: github.com/IST-DASLab/Spa… MARLIN is integrated with vLLM thanks to @neuralmagic!

Red Hat AI (@redhat_ai) 's Twitter Profile Photo

Sparse-Marlin is here and integrated into vLLM! This GPU-optimized kernel accelerates matrix multiplication with 4-bit quantized weights and 2:4 sparsity, achieving 5.3x speedups on NVIDIA GPUs (Ampere/Ada). Maintains efficiency with batch sizes up to 32. Links below.

Sparse-Marlin is here and integrated into <a href="/vllm_project/">vLLM</a>! This GPU-optimized kernel accelerates matrix multiplication with 4-bit quantized weights and 2:4 sparsity, achieving 5.3x speedups on NVIDIA GPUs (Ampere/Ada). Maintains efficiency with batch sizes up to 32. Links  below.
Red Hat AI (@redhat_ai) 's Twitter Profile Photo

Code: github.com/IST-DASLab/Spa… Paper: arxiv.org/abs/2408.11743 Made possible by: Dan Alistarh, Elias Frantar, Roberto López Castro. Shout out to Neural Magic engineers who swiftly integrated Sparse-Marlin into #vLLM for immediate use.

Eldar Kurtic (@_eldarkurtic) 's Twitter Profile Photo

2:4 Sparsity + AI at Meta Llama-3.1: At @neuralmagic, we've developed a recipe to produce very competitive sparse LLMs, and we are starting by open-sourcing the first one: Sparse-Llama-3.1-8B-2of4. We also show how to leverage it for blazingly fast inference in vLLM.

2:4 Sparsity + <a href="/AIatMeta/">AI at Meta</a> Llama-3.1: At @neuralmagic,  we've developed a recipe to produce very competitive sparse LLMs, and we are starting by open-sourcing the first one: Sparse-Llama-3.1-8B-2of4. We also show how to leverage it for blazingly fast inference in <a href="/vllm_project/">vLLM</a>.
Eldar Kurtic (@_eldarkurtic) 's Twitter Profile Photo

4) The 2:4 models are compatible with quantization. We apply GPTQ to quantize weights to 4-bit integers, but modify it such that it preserves the 2:4 sparsity pattern. The resulting model has a minimal drop in acc, but runs super fast with Sparse-Marlin kernel in vLLM

4) The 2:4 models are compatible with quantization. We apply GPTQ to quantize weights to 4-bit integers, but modify it such that it preserves the 2:4 sparsity pattern. The resulting model has a minimal drop in acc, but runs super fast with Sparse-Marlin kernel in <a href="/vllm_project/">vLLM</a>
Red Hat (@redhat) 's Twitter Profile Photo

Today, Red Hat completed the acquisition of Red Hat AI (formerly Neural Magic), a pioneer in software and algorithms that accelerate #GenAI inference workloads. Read how we are accelerating our vision for #AI’s future: red.ht/408kJ8K.

Saleh Ashkboos (@ashkboossaleh) 's Twitter Profile Photo

Happy to release #HALO, a Hadamard-Assisted Lower-Precision scheme that enables INT8/FP6 full #finetuning (FFT) of LLMs. with @mmnnn76 Rush Tabesh Roberto López Castro Torsten Hoefler 🇨🇭 Dan Alistarh Paper: arxiv.org/pdf/2501.02625 Code: github.com/IST-DASLab/HALO 1/4

Happy to release #HALO, a Hadamard-Assisted Lower-Precision scheme that enables INT8/FP6 full #finetuning (FFT) of LLMs.

with @mmnnn76 <a href="/rush_tabesh/">Rush Tabesh</a> <a href="/RobertoL_Castro/">Roberto López Castro</a> <a href="/thoefler/">Torsten Hoefler 🇨🇭</a> <a href="/DAlistarh/">Dan Alistarh</a>

Paper: arxiv.org/pdf/2501.02625
Code: github.com/IST-DASLab/HALO

1/4
SPCL@ETH (@spcl_eth) 's Twitter Profile Photo

Yesterday, Jiale and Roberto presented the paper MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models at PPoPP 2025 in Las Vegas! Want to know more? Check out the paper here👇dl.acm.org/doi/10.1145/37… #HPC Torsten Hoefler 🇨🇭 ETH CS Department

Yesterday, Jiale and Roberto presented the paper MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models at PPoPP 2025 in Las Vegas! 

Want to know more? Check out the paper here👇dl.acm.org/doi/10.1145/37…

#HPC <a href="/thoefler/">Torsten Hoefler 🇨🇭</a> <a href="/CSatETH/">ETH CS Department</a>
Dan Alistarh (@dalistarh) 's Twitter Profile Photo

Our QuEST paper was selected for Oral Presentation at ICLR Sparsity in LLMs Workshop at ICLR 2025 workshop! QuEST is the first algorithm with Pareto-optimal LLM training for 4bit weights/activations, and can even train accurate 1-bit LLMs. Paper: arxiv.org/abs/2502.05003 Code: github.com/IST-DASLab/QuE…

Dan Alistarh (@dalistarh) 's Twitter Profile Photo

Introducing MoE-Quant, a fast version of GPTQ for MoEs, with: * Optimized Triton kernels and expert&data parallelism * Quantizes the 671B DeepSeekV3/R1 models in 2 hours on 8xH100 * ~99% accuracy recovery for 4bit R1 on *reasoning* tasks, and 100% recovery on leaderboards [1/3]

Introducing MoE-Quant, a fast version of GPTQ for MoEs, with:
* Optimized Triton kernels and expert&amp;data parallelism
* Quantizes the 671B DeepSeekV3/R1 models in 2 hours on 8xH100
* ~99% accuracy recovery for 4bit R1 on *reasoning* tasks, and 100% recovery on leaderboards  
[1/3]
Dan Alistarh (@dalistarh) 's Twitter Profile Photo

We are introducing Quartet, a fully FP4-native training method for Large Language Models, achieving optimal accuracy-efficiency trade-offs on NVIDIA Blackwell GPUs! Quartet can be used to train billion-scale models in FP4 faster than FP8 or FP16, at matching accuracy. [1/4]

We are introducing Quartet, a fully FP4-native training method for Large Language Models, achieving optimal accuracy-efficiency trade-offs on NVIDIA Blackwell GPUs! Quartet can be used to train billion-scale models in FP4 faster than FP8 or FP16, at matching  accuracy. 
[1/4]