Steffen Röcker (@sroecker) 's Twitter Profile
Steffen Röcker

@sroecker

OG local LLaMA shill. Sr. Solution Architect @RedHat, ex particle physicist. Born @ 347 ppm CO₂. Personal account, potentially unaligned.

ID: 22403036

calendar_today01-03-2009 20:33:59

3,3K Tweet

1,1K Takipçi

6,6K Takip Edilen

Cohere Labs (@cohere_labs) 's Twitter Profile Photo

Join us for a deep dive into Zero-Shot Named Entity Recognition with GLiNeR presented by Ihor Stepanov on Tuesday, August 26th. Thanks to our Retrieval and Search program leads Mayank Rakesh and Avinab Neogy for organizing this session ✨ Learn more: cohere.com/events/Cohere-…

Join us for a deep dive into Zero-Shot Named Entity Recognition with GLiNeR presented by <a href="/ihor_step/">Ihor Stepanov</a> on Tuesday, August 26th.

Thanks to our Retrieval and Search program leads Mayank Rakesh and <a href="/avinab_neogy/">Avinab Neogy</a> for organizing this session ✨

Learn more: cohere.com/events/Cohere-…
vLLM (@vllm_project) 's Twitter Profile Photo

🚀 LLM Compressor v0.7.0 is here! This release brings powerful new features for quantizing large language models, including transform support (QuIP, SpinQuant), mixed precision compression, improved MoE handling with Llama4 support, and more. Full blog: developers.redhat.com/articles/2025/…

Red Hat AI (@redhat_ai) 's Twitter Profile Photo

Let's break down intelligent inference serving. Traditional serving uses basic round-robin load balancing where requests are sent to the "next" pod. Intelligent inference serving makes scheduling decisions based on AI-specific workload signals. Let's dig into what this means.

Let's break down intelligent inference serving.

Traditional serving uses basic round-robin load balancing where requests are sent to the "next" pod.

Intelligent inference serving makes scheduling decisions based on AI-specific workload signals. 

Let's dig into what this means.
Dan Alistarh (@dalistarh) 's Twitter Profile Photo

🚀 Excited to announce QuTLASS v0.1.0 🎉 QuTLASS is a high-performance library for low-precision deep learning kernels, following NVIDIA CUTLASS. The new release brings 4-bit NVFP4 microscaling and fast transforms to NVIDIA Blackwell GPUs (including the B200!) [1/N]

🚀 Excited to announce QuTLASS v0.1.0 🎉

QuTLASS is a high-performance library for low-precision deep learning kernels, following NVIDIA CUTLASS.

The new release brings 4-bit NVFP4 microscaling and fast transforms to NVIDIA Blackwell GPUs (including the B200!)

[1/N]
Omer Cheema (@omercheeema) 's Twitter Profile Photo

Someone at a16z claimed a few weeks ago that 80% of Bay Area startups are building on Chinese open source models. The graphic below shows Chinese model downloads exceeding US models on HuggingFace.

Someone at a16z claimed a few weeks ago that 80% of Bay Area startups are building on Chinese open source models. The graphic below shows Chinese model downloads exceeding US models on HuggingFace.
Red Hat AI (@redhat_ai) 's Twitter Profile Photo

🚀 Thrilled to announce GuideLLM v0.3.0! This release is highlighted by a brand new Web UI, containerized benchmarking, and powerful dataset preprocessing. GuideLLM GitHub: github.com/vllm-project/g… (Thread 👇)

🚀 Thrilled to announce GuideLLM v0.3.0!

This release is highlighted by a brand new Web UI, containerized benchmarking, and powerful dataset preprocessing.

GuideLLM GitHub: github.com/vllm-project/g…

(Thread 👇)
merve (@mervenoyann) 's Twitter Profile Photo

IBM just released small swiss army knife for the document models: granite-docling-258M 🔥 not only a document converter but also can do document question answering, understand multiple languages 🤯 with Apache 2.0 license 👏

IBM just released small swiss army knife for the document models: granite-docling-258M 🔥

not only a document converter but also can do document question answering, understand multiple languages 🤯 

with Apache 2.0 license 👏
Julian Schrittwieser (@mononofu) 's Twitter Profile Photo

As a researcher at a frontier lab I’m often surprised by how unaware of current AI progress public discussions are. I wrote a post to summarize studies of recent progress, and what we should expect in the next 1-2 years: julian.ac/blog/2025/09/2…

Zichen Liu @ ICLR2025 (@zzlccc) 's Twitter Profile Photo

much more convinced after getting my own results: LoRA with rank=1 learns (and generalizes) as well as full-tuning while saving 43% vRAM usage! allows me to RL bigger models with limited resources😆 script: github.com/sail-sg/oat/bl…

much more convinced after getting my own results:
LoRA with rank=1 learns (and generalizes) as well as full-tuning while saving 43% vRAM usage! allows me to RL bigger models with limited resources😆

script: github.com/sail-sg/oat/bl…
Red Hat AI (@redhat_ai) 's Twitter Profile Photo

LLM Compressor 0.8.0 is here, with extended support for Qwen3-Next and Qwen3-VL models, improved GPTQ accuracy, and more flexible quantization workflows. Explore what’s new in this release 👇

LLM Compressor 0.8.0 is here, with extended support for Qwen3-Next and Qwen3-VL models, improved GPTQ accuracy, and more flexible quantization workflows. Explore what’s new in this release 👇
Red Hat AI (@redhat_ai) 's Twitter Profile Photo

If you’re building with open source AI, join Red Hat AI Day of Learning next week (Oct 16) for deep dives into vLLM, LLM Compressor, agentic AI, scaling inference, and more. Free & virtual → redhat.com/en/events/webi…

merve (@mervenoyann) 's Twitter Profile Photo

IBM Granite team released Granite 4 Nano models 1B variant outperforms Qwen3-1.7B with fewer params on a mix of tasks from math to coding 👏

IBM Granite team released Granite 4 Nano models 

1B variant outperforms Qwen3-1.7B with fewer params on a mix of tasks from math to coding 👏
Percy Liang (@percyliang) 's Twitter Profile Photo

⛵Marin 32B Base (mantis) is done training! It is the best open-source base model (beating OLMo 2 32B Base) and it’s even close to the best comparably-sized open-weight base models, Gemma 3 27B PT and Qwen 2.5 32B Base. Ranking across 19 benchmarks:

⛵Marin 32B Base (mantis) is done training!  It is the best open-source base model (beating OLMo 2 32B Base) and it’s even close to the best comparably-sized open-weight base models, Gemma 3 27B PT and Qwen 2.5 32B Base.  Ranking across 19 benchmarks:
Red Hat AI (@redhat_ai) 's Twitter Profile Photo

Good news - we'll be live streaming the first official vLLM meetup in Europe from Zürich. Thu, Nov 6 at 11:30am ET / 8:30am PT / 5:30pm CET Hear from vLLM maintainers and contributors at Red Hat, IBM, and Mistral AI covering quantization, hybrid models, distributed

PyTorch (@pytorch) 's Twitter Profile Photo

Hybrid models like Qwen3-Next, Nemotron Nano 2 and Granite 4.0 are now fully supported in vLLM! Check out our latest blog from the vLLM team at IBM to learn how the vLLM community has elevated hybrid models from experimental hacks in V0 to first-class citizens in V1. 🔗

Hybrid models like Qwen3-Next, Nemotron Nano 2 and Granite 4.0 are now fully supported in vLLM!  Check out our latest blog from the vLLM team at IBM to learn how the vLLM community has elevated hybrid models from experimental hacks in V0 to first-class citizens in V1.

🔗
Red Hat AI (@redhat_ai) 's Twitter Profile Photo

The latest Kimi K2 Thinking model is officially released in the compressed-tensors (INT4A16) format, enabling faster, more efficient reasoning and tool-use at scale. INT4A16 delivers ~2x speedup with minimal accuracy loss. It's ideal for 256K context and agentic tasks. Kudos to

The latest Kimi K2 Thinking model is officially released in the compressed-tensors (INT4A16) format, enabling faster, more efficient reasoning and tool-use at scale.

INT4A16 delivers ~2x speedup with minimal accuracy loss. It's ideal for 256K context and agentic tasks.

Kudos to
Eldar Kurtic (@_eldarkurtic) 's Twitter Profile Photo

Today, we are officially open-sourcing a set of high-quality speculator models on the Hugging Face Hub. Our first release includes Llamas, Qwens, and gpt-oss. In practice, you can expect 1.5–2.5× speedups on average, with some workloads seeing more than 4× improvements!

Today, we are officially open-sourcing a set of high-quality speculator models on the <a href="/huggingface/">Hugging Face</a> Hub.

Our first release includes Llamas, Qwens, and gpt-oss. In practice, you can expect 1.5–2.5× speedups on average, with some workloads seeing more than 4× improvements!