Steffen Röcker (@sroecker) Twitter Tweets • TwiCopy

Cohere Labs

4 months ago

Join us for a deep dive into Zero-Shot Named Entity Recognition with GLiNeR presented by Ihor Stepanov on Tuesday, August 26th. Thanks to our Retrieval and Search program leads Mayank Rakesh and Avinab Neogy for organizing this session ✨ Learn more: cohere.com/events/Cohere-…

Join us for a deep dive into Zero-Shot Named Entity Recognition with GLiNeR presented by <a href="/ihor_step/">Ihor Stepanov</a> on Tuesday, August 26th.

Thanks to our Retrieval and Search program leads Mayank Rakesh and <a href="/avinab_neogy/">Avinab Neogy</a> for organizing this session ✨

Learn more: cohere.com/events/Cohere-…

thumb_up_off_alt21

chat_bubble_outline0

repeat6

shareShare

vLLM

@vllm_project

3 months ago

🚀 LLM Compressor v0.7.0 is here! This release brings powerful new features for quantizing large language models, including transform support (QuIP, SpinQuant), mixed precision compression, improved MoE handling with Llama4 support, and more. Full blog: developers.redhat.com/articles/2025/…

thumb_up_off_alt247

chat_bubble_outline3

repeat45

shareShare

Red Hat AI

@redhat_ai

3 months ago

Let's break down intelligent inference serving. Traditional serving uses basic round-robin load balancing where requests are sent to the "next" pod. Intelligent inference serving makes scheduling decisions based on AI-specific workload signals. Let's dig into what this means.

thumb_up_off_alt36

chat_bubble_outline2

repeat11

shareShare

Dan Alistarh

@dalistarh

3 months ago

🚀 Excited to announce QuTLASS v0.1.0 🎉 QuTLASS is a high-performance library for low-precision deep learning kernels, following NVIDIA CUTLASS. The new release brings 4-bit NVFP4 microscaling and fast transforms to NVIDIA Blackwell GPUs (including the B200!) [1/N]

thumb_up_off_alt220

chat_bubble_outline3

repeat36

shareShare

vLLM

@vllm_project

3 months ago

Wow, thanks to Charles 🎉 Frye , you can understand internals of vLLM with a live notebook from Modal 🥰

thumb_up_off_alt335

chat_bubble_outline3

repeat32

shareShare

Omer Cheema

@omercheeema

3 months ago

Someone at a16z claimed a few weeks ago that 80% of Bay Area startups are building on Chinese open source models. The graphic below shows Chinese model downloads exceeding US models on HuggingFace.

thumb_up_off_alt2,2K

chat_bubble_outline69

repeat358

shareShare

Red Hat AI

@redhat_ai

3 months ago

🚀 Thrilled to announce GuideLLM v0.3.0! This release is highlighted by a brand new Web UI, containerized benchmarking, and powerful dataset preprocessing. GuideLLM GitHub: github.com/vllm-project/g… (Thread 👇)

thumb_up_off_alt21

chat_bubble_outline1

repeat11

shareShare

merve

@mervenoyann

3 months ago

IBM just released small swiss army knife for the document models: granite-docling-258M 🔥 not only a document converter but also can do document question answering, understand multiple languages 🤯 with Apache 2.0 license 👏

thumb_up_off_alt809

chat_bubble_outline15

repeat120

shareShare

Alexander Doria

@dorialexander

3 months ago

small models are the frontier now.

thumb_up_off_alt466

chat_bubble_outline5

repeat43

shareShare

Julian Schrittwieser

@mononofu

2 months ago

As a researcher at a frontier lab I’m often surprised by how unaware of current AI progress public discussions are. I wrote a post to summarize studies of recent progress, and what we should expect in the next 1-2 years: julian.ac/blog/2025/09/2…

thumb_up_off_alt5,5K

chat_bubble_outline231

repeat810

shareShare

Zichen Liu @ ICLR2025

@zzlccc

2 months ago

much more convinced after getting my own results: LoRA with rank=1 learns (and generalizes) as well as full-tuning while saving 43% vRAM usage! allows me to RL bigger models with limited resources😆 script: github.com/sail-sg/oat/bl…

thumb_up_off_alt801

chat_bubble_outline8

repeat95

shareShare

Red Hat AI

@redhat_ai

2 months ago

LLM Compressor 0.8.0 is here, with extended support for Qwen3-Next and Qwen3-VL models, improved GPTQ accuracy, and more flexible quantization workflows. Explore what’s new in this release 👇

thumb_up_off_alt16

chat_bubble_outline1

repeat2

shareShare

Red Hat AI

@redhat_ai

2 months ago

If you’re building with open source AI, join Red Hat AI Day of Learning next week (Oct 16) for deep dives into vLLM, LLM Compressor, agentic AI, scaling inference, and more. Free & virtual → redhat.com/en/events/webi…

thumb_up_off_alt11

chat_bubble_outline0

repeat3

shareShare

merve

@mervenoyann

a month ago

IBM Granite team released Granite 4 Nano models 1B variant outperforms Qwen3-1.7B with fewer params on a mix of tasks from math to coding 👏

thumb_up_off_alt278

chat_bubble_outline13

repeat29

shareShare

Percy Liang

@percyliang

a month ago

⛵Marin 32B Base (mantis) is done training! It is the best open-source base model (beating OLMo 2 32B Base) and it’s even close to the best comparably-sized open-weight base models, Gemma 3 27B PT and Qwen 2.5 32B Base. Ranking across 19 benchmarks:

thumb_up_off_alt562

chat_bubble_outline19

repeat84

shareShare

Red Hat AI

@redhat_ai

a month ago

Good news - we'll be live streaming the first official vLLM meetup in Europe from Zürich. Thu, Nov 6 at 11:30am ET / 8:30am PT / 5:30pm CET Hear from vLLM maintainers and contributors at Red Hat, IBM, and Mistral AI covering quantization, hybrid models, distributed

thumb_up_off_alt24

chat_bubble_outline1

repeat6

shareShare

PyTorch

@pytorch

a month ago

Hybrid models like Qwen3-Next, Nemotron Nano 2 and Granite 4.0 are now fully supported in vLLM! Check out our latest blog from the vLLM team at IBM to learn how the vLLM community has elevated hybrid models from experimental hacks in V0 to first-class citizens in V1. 🔗

thumb_up_off_alt146

chat_bubble_outline2

repeat37

shareShare

Red Hat AI

@redhat_ai

a month ago

The latest Kimi K2 Thinking model is officially released in the compressed-tensors (INT4A16) format, enabling faster, more efficient reasoning and tool-use at scale. INT4A16 delivers ~2x speedup with minimal accuracy loss. It's ideal for 256K context and agentic tasks. Kudos to

thumb_up_off_alt133

chat_bubble_outline4

repeat11

shareShare

Eldar Kurtic

@_eldarkurtic

19 days ago

Today, we are officially open-sourcing a set of high-quality speculator models on the Hugging Face Hub. Our first release includes Llamas, Qwens, and gpt-oss. In practice, you can expect 1.5–2.5× speedups on average, with some workloads seeing more than 4× improvements!

Today, we are officially open-sourcing a set of high-quality speculator models on the <a href="/huggingface/">Hugging Face</a> Hub.

Our first release includes Llamas, Qwens, and gpt-oss. In practice, you can expect 1.5–2.5× speedups on average, with some workloads seeing more than 4× improvements!

thumb_up_off_alt35

chat_bubble_outline2

repeat6

shareShare

Luca Soldaini ✈️ ICLR 25

@soldni

7 days ago

So cool to see Artificial Analysis add openness in their analysis!

thumb_up_off_alt16

chat_bubble_outline1

repeat1

shareShare