y7xyz (@y7xyz_) 's Twitter Profile
y7xyz

@y7xyz_

ID: 1917477549774626819

calendar_today30-04-2025 07:14:22

33 Tweet

13 Followers

78 Following

David Hendrickson (@teksedge) 's Twitter Profile Photo

🚀 IBM just soft-launched Granite 4.1! 🔥 It's a new family of dense, open-source models (Apache 2.0) built for real enterprise workloads. Another personal inferencing candidate. 📦 Full Family (128K context): • 30B: Highest performance • 8B: Sweet spot (GSM8K 92.5% •

🚀 IBM just soft-launched Granite 4.1! 🔥

It's a new family of dense, open-source models (Apache 2.0) built for real enterprise workloads.

Another personal inferencing candidate.

📦 Full Family (128K context):
• 30B: Highest performance
• 8B: Sweet spot (GSM8K 92.5% •
🧬Maxpein🧬 (@maximumpain333) 's Twitter Profile Photo

WE LIVE IN A MIND MATTER UNIVERSE Without a deep understanding of our own energy field we are missing out on the number contributor to our health and Well being. Even more, your field holds all of your trauma, memory, loops, patterns, and is essentially the fingerprint of your

Mario Nawfal’s Roundtable (@roundtablespace) 's Twitter Profile Photo

Someone just publicly committed to beating Claude Code with a fully local alternative by end of year. They're building vllm-studio - a control panel for VLLM, SGLang, llama.cpp, and exllamav3. The local AI war just got a named target.

AJ 💙 (@itsmeajaykv) 's Twitter Profile Photo

Qwen3.6-35B-A3B (TQ3_4S ~4bpw) on RTX 3060 (12GB) via llama.cpp-tq3 (TurboQuant): • ~619 t/s prompt (4K ctx) • ~60 t/s generation (128K ctx) • fits in ~12.4GB VRAM 128K context with usable decode speed on a single 3060 is kind of wild

Qwen3.6-35B-A3B (TQ3_4S ~4bpw) on RTX 3060 (12GB) via llama.cpp-tq3 (TurboQuant):

• ~619 t/s prompt (4K ctx)
• ~60 t/s generation (128K ctx)
• fits in ~12.4GB VRAM
128K context with usable decode speed on a single 3060 is kind of wild
ollama (@ollama) 's Twitter Profile Photo

🤯 Ollama now supports Claude Desktop via Claude’s built-in third party inference. ollama launch claude-desktop This allows all models from Ollama's Cloud to be used across Claude Cowork and Claude Code from the Claude Desktop app.

🤯 Ollama now supports Claude Desktop via Claude’s built-in third party inference.

ollama launch claude-desktop

This allows all models from Ollama's Cloud to be used across Claude Cowork and Claude Code from the Claude Desktop app.
kosovi (@nimses1010) 's Twitter Profile Photo

مشعوذ الصواريخ: السر الذي أخفته المخابرات عن 'فتح بوابات الأبعاد' عام 1946!" عالم صواريخ ومؤسس مختبر JPL التابع لناسا، قاد أبحاث وقود الصواريخ نهاراً، ومارس "سحر الثيليما" ليلاً تحت إشراف "أليستر كراولي". 🔹الوثيقة: ملفات الـ FBI (رقم 100-245448) تؤكد انتمائه لطائفة (Cult) تمارس

مشعوذ الصواريخ: السر الذي أخفته المخابرات عن 'فتح بوابات الأبعاد' عام 1946!"

عالم صواريخ ومؤسس مختبر JPL التابع لناسا، قاد أبحاث وقود الصواريخ نهاراً، ومارس "سحر الثيليما" ليلاً تحت إشراف "أليستر كراولي".
🔹الوثيقة: ملفات الـ FBI (رقم 100-245448) تؤكد انتمائه لطائفة (Cult) تمارس
ハカセ アイ(Ai-Hakase)🐾最新トレンドAIのためのX 🐾 (@ai_hakase_) 's Twitter Profile Photo

Llama.cppがついにMTP対応!ローカルAIの生成速度が異次元へ 🚀 Llama.cppに待望の「Multi-Token Prediction(MTP)」ベータ版サポートが追加されました!複数の単語を一度に予測することで、ローカルLLMの動作が驚くほど高速化します。 🌟 注目ポイント ・生成速度が最大1.5〜2.0倍に爆上がり

Llama.cppがついにMTP対応!ローカルAIの生成速度が異次元へ 🚀

Llama.cppに待望の「Multi-Token Prediction(MTP)」ベータ版サポートが追加されました!複数の単語を一度に予測することで、ローカルLLMの動作が驚くほど高速化します。

🌟 注目ポイント
・生成速度が最大1.5〜2.0倍に爆上がり
Witcheer | b/era (@0xwitcheer) 's Twitter Profile Photo

benched 5 open-source models on my windows tower (RTX 4060 Ti 8GB, Ryzen 5 7600x, 32GB DDR5). all q4_k_m, LM Studio, full GPU offload, 16K context. results: > nemotron-3-nano-4b 80.7 t/s, 3.6GB VRAM. fastest small model I've benched on 8GB. > gemma-4-e4b 68.5 t/s, 6.0GB

benched 5 open-source models on my windows tower (RTX 4060 Ti 8GB, Ryzen 5 7600x, 32GB DDR5). 

all q4_k_m, LM Studio, full GPU offload, 16K context.

results:

> nemotron-3-nano-4b

80.7 t/s, 3.6GB VRAM. fastest small model I've benched on 8GB.

> gemma-4-e4b

68.5 t/s, 6.0GB
Unsloth AI (@unslothai) 's Twitter Profile Photo

We made a guide on how to run open LLMs in Claude Code, Codex and OpenClaw. Use Gemma 4 and Qwen3.6 GGUFs for local agentic coding on 24GB RAM Run with self-healing tool calls, code execution, web search via the Unsloth API endpoint and llama.cpp Guide: unsloth.ai/docs/basics/api

We made a guide on how to run open LLMs in Claude Code, Codex and OpenClaw.

Use Gemma 4 and Qwen3.6 GGUFs for local agentic coding on 24GB RAM

Run with self-healing tool calls, code execution, web search via the Unsloth API endpoint and llama.cpp

Guide: unsloth.ai/docs/basics/api
Hugging Models (@huggingmodels) 's Twitter Profile Photo

Meet Qwen3.6-27B-Claude-Opus-Reasoning-Distill-v2-int4-AutoRound. A massive 27B parameter model that combines Qwen3.5's reasoning with Claude Opus distillation. This is advanced text and image reasoning compressed into a 4-bit quantized package, making it runnable on consumer

Meet Qwen3.6-27B-Claude-Opus-Reasoning-Distill-v2-int4-AutoRound. A massive 27B parameter model that combines Qwen3.5's reasoning with Claude Opus distillation. This is advanced text and image reasoning compressed into a 4-bit quantized package, making it runnable on consumer
Hugging Models (@huggingmodels) 's Twitter Profile Photo

Just dropped: storagejuju/kimi-k2.6-ud-q8-k-xl-juju. A custom compressed model using kimi_k25 architecture. It's designed for efficient, high-performance inference in the US region. Get ready for a new level of AI speed.

Just dropped: storagejuju/kimi-k2.6-ud-q8-k-xl-juju. A custom compressed model using kimi_k25 architecture. It's designed for efficient, high-performance inference in the US region. Get ready for a new level of AI speed.
Google AI Developers (@googleaidevs) 's Twitter Profile Photo

Speed up your Gemma 4 workflows by up to 3x with Multi-Token Prediction (MTP) drafters. Standard LLM inference is fundamentally memory-bandwidth bound, creating a latency bottleneck as billions of parameters travel from VRAM just to generate a single token. We're working to ease

Speed up your Gemma 4 workflows by up to 3x with Multi-Token Prediction (MTP) drafters.

Standard LLM inference is fundamentally memory-bandwidth bound, creating a latency bottleneck as billions of parameters travel from VRAM just to generate a single token. We're working to ease
Ton Incubator (@ton_incubator) 's Twitter Profile Photo

Elon Musk said he once had dinner with a top physicist and a top computer scientist and asked them what they thought the probability was that we are living in a simulation. They answered simultaneously, 0% and 100% respectively. It was like a double-slit experiment, but with

Elon Musk said he once had dinner with a top physicist and a top computer scientist and asked them what they thought the probability was that we are living in a simulation. They answered simultaneously, 0% and 100% respectively. It was like a double-slit experiment, but with
vLLM (@vllm_project) 's Twitter Profile Photo

🚀 Day-0 MTP support for Gemma4 now available at vLLM with ready-to-use docker image! ⚡️Enjoy up to 3x faster decoding performance to supercharge your development with zero quality degradation! Check out the full vLLM recipes for Gemma 4 model series👇 recipes.vllm.ai/Google/gemma-4…

🚀 Day-0 MTP support for Gemma4 now available at vLLM with ready-to-use docker image!

⚡️Enjoy up to 3x faster decoding performance to supercharge your development with zero quality degradation!

Check out the full vLLM recipes for Gemma 4 model series👇
recipes.vllm.ai/Google/gemma-4…
طريق البيتكوين (@bitcoin_way) 's Twitter Profile Photo

📶 مؤشر جديد Silicon Data يتتبع تكلفة "الوقود الرقمي" — أي تكلفة تشغيل نماذج الذكاء الاصطناعي لكل مليون رمز 📊 الفكرة الأساسية: كما يوجد مؤشرات لأسعار النفط أو المعادن، هذا موشر للذكاء الاصطناعي. هذا المؤشر يقيس 'الإنفاق على رموز نماذج اللغة الكبيرة للذكاء الاصطناعي ' (LLM Token

📶 مؤشر جديد Silicon Data يتتبع تكلفة "الوقود الرقمي" — أي تكلفة تشغيل نماذج الذكاء الاصطناعي لكل مليون رمز 📊 الفكرة الأساسية: كما يوجد مؤشرات لأسعار النفط أو المعادن، هذا موشر للذكاء الاصطناعي.

هذا المؤشر يقيس 'الإنفاق على رموز نماذج اللغة الكبيرة للذكاء الاصطناعي ' (LLM Token
Akshay 🚀 (@akshay_pachaar) 's Twitter Profile Photo

NVIDIA + Unsloth just dropped a guide on making fine-tuning 25% faster. this is hands-down the cleanest systems-level writeup i've read. you'll learn how 3 optimizations help your gpu train models faster: 1. packed-sequence metadata caching 2. double-buffered checkpoint

NVIDIA + Unsloth just dropped a guide on making fine-tuning 25% faster.

this is hands-down the cleanest systems-level writeup i've read.

you'll learn how 3 optimizations help your gpu train models faster:

1. packed-sequence metadata caching
2. double-buffered checkpoint
Perplexity (@perplexity_ai) 's Twitter Profile Photo

We’ve developed our own inference engine Runtime-Optimized Serving Engine (ROSE) to serve models ranging from embeddings to trillion-parameter LLMs. With CuTeDSL integrated into our inference engine, Perplexity can build the specialized GPU kernels faster to bring models up to

We’ve developed our own inference engine Runtime-Optimized Serving Engine (ROSE) to serve models ranging from embeddings to trillion-parameter LLMs.

With CuTeDSL integrated into our inference engine, Perplexity can build the specialized GPU kernels faster to bring models up to