Sudo su (@sudoingx) 's Twitter Profile
Sudo su

@sudoingx

Solo dev founder crafting fast ML/AI compute platform & Style Assistant. Sharing code snippets, founder lessons, & build-in-public wins 🔥

ID: 1555661341914198016

calendar_today05-08-2022 21:05:57

185 Tweet

83 Takipçi

153 Takip Edilen

Sudo su (@sudoingx) 's Twitter Profile Photo

i've been wanting to run this comparison for weeks. dense vs MoE. same param count. same GPU. completely different architecture. here's what caught my eye. hermes 4.3. 36B dense. 93.8% on MATH-500. 512K context. every single parameter active on every forward pass. no routing.

i've been wanting to run this comparison for weeks. dense vs MoE. same param count. same GPU. completely different architecture. 

here's what caught my eye. hermes 4.3. 36B dense. 93.8% on MATH-500. 512K context. every single parameter active on every forward pass. no routing.
Sudo su (@sudoingx) 's Twitter Profile Photo

qwen just dropped 4 new models. 0.8B runs on a phone. 9B runs on 6GB RAM. same Qwen3.5 family, same 256K context, all the way down. i just finished benchmarking the 35B MoE across 15 GPUs and published the full breakdown. now the entire family is here. 0.8B, 2B, 4B, 9B.

Sudo su (@sudoingx) 's Twitter Profile Photo

124 tok/s on vLLM with AWQ 4-bit. beating the llama.cpp 112 tok/s number on the same RTX 3090. Qwen3.5-35B-A3B. different engine, different quant, fp8 KV cache instead of q8_0. same GPU. same model. haven't tested vLLM myself yet but it's on the list. if anyone else can verify

Sudo su (@sudoingx) 's Twitter Profile Photo

testing hermes 4.3 36B through opencode harness. 24GB VRAM. 22K usable context. within 10 tool calls it hit the wall. the model is 21.8GB at Q4_K_M. on 24GB that forces 32K context for usable speed. but the agent eats 10K tokens for system prompt and tool definitions. leaves 22K

testing hermes 4.3 36B through opencode harness. 24GB VRAM. 22K usable context. within 10 tool calls it hit the wall.

the model is 21.8GB at Q4_K_M. on 24GB that forces 32K context for usable speed. but the agent eats 10K tokens for system prompt and tool definitions. leaves 22K
Sudo su (@sudoingx) 's Twitter Profile Photo

hey if you're thinking about running hermes 4.3 36B as a coding agent on a single RTX 3090, let me save you 24 minutes. the model is 21.8GB at Q4_K_M. on 24GB VRAM that leaves room for 32K context. sounds workable until the agent eats 10K tokens for system prompt and tools. you