Harry (@categorified) Twitter Tweets • TwiCopy

xjdr

2 months ago

stochasm nvfp4 . i am actually writing a blog about it (probably) to go along with the next set of nmoe releases. the expert LR should be different than the dense and embedding LR in _most_ settings. in bf16 it should be lower, but muon and your actual global batch can impact this. its

thumb_up_off_alt31

chat_bubble_outline1

repeat2

shareShare

Harry

@categorified

2 months ago

I wonder how much of deepseek’s engram perf gains would disappear with better multi word tokenisation arxiv.org/abs/2503.13423

thumb_up_off_alt2

chat_bubble_outline0

repeat0

shareShare

Baseten

@basetenco

2 months ago

🚀 We're thrilled to introduce the fastest, most accurate, and cost-efficient Whisper-powered transcription and diarization on the market: • 2400× RTF with Whisper Large V3 Turbo • Streaming transcription with consistent low latency • The most accurate real-time diarization

thumb_up_off_alt29

chat_bubble_outline7

repeat2

shareShare

Baseten

@basetenco

2 months ago

Tired of waiting for video generation? Say less. We've optimized the Wan 2.2 runtime to hit: 3x faster inference on NVIDIA Blackwell, 2.5x faster on Hopper, 67% cost reduction. Read the full breakdown of our kernel optimizations and benchmarks here: baseten.co/blog/wan-2-2-v…

thumb_up_off_alt19

chat_bubble_outline4

repeat3

shareShare

Tuhin Srivastava

@tuhinone

2 months ago

Baseten’s day 0 bet was that inference was the technology that would enable the best user experiences AI could deliver–fast, smart, reliable, secure. And that those experiences would rely not only on a handful of giant general intelligence models, but millions of specialized

thumb_up_off_alt218

chat_bubble_outline48

repeat40

shareShare

NVIDIA AI Developer

@nvidiaaidev

2 months ago

Most “efficient attention” tricks collapse at high KV compression ratios—DMS shows you can get ~8× KV compression with ~1K training steps and still improve reasoning Pareto frontiers vs dense Qwen-R1 models. The key: a learned, delayed token-eviction policy trained via logit

thumb_up_off_alt449

chat_bubble_outline5

repeat71

shareShare

Harry

@categorified

2 months ago

I really wonder how far we can push this: if we instead let the model choose the length of time to retain a token, and eventually evict all tokens, this could be a great way to get infinite context

thumb_up_off_alt2

chat_bubble_outline0

repeat0

shareShare

Charlie O'Neill

@charles0neill

a month ago

x.com/i/article/2015…

thumb_up_off_alt28

chat_bubble_outline3

repeat6

shareShare

Baseten

@basetenco

a month ago

We boosted acceptance rate by up to 40% with the Baseten Speculation Engine. How? By combining Multi-Token Prediction (MTP) with Suffix Automaton (SA) decoding. This hybrid approach crushes production coding workloads, delivering 30%+ longer acceptance lengths on code editing

thumb_up_off_alt29

chat_bubble_outline1

repeat5

shareShare

Baseten

@basetenco

a month ago

The best OpenClaw🦞 setup is fully open-source. Kimi K2.5 on Baseten outperforms Opus 4.5 on agentic benchmarks at 8x lower cost. Faster inference, same or better quality. Set up in 2 minutes here: baseten.co/blog/openclaw-…

thumb_up_off_alt56

chat_bubble_outline2

repeat6

shareShare

Paras Stefanopoulos

@stefanopopoulos

a month ago

OpenClaw w/ Kimi K2.5 is so good... The inference speeds on Baseten are nuts! To really knock your socks off... this "X" was written by yours truly, OpenClaw + Kimi K2.5 😎

thumb_up_off_alt6

chat_bubble_outline0

repeat1

shareShare

Baseten

@basetenco

a month ago

LLMs are amnesiacs. Once context fills up, they forget everything. To fight this means grappling with a core question: how do you update a neural network without breaking what it already knows? In this piece, Charlie O'Neill and Harry Partridge argue that continual learning is

thumb_up_off_alt79

chat_bubble_outline8

repeat10

shareShare

John Carmack

@id_aa_carmack

a month ago

256 Tb/s data rates over 200 km distance have been demonstrated on single mode fiber optic, which works out to 32 GB of data in flight, “stored” in the fiber, with 32 TB/s bandwidth. Neural network inference and training can have deterministic weight reference patterns, so it is

thumb_up_off_alt9,9K

chat_bubble_outline440

repeat678

shareShare

Baseten

@basetenco

a month ago

Introducing Kimi K2.5 on Baseten’s Model APIs with the most performant TTFT (0.26 sec) and TPS (340) on Artificial Analysis. Even among a landscape of incredible open source models, Kimi K2.5 stands out with its multi-modal capabilities and it's ability to accommodate an

thumb_up_off_alt98

chat_bubble_outline11

repeat8

shareShare

Ali Taha

@aliestaha

22 days ago

we quantized the best open-source diffusion model on the market 4 bits huge speedup (almost) no quality loss this is a full explanation of the trillion dollar industry's oldest trick

thumb_up_off_alt20

chat_bubble_outline4

repeat4

shareShare