L (@llllvvuu) Twitter Tweets • TwiCopy

L

@llllvvuu

a month ago

why did nvidia choose 72? why not 64 or 128?

thumb_up_off_alt3

chat_bubble_outline1

repeat0

shareShare

interesting that anything with O(1) state is called “RNN”. i would’ve thought the more meaningful property would have been ω(1) depth, i.e. the success of transformers comes from layer k only depend on layer <k output thus achieving O(1) depth

thumb_up_off_alt8

chat_bubble_outline1

repeat0

shareShare

L

@llllvvuu

17 days ago

True, but let’s innovate further. Imagine if you not only had vocabulary but also syntax and semantics to express the behavior of your application. Who’s building this?

thumb_up_off_alt23

chat_bubble_outline2

repeat0

shareShare

L

@llllvvuu

15 days ago

Seen a similar thing also got shouted out on SLIME docs. Could this be the kind of thing that brings back value function research in OSS? Since if you train directly on live user feedback (vs re-run against a reward model) you don’t have groups

thumb_up_off_alt7

chat_bubble_outline0

repeat0

shareShare

L

@llllvvuu

14 days ago

twitter: 10 claudes/codexes reality: 1 claude/codex + xkcd.com/303/

thumb_up_off_alt10

chat_bubble_outline0

repeat1

shareShare

L

@llllvvuu

14 days ago

“Delete the product”

thumb_up_off_alt2

chat_bubble_outline0

repeat0

shareShare

L

@llllvvuu

12 days ago

no one warns you how awkward linear attention makes prefix caching and eviction

thumb_up_off_alt4

chat_bubble_outline0

repeat0

shareShare

L

@llllvvuu

12 days ago

is this not the textbook multiple testing problem?

thumb_up_off_alt7

chat_bubble_outline0

repeat0

shareShare

L

@llllvvuu

11 days ago

I find the SLO-matched framing that SemiAnalysis does quite pointless. “chip 2 can achieve better SLOs than chip 1” “it is infinity times cheaper to meet the new SLOs on chip 2 than chip 1” what did we learn?

thumb_up_off_alt0

chat_bubble_outline1

repeat0

shareShare

L

@llllvvuu

11 days ago

All frontier API providers offer arbitrary prefix match, so I’d guess none are using linear attention? I wonder if there are any tells of SWA, e.g. ABCD hits but AB misses the window

thumb_up_off_alt2

chat_bubble_outline0

repeat0

shareShare

L

@llllvvuu

9 days ago

It goes like that: you have this nice system for serving LLMs efficiently across multi-turn, long-context, and shared-prompt scenarios. Then researchers invent with some new BS to pump benchmarks by 0.1%. Then they start talking about dynamic this encoder that, then you cry.

thumb_up_off_alt6

chat_bubble_outline1

repeat1

shareShare

Yifan Zhang

@yifan_zhang_

4 days ago

After 18 months of hard work by Tomas and Zhen, we cooked it! 🚀 Thanks to all friends who give constructive feedback! Deep Learning 2.0, Rethinking every fundamental cornerstone of Modern Foundation Models. It's just the beginning, Hyped! 🚀 github.com/FlashSampling/…

thumb_up_off_alt467

chat_bubble_outline9

repeat65

shareShare