TianyLin (@tianylin) Twitter Tweets • TwiCopy

Gate.io

5 hours ago

🔥The 9th Round of Easy Loan, Earn $40 Reward is in progress❗️ ⏰ Promotion Period: January 15th - Feburary 15th, 2025 👉 Register now and check more details at gate.io/campaigns/358

thumb_up_off_alt34

chat_bubble_outline39

repeat6

shareShare

Our paper "Function-Space Learning Rates" is on arXiv! We give an efficient way to estimate the magnitude of changes to NN outputs caused by a particular weight update. We analyse optimiser dynamics in function space, and enable hyperparameter transfer with our scheme FLeRM! 🧵👇

thumb_up_off_alt420

chat_bubble_outline12

repeat68

shareShare

Piotr Nawrot

@p_nawrot

4 months ago

Sparse attention is one of the most promising strategies to unlock long-context processing and long generation reasoning in LLMs. We performed the most comprehensive study on training-free sparse attention to date. Here is what we found:

thumb_up_off_alt596

chat_bubble_outline5

repeat102

shareShare

TianyLin

@tianylin

3 months ago

It’s worth checking out, and Jeremy Bernstein even shares the slides

thumb_up_off_alt6

chat_bubble_outline0

repeat0

shareShare

You Jiacheng

@youjiacheng

3 months ago

arxiv.org/abs/2505.16932 everyone told me even SVD won't improve loss so I only spent a bit effort on improving the iters. After reading leloy! 's post last year, I had the intuition that greedy "contraction" will be a good solution, but didn't know it's optimal.

arxiv.org/abs/2505.16932
everyone told me even SVD won't improve loss so I only spent a bit effort on improving the iters.
After reading <a href="/leloykun/">leloy!</a> 's post last year, I had the intuition that greedy "contraction" will be a good solution, but didn't know it's optimal.

thumb_up_off_alt128

chat_bubble_outline2

repeat21

shareShare

Jason Lee

@jasondeanlee

3 months ago

Psa that theorists should really be more careful when working in continuous time gf vs gd (cts time neq runtime) . I have similar remarks when working with gd vs sgd (gd steps neq sample complexity) . The worst is when you have a result on gf and interpret it as if it were for

thumb_up_off_alt34

chat_bubble_outline1

repeat4

shareShare

TianyLin

@tianylin

2 months ago

great thoughts

thumb_up_off_alt1

chat_bubble_outline0

repeat0

shareShare

You Jiacheng

@youjiacheng

2 months ago

If you cite Muon, I think you should definitely cite SSD (proceedings.mlr.press/v38/carlson15.…) by Volkan Cevher et al. (sorry I can't find the handle of other authors) -- which proposed spectral descent.

If you cite Muon, I think you should definitely cite SSD (proceedings.mlr.press/v38/carlson15.…) by <a href="/CevherLIONS/">Volkan Cevher</a> et al. (sorry I can't find the handle of other authors) -- which proposed spectral descent.

thumb_up_off_alt153

chat_bubble_outline1

repeat18

shareShare

Songlin Yang

@songlinyang4

2 months ago

Flash Linear Attention (github.com/fla-org/flash-…) will no longer maintain support for the RWKV series (existing code will remain available). Here’s why:

thumb_up_off_alt793

chat_bubble_outline11

repeat75

shareShare

Zhihao Jia

@jiazhihao

2 months ago

One of the best ways to reduce LLM latency is by fusing all computation and communication into a single GPU megakernel. But writing megakernels by hand is extremely hard. 🚀Introducing Mirage Persistent Kernel (MPK), a compiler that automatically transforms LLMs into optimized

thumb_up_off_alt439

chat_bubble_outline6

repeat68

shareShare

TianyLin

@tianylin

2 months ago

huggingface website is down. plz fix it Hugging Face

thumb_up_off_alt0

chat_bubble_outline0

repeat0

shareShare

Tilde

@tilderesearch

2 months ago

~8/8~ We release our NSA kernel for experimentation and research here: github.com/tilde-research… At Tilde, we believe interpretability is the path towards building better models. If that sounds cool, reach out!

thumb_up_off_alt32

chat_bubble_outline1

repeat2

shareShare

Cohere Labs

@cohere_labs

a month ago

Join our ML Theory group next week as they welcome Tony S.F. on July 3rd for a presentation on "Training neural networks at any scale" Thanks to Andrej Jovanović Anier Velasco Sotomayor and Thang Chu for organizing this session 👏 Learn more: cohere.com/events/Cohere-…

Join our ML Theory group next week as they welcome <a href="/tonysilveti/">Tony S.F.</a> on July 3rd for a presentation on "Training neural networks at any scale"

Thanks to <a href="/itsmaddox_j/">Andrej Jovanović</a> <a href="/aniervs/">Anier Velasco Sotomayor</a> and <a href="/ThangChu77/">Thang Chu</a> for organizing this session 👏

Learn more: cohere.com/events/Cohere-…

thumb_up_off_alt50

chat_bubble_outline4

repeat13

shareShare

TianyLin

@tianylin

a month ago

I ran the same watermark test against Adam a few weeks back. This is the result I got:

thumb_up_off_alt4

chat_bubble_outline0

repeat1

shareShare

samsja

@samsja19

a month ago

finbarr You can fix it with doing the softmax in fp32 arxiv.org/abs/2506.13585

<a href="/finbarrtimbers/">finbarr</a> You can fix it with doing the softmax in fp32

arxiv.org/abs/2506.13585

thumb_up_off_alt96

chat_bubble_outline3

repeat4

shareShare

TianyLin

@tianylin

a month ago

It’s not a bug if you use it right! Some work use similar property to accelerate model inferencing, e.g. SliceGPT and LaRoSa (except they use orthogonal matrices to avoid P inverse here).

thumb_up_off_alt1

chat_bubble_outline0

repeat0

shareShare

TianyLin

@tianylin

20 days ago

They got k2 in contributor list of k2 report lol

thumb_up_off_alt16

chat_bubble_outline0

repeat0

shareShare

TianyLin

@tianylin

19 days ago

very insightful

thumb_up_off_alt1

chat_bubble_outline0

repeat0

shareShare

Jeremy Bernstein

@jxbz

7 days ago

Looks like extremely exciting and useful work by Kwangjun Ahn, Byron Xu, Natalie Abreu, John Langford and Gagik Magakyan github.com/microsoft/dion/ (2/2)

thumb_up_off_alt140

chat_bubble_outline4

repeat15

shareShare

TianyLin

Gate.io

Edward Milsom

Piotr Nawrot

TianyLin

You Jiacheng

Jason Lee

TianyLin

You Jiacheng

Songlin Yang

Zhihao Jia

TianyLin

Tilde

Cohere Labs

TianyLin

samsja

TianyLin

TianyLin

TianyLin

Jeremy Bernstein