TianyLin (@tianylin) 's Twitter Profile
TianyLin

@tianylin

DL practitioner #Ihavecats

ID: 1266668735554711558

linkhttp://nil9.net calendar_today30-05-2020 09:52:24

321 Tweet

320 Takipçi

441 Takip Edilen

Edward Milsom (@edward_milsom) 's Twitter Profile Photo

Our paper "Function-Space Learning Rates" is on arXiv! We give an efficient way to estimate the magnitude of changes to NN outputs caused by a particular weight update. We analyse optimiser dynamics in function space, and enable hyperparameter transfer with our scheme FLeRM! 🧵👇

Our paper "Function-Space Learning Rates" is on arXiv! We give an efficient way to estimate the magnitude of changes to NN outputs caused by a particular weight update. We analyse optimiser dynamics in function space, and enable hyperparameter transfer with our scheme FLeRM! 🧵👇
Piotr Nawrot (@p_nawrot) 's Twitter Profile Photo

Sparse attention is one of the most promising strategies to unlock long-context processing and long generation reasoning in LLMs. We performed the most comprehensive study on training-free sparse attention to date. Here is what we found:

Sparse attention is one of the most promising strategies to unlock long-context processing and long generation reasoning in LLMs.

We performed the most comprehensive study on training-free sparse attention to date.

Here is what we found:
You Jiacheng (@youjiacheng) 's Twitter Profile Photo

arxiv.org/abs/2505.16932 everyone told me even SVD won't improve loss so I only spent a bit effort on improving the iters. After reading leloy! 's post last year, I had the intuition that greedy "contraction" will be a good solution, but didn't know it's optimal.

arxiv.org/abs/2505.16932
everyone told me even SVD won't improve loss so I only spent a bit effort on improving the iters.
After reading <a href="/leloykun/">leloy!</a> 's post last year, I had the intuition that greedy "contraction" will be a good solution, but didn't know it's optimal.
Jason Lee (@jasondeanlee) 's Twitter Profile Photo

Psa that theorists should really be more careful when working in continuous time gf vs gd (cts time neq runtime) . I have similar remarks when working with gd vs sgd (gd steps neq sample complexity) . The worst is when you have a result on gf and interpret it as if it were for

You Jiacheng (@youjiacheng) 's Twitter Profile Photo

If you cite Muon, I think you should definitely cite SSD (proceedings.mlr.press/v38/carlson15.…) by Volkan Cevher et al. (sorry I can't find the handle of other authors) -- which proposed spectral descent.

If you cite Muon, I think you should definitely cite SSD (proceedings.mlr.press/v38/carlson15.…) by <a href="/CevherLIONS/">Volkan Cevher</a> et al. (sorry I can't find the handle of other authors) -- which proposed spectral descent.
Songlin Yang (@songlinyang4) 's Twitter Profile Photo

Flash Linear Attention (github.com/fla-org/flash-…) will no longer maintain support for the RWKV series (existing code will remain available). Here’s why:

Zhihao Jia (@jiazhihao) 's Twitter Profile Photo

One of the best ways to reduce LLM latency is by fusing all computation and communication into a single GPU megakernel. But writing megakernels by hand is extremely hard. 🚀Introducing Mirage Persistent Kernel (MPK), a compiler that automatically transforms LLMs into optimized

One of the best ways to reduce LLM latency is by fusing all computation and communication into a single GPU megakernel. But writing megakernels by hand is extremely hard.

🚀Introducing Mirage Persistent Kernel (MPK), a compiler that automatically transforms LLMs into optimized
Tilde (@tilderesearch) 's Twitter Profile Photo

~8/8~ We release our NSA kernel for experimentation and research here: github.com/tilde-research… At Tilde, we believe interpretability is the path towards building better models. If that sounds cool, reach out!

Cohere Labs (@cohere_labs) 's Twitter Profile Photo

Join our ML Theory group next week as they welcome Tony S.F. on July 3rd for a presentation on "Training neural networks at any scale" Thanks to Andrej Jovanović Anier Velasco Sotomayor and Thang Chu for organizing this session 👏 Learn more: cohere.com/events/Cohere-…

Join our ML Theory group next week as they welcome <a href="/tonysilveti/">Tony S.F.</a> on July 3rd for a presentation on "Training neural networks at any scale"

Thanks to <a href="/itsmaddox_j/">Andrej Jovanović</a>  <a href="/aniervs/">Anier Velasco Sotomayor</a>  and <a href="/ThangChu77/">Thang Chu</a>  for organizing this session 👏

Learn more: cohere.com/events/Cohere-…
TianyLin (@tianylin) 's Twitter Profile Photo

It’s not a bug if you use it right! Some work use similar property to accelerate model inferencing, e.g. SliceGPT and LaRoSa (except they use orthogonal matrices to avoid P inverse here).