Shawn Tan (@tanshawn) 's Twitter Profile
Shawn Tan

@tanshawn

MIT-IBM Watson AI Lab / PhD student, Mila, UdeM.

ID: 84874624

linkhttp://blog.wtf.sg calendar_today24-10-2009 15:35:04

1,1K Tweet

1,1K Followers

451 Following

Xinyi Wang @ ICLR (@xinyiwang98) 's Twitter Profile Photo

Happy to share our new preprint for my MIT-IBM internship project! arxiv.org/abs/2504.03635 In a controlled synthetic pretraining setup, we uncover a surprising twist in scaling laws: Bigger models can hurt reasoning.

Chun Kai Ling (@chunkailing1) 's Twitter Profile Photo

Pleased to have Noam Brown to give a Distinguished Lecture at the NUS Artificial Intelligence Institute (NAII)! Register here: forms.cloud.microsoft/r/SxydPPcPE9 now! [Reposting from earlier to correct institution name. Those who have already registered do not need to do so again!]

Pleased to have <a href="/polynoamial/">Noam Brown</a> to give a Distinguished Lecture at the NUS Artificial Intelligence Institute (NAII)!

Register here: forms.cloud.microsoft/r/SxydPPcPE9 now!

[Reposting from earlier to correct institution name. Those who have already registered do not need to do so again!]
Mayank Mishra (@mayankmish98) 's Twitter Profile Photo

FlashAttention-3 is now available in dolomite-engine github.com/IBM/dolomite-e… thanks to Tri Dao for answering all my dumb questions 🤣

William Merrill (@lambdaviking) 's Twitter Profile Photo

Padding a transformer’s input with blank tokens (...) is a simple form of test-time compute. Can it increase the computational power of LLMs? 👀 New work with Ashish Sabharwal addresses this with *exact characterizations* of the expressive power of transformers with padding 🧵

Padding a transformer’s input with blank tokens (...) is a simple form of test-time compute. Can it increase the computational power of LLMs? 👀

New work with <a href="/Ashish_S_AI/">Ashish Sabharwal</a> addresses this with *exact characterizations* of the expressive power of transformers with padding 🧵
Han Guo (@hanguo97) 's Twitter Profile Photo

We know Attention and its linear-time variants, such as linear attention and State Space Models. But what lies in between? Introducing Log-Linear Attention with: - Log-linear time training - Log-time inference (in both time and memory) - Hardware-efficient Triton kernels

We know Attention and its linear-time variants, such as linear attention and State Space Models. But what lies in between?

Introducing Log-Linear Attention with:

- Log-linear time training
- Log-time inference (in both time and memory)
- Hardware-efficient Triton kernels
William Merrill (@lambdaviking) 's Twitter Profile Photo

I'll be defending my dissertation at NYU next Monday, June 16 at 4pm ET! I've definitely missed inviting some people who might be interested, so please email me if you'd like to attend (NYC or Zoom)

I'll be defending my dissertation at NYU next Monday, June 16 at 4pm ET!

I've definitely missed inviting some people who might be interested, so please email me if you'd like to attend (NYC or Zoom)
Chin-Wei Huang (@chinwei_h) 's Twitter Profile Photo

🚀 After two years of intense research, we’re thrilled to introduce Skala — a scalable DL density functional that hits chemical accuracy on atomization energies and matches hybrid-level performance on main group chemistry — all at the cost of a semi-local functional. ⚛️🔥🧪⚗️🧬

Albert Gu (@_albertgu) 's Twitter Profile Photo

Tokenization is just a special case of "chunking" - building low-level data into high-level abstractions - which is in turn fundamental to intelligence. Our new architecture, which enables hierarchical *dynamic chunking*, is not only tokenizer-free, but simply scales better.

Tokenization is just a special case of "chunking" - building low-level data into high-level abstractions - which is in turn fundamental to intelligence.

Our new architecture, which enables hierarchical *dynamic chunking*, is not only tokenizer-free, but simply scales better.
Shital Shah (@sytelus) 's Twitter Profile Photo

It's year 2018. You walk into Algorithms and Data Structure class. You tell students to just use whatever algorithm first comes to their mind, throw in a ton of compute and call it scaling. "That's a bitter lesson for you all", you say, and leave the classroom.

Cedric Chin (@ejames_c) 's Twitter Profile Photo

How to resist thinking of LLMs as friends. Or sentient things. Or intelligences you have to treat like god. This is actually very easy, lol. You need to hold a model in your head that explains what you see in front of you.

How to resist thinking of LLMs as friends. 

Or sentient things. 

Or intelligences you have to treat like god.

This is actually very easy, lol. You need to hold a model in your head that explains what you see in front of you.
ARC Prize (@arcprize) 's Twitter Profile Photo

Finding #1: The "hierarchical" architecture had minimal performance impact when compared to a similarly sized transformer A drop-in transformer comes within a few points without any hyperparameter optimization. See our full post: arcprize.org/blog/hrm-analy…

Finding #1: The "hierarchical" architecture had minimal performance impact when compared to a similarly sized transformer

A drop-in transformer comes within a few points without any hyperparameter optimization.

See our full post: arcprize.org/blog/hrm-analy…