Shawn Tan (@tanshawn) Twitter Tweets • TwiCopy

Xinyi Wang @ ICLR

5 months ago

Happy to share our new preprint for my MIT-IBM internship project! arxiv.org/abs/2504.03635 In a controlled synthetic pretraining setup, we uncover a surprising twist in scaling laws: Bigger models can hurt reasoning.

thumb_up_off_alt126

chat_bubble_outline2

repeat14

shareShare

Shawn Tan

@tanshawn

4 months ago

Scaling Stick-breaking poster #131 at 10am. Stop by for a chat!

thumb_up_off_alt10

chat_bubble_outline1

repeat3

shareShare

Chun Kai Ling

@chunkailing1

4 months ago

Pleased to have Noam Brown to give a Distinguished Lecture at the NUS Artificial Intelligence Institute (NAII)! Register here: forms.cloud.microsoft/r/SxydPPcPE9 now! [Reposting from earlier to correct institution name. Those who have already registered do not need to do so again!]

Pleased to have <a href="/polynoamial/">Noam Brown</a> to give a Distinguished Lecture at the NUS Artificial Intelligence Institute (NAII)!

Register here: forms.cloud.microsoft/r/SxydPPcPE9 now!

[Reposting from earlier to correct institution name. Those who have already registered do not need to do so again!]

thumb_up_off_alt76

chat_bubble_outline3

repeat12

shareShare

🇺🇦 Dzmitry Bahdanau

@dbahdanau

4 months ago

Adam deserves the award, but in Singapore everyone still uses SGD

thumb_up_off_alt793

chat_bubble_outline23

repeat64

shareShare

Mayank Mishra

@mayankmish98

4 months ago

FlashAttention-3 is now available in dolomite-engine github.com/IBM/dolomite-e… thanks to Tri Dao for answering all my dumb questions 🤣

thumb_up_off_alt24

chat_bubble_outline5

repeat24

shareShare

Shawn Tan

@tanshawn

4 months ago

Petition to change torch.autograd.Function -> torch.autograd.Artisanal

thumb_up_off_alt2

chat_bubble_outline0

repeat0

shareShare

Shawn Tan

@tanshawn

4 months ago

Bunch of us at Mila used to jokingly call this an "idealised Bengio space"

thumb_up_off_alt5

chat_bubble_outline0

repeat0

shareShare

William Merrill

@lambdaviking

3 months ago

Padding a transformer’s input with blank tokens (...) is a simple form of test-time compute. Can it increase the computational power of LLMs? 👀 New work with Ashish Sabharwal addresses this with *exact characterizations* of the expressive power of transformers with padding 🧵

thumb_up_off_alt275

chat_bubble_outline3

repeat37

shareShare

Han Guo

@hanguo97

3 months ago

We know Attention and its linear-time variants, such as linear attention and State Space Models. But what lies in between? Introducing Log-Linear Attention with: - Log-linear time training - Log-time inference (in both time and memory) - Hardware-efficient Triton kernels

thumb_up_off_alt1,1K

chat_bubble_outline14

repeat185

shareShare

William Merrill

@lambdaviking

3 months ago

I'll be defending my dissertation at NYU next Monday, June 16 at 4pm ET! I've definitely missed inviting some people who might be interested, so please email me if you'd like to attend (NYC or Zoom)

thumb_up_off_alt135

chat_bubble_outline1

repeat10

shareShare

Chin-Wei Huang

@chinwei_h

3 months ago

🚀 After two years of intense research, we’re thrilled to introduce Skala — a scalable DL density functional that hits chemical accuracy on atomization energies and matches hybrid-level performance on main group chemistry — all at the cost of a semi-local functional. ⚛️🔥🧪⚗️🧬

thumb_up_off_alt184

chat_bubble_outline1

repeat30

shareShare

Albert Gu

@_albertgu

2 months ago

Tokenization is just a special case of "chunking" - building low-level data into high-level abstractions - which is in turn fundamental to intelligence. Our new architecture, which enables hierarchical *dynamic chunking*, is not only tokenizer-free, but simply scales better.

thumb_up_off_alt1,1K

chat_bubble_outline58

repeat177

shareShare

Shital Shah

@sytelus

a month ago

It's year 2018. You walk into Algorithms and Data Structure class. You tell students to just use whatever algorithm first comes to their mind, throw in a ton of compute and call it scaling. "That's a bitter lesson for you all", you say, and leave the classroom.

thumb_up_off_alt51

chat_bubble_outline0

repeat5

shareShare

Cedric Chin

@ejames_c

a month ago

How to resist thinking of LLMs as friends. Or sentient things. Or intelligences you have to treat like god. This is actually very easy, lol. You need to hold a model in your head that explains what you see in front of you.

thumb_up_off_alt50

chat_bubble_outline6

repeat4

shareShare

ARC Prize

@arcprize

21 days ago

Finding #1: The "hierarchical" architecture had minimal performance impact when compared to a similarly sized transformer A drop-in transformer comes within a few points without any hyperparameter optimization. See our full post: arcprize.org/blog/hrm-analy…

thumb_up_off_alt224

chat_bubble_outline2

repeat17

shareShare