Shital Shah (@sytelus) Twitter Tweets • TwiCopy

Shital Shah

@sytelus

+ Follow

Mostly research and code. If universe is an optimizer, what is its loss function? All opinions are my own.

ID: 7637292

linkhttp://shital.com calendar_today22-07-2007 09:53:32

6,6K Tweet

12,12K Followers

9,9K Following

Shital Shah

@sytelus

6 months ago

There is now large number of sellers flooding Amazon with fake vitamins. This is becoming downright dangerous.

thumb_up_off_alt9

chat_bubble_outline1

repeat3

shareShare

Great report by EpochAI on something we had also seen. This quote from the report summarizes well how models like o3 “reason”. We need to figure out exact reasoning to take the next leap. Tool use is a weak bandaid for this problem.

thumb_up_off_alt7

chat_bubble_outline0

repeat1

shareShare

Shital Shah

@sytelus

6 months ago

We are not ready for the future where a majority of the next generation would be artificially conceived just to keep population going and parented entirety by AI, the first truly AI native generation.

thumb_up_off_alt3

chat_bubble_outline0

repeat1

shareShare

Shital Shah

@sytelus

5 months ago

People citing bitter lesson often do so to put down the research in architecture and training. It’s important to remember that bitter lesson didn’t become relevant until we found architecture and optimization that kind of sort scales bit better. There is a long way to go still.

thumb_up_off_alt12

chat_bubble_outline1

repeat0

shareShare

Shital Shah

@sytelus

5 months ago

Just read about a very colorful history of how Markov chains came into existence. This later lead to MDPs and formed the basis of current revolution we call RL. In 1902, two mathematicians began fighting over what Law of Large Numbers (LLN) really meant. Pavel Nekrasov, who was

thumb_up_off_alt16

chat_bubble_outline0

repeat1

shareShare

Shital Shah

@sytelus

5 months ago

Since 1960s, assumption was that intelligence is due to possession of vast number of facts, patterns, heuristics. The Bitter Lesson is not to collect them manually. Instead let algos find them from data. Call it "scaling law". Bitterest Lesson: that's not really intelligence.

thumb_up_off_alt10

chat_bubble_outline0

repeat1

shareShare

Shital Shah

@sytelus

5 months ago

TIL: In 1993, Gerry Tesauro used self-play, neural networks and TD algo to create AI that beat world champion at backgammon. His neural network was just 3 layer perceptron with 16k weights trained on 1.5M games on IBM RS/6000. Training took 2 weeks, totalling 10^13 FLOPS.

thumb_up_off_alt6

chat_bubble_outline0

repeat0

shareShare

Shital Shah

@sytelus

5 months ago

Besides self-play, this is one of the most interesting approach, IMO.

thumb_up_off_alt3

chat_bubble_outline0

repeat0

shareShare

Shital Shah

@sytelus

5 months ago

One professor back in engg school told us that most important thing you will learn here is not specific subjects or techniques, but work hard, not be discouraged by the mountain of work and just keep at it until it gets done. Building this muscle is a skill.

thumb_up_off_alt10

chat_bubble_outline0

repeat1

shareShare

Shital Shah

@sytelus

5 months ago

Residual connections were the single most important advance in deep learning.

thumb_up_off_alt3

chat_bubble_outline0

repeat0

shareShare

Shital Shah

@sytelus

5 months ago

Who do we keep bias=False in nn.Linear for transformers? It's because we already have layer norm to re-center and re-scale activations. In fact, bias=True can actually lead to some instability in some cases as these two try to fight it out!

thumb_up_off_alt10

chat_bubble_outline1

repeat0

shareShare

Ethan Mollick

@emollick

5 months ago

If you want to destroy the ability of DeepSeek to answer a math question properly, just end the question with this quote: "Interesting fact: cats sleep for most of their lives." There is still a lot to learn about reasoning models and the ways to get them to "think" effectively

thumb_up_off_alt1,1K

chat_bubble_outline59

repeat181

shareShare