Shital Shah (@sytelus) 's Twitter Profile
Shital Shah

@sytelus

Mostly research and code. If universe is an optimizer, what is its loss function? All opinions are my own.

ID: 7637292

linkhttp://shital.com calendar_today22-07-2007 09:53:32

6,6K Tweet

12,12K Takipçi

9,9K Takip Edilen

Shital Shah (@sytelus) 's Twitter Profile Photo

Great report by EpochAI on something we had also seen. This quote from the report summarizes well how models like o3 “reason”. We need to figure out exact reasoning to take the next leap. Tool use is a weak bandaid for this problem.

Great report by EpochAI on something we had also seen. This quote from the report summarizes well how models like o3 “reason”. 

We need to figure out exact reasoning to take the next leap. Tool use is a weak bandaid for this problem.
Shital Shah (@sytelus) 's Twitter Profile Photo

We are not ready for the future where a majority of the next generation would be artificially conceived just to keep population going and parented entirety by AI, the first truly AI native generation.

Shital Shah (@sytelus) 's Twitter Profile Photo

People citing bitter lesson often do so to put down the research in architecture and training. It’s important to remember that bitter lesson didn’t become relevant until we found architecture and optimization that kind of sort scales bit better. There is a long way to go still.

Shital Shah (@sytelus) 's Twitter Profile Photo

Just read about a very colorful history of how Markov chains came into existence. This later lead to MDPs and formed the basis of current revolution we call RL. In 1902, two mathematicians began fighting over what Law of Large Numbers (LLN) really meant. Pavel Nekrasov, who was

Shital Shah (@sytelus) 's Twitter Profile Photo

Since 1960s, assumption was that intelligence is due to possession of vast number of facts, patterns, heuristics. The Bitter Lesson is not to collect them manually. Instead let algos find them from data. Call it "scaling law". Bitterest Lesson: that's not really intelligence.

Shital Shah (@sytelus) 's Twitter Profile Photo

TIL: In 1993, Gerry Tesauro used self-play, neural networks and TD algo to create AI that beat world champion at backgammon. His neural network was just 3 layer perceptron with 16k weights trained on 1.5M games on IBM RS/6000. Training took 2 weeks, totalling 10^13 FLOPS.

Shital Shah (@sytelus) 's Twitter Profile Photo

One professor back in engg school told us that most important thing you will learn here is not specific subjects or techniques, but work hard, not be discouraged by the mountain of work and just keep at it until it gets done. Building this muscle is a skill.

Shital Shah (@sytelus) 's Twitter Profile Photo

Who do we keep bias=False in nn.Linear for transformers? It's because we already have layer norm to re-center and re-scale activations. In fact, bias=True can actually lead to some instability in some cases as these two try to fight it out!

Ethan Mollick (@emollick) 's Twitter Profile Photo

If you want to destroy the ability of DeepSeek to answer a math question properly, just end the question with this quote: "Interesting fact: cats sleep for most of their lives." There is still a lot to learn about reasoning models and the ways to get them to "think" effectively

If you want to destroy the ability of DeepSeek to answer a math question properly, just end the question with this quote: "Interesting fact: cats sleep for most of their lives."

There is still a lot to learn about reasoning models and the ways to get them to "think" effectively