Arindam (@halg0rithmist) 's Twitter Profile
Arindam

@halg0rithmist

22
Aspiring theoretician
Interested in decision making algorithms

ID: 1875940172774387713

calendar_today05-01-2025 16:19:49

45 Tweet

27 Followers

1,1K Following

Aaron Defazio (@aaron_defazio) 's Twitter Profile Photo

L1 regularization for sparse solutions - as usually taught - is actually terrible in practice! I’m always surprised how few people know this. To get good results, retrain with the sparsity pattern found from the initial L1 run, but without the regularizer. Works much better.

Daniel Han (@danielhanchen) 's Twitter Profile Photo

My Gemma-3 analysis: 1. 1B text only, 4, 12, 27B Vision + text. 14T tokens 2. 128K context length further trained from 32K 3. Removed attn softcapping. Replaced with QK norm 4. 5 sliding + 1 global attn 5. 1024 sliding window attention 6. RL - BOND, WARM, WARP Detailed analysis:

My Gemma-3 analysis:
1. 1B text only, 4, 12, 27B Vision + text. 14T tokens
2. 128K context length further trained from 32K
3. Removed attn softcapping. Replaced with QK norm
4. 5 sliding + 1 global attn
5. 1024 sliding window attention
6. RL - BOND, WARM, WARP

Detailed analysis:
Daniel Han (@danielhanchen) 's Twitter Profile Photo

Nathan Lambert I know you all uploaded GGUFs, but also just uploaded other GGUF formats + dynamic 4bit bitsandbytes and general 4bit BnB versions! Dynamic 4bit BnB: huggingface.co/unsloth/OLMo-2… 4bit BnB: huggingface.co/unsloth/OLMo-2… GGUFs: huggingface.co/unsloth/OLMo-2… Fantastic true open source model!

Nathan Lambert (@natolambert) 's Twitter Profile Photo

This paper also recommended for understanding GRPO. TLDR is that the output-length normalization in DeepSeek GRPO is making models not penalize repetitive behaviors while rewarding shorter correct responses. Same intuition as the last RL paper I posted. Writeup soon.

Goodfire (@goodfireai) 's Twitter Profile Photo

Today, we're announcing our $50M Series A and sharing a preview of Ember - a universal neural programming platform that gives direct, programmable access to any AI model's internal thoughts.

Sean Welleck (@wellecks) 's Twitter Profile Photo

And to finish off, Lectures 21 - 23: - AI for Mathematics: youtu.be/ToY57HgQKXA - Multimodal I (CLIP / Llava): youtu.be/5uI5WOpq8LQ - Multimodal II (VQVAE / Chameleon): youtu.be/VismiXpCs_Y

Andrej Karpathy (@karpathy) 's Twitter Profile Photo

We're missing (at least one) major paradigm for LLM learning. Not sure what to call it, possibly it has a name - system prompt learning? Pretraining is for knowledge. Finetuning (SL/RL) is for habitual behavior. Both of these involve a change in parameters but a lot of human

Anthropic (@anthropicai) 's Twitter Profile Photo

Our interpretability team recently released research that traced the thoughts of a large language model. Now we’re open-sourcing the method. Researchers can generate “attribution graphs” like those in our study, and explore them interactively.

Percy Liang (@percyliang) 's Twitter Profile Photo

Wrapped up Stanford CS336 (Language Models from Scratch), taught with an amazing team Tatsunori Hashimoto Marcel Rød Neil Band Rohith Kuditipudi. Researchers are becoming detached from the technical details of how LMs work. In CS336, we try to fix that by having students build everything:

Gokul Swamy (@g_k_swamy) 's Twitter Profile Photo

It was a dream come true to teach the course I wish existed at the start of my PhD. We built up the algorithmic foundations of modern-day RL, imitation learning, and RLHF, going deeper than the usual "grab bag of tricks". All 25 lectures + 150 pages of notes are now public! 🧵

It was a dream come true to teach the course I wish existed at the start of my PhD. We built up the algorithmic foundations of modern-day RL, imitation learning, and RLHF, going deeper than the usual "grab bag of tricks". All 25 lectures + 150 pages of notes are now public! 🧵
Laura Ruis (@lauraruis) 's Twitter Profile Photo

LLMs can be programmed by backprop 🔎 In our new preprint, we show they can act as fuzzy program interpreters and databases. After being ‘programmed’ with next-token prediction, they can retrieve, evaluate, and even *compose* programs at test time, without seeing I/O examples.

LLMs can be programmed by backprop 🔎

In our new preprint, we show they can act as fuzzy program interpreters and databases. After being ‘programmed’ with next-token prediction, they can retrieve, evaluate, and even *compose* programs at test time, without seeing I/O examples.