Arindam (@halg0rithmist) Twitter Tweets • TwiCopy

Aaron Defazio

a year ago

L1 regularization for sparse solutions - as usually taught - is actually terrible in practice! I’m always surprised how few people know this. To get good results, retrain with the sparsity pattern found from the initial L1 run, but without the regularizer. Works much better.

thumb_up_off_alt502

chat_bubble_outline24

repeat49

shareShare

Daniel Han

@danielhanchen

5 months ago

My Gemma-3 analysis: 1. 1B text only, 4, 12, 27B Vision + text. 14T tokens 2. 128K context length further trained from 32K 3. Removed attn softcapping. Replaced with QK norm 4. 5 sliding + 1 global attn 5. 1024 sliding window attention 6. RL - BOND, WARM, WARP Detailed analysis:

thumb_up_off_alt794

chat_bubble_outline22

repeat126

shareShare

Daniel Han

@danielhanchen

5 months ago

Nathan Lambert I know you all uploaded GGUFs, but also just uploaded other GGUF formats + dynamic 4bit bitsandbytes and general 4bit BnB versions! Dynamic 4bit BnB: huggingface.co/unsloth/OLMo-2… 4bit BnB: huggingface.co/unsloth/OLMo-2… GGUFs: huggingface.co/unsloth/OLMo-2… Fantastic true open source model!

thumb_up_off_alt23

chat_bubble_outline2

repeat2

shareShare

Nathan Lambert

@natolambert

5 months ago

This paper also recommended for understanding GRPO. TLDR is that the output-length normalization in DeepSeek GRPO is making models not penalize repetitive behaviors while rewarding shorter correct responses. Same intuition as the last RL paper I posted. Writeup soon.

thumb_up_off_alt406

chat_bubble_outline4

repeat28

shareShare

Goodfire

@goodfireai

4 months ago

Today, we're announcing our $50M Series A and sharing a preview of Ember - a universal neural programming platform that gives direct, programmable access to any AI model's internal thoughts.

thumb_up_off_alt1,1K

chat_bubble_outline42

repeat106

shareShare

Sean Welleck

@wellecks

4 months ago

And to finish off, Lectures 21 - 23: - AI for Mathematics: youtu.be/ToY57HgQKXA - Multimodal I (CLIP / Llava): youtu.be/5uI5WOpq8LQ - Multimodal II (VQVAE / Chameleon): youtu.be/VismiXpCs_Y

thumb_up_off_alt327

chat_bubble_outline1

repeat49

shareShare

Arindam

@halg0rithmist

4 months ago

first time someone talking about 2045 AGI timeline

thumb_up_off_alt0

chat_bubble_outline0

repeat0

shareShare

Andrej Karpathy

@karpathy

3 months ago

We're missing (at least one) major paradigm for LLM learning. Not sure what to call it, possibly it has a name - system prompt learning? Pretraining is for knowledge. Finetuning (SL/RL) is for habitual behavior. Both of these involve a change in parameters but a lot of human

thumb_up_off_alt9,9K

chat_bubble_outline698

repeat1,1K

shareShare

Anthropic

@anthropicai

3 months ago

Our interpretability team recently released research that traced the thoughts of a large language model. Now we’re open-sourcing the method. Researchers can generate “attribution graphs” like those in our study, and explore them interactively.

thumb_up_off_alt4,4K

chat_bubble_outline103

repeat576

shareShare

Percy Liang

@percyliang

2 months ago

Wrapped up Stanford CS336 (Language Models from Scratch), taught with an amazing team Tatsunori Hashimoto Marcel Rød Neil Band Rohith Kuditipudi. Researchers are becoming detached from the technical details of how LMs work. In CS336, we try to fix that by having students build everything:

thumb_up_off_alt3,3K

chat_bubble_outline31

repeat323

shareShare

Gokul Swamy

@g_k_swamy

2 months ago

It was a dream come true to teach the course I wish existed at the start of my PhD. We built up the algorithmic foundations of modern-day RL, imitation learning, and RLHF, going deeper than the usual "grab bag of tricks". All 25 lectures + 150 pages of notes are now public! 🧵

thumb_up_off_alt691

chat_bubble_outline7

repeat87

shareShare

Laura Ruis

@lauraruis

2 months ago

LLMs can be programmed by backprop 🔎 In our new preprint, we show they can act as fuzzy program interpreters and databases. After being ‘programmed’ with next-token prediction, they can retrieve, evaluate, and even *compose* programs at test time, without seeing I/O examples.