Tejesh Bhalla (@og_tejeshbhalla) Twitter Tweets • TwiCopy

kalomaze

a month ago

teaching people quickly & in an information dense way is hard. i take for granted how much implicit knowledge i've learned. when i started i didnt know how2tmux, how to calculate effective batch size, or that gradient accumulation should be loss equiv to higher native bs, or...

thumb_up_off_alt423

chat_bubble_outline5

repeat11

shareShare

ℏεsam

@hesamation

a month ago

$12 billion so temperature=0 actually works

thumb_up_off_alt2,2K

chat_bubble_outline52

repeat61

shareShare

Kwebbelkop

@kwebbelkop

a month ago

Wait until people realize that our entire economy works like this.

thumb_up_off_alt10,10K

chat_bubble_outline133

repeat557

shareShare

Daniel Nakov

@dnak0v

a month ago

thumb_up_off_alt4,4K

chat_bubble_outline24

repeat195

shareShare

Lei Li

@_tobiaslee

a month ago

Qwen should be placed at the S-tier.

thumb_up_off_alt620

chat_bubble_outline20

repeat21

shareShare

Gaurav Taneja

@flyingbeast320

a month ago

Just a reminder -Today is the last Day for Advance Tax

thumb_up_off_alt18,18K

chat_bubble_outline426

repeat2,2K

shareShare

Siyan Zhao

@siyan_zhao

a month ago

Thanks AK for sharing our work! Unlike autoregressive LLMs, diffusion LLMs can be conditioned on future reasoning hints during generation through inpainting 🧩, enabling guided exploration toward correct solutions. We show that applying inpainting-guided exploration in RL

thumb_up_off_alt189

chat_bubble_outline4

repeat28

shareShare

Matt Beton

@mattbeton

a month ago

Finetune DeepSeek 🐳 with two Mac Studios + MLX 🚀 We use pipeline parallelism to split the full 671GB model across two devices connected by a single TB5 cable. LoRA reduces the number of parameters to train from 671 billion down to 37 million, reducing the memory overhead from

thumb_up_off_alt568

chat_bubble_outline18

repeat81

shareShare

Wenhao Yu

@wyu_nd

a month ago

RL often cause 𝐞𝐧𝐭𝐫𝐨𝐩𝐲 𝐜𝐨𝐥𝐥𝐚𝐩𝐬𝐞: generations become shorter, less diverse, and brittle. A simple fix is 𝐝𝐢𝐯𝐞𝐫𝐬𝐢𝐭𝐲 reward to boost exploration. I use it in many of my projects — surprisingly effective! Details in our NEW paper: arxiv.org/abs/2509.15194

thumb_up_off_alt377

chat_bubble_outline6

repeat67

shareShare

Aran Komatsuzaki

@arankomatsuzaki

a month ago

FlowRL: Reward Distribution Matching for LLM RL • Shifts from reward maximization → distribution matching • +10.0% vs GRPO, +5.1% vs PPO on math; strong gains on code • Minimizes reverse KL to cover all valid reasoning paths (avoids mode collapse)

thumb_up_off_alt235

chat_bubble_outline11

repeat32

shareShare

Sully

@sullyomarr

24 days ago

so let me get this right: Oracle says Openai committed $300B for cloud compute → oracle stock jumps 36% (best day since 1992) Oracle runs on Nvidia GPUs → has to buy billions in chips from Nvidia Nvidia just announced they're investing $100B into openai Openai uses that

thumb_up_off_alt25,25K

chat_bubble_outline1,1K

repeat1,1K

shareShare

Da Yu

@dayu85201802

23 days ago

✨ Internship Opportunity @ Google Research ✨ We are seeking a self-motivated student researcher to join our team at Google Research starting around January 2026. 🚀 In this role, you will contribute to research projects advancing agentic LLMs through tool use and RL, with the

thumb_up_off_alt837

chat_bubble_outline14

repeat95

shareShare

Elliot Arledge

@elliotarledge

21 days ago

if you're gonna stick to deep kernel work, you'll need to know cuda, triton, cute-dsl, cutlass, etc. if you think kernels are cool but want to see them primarily in infrastructure and get to work on cool tricks like spec decoding, and model-level problems as opposed to kernel

thumb_up_off_alt228

chat_bubble_outline4

repeat9

shareShare

Daniel Han

@danielhanchen

17 days ago

DeepSeek V3.2 breakdown 1. Sparse attention via lightning indexer + top_k attention 2. Uses V3.1 Terminus + 1T continued pretraining tokens 3. 5 specialized models (coding, math etc) via RL then distillation for final ckpt 4. GRPO. Reward functions for length penalty, language

thumb_up_off_alt1,1K

chat_bubble_outline15

repeat160

shareShare

Ahmad

@theahmadosman

17 days ago

DeepSeek casually unlocked 50x attention efficiency in ~1 year > MLA is ~5.6x faster than MHA > DSA is 9x faster than MLA never doubted you, you big beautiful whale

thumb_up_off_alt1,1K

chat_bubble_outline21

repeat49

shareShare

Tejesh Bhalla

@og_tejeshbhalla

14 days ago

Deepseek v3.2 sparse attention seems so shady no longbench or ruler evals done , how do i trust gpqa accuracy to validate if sparse attention works , 2048 top k per query token that seems too less need to eval it

thumb_up_off_alt0

chat_bubble_outline0

repeat0

shareShare

George Grigorev

@iamgrigorev

11 days ago

Just trained a 70M-param LLM to <20 perplexity on DCLM in 5 hours – on a single consumer GPU. All my convergence + sample-efficiency tricks are stacking beautifully. I now have separate repo with my research related to sample efficient GPT training, link in reply.

thumb_up_off_alt688

chat_bubble_outline21

repeat49

shareShare

Yulu Gan

@yule_gan

10 days ago

Reinforcement Learning (RL) has long been the dominant method for fine-tuning, powering many state-of-the-art LLMs. Methods like PPO and GRPO explore in action space. But can we instead explore directly in parameter space? YES we can. We propose a scalable framework for

thumb_up_off_alt2,2K

chat_bubble_outline85

repeat352

shareShare

Yacine Mahdid

@yacinelearning

9 days ago

if you ever wondered how diffusion models can be used for text generation (like in those blazing fast coding demo) check out Julia Turc latest tutorial in 24min you'll get the main strategy in D3PM/LLaDA, their inference/training tradeoff and the math intuition behind them

if you ever wondered how diffusion models can be used for text generation (like in those blazing fast coding demo) check out <a href="/juliarturc/">Julia Turc</a> latest tutorial

in 24min you'll get the main strategy in D3PM/LLaDA, their inference/training tradeoff and the math intuition behind them

thumb_up_off_alt573

chat_bubble_outline8

repeat42

shareShare

wh

@nrehiew_

8 days ago

I think we are all in agreement that it is literally impossible for a <10M model to be more useful or intelligent than Gemini 2.5 Pro or o3 mini. Therefore, any benchmark which allows for a 10M model to come out on top is useless and not worth anyone’s time.

thumb_up_off_alt378

chat_bubble_outline45

repeat8

shareShare