Tejesh Bhalla (@og_tejeshbhalla) 's Twitter Profile
Tejesh Bhalla

@og_tejeshbhalla

@theagentic

ID: 1188159594318594048

calendar_today26-10-2019 18:25:04

972 Tweet

51 Followers

315 Following

kalomaze (@kalomaze) 's Twitter Profile Photo

teaching people quickly & in an information dense way is hard. i take for granted how much implicit knowledge i've learned. when i started i didnt know how2tmux, how to calculate effective batch size, or that gradient accumulation should be loss equiv to higher native bs, or...

Siyan Zhao (@siyan_zhao) 's Twitter Profile Photo

Thanks AK for sharing our work! Unlike autoregressive LLMs, diffusion LLMs can be conditioned on future reasoning hints during generation through inpainting 🧩, enabling guided exploration toward correct solutions. We show that applying inpainting-guided exploration in RL

Thanks AK for sharing our work! 

Unlike autoregressive LLMs, diffusion LLMs can be conditioned on future reasoning hints during generation through inpainting 🧩, enabling guided exploration toward correct solutions. 

We show that applying inpainting-guided exploration in RL
Matt Beton (@mattbeton) 's Twitter Profile Photo

Finetune DeepSeek 🐳 with two Mac Studios + MLX 🚀 We use pipeline parallelism to split the full 671GB model across two devices connected by a single TB5 cable. LoRA reduces the number of parameters to train from 671 billion down to 37 million, reducing the memory overhead from

Finetune DeepSeek 🐳 with two Mac Studios + MLX 🚀

We use pipeline parallelism to split the full 671GB model across two devices connected by a single TB5 cable.

LoRA reduces the number of parameters to train from 671 billion down to 37 million, reducing the memory overhead from
Wenhao Yu (@wyu_nd) 's Twitter Profile Photo

RL often cause 𝐞𝐧𝐭𝐫𝐨𝐩𝐲 𝐜𝐨𝐥𝐥𝐚𝐩𝐬𝐞: generations become shorter, less diverse, and brittle. A simple fix is 𝐝𝐢𝐯𝐞𝐫𝐬𝐢𝐭𝐲 reward to boost exploration. I use it in many of my projects — surprisingly effective! Details in our NEW paper: arxiv.org/abs/2509.15194

RL often cause 𝐞𝐧𝐭𝐫𝐨𝐩𝐲 𝐜𝐨𝐥𝐥𝐚𝐩𝐬𝐞: generations become shorter, less diverse, and brittle.

A simple fix is 𝐝𝐢𝐯𝐞𝐫𝐬𝐢𝐭𝐲 reward to boost exploration.

I use it in many of my projects — surprisingly effective!

Details in our NEW paper: arxiv.org/abs/2509.15194
Aran Komatsuzaki (@arankomatsuzaki) 's Twitter Profile Photo

FlowRL: Reward Distribution Matching for LLM RL • Shifts from reward maximization → distribution matching • +10.0% vs GRPO, +5.1% vs PPO on math; strong gains on code • Minimizes reverse KL to cover all valid reasoning paths (avoids mode collapse)

FlowRL: Reward Distribution Matching for LLM RL

• Shifts from reward maximization → distribution matching
• +10.0% vs GRPO, +5.1% vs PPO on math; strong gains on code
• Minimizes reverse KL to cover all valid reasoning paths (avoids mode collapse)
Sully (@sullyomarr) 's Twitter Profile Photo

so let me get this right: Oracle says Openai committed $300B for cloud compute → oracle stock jumps 36% (best day since 1992) Oracle runs on Nvidia GPUs → has to buy billions in chips from Nvidia Nvidia just announced they're investing $100B into openai Openai uses that

Da Yu (@dayu85201802) 's Twitter Profile Photo

✨ Internship Opportunity @ Google Research ✨ We are seeking a self-motivated student researcher to join our team at Google Research starting around January 2026. 🚀 In this role, you will contribute to research projects advancing agentic LLMs through tool use and RL, with the

Elliot Arledge (@elliotarledge) 's Twitter Profile Photo

if you're gonna stick to deep kernel work, you'll need to know cuda, triton, cute-dsl, cutlass, etc. if you think kernels are cool but want to see them primarily in infrastructure and get to work on cool tricks like spec decoding, and model-level problems as opposed to kernel

Daniel Han (@danielhanchen) 's Twitter Profile Photo

DeepSeek V3.2 breakdown 1. Sparse attention via lightning indexer + top_k attention 2. Uses V3.1 Terminus + 1T continued pretraining tokens 3. 5 specialized models (coding, math etc) via RL then distillation for final ckpt 4. GRPO. Reward functions for length penalty, language

DeepSeek V3.2 breakdown
1. Sparse attention via lightning indexer + top_k attention
2. Uses V3.1 Terminus + 1T continued pretraining tokens
3. 5 specialized models (coding, math etc) via RL then distillation for final ckpt
4. GRPO. Reward functions for length penalty, language
Ahmad (@theahmadosman) 's Twitter Profile Photo

DeepSeek casually unlocked 50x attention efficiency in ~1 year > MLA is ~5.6x faster than MHA > DSA is 9x faster than MLA never doubted you, you big beautiful whale

Tejesh Bhalla (@og_tejeshbhalla) 's Twitter Profile Photo

Deepseek v3.2 sparse attention seems so shady no longbench or ruler evals done , how do i trust gpqa accuracy to validate if sparse attention works , 2048 top k per query token that seems too less need to eval it

George Grigorev (@iamgrigorev) 's Twitter Profile Photo

Just trained a 70M-param LLM to <20 perplexity on DCLM in 5 hours – on a single consumer GPU. All my convergence + sample-efficiency tricks are stacking beautifully. I now have separate repo with my research related to sample efficient GPT training, link in reply.

Yulu Gan (@yule_gan) 's Twitter Profile Photo

Reinforcement Learning (RL) has long been the dominant method for fine-tuning, powering many state-of-the-art LLMs. Methods like PPO and GRPO explore in action space. But can we instead explore directly in parameter space? YES we can. We propose a scalable framework for

Yacine Mahdid (@yacinelearning) 's Twitter Profile Photo

if you ever wondered how diffusion models can be used for text generation (like in those blazing fast coding demo) check out Julia Turc latest tutorial in 24min you'll get the main strategy in D3PM/LLaDA, their inference/training tradeoff and the math intuition behind them

if you ever wondered how diffusion models can be used for text generation (like in those blazing fast coding demo) check out <a href="/juliarturc/">Julia Turc</a> latest tutorial

in 24min you'll get the main strategy in D3PM/LLaDA, their inference/training tradeoff and the math intuition behind them
wh (@nrehiew_) 's Twitter Profile Photo

I think we are all in agreement that it is literally impossible for a <10M model to be more useful or intelligent than Gemini 2.5 Pro or o3 mini. Therefore, any benchmark which allows for a 10M model to come out on top is useless and not worth anyone’s time.