We'd love our flow-based generative models to learn the optimal transport from noise to data... but they rarely do ❌.
Mini-batch Optimal Transport methods aim to fix this — but they're costly and require large batch sizes to work well... Can we approximate this behaviour
gpt-oss is probably the most standard MoE transformer that ever was. Couple of details worth noting:
- Uses attention sinks (a.k.a. registers)
- Sliding window attention in every second layer
- YaRN context window extension
- RMSNorm without biases
- No QK norm, no attn. softcap
GRPO vs GSPO, or DeepSeek vs Qwen - a workflow breakdown of the main Chinese reinforcement learning algorithms
➡️ Group Relative Policy Optimization (GRPO): Learning by comparison
GRPO is tailored for reasoning-heavy tasks where relative quality matters more than absolute
RLVR/RLHF libraries:
• verl - ByteDance
• TRL - HuggingFace
• slime - Zhipu AI
• prime-rl - Prime Intellect
• ROLL - Alibaba
• Nemo-RL - NVIDIA
• AReaL - Ant Research
• SkyRL - UC Berkeley
• open-instruct - Allen AI
• torchtune - PyTorch
Any I am missing? Which do you
I think I found a based Substack on low-level GPU programming by accident.
He has some extensive articles on CUDA programming, building LLM inference engines, looking inside GPUs and much more.
even the name is cool: "From Scratch". bro.
🧠 [Primer] Model Compression • compress.aman.ai
- Model compression techniques make it possible to run powerful AI models efficiently on edge devices by reducing memory, compute, and energy/power demands without severely sacrificing accuracy.
- This primer explores