silbux (@silbux120824) 's Twitter Profile
silbux

@silbux120824

grokking math and ml.

ID: 1824958243929464832

calendar_today17-08-2024 23:55:46

319 Tweet

39 Takipçi

100 Takip Edilen

silbux (@silbux120824) 's Twitter Profile Photo

Key Related Works - Origins to 2018: RL on Preferences. - 2019 to 2022: RL from Human Preferences on Language Models. - 2023 to Present: ChatGPT Era.

Key Related Works
- Origins to 2018: RL on Preferences.
- 2019 to 2022: RL from Human Preferences on Language Models.
- 2023 to Present: ChatGPT Era.
Eden Chan (@onlychans1) 's Twitter Profile Photo

I respect people who prioritize with intensity and articulate clearly. Prioritizing with intensity means saying no to almost everything so the important things get done. Most people don't do this. Clear articulation means explaining messy situations so simply that everyone

Yifan Zhang (@yifan_zhang_) 's Twitter Profile Photo

Scaling KL-Regularized Policy Gradient and REINFORCE Is All You Need. Our ICLR 2026 paper, “On the Design of KL-Regularized Policy Gradient Algorithms for LLM Reasoning,” will be presented at Pavilion 4, Riocentro Convention and Event Center, today! Glad to see that V4 and V3.2

Scaling KL-Regularized Policy Gradient and REINFORCE Is All You Need.

Our ICLR 2026 paper, “On the Design of KL-Regularized Policy Gradient Algorithms for LLM Reasoning,” will be presented at Pavilion 4, Riocentro Convention and Event Center, today!

Glad to see that V4 and V3.2
silbux (@silbux120824) 's Twitter Profile Photo

Instruction Fine-tuning - Chat templates and the structure of instructions - Best practices of instruction tuning - Implementation

Instruction Fine-tuning
- Chat templates and the structure of instructions
- Best practices of instruction tuning
- Implementation
Microsoft Learn (@microsoftlearn) 's Twitter Profile Photo

On the job hunt? Know these terms: • Certification stack: Combining many credentials to show expertise • Competency framework: Set of skills and behaviors needed for success • Job matching: Aligning abilities and interests to a role • Learning roadmap: Planned path for skill

silbux (@silbux120824) 's Twitter Profile Photo

Reward Modeling - Training a Bradley-Terry Reward Model - The Default Reward Model Architecture - Implementation Example - Reward Model Variants - Outcome RMs - Process RMs - Comparing Reward Model Types (and Value Functions) - Generative Reward Modeling (a.k.a. LLM-as-a-judge)

Reward Modeling
- Training a Bradley-Terry Reward Model
- The Default Reward Model Architecture
- Implementation Example
- Reward Model Variants
- Outcome RMs
- Process RMs
- Comparing Reward Model Types (and Value Functions)
- Generative Reward Modeling (a.k.a. LLM-as-a-judge)
Michael Nicolls (@michaelnicollsx) 's Twitter Profile Photo

Stunning first-sat views from Starlink launch G10-38 on May 1, deployed from SpaceX's Falcon rocket. Watch as the Starlink sats cruise over an entire orbit, through sunrise and sunset, and slowly separate from each as they complete their post-launch deployment sequence before

Adithya S K (@adithya_s_k) 's Twitter Profile Photo

Excited to release the Ultimate guide to RL environments! Definitions of RL environments differ wildly in the LLM era, so we spent the last month building several RL environments across 6 different frameworks, domains and complexities to map out which are easiest to build with

silbux (@silbux120824) 's Twitter Profile Photo

Reinforcement Learning (i.e., Policy Gradient Algorithms) - Vanilla Policy Gradient - REINFORCE - REINFORCE Leave One Out (RLOO) - Proximal Policy Optimization (PPO) - Group Relative Policy Optimization (GRPO) - Group Sequence Policy Optimization (GSPO)

Reinforcement Learning (i.e., Policy Gradient Algorithms)
- Vanilla Policy Gradient
- REINFORCE
- REINFORCE Leave One Out (RLOO)
- Proximal Policy Optimization (PPO)
- Group Relative Policy Optimization (GRPO)
- Group Sequence Policy Optimization (GSPO)
kache (@yacinemtb) 's Twitter Profile Photo

You get good at anything by doing it a lot. Just do things a lot It's not time spent. It's volume of doing. Make sure the time it takes you to do a single thing is as fast as possible. Your cycle time is sacred

Jessica Meir (@astro_jessica) 's Twitter Profile Photo

Toward the end of a recent nighttime timelapse from a SpaceX Dragon window, I caught this impressive lightning flash. I was astounded by the size and intensity of this monster thundercloud. The things we witness from our vantage point on the International Space Station never cease to amaze

Toward the end of a recent nighttime timelapse from a <a href="/SpaceX/">SpaceX</a> Dragon window, I caught this impressive lightning flash. I was astounded by the size and intensity of this monster thundercloud. The things we witness from our vantage point on the <a href="/Space_Station/">International Space Station</a> never cease to amaze
jason liu - vacation mode (@jxnlco) 's Twitter Profile Photo

looks like i passed 60k followers, so figured it’d be a nice time to do another intro i’m jason. i work on developer experience at openai. since joining, i’ve been working on things like: - helping developers move to codex - openai cli, agents, realtime - building codex for