silbux (@silbux120824) Twitter Tweets • TwiCopy

silbux

@silbux120824

19 days ago

Key Related Works - Origins to 2018: RL on Preferences. - 2019 to 2022: RL from Human Preferences on Language Models. - 2023 to Present: ChatGPT Era.

thumb_up_off_alt1

chat_bubble_outline0

repeat0

shareShare

I respect people who prioritize with intensity and articulate clearly. Prioritizing with intensity means saying no to almost everything so the important things get done. Most people don't do this. Clear articulation means explaining messy situations so simply that everyone

thumb_up_off_alt257

chat_bubble_outline13

repeat15

shareShare

silbux

@silbux120824

18 days ago

RL is indeed hard!

thumb_up_off_alt1

chat_bubble_outline0

repeat0

shareShare

silbux

@silbux120824

17 days ago

Training Overview - Problem Formulation - Canonical Training Recipes

thumb_up_off_alt1

chat_bubble_outline0

repeat0

shareShare

Yifan Zhang

@yifan_zhang_

16 days ago

Scaling KL-Regularized Policy Gradient and REINFORCE Is All You Need. Our ICLR 2026 paper, “On the Design of KL-Regularized Policy Gradient Algorithms for LLM Reasoning,” will be presented at Pavilion 4, Riocentro Convention and Event Center, today! Glad to see that V4 and V3.2

thumb_up_off_alt255

chat_bubble_outline4

repeat30

shareShare

silbux

@silbux120824

14 days ago

Instruction Fine-tuning - Chat templates and the structure of instructions - Best practices of instruction tuning - Implementation

thumb_up_off_alt1

chat_bubble_outline0

repeat0

shareShare

Microsoft Learn

@microsoftlearn

12 days ago

On the job hunt? Know these terms: • Certification stack: Combining many credentials to show expertise • Competency framework: Set of skills and behaviors needed for success • Job matching: Aligning abilities and interests to a role • Learning roadmap: Planned path for skill

thumb_up_off_alt412

chat_bubble_outline6

repeat61

shareShare

silbux

@silbux120824

11 days ago

Reward Modeling - Training a Bradley-Terry Reward Model - The Default Reward Model Architecture - Implementation Example - Reward Model Variants - Outcome RMs - Process RMs - Comparing Reward Model Types (and Value Functions) - Generative Reward Modeling (a.k.a. LLM-as-a-judge)

thumb_up_off_alt1

chat_bubble_outline1

repeat0

shareShare

𝗿𝗮𝗺𝗮𝗸𝗿𝘂𝘀𝗵𝗻𝗮— 𝗲/𝗮𝗰𝗰

@techwith_ram

7 days ago

Created using GPT image 2.0. These infographic designs look really classy.

thumb_up_off_alt656

chat_bubble_outline6

repeat64

shareShare

Michael Nicolls

@michaelnicollsx

7 days ago

Stunning first-sat views from Starlink launch G10-38 on May 1, deployed from SpaceX's Falcon rocket. Watch as the Starlink sats cruise over an entire orbit, through sunrise and sunset, and slowly separate from each as they complete their post-launch deployment sequence before

thumb_up_off_alt4,4K

chat_bubble_outline185

repeat635

shareShare

Adithya S K

@adithya_s_k

6 days ago

Excited to release the Ultimate guide to RL environments! Definitions of RL environments differ wildly in the LLM era, so we spent the last month building several RL environments across 6 different frameworks, domains and complexities to map out which are easiest to build with

thumb_up_off_alt1,1K

chat_bubble_outline50

repeat152

shareShare

silbux

@silbux120824

6 days ago

Reinforcement Learning (i.e., Policy Gradient Algorithms) - Vanilla Policy Gradient - REINFORCE - REINFORCE Leave One Out (RLOO) - Proximal Policy Optimization (PPO) - Group Relative Policy Optimization (GRPO) - Group Sequence Policy Optimization (GSPO)

thumb_up_off_alt2

chat_bubble_outline2

repeat0

shareShare

Owen

@oswinnerlol

6 days ago

sillianź b.c Great cheat sheet. Saved this for my next RL project, perfect reference.

thumb_up_off_alt1

chat_bubble_outline0

repeat1

shareShare

Owen

@oswinnerlol

6 days ago

sillianź b.c Generative RM is so underrated right now. This breakdown is perfect for anyone getting into alignment.

thumb_up_off_alt1

chat_bubble_outline0

repeat1

shareShare

Sandro

@pupposandro

5 days ago

Reminder that this is the future of humanity if open source AI doesn’t win

thumb_up_off_alt3,3K

chat_bubble_outline116

repeat237

shareShare

𝗿𝗮𝗺𝗮𝗸𝗿𝘂𝘀𝗵𝗻𝗮— 𝗲/𝗮𝗰𝗰

@techwith_ram

4 days ago

CNN . RNN . LSTM

thumb_up_off_alt473

chat_bubble_outline0

repeat78

shareShare

kache

@yacinemtb

3 days ago

You get good at anything by doing it a lot. Just do things a lot It's not time spent. It's volume of doing. Make sure the time it takes you to do a single thing is as fast as possible. Your cycle time is sacred

thumb_up_off_alt4,4K

chat_bubble_outline105

repeat519

shareShare

Jessica Meir

@astro_jessica

3 days ago

Toward the end of a recent nighttime timelapse from a SpaceX Dragon window, I caught this impressive lightning flash. I was astounded by the size and intensity of this monster thundercloud. The things we witness from our vantage point on the International Space Station never cease to amaze

Toward the end of a recent nighttime timelapse from a <a href="/SpaceX/">SpaceX</a> Dragon window, I caught this impressive lightning flash. I was astounded by the size and intensity of this monster thundercloud. The things we witness from our vantage point on the <a href="/Space_Station/">International Space Station</a> never cease to amaze

thumb_up_off_alt529

chat_bubble_outline19

repeat94

shareShare

jason liu - vacation mode

@jxnlco

2 days ago

looks like i passed 60k followers, so figured it’d be a nice time to do another intro i’m jason. i work on developer experience at openai. since joining, i’ve been working on things like: - helping developers move to codex - openai cli, agents, realtime - building codex for

thumb_up_off_alt870

chat_bubble_outline76

repeat18

shareShare

silbux

silbux

Eden Chan

silbux

silbux

Yifan Zhang

silbux

Microsoft Learn

silbux

𝗿𝗮𝗺𝗮𝗸𝗿𝘂𝘀𝗵𝗻𝗮— 𝗲/𝗮𝗰𝗰

Michael Nicolls

Adithya S K

silbux

Owen

Owen

Sandro

𝗿𝗮𝗺𝗮𝗸𝗿𝘂𝘀𝗵𝗻𝗮— 𝗲/𝗮𝗰𝗰

kache

Jessica Meir

jason liu - vacation mode