@iscienceluvr : Part I: Tricks or Traps? A Deep Dive into RL for LLM Reasoning "we demonstrate that employing only two techniques, i.e., advantage normalization (group-level mean, batch-level std) and token-level loss aggregation, can unlock the learning capability of critic-free policies using • TwiCopy

Tanishq Mathew Abraham, Ph.D.

@iscienceluvr

+ Follow

ID: 441465751

linkhttps://tanishq.ai calendar_today20-12-2011 03:45:50

16,16K Tweet

75,75K Takipçi

1,1K Takip Edilen

Tanishq Mathew Abraham, Ph.D.

@iscienceluvr

a month ago

Part I: Tricks or Traps? A Deep Dive into RL for LLM Reasoning "we demonstrate that employing only two techniques, i.e., advantage normalization (group-level mean, batch-level std) and token-level loss aggregation, can unlock the learning capability of critic-free policies using

thumb_up_off_alt198

chat_bubble_outline3

repeat31

shareShare