@get_writer : New research from the Writer team: We trained LLMs to improve their reasoning by rewarding the skill of self-reflection using reinforcement learning (GRPO) on a model’s reflection tokens, not its output. The result: 👉 Improved performance on verifiable tasks (up to 34.7% on • TwiCopy

Writer

@get_writer

+ Follow

Writer is where the world’s leading enterprises orchestrate AI-powered work | Dream Big, Build Fast | Fueled by our Palmyra LLMs

ID: 329933902

linkhttp://writer.com calendar_today05-07-2011 21:01:10

1,1K Tweet

7,7K Followers

35 Following

Writer

@get_writer

2 months ago

New research from the Writer team: We trained LLMs to improve their reasoning by rewarding the skill of self-reflection using reinforcement learning (GRPO) on a model’s reflection tokens, not its output. The result: 👉 Improved performance on verifiable tasks (up to 34.7% on

thumb_up_off_alt1

chat_bubble_outline0

repeat0

shareShare