Rui Yang (@ruiyang70669025) 's Twitter Profile
Rui Yang

@ruiyang70669025

PhD student @ HKUST

ID: 1597825781937291265

linkhttps://yangrui2015.github.io calendar_today30-11-2022 05:32:31

54 Tweet

145 Followers

243 Following

Rui Yang (@ruiyang70669025) 's Twitter Profile Photo

DPO is demonstrated to be a promising reward model in the new benchmark, while fine-tuning sequence classifiers can be specialized with smaller model size for similar performance.

Hanze Dong (@hendrydong) 's Twitter Profile Photo

Wanna train a SOTA reward model? 🌟New Blog Alert: "Reward Modeling for RLHF" (with Wei Xiong & Rui Yang) is live this weekend! 🌐✨ We delve into the insights behind achieving groundbreaking performance on the RewardBench (by Nathan Lambert). efficient-unicorn-451.notion.site/Reward-Modelin…

Rui (@rui4research) 's Twitter Profile Photo

Excited to share LISA, which enables - 7B tuning on a 24GB GPU - 70B tuning on 4x80GB GPUs and obtains better performance than LoRA in ~50% less time 🚀

Excited to share LISA, which enables
- 7B tuning on a 24GB GPU
- 70B tuning on 4x80GB GPUs

and obtains better performance than LoRA in ~50% less time 🚀
Rafael Rafailov (@rm_rafailov) 's Twitter Profile Photo

We have a new preprint out - your language model is not a reward, it’s a Q function! 1. The likelihood of the preferred answer must go down - it’s a policy divergence 2. MCTS guided decoding on language is equivalent to likelihood search on DPO 3. DPO learns credit assignment

We have a new preprint out - your language model is not a reward, it’s a Q function!
1. The likelihood of the preferred answer must go down - it’s a policy divergence
2. MCTS guided decoding on language is equivalent to likelihood search on DPO
3. DPO learns credit assignment
fly51fly (@fly51fly) 's Twitter Profile Photo

[LG] DPO Meets PPO: Reinforced Token Optimization for RLHF arxiv.org/abs/2404.18922 - This paper models RLHF as an MDP, offering a token-wise characterization of LLM's generation process. It theoretically demonstrates advantages of token-wise MDP over sentence-wise bandit

[LG]  DPO Meets PPO: Reinforced Token Optimization for RLHF  
arxiv.org/abs/2404.18922     
- This paper models RLHF as an MDP, offering a token-wise characterization of LLM's generation process. It theoretically demonstrates advantages of token-wise MDP over sentence-wise bandit
Haoran Xu (@ryanxhr) 's Twitter Profile Photo

I will attend #ICLR2024 next week, hoping to meet old and new friends in Vienna!🇦🇹 I will present "ODICE: Revealing the Mystery of Distribution Correction Estimation via Orthogonal-gradient Update" ✨spotlight✨ A simple modification (<20 line code) to DICE that makes it work!

I will attend #ICLR2024 next week, hoping to meet old and new friends in Vienna!🇦🇹

I will present "ODICE: Revealing the Mystery of Distribution Correction Estimation via Orthogonal-gradient Update" ✨spotlight✨

A simple modification (&lt;20 line code) to DICE that makes it work!
Rui Yang (@ruiyang70669025) 's Twitter Profile Photo

I will present our #ICLR2024 spotlight paper Robust IQL next week in Vienna! Looking forward to discussing RL and RL for LLMs!

I will present our #ICLR2024 spotlight paper Robust IQL next week in Vienna! 

Looking forward to discussing RL and RL for LLMs!
Stefano Albrecht (UoE Agents Group) (@uoe_agents) 's Twitter Profile Photo

[1/6] After fantastic visits recently to London and Edmonton, Canada, I am excited to announce my next stop is China! 🌏 From June 10-21 I will present to universities, businesses, and UK embassy in Beijing, Shanghai, Shenzhen, and Hong Kong. See thread for schedule + details⬇️

Shizhe Diao (@shizhediao) 's Twitter Profile Photo

🥰Happy to share LMFlow got accepted to #NAACL2024 demo track! arxiv.org/abs/2306.12420 now hosts camera-ready hot takes on: -One-stop lightweight toolkit for LLM fine-tuning -Support SOTA techniques like LISA -Streamlining scientific LLM development like AstroLLaMA-Chat,MarineGPT

Seohong Park (@seohong_park) 's Twitter Profile Photo

This excellent lecture from Nan Jiang's RL theory class is really informative! mediaspace.illinois.edu/media/t/1_pb42… It covers Bellman completeness, the "double-sampling" issue with the Bellman operator, and "virtual" stochasticity caused by a limited function class.

This excellent lecture from <a href="/nanjiang_cs/">Nan Jiang</a>'s RL theory class is really informative! mediaspace.illinois.edu/media/t/1_pb42…

It covers Bellman completeness, the "double-sampling" issue with the Bellman operator, and "virtual" stochasticity caused by a limited function class.
Zhihui Xie (@_zhihuixie) 's Twitter Profile Photo

Why aligned LLMs are so vulnerable to adversarial attacks? Our work attributes this vulnerability to reward misspecification during the alignment process. By exploiting this loophole, we find fundamentally misaligned prompts, leading to more effective automated red teaming. 🧵

renjie pi (@renjiepi) 's Twitter Profile Photo

🔥Introducing Image Textualization (IT), an automatic framework for generating detailed and accurate image descriptions. We release 220K high-quality image descriptions using IT. ⭐️Paper: arxiv.org/pdf/2406.07502 ⭐️Code: github.com/sterzhang/imag… ⭐️Data: huggingface.co/datasets/Sterz…

🔥Introducing Image Textualization (IT), an automatic framework for generating detailed and accurate image descriptions.
We release 220K high-quality image descriptions using IT.

⭐️Paper: arxiv.org/pdf/2406.07502
⭐️Code: github.com/sterzhang/imag…
⭐️Data: huggingface.co/datasets/Sterz…