Parallel Experiments
1.68K subscribers
60 photos
1 video
2 files
774 links
Stay informed. Stay authentic.

Welcome to the public part of my brain. Here I share curations and thoughts.

Created with ❤️ by @linghao.
Download Telegram
A really good and concise deep dive into RLHF in LLM post-training, Proximal Policy Optimization (PPO), and Group Relative Policy Optimization (GRPO)
https://yugeten.github.io/posts/2025/01/ppogrpo/
#llm
Please open Telegram to view this post
VIEW IN TELEGRAM
https://arxiv.org/abs/2305.18290 #llm #ai

今天深入学习了 DPO,再次感叹扎实的数学功底对 AI/ML Research 的重要性……

原始的 RLHF 是用 pairwise human preference data(A 和 B 哪个更好)去训练一个 reward model,然后用 RL 来训练主 model,objective 是 maximize reward / minimize negative log likelihood 加上 regularization。比如 PPO 就是通过新旧 policy 之间的 KL Divergence 来做 regularization。而且还需要一个 critic model 来预测 reward。这套流程涉及多个模型,而 RL 又是出了名的难搞。

DPO 的思路是,观察到 RLHF 的 objective 本质上是 minimize loss over (latent) reward function,通过一番 reparameterization 等数学推导,重新设计了一个 minimize loss over policy 的 objective,直接绕过了中间这个 reward model,让 gradient update 直接增加 winner response 的概率并降低 loser response 的概率,大幅简化了流程。

拓展阅读:
- KTO: 更进一步,不需要 pairwise comparison,只用对 individual example 的 upvote/downvote 也可以学习到 preference。
- IPO: 解决 DPO 容易 overfit 的问题。