A really good and concise deep dive into RLHF in LLM post-training, Proximal Policy Optimization (PPO), and Group Relative Policy Optimization (GRPO)
https://yugeten.github.io/posts/2025/01/ppogrpo/
#llm
https://yugeten.github.io/posts/2025/01/ppogrpo/
#llm
https://www.anthropic.com/research/tracing-thoughts-language-model
Anthropic 这个 LLM Interpretability 的研究得到了不少有趣的结论。想要 TLDR 可以读这篇博客;有兴趣可以看看两篇对应的论文,有更多细节并且页面交互做得不错。 #llm
https://transformer-circuits.pub/2025/attribution-graphs/biology.html
https://transformer-circuits.pub/2025/attribution-graphs/methods.html
Anthropic 这个 LLM Interpretability 的研究得到了不少有趣的结论。想要 TLDR 可以读这篇博客;有兴趣可以看看两篇对应的论文,有更多细节并且页面交互做得不错。 #llm
https://transformer-circuits.pub/2025/attribution-graphs/biology.html
https://transformer-circuits.pub/2025/attribution-graphs/methods.html
Anthropic
Tracing the thoughts of a large language model
Anthropic's latest interpretability research: a new microscope to understand Claude's internal mechanisms
Please open Telegram to view this post
VIEW IN TELEGRAM
https://newsletter.pragmaticengineer.com/p/the-philosophy-of-software-design
A Philosophy of Software Design 作者 John Ousterhout 做客 The Pragmatic Engineer. #podcast #software_design
A Philosophy of Software Design 作者 John Ousterhout 做客 The Pragmatic Engineer. #podcast #software_design
Pragmaticengineer
The Philosophy of Software Design – with John Ousterhout
Stanford professor John Ousterhout explains why thoughtful software design matters more than ever as AI tools transform coding practices and developer workflows.
https://arxiv.org/abs/2305.18290 #llm #ai
今天深入学习了 DPO,再次感叹扎实的数学功底对 AI/ML Research 的重要性……
原始的 RLHF 是用 pairwise human preference data(A 和 B 哪个更好)去训练一个 reward model,然后用 RL 来训练主 model,objective 是 maximize reward / minimize negative log likelihood 加上 regularization。比如 PPO 就是通过新旧 policy 之间的 KL Divergence 来做 regularization。而且还需要一个 critic model 来预测 reward。这套流程涉及多个模型,而 RL 又是出了名的难搞。
DPO 的思路是,观察到 RLHF 的 objective 本质上是 minimize loss over (latent) reward function,通过一番 reparameterization 等数学推导,重新设计了一个 minimize loss over policy 的 objective,直接绕过了中间这个 reward model,让 gradient update 直接增加 winner response 的概率并降低 loser response 的概率,大幅简化了流程。
拓展阅读:
- KTO: 更进一步,不需要 pairwise comparison,只用对 individual example 的 upvote/downvote 也可以学习到 preference。
- IPO: 解决 DPO 容易 overfit 的问题。
今天深入学习了 DPO,再次感叹扎实的数学功底对 AI/ML Research 的重要性……
原始的 RLHF 是用 pairwise human preference data(A 和 B 哪个更好)去训练一个 reward model,然后用 RL 来训练主 model,objective 是 maximize reward / minimize negative log likelihood 加上 regularization。比如 PPO 就是通过新旧 policy 之间的 KL Divergence 来做 regularization。而且还需要一个 critic model 来预测 reward。这套流程涉及多个模型,而 RL 又是出了名的难搞。
DPO 的思路是,观察到 RLHF 的 objective 本质上是 minimize loss over (latent) reward function,通过一番 reparameterization 等数学推导,重新设计了一个 minimize loss over policy 的 objective,直接绕过了中间这个 reward model,让 gradient update 直接增加 winner response 的概率并降低 loser response 的概率,大幅简化了流程。
拓展阅读:
- KTO: 更进一步,不需要 pairwise comparison,只用对 individual example 的 upvote/downvote 也可以学习到 preference。
- IPO: 解决 DPO 容易 overfit 的问题。
arXiv.org
Direct Preference Optimization: Your Language Model is Secretly a...
While large-scale unsupervised language models (LMs) learn broad world knowledge and some reasoning skills, achieving precise control of their behavior is difficult due to the completely...