用看待人类智能的方式去设想机器智能的可能形式,是一种非常狭隘的观念。联想到这篇文章:https://zhuanlan.zhihu.com/p/26253133 #ai
知乎专栏
脱碳入硅
by 鲍捷 2017-02-25 不要问机器为你做了什么,问问你为机器做了什么。 人是软件定义的动物 人有三万个基因。几百个基因的区别就能区分两个物种。但人的一生其实被文因(Meme)塑造,一生被imprint(思想钢印)进…
用两天在路上开车的时间听完了 Latent Space 这期跟传奇 Bret Taylor 一个半小时的访谈,收获颇多! #podcast #ai
https://www.latent.space/p/bret
https://www.latent.space/p/bret
www.latent.space
The AI Architect — Bret Taylor
The legendary CEO of Sierra, Chairman of OpenAI, and creator of Google Maps/Facebook Likes on the future of Software Engineering, and building great products and teams at the break of the dawn of AGI.
❤4👍2
Forwarded from C’s Random Collection
https://ai-2027.com “We predict that the impact of superhuman AI over the next decade will be enormous, exceeding that of the Industrial Revolution.” 不管怎样,这个页面的 interaction 很棒 #ai
Ai-2027
AI 2027
A research-backed AI scenario forecast.
🤩1
ysymyth.github.io
The Second Half
tldr: We’re at AI’s halftime.
Truly a thought-provoking piece, from the author of τ-bench.
https://ysymyth.github.io/The-Second-Half/ #ai
https://ysymyth.github.io/The-Second-Half/ #ai
So what’s suddenly different now?
In three words: RL finally works. More precisely: RL finally generalizes. After several major detours and a culmination of milestones, we’ve landed on a working recipe to solve a wide range of RL tasks using language and reasoning.
The second half of AI — starting now — will shift focus from solving problems to defining problems. In this new era, evaluation becomes more important than training. Instead of just asking, “Can we train a model to solve X?”, we’re asking, “What should we be training AI to do, and how do we measure real progress?” To thrive in this second half, we’ll need a timely shift in mindset and skill set, ones perhaps closer to a product manager.
It turned out the most important part of RL might not even be the RL algorithm or environment, but the priors, which can be obtained in a way totally unrelated from RL (LLMs).
🔥2
https://arxiv.org/abs/2305.18290 #llm #ai
今天深入学习了 DPO,再次感叹扎实的数学功底对 AI/ML Research 的重要性……
原始的 RLHF 是用 pairwise human preference data(A 和 B 哪个更好)去训练一个 reward model,然后用 RL 来训练主 policy model,objective 是 minimize negative log likelihood + regularization(比如 PPO 就是通过新旧 policy 之间的 KL Divergence 来做 regularization)。这样的缺点在于 RL 是出了名的难搞,而且还需要一个 critic model 来预测 reward,使得整个系统的复杂性很高。
DPO 的思路是,观察到 RLHF 的 objective 本质上是 minimize loss over (latent) reward function,通过一番 reparameterization 等数学推导,重新设计了一个 minimize loss over policy 的 objective,绕过了中间这个 reward model,让 gradient update 直接增加 policy model 生成 winner response 的概率并降低 loser response 的概率,大幅简化了流程。
拓展阅读:
- KTO: 更进一步,不需要 pairwise comparison,只用对 individual example 的 upvote/downvote 也可以学习到 preference。
- IPO: 解决 DPO 容易 overfit 的问题。
今天深入学习了 DPO,再次感叹扎实的数学功底对 AI/ML Research 的重要性……
原始的 RLHF 是用 pairwise human preference data(A 和 B 哪个更好)去训练一个 reward model,然后用 RL 来训练主 policy model,objective 是 minimize negative log likelihood + regularization(比如 PPO 就是通过新旧 policy 之间的 KL Divergence 来做 regularization)。这样的缺点在于 RL 是出了名的难搞,而且还需要一个 critic model 来预测 reward,使得整个系统的复杂性很高。
DPO 的思路是,观察到 RLHF 的 objective 本质上是 minimize loss over (latent) reward function,通过一番 reparameterization 等数学推导,重新设计了一个 minimize loss over policy 的 objective,绕过了中间这个 reward model,让 gradient update 直接增加 policy model 生成 winner response 的概率并降低 loser response 的概率,大幅简化了流程。
拓展阅读:
- KTO: 更进一步,不需要 pairwise comparison,只用对 individual example 的 upvote/downvote 也可以学习到 preference。
- IPO: 解决 DPO 容易 overfit 的问题。
arXiv.org
Direct Preference Optimization: Your Language Model is Secretly a...
While large-scale unsupervised language models (LMs) learn broad world knowledge and some reasoning skills, achieving precise control of their behavior is difficult due to the completely...
👍3