Parallel Experiments

用看待人类智能的方式去设想机器智能的可能形式，是一种非常狭隘的观念。联想到这篇文章：https://zhuanlan.zhihu.com/p/26253133 #ai

知乎专栏

脱碳入硅

by 鲍捷 2017-02-25 不要问机器为你做了什么，问问你为机器做了什么。人是软件定义的动物人有三万个基因。几百个基因的区别就能区分两个物种。但人的一生其实被文因（Meme）塑造，一生被imprint（思想钢印）进…

301 viewsedited 07:22

Parallel Experiments

用两天在路上开车的时间听完了 Latent Space 这期跟传奇 Bret Taylor 一个半小时的访谈，收获颇多！ #podcast #ai
https://www.latent.space/p/bret

Latent

The AI Architect — Bret Taylor

The legendary CEO of Sierra, Chairman of OpenAI, and creator of Google Maps/Facebook Likes on the future of Software Engineering, and building great products and teams at the break of the dawn of AGI.

❤4👍2

1.76K viewsLinghao Zhang, 22:05

Parallel Experiments

Forwarded from C’s Random Collection

https://ai-2027.com “We predict that the impact of superhuman AI over the next decade will be enormous, exceeding that of the Industrial Revolution.” 不管怎样，这个页面的 interaction 很棒 #ai

Ai-2027

AI 2027

A research-backed AI scenario forecast.

🤩1

864 viewsLinghao Zhang, 06:41

Parallel Experiments

ysymyth.github.io

The Second Half

tldr: We’re at AI’s halftime.

Truly a thought-provoking piece, from the author of τ-bench.
https://ysymyth.github.io/The-Second-Half/ #ai

So what’s suddenly different now?

In three words: RL finally works. More precisely: RL finally generalizes. After several major detours and a culmination of milestones, we’ve landed on a working recipe to solve a wide range of RL tasks using language and reasoning.

The second half of AI — starting now — will shift focus from solving problems to defining problems. In this new era, evaluation becomes more important than training. Instead of just asking, “Can we train a model to solve X?”, we’re asking, “What should we be training AI to do, and how do we measure real progress?” To thrive in this second half, we’ll need a timely shift in mindset and skill set, ones perhaps closer to a product manager.

It turned out the most important part of RL might not even be the RL algorithm or environment, but the priors, which can be obtained in a way totally unrelated from RL (LLMs).

🔥2

799 viewsLinghao Zhang, edited 05:22

Parallel Experiments

https://arxiv.org/abs/2305.18290 #llm #ai

今天深入学习了 DPO，再次感叹扎实的数学功底对 AI/ML Research 的重要性……

原始的 RLHF 是用 pairwise human preference data（A 和 B 哪个更好）去训练一个 reward model，然后用 RL 来训练主 policy model，objective 是 minimize negative log likelihood + regularization（比如 PPO 就是通过新旧 policy 之间的 KL Divergence 来做 regularization）。这样的缺点在于 RL 是出了名的难搞，而且还需要一个 critic model 来预测 reward，使得整个系统的复杂性很高。

DPO 的思路是，观察到 RLHF 的 objective 本质上是 minimize loss over (latent) reward function，通过一番 reparameterization 等数学推导，重新设计了一个 minimize loss over policy 的 objective，绕过了中间这个 reward model，让 gradient update 直接增加 policy model 生成 winner response 的概率并降低 loser response 的概率，大幅简化了流程。

拓展阅读：
- KTO: 更进一步，不需要 pairwise comparison，只用对 individual example 的 upvote/downvote 也可以学习到 preference。
- IPO: 解决 DPO 容易 overfit 的问题。

arXiv.org

Direct Preference Optimization: Your Language Model is Secretly a...

While large-scale unsupervised language models (LMs) learn broad world knowledge and some reasoning skills, achieving precise control of their behavior is difficult due to the completely...

👍3

2.66K viewsLinghao Zhang, edited 05:31

About

Blog

Apps

Platform