Parallel Experiments

2.2K viewsLinghao Zhang, edited 04:41

最近这段时间购房、搬家、装修，疏于更新了。🙏

分享这篇看到过最好的关于 transformer 的综述之一：https://deeprevision.github.io/posts/001-transformer
#llm

deeprevision.github.io

AI Research Blog - The Transformer Blueprint: A Holistic Guide to the Transformer Neural Network Architecture

A deep dive into Transformer, a neural network architecture that was introduced in the famous paper “attention is all you need” in 2017, its applications, impacts, challenges and future directions

2.1K viewsLinghao Zhang, edited 07:36

Parallel Experiments

TL;DR：基于 LLM 开发上层应用时大概率不需要 fine-tune 模型 — 通过各种技巧来提供领域特定的 context 是更为有效和低成本的做法。

https://www.tidepool.so/2023/08/17/why-you-probably-dont-need-to-fine-tune-an-llm/

#llm

2.0K viewsLinghao Zhang, 06:12

Parallel Experiments

👍 impressive step by step visualization of how GPTs work
https://bbycroft.net/llm
#llm

bbycroft.net

LLM Visualization

A 3D animated visualization of an LLM with a walkthrough.

2.0K viewsLinghao Zhang, edited 02:00

Parallel Experiments

👍 https://www.oreilly.com/radar/what-we-learned-from-a-year-of-building-with-llms-part-i/ #llm

O’Reilly Media

What We Learned from a Year of Building with LLMs (Part I)

2.0K viewsLinghao Zhang, edited 03:30

Parallel Experiments

相见恨晚！One of the best research into the fundamentals of LLM capabilities I’ve seen. This talk is a great summarized version. #llm

https://youtu.be/yBL7J0kgldU

YouTube

ICML 2024 Tutorial: Physics of Language Models

Project page (with further readings): https://physics.allen-zhu.com/

Abstract: We divide "intelligence" into multiple dimensions (like language structures, knowledge, reasoning, etc.). For each dimension, we create synthetic data for LLM pretraining to understand…

1.3K viewsLinghao Zhang, edited 02:00

Parallel Experiments

https://www.anthropic.com/research/building-effective-agents

非常欣赏 Anthropic 的技术分享风格，实事求是不 hype。这篇关于 agents 的文章上来就明确定义区分了 workflow 和 agents，并且推荐 1) 能用简单 workflow 解决的就不要上 agents；2) 没有必要上来就用 LangChain 之类的 agents framework，因为核心逻辑其实不复杂，很多 wrapper 反而隐藏太多细节阻碍开发和调试。我之前做了几个月 agents 相关的工作，也非常认同这两点。文中总结的几类常见 workflow 也非常典型，并且解释得很简明扼要。

#llm

Anthropic

Building Effective AI Agents

Discover how Anthropic approaches the development of reliable AI agents. Learn about our research on agent capabilities, safety considerations, and technical framework for building trustworthy AI.

1.7K viewsLinghao Zhang, 23:53

Parallel Experiments

The best explanation of Flash Attention I’ve read. #llm

https://gordicaleksa.medium.com/eli5-flash-attention-5c44017022ad

Medium

ELI5: FlashAttention

Step by step explanation of how one of the most important MLSys breakthroughs work — in gory detail.

919 viewsLinghao Zhang, 08:13

Parallel Experiments

这几天围绕 DeepSeek 发生的种种非常有趣。我自己凑巧在去年年底 V3 刚发布时就开始关注，陆陆续续读了一些他们的 paper，在过去一个月内看着西方大部分非从业人士从漠不关心和怀疑到去了解和赞美；直到这两天 R1 发布，somehow 导致 NVDA 市值一天蒸发 $600 billion，这中间观察到许多不同的 perspective 和人性的体现，实在精彩。

喧嚣过后想分享几点 takeaway：

1. V3 和 R1 的 technical report 读起来最大的感受是，里面轻描淡写地放了很多需要大量实验才能探明和得出的结论；而这些探索基本都需要大量硬核的 research engineering。这背后必然是一个人才密度极高的团队，而那才是在大模型几乎注定迟早要成为 commodity 的前景下一个公司真正的 moat。如梁文锋自己在采访中所说，「在颠覆性的技术面前，闭源形成的护城河是短暂的。即使 OpenAI 闭源，也无法阻止被别人赶超。所以我们把价值沉淀在团队上，我们的同事在这个过程中得到成长，积累很多 know-how，形成可以创新的组织和文化，就是我们的护城河。」

2. Gemini 初期灾难性的 PR 至今依然在拖后腿。We don't get a second chance at first impressions. 时至今日大家还是言及 LLM 必提 ChatGPT 和 Claude，在开源语境下可能还会提到 Llama，当然现在得多个 DeepSeek。而 Gemini 很多时候甚至都不配出现在比较对象中…… 要知道最近几个发布比如 Gemini 2.0 Flash Thinking 的表现和成本都非常亮眼（见题图，出处 https://x.com/swyx/status/1882933368444309723）。

3. Stratechery 的解读一如既往地到位。如果没有订阅，这篇 [DeepSeek FAQ](https://stratechery.com/2025/deepseek-faq/) 是免费阅读的，推荐；如果订阅了，最近的几篇分析里对 OpenAI 的批评我认为说得很在点上。尤其关于 OpenAI （或者说 Sam 本人）对通过 regulation 巩固地位的渴望以及 o1 选择隐藏 chain of thought 的失误。

4. Reasoning 看起来潜力无限，相关从业者需要好好 reflect 自己的 research/product roadmap；而对 user 来说，一个或许有用的 tip 是从常规 model 换到 reasoning model 时，prompt 写得越像论文，得到的回答质量越好。In other words, reasoning models are not necessarily good chat models; and you might be disappointed if you use them like chat models.

Disclaimer: I work at Google and opinions are my own. #llm

1.7K viewsLinghao Zhang, edited 09:21

Parallel Experiments

https://jax-ml.github.io/scaling-book/
非常值得学习的分享，作者列表里好几个 Gemini 核心团队的人😃 Sholto、Jacob、Sharad 等人都是超一流的 research engineer 🙏
#llm

jax-ml.github.io

How To Scale Your Model

Training LLMs often feels like alchemy, but understanding and optimizing the performance of your models doesn't have to. This book aims to demystify the science of scaling language models on TPUs: how TPUs work and how they communicate with each other, how…

1.5K viewsLinghao Zhang, edited 07:23

Parallel Experiments

https://jax-ml.github.io/scaling-book/ 非常值得学习的分享，作者列表里好几个 Gemini 核心团队的人😃 Sholto、Jacob、Sharad 等人都是超一流的 research engineer 🙏 #llm

https://huggingface.co/spaces/nanotron/ultrascale-playbook
Hugging Face 发布了 Scaling LLM Training on GPU 的 playbook，应该会比 DeepMind 那本侧重 TPU 的 scaling book 更普适一些。 #llm

huggingface.co

The Ultra-Scale Playbook - a Hugging Face Space by nanotron

The ultimate guide to training LLM on large GPU Clusters

1.1K viewsLinghao Zhang, 20:32

Parallel Experiments

前段时间准备 ML Interview (with a focus on LLMs)，浏览了不少学习资源，这里分享一些：

CMU 11-711 Advanced NLP

Language Modeling 综述。

The Transformer Blueprint: A Holistic Guide to the Transformer Neural Network Architecture

比较好的一篇 Transformer 综述。

3Blue1Brown: Attention in transformers, step-by-step

解释 Attention 最好的视频，没有之一。

Hugging Face: Mixture of Experts Explained

Hugging Face: RLHF

Hugging Face: Introduction to Deep Reinforcement Learning

Hugging Face: Multimodal Models

HF 这几个资源很适合快速查漏补缺相关的话题。

Lilian Weng: Agents

依然是最好的 Agents 综述之一。

Understanding Reasoning LLMs

一些 post-training 的细节，侧重分析了 DeepSeek R1 和 R1 Zero。

Designing Machine Learning Systems 笔记 by @tms_ur_way

适合快速查漏补缺 ML 实践中的要点。

Stable Diffusion Explained From Scratch

关于 Diffusion 基本原理的解释。

除此之外以下这几位的内容都很不错，可以针对话题有选择性地摄入。

- Andrej Karpathy 的 YouTube 视频
- Lilian Weng 的博客
- Chip Huyen 的博客

这里推荐的基本都比较入门 / high level，更多是为了查漏补缺。要深度挖掘具体话题还是得去看进一步的资源和论文等。 #ml #llm

1.7K viewsLinghao Zhang, edited 19:22

Parallel Experiments

A really good and concise deep dive into RLHF in LLM post-training, Proximal Policy Optimization (PPO), and Group Relative Policy Optimization (GRPO)
https://yugeten.github.io/posts/2025/01/ppogrpo/
#llm

473 viewsLinghao Zhang, edited 02:24

Parallel Experiments

https://www.anthropic.com/research/tracing-thoughts-language-model
Anthropic 这个 LLM Interpretability 的研究得到了不少有趣的结论。想要 TLDR 可以读这篇博客；有兴趣可以看看两篇对应的论文，有更多细节并且页面交互做得不错。 #llm

https://transformer-circuits.pub/2025/attribution-graphs/biology.html
https://transformer-circuits.pub/2025/attribution-graphs/methods.html

Anthropic

Tracing the thoughts of a large language model

Anthropic's latest interpretability research: a new microscope to understand Claude's internal mechanisms

495 viewsLinghao Zhang, 21:37

Parallel Experiments

https://arxiv.org/abs/2305.18290 #llm #ai

今天深入学习了 DPO，再次感叹扎实的数学功底对 AI/ML Research 的重要性……

原始的 RLHF 是用 pairwise human preference data（A 和 B 哪个更好）去训练一个 reward model，然后用 RL 来训练主 policy model，objective 是 minimize negative log likelihood + regularization（比如 PPO 就是通过新旧 policy 之间的 KL Divergence 来做 regularization）。这样的缺点在于 RL 是出了名的难搞，而且还需要一个 critic model 来预测 reward，使得整个系统的复杂性很高。

DPO 的思路是，观察到 RLHF 的 objective 本质上是 minimize loss over (latent) reward function，通过一番 reparameterization 等数学推导，重新设计了一个 minimize loss over policy 的 objective，绕过了中间这个 reward model，让 gradient update 直接增加 policy model 生成 winner response 的概率并降低 loser response 的概率，大幅简化了流程。

拓展阅读：
- KTO: 更进一步，不需要 pairwise comparison，只用对 individual example 的 upvote/downvote 也可以学习到 preference。
- IPO: 解决 DPO 容易 overfit 的问题。

arXiv.org

Direct Preference Optimization: Your Language Model is Secretly a...

While large-scale unsupervised language models (LMs) learn broad world knowledge and some reasoning skills, achieving precise control of their behavior is difficult due to the completely...

475 viewsLinghao Zhang, edited 05:31

About

Blog

Apps

Platform