Eric Jang – Building AlphaGo from scratch
发布时间 来源
Episode 设置
摘要
Eric Jang walks through how to build AlphaGo from scratch, but with modern AI tools.Sometimes you understand the future better by stepping backward. AlphaGo is still the cleanest worked example of the primitives of intelligence: search, learning from experience, and self-play. You have to go back to 2017 to get insight into how the more general AIs of the future might learn.Once he explained how AlphaGo works, it gave us the context to have a discussion about how RL works in LLMs and how it could work better – naive policy gradient RL has to figure out which of the 100k+ tokens in your trajectory actually got you the right answer, while AlphaGo’s MCTS suggests a strictly better action every single move, giving you a training target that sidesteps the credit assignment problem. The way humans learn is surely closer to the second.Eric also kickstarted an Autoresearch loop on his project. And it was very interesting to discuss which parts of AI research LLMs can already automate pretty well (implementing and running experiments, optimizing hyperparameters) and which they still struggle with (choosing the right question to investigate next, escaping research dead ends). Informative to all the recent discussion about when we should expect an intelligence explosion, and what it would look like from the inside.Watch on YouTube. Read the transcript.And check out the flashcards I wrote to retain the insights.Sponsors* Cursor‘s agent SDK let me build a pipeline to generate flashcards for this episode. For each card, I had an agent read the transcript, ingest blackboard screenshots, generate an SVG visual, and run everything through a critic. A durable agent is much better at this kind of work than a chain of LLM calls, and Cursor’s SDK made it easy. Check out the cards at flashcards.dwarkesh.com and get started with the SDK at cursor.com/dwarkesh* Jane Street gave me a real deep-dive tour of one of their datacenters. I got to ask a bunch of questions to Ron Minsky, who co-leads Jane Street’s tech group, and Dan Pontecorvo, who runs Jane Street’s physical engineering team. They were willing to literally pull up the floorboards and take out racks to explain how everything works. Check out the full tour at janestreet.com/dwarkeshTimestamps(00:00:00) – Basics of Go(00:08:17) – Monte Carlo Tree Search(00:32:04) – What the neural network does(01:00:33) – Self-play(01:25:38) – Alternative RL approaches(01:45:47) – Why doesn't MCTS work for LLMs(02:01:09) – Off-policy training(02:12:02) – RL is even more information inefficient than you thought(02:22:16) – Automated AI researchers Get full access to Dwarkesh Podcast at www.dwarkesh.com/subscribe
GPT-4正在为你翻译摘要中......
中英文字稿
今天我和Eric Zhang在一起,他最近担任1x科技公司的AI副总裁,在此之前是现在的谷歌DeepMind机器人部门的高级研究科学家。过去几个月他一直在休假。期间,他的一项工作是重建、改进和破解AlphaGo。因此,今天我们将讨论你如何从零开始构建AlphaGo,并了解这对未来的AI研究和发展有什么启示。在此之前,为什么AlphaGo引人注目?为什么你在休假期间选择这个项目,而不是悠闲地在海滩上度过?
▶ 英文原文 ⏱
Today I'm here with Eric Zhang, who was most recently vice president of AI at 1x Technologies, before that senior research scientist at what is now Google DeepMind Robotics. And you've been on sabbatical for the last few months. One of the things you've been doing is rebuilding and improving and hacking on AlphaGo. And so today what we're going to do is you're going to explain building AlphaGo from scratch and what it tells us about the future of AI research and development. But before we get to that, why is AlphaGo interesting? Why is this the project you decided to do on sabbatical rather than just hanging out at the beach?
当然,是的。我喜欢创造东西,而AlphaGo和围棋AI就是让我真正进入这个领域的其中一个原因。当我看到在2014年、2015年、2016年左右AlphaGo的早期突破时,我感到十分震撼,AI系统能够变得如此智能,并且可以通过深度学习解决具有高度计算复杂性的问题。这个问题长期以来被认为在搜索中是难以解决的,但最终通过深度学习得到了破解。
▶ 英文原文 ⏱
Sure, yeah. I like making things and AlphaGo and GoAI is one of those things that really got me into the field. When I saw the kind of early breakthroughs on AlphaGo in 2014, 2015, 2016, and so forth, it was just profound to see how smart AI systems could become and the kind of computational complexity class that they could tackle with deep learning. This is a problem that has long been understood to be kind of intractable for search, and yet it was solved through deep learning.
这对我来说一直是个谜,我一直希望能够更好地理解这一现象。我的专业训练主要集中在用于机器人技术的深度神经网络上,此类网络做出的决策通常更加直观。但AlphaGo是另一种问题,其中的决策实际上是经过非常非常深的搜索得出的。我一直觉得很神奇,一个只有10层的神经网络竟然可以在游戏树中模拟出如此深的决策。
▶ 英文原文 ⏱
And so that was quite mysterious to me, and I've always wanted to understand that phenomena a little bit better. My training is often in deep neural nets for robotics, where the decisions made by the neural networks are a bit more intuitive. But AlphaGo is a sort of problem where the decisions are actually the result of a very, very deep search. And it's always been very mysterious to me how like a 10 layer network can sort of amortize the simulation of something so deep in the game tree.
如果你把多年来构建各种强大的围棋AI所需的计算量绘制出来,你会发现,在2020年,有一个名为Katago的开源项目,由Jane Street的吴炳辉发起,他成功地将训练一个非常强大的围棋AI Tablet Rasa所需的计算量减少了40倍。我不确定它是否强于AlphaGo 0、Alpha 0或Mu 0,但它非常、非常强。这也是现在大多数围棋爱好者在对战AI时所使用的训练对象。
▶ 英文原文 ⏱
So if you plot out how much compute it took to build various iterations of strong GoBots over the years, you can see that in 2020, there was an open-source project called Katago by David Wu from Jane Street, who basically achieved a 40x reduction in compute needed to train a really strong GoBot Tablet Rasa. I'm not certain if it's stronger than AlphaGo 0 or Alpha 0 or Mu 0, but it's very, very strong. And this is what most GoB practitioners today train against when they're playing an AI.
由于LLM编码技术的进步,过去需要DeepMind整整一个研究团队和数百万美元的研究资金与计算资源才能完成的工作,现在只需要花费几千美元的租赁计算资源就可以实现了。顺便提一下,如果你是在音频平台上收听这段内容,这是一次黑板讲解。因此,如果可以的话,我强烈推荐你切换到像YouTube这样的视频平台来观看数学公式、图表和围棋棋盘。
▶ 英文原文 ⏱
And thanks to LLM coding, what took a whole team of research scientists at DeepMind and, you know, millions of dollars of research and compute can now be done for, you know, a few thousand dollars of rented compute. By the way, if you're listening to this on an audio platform, this is a Blackboard lecture. So I highly recommend switching over to a video platform like YouTube, if you can, to look at the math and the graphs and the GoBoard.
好的,我想我们应该先讨论一下围棋的规则。不错,那么,这个游戏是怎么玩的呢?围棋是一个非常简单的游戏,可以很快且容易地在电脑上实现。游戏的目标基本上就是通过放置黑白棋子,尽可能多地占领游戏中的领地。我可以先放下一颗黑棋。黑棋总是先走。你可以开始了。
▶ 英文原文 ⏱
Okay, I guess we should first discuss how Go works. Great. So, yeah, how does the game work? So the game of Go is a very simple one that can be implemented quickly and easily in a computer. The objective of the game is basically to put down black and white stones and try to occupy as much territory in the game as possible. So I might start by putting down a black stone. Black always goes first. So go ahead.
在围棋中,吃掉对方棋子的方式是:如果在一个交叉点上,你能用自己的棋子完全包围这个交叉点四个相邻的位置,那么这个交叉点就像是被切断了氧气,这个棋子就算是死子。这样,你就控制了这些相邻的四个棋子以及这个空的交叉点。在围棋规则中,中国、日本和所谓的Trump-Taylor规则之间存在一些细微的差异。Trump-Taylor规则的设计目的是为了让围棋规则完全没有歧义。
▶ 英文原文 ⏱
And so the way you capture an opponent's stones is that for every intersection, if you can surround all four of its neighbors with your stones, then this one is sort of cut off from oxygen, if you will, and then it is a dead stone. So then now I control these four stones as well as this empty intersection here. So there's like slight variations between Chinese, Japanese, and what is called Trump-Taylor rules. Trump-Taylor rules are designed to be completely unambiguous for Go.
所以这就是所有围棋AI训练和对弈的情况。在典型的围棋比赛中,比如人类对弈时,你其实不被允许在这里放下这个白子,因为这会立即导致自杀(指的是违反围棋规则)。但在特朗普-泰勒规则中,这样做是可以的。你放下后,它会立即被判定为死亡。所以结果其实是差不多的。好吧,让我们重新开始,走几步棋,然后我再详细解释一下。
▶ 英文原文 ⏱
So this is what all Go AIs train against and resolve against. So in typical Go, like when the humans play, you're actually not allowed to put this white stone down here. It would be instant suicide. In Trump-Taylor, it's actually fine. You put it down and then it immediately resolves to death. So the outcome is sort of the same. Let's go ahead and start over and play a few stones and then I'll explain some more.
好的,那我就从这里开始。我在这里算是随机下棋,但我试图绕过你的棋子。我会围住你的棋子,看看能不能靠近它们。听起来不错。嗯,所以这步棋基本上是让你的白棋有一个空邻点,这有点像国际象棋中的“将军”,如果你不立即在这里下子回应,那么我可以马上吃掉它。
▶ 英文原文 ⏱
So I'll just start there. I'm like basically playing randomly here, but I'm trying to get around your stone. I'm going to round your stones and see if I can close by them. Sounds good. Yep. So this move basically exposes one empty neighbor for your white stone, and it's very akin to a check in chess, where if you don't respond immediately by putting one here, then I can immediately capture this.
好的,我明白了。因为决定是否稳住局面的因素有点像是对角线。然而,不是对角线,而是横切面。因此,这个棋子被三面包围了。是的,如果你不立刻在那里下一个棋子,就有可能丢掉这颗棋子。现在你可以看到我开始给你施压,因为我在这里放了一颗棋子,你就必须在这里放一颗,否则你就无法阻挡这一点。
▶ 英文原文 ⏱
I see, okay. Because it is sort of the diagonals that determine whether you're grounded in it. The cross-section, not the diagonals. So this one is surrounded on three sides. Yep. And so you're at threat of losing that stone if you don't play one immediately there. Now you can see that I'm starting to pressure you because by putting a stone here, now you are forced to put one here. Otherwise, you would have this to block.
好的。对你自己。如果你仔细想想,如果你在这里回应会发生什么,你可能可以预测未来,并推断出一旦你这么做,我会怎么回应。你对我的能力非常自信,但我猜你会把黑子放在这里。没错。然后我会拿下这三个子。所以我应该假定这个小区域已经没有了。是的。因此在围棋中,如果让对手吃掉一些棋子,能够帮助你在其他地方占据更有利的位置,那也是可以的。这就是围棋的美妙之处,你可以在小战中失利,但在大战中获胜。有趣的是,随着棋盘尺寸的增大,这种微观与宏观之间动态的复杂性会变得更加有趣。
▶ 英文原文 ⏱
Yes. To yourself. And then if you think through what happens if you were to respond here, you can probably search into the future and deduce what I'll do in response once you do that. You have a lot of confidence in my abilities, but I'm guessing you'd put the black here. That's right. And then I would capture all three of the stones. So I should just assume that this is gone. This little block is gone. Yes. So in Go, it's actually okay to let an opponent capture some stones if, for example, it allows you to position to capture more stones in somewhere else on the board. And this is what makes Go a very beautiful game is that you can kind of lose the battle but win the war. Interesting. And as the board size increases, the complexity of these kind of like micro versus macro dynamics gets more interesting.
大概,你会在这里放一个。这样我就能吃掉整个这一组了。好,这就归我了。不过还有一个情况我想演示一下,我最近的代码里其实遇到了一个错误问题。考虑一下这样一个布局吧,对吧?然后棋盘上还有其他棋子在发挥作用。现在我们来讨论一下游戏如何结束,对吧?在这个区域,谁控制这些区域?是白棋还是黑棋?白棋吧。实际上是黑棋,因为我已经包围了整个区域。
▶ 英文原文 ⏱
Presumably, you'd put one here. And so now I would capture this entire group. Okay. And this would be mine. Okay, there's one more case that I want to demonstrate, which actually I had a bug in my code recently, which is the following situation. So let's consider a formation like this, right? And then, you know, we have other pieces on the board in play or whatever. And so let's talk a little bit about how the game ends, right? In this territory, who controls these areas? Is it white or is it black? White. It's actually black because I have actually surrounded this whole area.
好的。那么,假设我这里有其他的黑色棋子,实际上这样的话你很难将它们从这些棋子的控制中解放出来。所以在最后计分时,这些棋子也会算作在里面吗?这是个好问题。不同的规则体系在评分时会有不同的方式,所以我们应该稍微讨论一下在人类对弈和计算机代码判断时,如何解决分数问题。因为实际上在人类评估这个问题时存在一些模糊性。大多数人类会看看这个棋局,然后得出结论:黑子已经完全包围了白子。
▶ 英文原文 ⏱
Yeah. And it's very, assuming I have like other black stones here, it's actually very hard for you to break this out of the control of these stones. So when the final score is tallied, would these ones also count as being in. Yeah, great question. So this is where different rule sets have different ways of scoring. And so we should talk a little bit about how like you resolve scores between humans and how you resolve scores between computer code. Because there's actually some ambiguity in how humans evaluate this. So most humans would look at this board configuration and conclude that like black has kind of totally surrounded white.
因此,白棋几乎没有生存的机会。我们可以在这里多下几步,但最后我会把全部棋子吃掉。不过,如果你能找到突破这个棋形的方法,并将白棋连接到外面的某个地方,那么局势可能会扭转,对吧?这就是为什么计算机在决定这种局面时有些困难的原因。那么人类是怎么做到的呢?值得考虑一下人类是如何解决这个问题的,因为这实际上可以映射到我们思考深度神经网络的方式。人类基本上会说,我认为游戏已经结束了。
▶ 英文原文 ⏱
And so white has no chance of life. We could play out more here, but then at the end, I would capture everything. However, if you have a way of breaking this formation and connecting white to something outside of it, then it can flip, right? And so this is where it's, you know, a little bit hard for a computer to decide these kind of things, right? So how do humans do it, right? Like it's worth thinking a little bit about how humans resolve this, because this will actually map later to how we think about the deep neural network. Humans basically say, I think the game is done.
然后你还得说,我觉得游戏结束了。接下来我们会说,我觉得这些是我的棋子。然后你需要同意。如果你不同意,那么我们就继续玩。对,因此基本上,当两个人的所谓价值观达成共识时,就会按照中国规则来决定结果。很有趣的是,在特朗普-泰勒计分法中,这种情况是完全清晰的,可以由电脑算法来决定。假设这是游戏结束时的情况,计分的方法是首先数清楚你控制了多少个棋子,这是明确无误的。
▶ 英文原文 ⏱
And then you have to also say, I think the game is done. And then we'll say like, I think these are my stones. And then you have to agree. If you don't agree, then we keep playing. Yeah. So essentially, once two humans, their so-called value function, agree on a consensus, then the Chinese rules result that. Yeah, interesting. So in Trump-Taylor scoring, it's perfectly unambiguous. So it can be decided, you know, algorithmically by a computer. So if let's say you have this at the end game, the way you score this is that you first count how many stones you control, and that's unambiguous.
然后,你要计算有多少空白交叉点没有被对手的棋子占据。这些交叉点不会算作任何一方的得分,因为所有这些交叉点都与白棋和黑棋相连,对吗?如果是这样的话,那么白棋将得到三分。不过,这有点奇怪,因为人类会知道白棋实际上失去了这些分数。但是,Trump-Taylor 计分法会认为白棋拥有所有这些分数和其他一些分数。明白了吧?这就是电脑围棋计分方式与人类计分方式之间的一个很大区别。
▶ 英文原文 ⏱
Then you count how many empty intersections that are not touched by your opponent's stones. So these intersections would not count for either player, because both of all of these intersections are connected to both white stones and black stones, right? If this were like this, then white would get three points. Now, this is a little odd, because a human would know that white is actually losing these points. Yeah. But Trump-Taylor scoring would consider white to have all of these points as well as these points. Got it. Okay. So that is a very big difference in how computer Go scores things and how humans score things.
游戏如何结束?游戏在任一玩家选择认输或双方连续弃权时结束。好,了解了。这就是规则。不错。那么,接下来帮我用AI来纠正这个。好的。现在,让我们来了解AlphaGo是如何运作的,以及在场的人如何能够实现它。很好。我们先从对其基本搜索过程的直观理解开始,也就是用于制定棋步的搜索过程,然后再加入深度学习的概念,使其更加高效和易于处理。
▶ 英文原文 ⏱
How does the game end? The game ends when either a player chooses to resign or both players pass consecutively. Cool. Yep. So that's the rules. Nice. All right. Now, help me correct this with AI. Great. Okay. Let's understand how AlphaGo actually works and how somebody in the audience might be able to implement it. Great. Yeah. Let's start with kind of an intuition about the underlying, you know, search process used to make moves and we'll layer on ideas from deep learning to make it much more efficient and tractable.
围棋是一种由两位玩家参与的游戏。我们在这里画一个人,再画一个人工智能。假设这个人持黑子,因此他们先走。接着我们画出他们的下法,他们落子在这里。现在,人工智能将根据它看到的棋局状态进行下一步操作。这里有一个问题,就是你如何将这些输入编码到人工智能中。或许你可以用0和1进行编码,但你需要表示黑子、白子和空位,所以至少需要三种不同的数值。可能你可以使用0、1和2来实现这一点。
▶ 英文原文 ⏱
So Go is a game where there's just two players. We're going to draw a person here and we're going to draw an AI here. And let's say this person is playing black, so they go first. So we're going to draw. They go here. And then now the AI is going to make a move based on what it sees here. So there's a question of like how you encode these inputs into the AI. Maybe you can use ones and zeros, but you want to represent, you know, black, white, and empty. So you would need at least three different values here, right? So maybe you can use zero, ones, and twos or something.
所以,AI可能会看到类似于“00001”这样的输入。这就是AI在轮到它行动时所获得的输入。那么,AI可以选择,我们就随便举三个可能的随机动作。那么在这种情况下,哪个动作是最好的呢?其实,我们在游戏结束前是无法知道的。围棋没有任何一种可以告诉你哪个动作好的局部奖励机制。这就是围棋非常难的原因:在真正到达游戏结束之前,你实际上不知道谁赢了。
▶ 英文原文 ⏱
So the AI might see something like, you know, zero, zero, zero, zero, one. Great. So this is the input to the AI on its turn. Yeah. So the AI can choose, let's just pick three possible random moves that can go and I just drew these at random. And so which move is best here, right? Well, we don't know until the game ends. There's no, Go does not have any kind of local reward of which move here is good. And this is what makes Go a very difficult game is that you don't actually know who won until you really get to the end of the game.
那么,这棵“树”的深度到底是多少呢?在一个19乘19的围棋棋盘上,每一步大约有361种可能的落子。当然,随着棋盘上逐渐被填满,可走的步数会减少。而一局游戏的总步数通常在250到300步之间。当然,专家可能会提前结束比赛。但是,根据特朗普泰勒计分法,你需要一直下到最后一步。所以,这样的话,可能会需要下到300步之类的,也就是这棵“树”的深度大约是300步。
▶ 英文原文 ⏱
So how deep is this tree, right? Well, in a 19 by 19 Go board, there are, you know, roughly to the order of 361 moves on any given move. And of course, as it fills up, you have less moves. And the number of steps in the game can be somewhere from 250 to 300 moves. And maybe experts might decide to end the game well before that. But, you know, under Trump Taylor scoring, you actually have to play things all the way to the end. So this could be like 300 moves or something, right? So like 300, like depth of the tree.
是的。如果你继续展开可能的步骤,比如在这一步中,AI做出一个棋步,接着就轮到人类进行下一步。如此往复,你会发现最终会从这个初始状态衍生出极其海量的游戏结果。这大约是361的300次方,远远超过宇宙中的原子数量。当然,这其中实际上还存在一些冗余和对称性。
▶ 英文原文 ⏱
Yeah. So if you keep on expanding possible moves here, so in this move, the AI is going. And then, you know, here the human would go. And then, you know, there's, there's some and so forth. You can find that like essentially what you end up with is an enormous explosion in the possible game outcomes originating from just this one state. So this is something to the order of like, you know, 361 to 300, power of 300, which is far more than the number of atoms in the universe, right? Like it's, it's just, it's just, and of course, actually there are redundancies and symmetries.
所以其实不是300,但是,如果你建立一个没有子节点合并的简单树状图,那么实际上你会得到一个大约这么大的树。你说的子节点合并是什么意思呢?好的,我用这个板子来解释一下。如果我们从这里开始,然后你在这里下,我在这里下,然后你又在这里下,这等同于我从这里开始,你在这里下,我在这里下,然后你又在这里下,对吧?
▶ 英文原文 ⏱
So it's not actually 300, but, but that's sort of the, if you were to do a naive tree where there were no merging of children, then actually you end up with a tree about this big. What do you mean by merging of children? Right. Let me use this board here. So if we start here and then you play here and then I play here and then you play here, that is equivalent to, I start here, you play here, I play here. Yeah. And then you play here, right?
好的。所以说,两者虽然走了不同的路径,但都到达了同一个点。可以将这个子节点视为一个共同的祖先。我想这个数字不是361,而是从361开始,每次减少一个,而分支因子每次也减少一个。但无论如何,这棵树非常非常大。这也是为什么多年来,计算机科学家认为围棋在本世纪是个难以解决的问题。
▶ 英文原文 ⏱
Yeah. So, so both of them arrived at the same spot, but through different paths. So this child node can be thought about as a shared ancestor. And I guess it's not 361, it starts at 361, but it decreases by one each time. And the branching factor decreases by one each time. Yeah. Yes, yes. But in any case, this is a very, very, very large tree. Yeah. And this is also why, you know, computer scientists for many years thought that Go was not a tractable problem this century.
要穷尽地搜索围棋中的每一种可能情况所需的计算量实在太大了。围棋其实是一个确定性的游戏。在任何给定的状态下,你实际上可以计算出赢得比赛的最佳策略。你可以搜索所有可能在其中获胜的未来,并确保你的棋路始终保持在那些有利的未来中。
▶ 英文原文 ⏱
Because the amount of compute you would need to exhaustively search every possible possibility is just too large. Yeah. If you could, Go is actually a deterministic game. So on any given state, you can actually compute what the best possible strategy you can make is in order to win the game. You can search all the possible futures where you win and then just make sure your views always stay in that, you know, set of futures.
AlphaGo 的核心概念突破在于使用神经网络使搜索问题变得可行。在我们讨论神经网络的作用之前,先聊聊假如我们有一台非常强大的计算机,我们将如何搜索这个决策树以找到最佳棋步。最开始的时候,你不会构建整个树,因为存储那棵树的代价太高了。
▶ 英文原文 ⏱
So AlphaGo's kind of core conceptual breakthrough was using neural nets to make this search problem tractable. So before we get into, you know, how neural networks are involved, let's talk a little bit about how we can, you know, assuming we had a very powerful enough computer, search this, this tree to find the best move, right? So in the beginning, you're not going to build out the whole tree, because storing that tree would be very expensive.
这句话可以翻译成:
相应地,你可以尝试通过互动的方式找出这棵树上哪些分支值得进一步探索和扩展,以便了解还有哪些可能性。因此,在多臂老虎机理论中,有一些早期算法,比如UCB1,虽然不完全适用于像围棋这样的顺序游戏,但它确实为AlphaGo所使用的动作选择算法提供了灵感。
▶ 英文原文 ⏱
Instead, you might do something like interactively figure out which leaves of this tree are worthy of exploring and expanding into the future to see, you know, what else is there. So there are some early algorithms in bandit literature like, you know, UCB1, which is not exactly appropriate for a, you know, sequential game like Go, but very much inspired the action selection algorithm used in AlphaGo.
所以,UCB1 算法看起来在每次行动时,我们会选择最佳行动,或者说,选择一个能够最大化 Q(A) 的行动,这里的 Q(A) 我稍后会解释,并加上一些探索奖励。因此,在每个节点上,我们都会跟踪一些量。我们将每个节点看作一个独立节点,这里是根节点,也就是你开始做决策的地方,而这些则是根节点的子节点。
▶ 英文原文 ⏱
So UCB1 looks like on every move, we're going to take the best action or, you know, the argmax over A that maximizes, you know, the Q of A, and I'll explain what Q of A is in a moment, plus some sort of exploration bonus. So on every node, we're going to track a few quantities. So let's, you know, consider each of these a node. This is the root node, where you're making decisions from, and these are the children of the root node.
我们可以这样说,每个节点基本上是一个数据结构,它存储了这个节点被访问的次数,也就是父节点访问该节点的频率。是的,我们称这个为一个动作(action)。如果你来自机器人学或其他类型的强化学习领域,很容易会疑惑,动作到底在哪里,对吧?这里我只是在谈论节点。
▶ 英文原文 ⏱
And we're going to say each node is basically a data structure that is, it stores a visit count of this node, this child node. Is how often the parent visited this node. Yes. And we'll call this an action. So one thing that is easy to trip on is, like, if you come from, you know, robotics or other kinds of reinforcement learning is, like, where are the actions, right? I'm only talking about nodes.
节点在这里代表状态。因为这是一个完全确定性的游戏,没有随机性,你实际上可以根据子节点推断出动作。所以如果我走到这里,这就意味着一种动作。而这就是我们解决问题的状态,对吧?因此,如果你让大型语言模型(LLMs)生成一个蒙特卡洛树搜索(MCTS)实现的代码,它很可能会在这里设计出合适的数据结构。但是,这有点像是厨师的选择,你实际上可以根据自己的喜好重写树结构。这是Claude 4.6在我向它提问时给我写的代码选择,选择得非常合理。
▶ 英文原文 ⏱
Nodes here represent states. And because this is a perfectly deterministic game with no randomness, you can actually just infer the action based on the child. So if I go here, that implies an action. And this is the state that we resolve it, right? So the LLMs, if you ask to, you know, vibe code a MCTS implementation, it'll most likely design the right data structure here. But, you know, it's sort of a chef's choice. You can actually rewrite the tree structure however you like. This was what Claude 4.6 wrote for me when I asked it. And it was a very reasonable choice.
所以,Q表示该动作的平均价值。我会用下标a来表示这是通过采取特定动作从根节点到达这里的。也就是说,如果我们有一个根节点,通过执行动作a,我们就能到达这个节点。然后,我们还会存储执行这个动作的概率。同样是从父节点开始。是的,从父节点开始。就像我们采样到这个动作的几率有多大一样。这一点会在后面变得更相关,因为我们之前讨论的是一个确定性的树,所以我之后会把概率引入到这里。
▶ 英文原文 ⏱
So then, you know, Q represents the mean action value of this action. And I'll use a subscript a to denote that this kind of corresponds to taking a specific action to get here, right, from the root node. So, like, if we have root, basically taking a gets us to this node here. And then we're going to also store the probability of taking this action. Again, from the parent. From the parent, yes. Like, what are the odds that we sampled this one? Yeah. And this will become relevant later, you know, like, we've talked about a deterministic tree for now, so I'll bring probabilities into this later.
然后,最后我们有一个类似字典形式的“孩子节点”,就像经典的链表结构中的引用树一样。因此,这就是实现树的基本数据结构。在 AlphaGo 中,他们使用了一种稍微不同的动作选择标准,称为 PUCKT。它的全称是“树的预测上置信度”。基本上,当你选择要采取哪个子节点时,你要计算 \( \text{argmax}_A \) 的 \( Q(S, A) \) 加上一个常数。所以这个公式和形式实际上相当类似。这些都是评分标准,对吧?比如,你想对这个数量进行 \( \text{argmax} \),你也希望对这个数量进行 \( \text{argmax} \) 以决定采取哪个动作。
▶ 英文原文 ⏱
And then finally, we have a sort of dictionary of children, which is just like, you know, more of these nodes, you know, in a sort of classic linked list style reference tree. So this is the basic data structure to implement a tree. And in AlphaGo, they use a slightly different action selection criteria called PUCKT. And it's short for predicted upper confidence with trees. And this is basically, when you select which child to take, you do argmax A of Q of S A plus constant. So the equation and forms are actually pretty similar. These are both scoring criteria, right? Like, you want to argmax this quantity, and you want to argmax this quantity to determine which action to take.
让我们来分析一下在这里你如何选择行动的直觉。这是平均行动值,它代表了一个具体子节点平均来说有多好。如果你实际上知道整棵树的情况,那么这就是你所需要的一切来选择最佳行动,你不需要做更多的分析。但如果你在动态地构建这棵树,来确定Q值应该是多少时,你必须偶尔尝试一些其他的行动,这算是一种探索和利用之间的平衡。在UCB和Puct方法中,都有一个术语用来奖励那些你之前没采取过的行动。
▶ 英文原文 ⏱
So let's break down the intuition of, like, how you select actions here. This is the mean action value. So how good is a given child on average? And if you actually, you know, knew the whole tree, then this is all you need, right, to select the best action. You don't really need to do more than that. But if you're interactively building this tree as you're figuring out what the Q values should be, then what you have to do is occasionally try some other actions, you know, as a sort of explore versus exploit tradeoff. So in both UCB and Pucked, there is this term here that basically rewards taking actions that you haven't taken before.
正如我们之前提到的,每个节点都记录选择特定动作的访问次数,对吧?所以一开始所有的计数都是零。假设我们有一个动作,我们称之为动作A,最初它的计数是零。当N增加时,如果我们已经从根节点做出了10次动作选择,但还没有选择过动作A,那么对于动作A,这个数值就会变得很大。相反,如果我们在10次中每次都选择了动作A,那么这个数值就会变得很小,并且会很快减小。在这里,同样的道理也是适用的。
▶ 英文原文 ⏱
So as we mentioned before, each node stores the visit count of taking that specific action, right? So everything is initialized to zero. And so for a given action, let's just say, like, call it, like, action A, initially it's zero. And so as N is increasing, if let's say we've already made 10 action selections from that root node, but we haven't picked A yet, then this term actually starts to become quite large for A. Yeah. Right. And conversely, if we have chosen A 10 times out of 10, then now this term is quite small. It diminishes very quickly. And the same thing is actually true here.
为了确保我理解正确,我试着用我自己的话来表达一下。我们来专注于UCB。我们在这里说的,你可以概念地将其看作是两件事情:Q和一个探索项。我们先明确一下Q是什么。Q基本上是说,当我们进行这些模拟时,你实际上是在进行各种模拟,沿着树走下去,然后你弄清楚,如果我最终到达这棵树的终端值,我是否赢得了这场游戏?接着,你在这棵树的所有叶子节点上,从这个节点开始平均计算我是否赢得这场游戏,这个平均值就放在Q中。对吗?
▶ 英文原文 ⏱
Just to make sure I'm understanding it, maybe I can put it in my own words. Let's just focus on UCB. What we're saying here, you can think of it conceptually as two different things, the Q and then this exploration term. Let's just be clear about what Q is. Q is basically saying, hey, once we do these rollouts, so you're actually running all these simulations, you go down the tree, and then you figure out, okay, if I end up at the terminal value of this tree, do I win this game or not? And then you do this, you average whether I win this game or not across all the, you know, the leafs of this tree, starting from this node, that average, you put in Q. Correct.
所以你的意思是,Q基本上代表着“我会赢得这场比赛吗?”这个问题。它是指从这个节点开始,我赢得比赛的概率是多少。这就是我们所说的“利用”,类似于“我已经进行了这些模拟,我认为这是一个好的策略”这样的判断。而另一个指标则是在评估我相对于其他可探索的动作,是否已经充分探索了这个分支。如果我还没有探索这个分支,尽管我可能认为它得分较低,但实际上我还没有深入探索到树的这个节点下的许多分支。
▶ 英文原文 ⏱
And so you're saying, the Q is basically representing, will I win this game or not? What is this probability that I'll win this game starting at this node? That's your sort of, that is your sort of exploit. That is like saying, I've run these simulations, I think this is a good move or not. And then this other term is saying, have I explored this branch enough yet relative to the other actions I could be exploring? Or I have already explored. If I haven't explored this branch yet, you know, maybe I think it has a low score, but I just haven't explored that many branch leaves of this, down this, leaves down this, down this node in this tree.
所以我可能应该试试看,尽管Q值(类似于这个利用的指标)告诉我这并不那么有价值。而且由于ln(n)的增长速度比n慢,基本上随着时间的推移,argmax(最大化函数值的变量)会从被探索项(这里是第二项)主导,转为被Q值主导。就是说,好吧,我已经做了足够多的模拟,我相当有信心这是需要选择的路径。是的,没错。
▶ 英文原文 ⏱
So I should maybe like try this, even though the Q, this sort of exploit is telling me that this is not that valuable. And because ln of n grows slower than n, basically as over time, you will move from the argmax being dominated by this exploration term, which is the second term here, to the argmax being dominated by the Q term, which is like, okay, I've done enough simulations. I'm quite confident that like, this is the branch to go down. Yes, that's right.
UCB 动机是设计一种算法,适用于当你不知道各个选项的收益时,通过某种探索条款来限制你的后悔程度,即犯错的可能性。我对这个算法的证明并不了解,也不确定它是否被证明可以实现对数或平方根的后悔界限。但我认为这个算法就是按照这样的思路推导出来的。你可以看出这些条款的增长方式有些不同,这实际上是为了应对围棋在每一步中有更多选择,与普通赌博机问题相比。
▶ 英文原文 ⏱
So the motivation for UCB was to come up with an algorithm where if you don't know the payoff of the arms, the different actions you can select to begin with, this strategy, basically with given some exploration term here, bounds your regret in terms of how wrong you can possibly be. I don't know the proof. I don't also know if this one is proved to have a logarithmically or like, you know, square root bounded regret or anything, but I think the algorithm was just derived to look something like this. And you can tell that these terms are, they grow a little bit differently. And this is actually just to account for the fact that Go has many more actions in every given move compared to your standard banded problem.
要澄清一点,你提到了概率模拟等内容。我们应该记住,围棋本质上是一个确定性的游戏。那么,这里的概率概念是从哪里来的呢?如果你有一台非常强大的计算机,就没有概率的问题,你可以直接计算出平均动作价值的真实值。那么,概率是如何引入的呢?
▶ 英文原文 ⏱
So one small clarification to make is that you talked a little about simulations on probabilities and forth. We should remember that Go fundamentally is a deterministic game. So the notion of, like, where does the notion of probability come from here, right? If you had a very powerful computer, there are no probabilities. You can just compute the true average of what the mean action value is. So where does the probability come in?
好吧,事实证明,就像计算机围棋在AlphaGo出现之前一样,我们一直在使用某种蒙特卡洛方法。这个方法是通过在一个随机选择的树上对期望的Q值进行平均计算。而这个随机选择的树就是概率出现的地方。所以,Q值的解释是:在某种随机搜索过程引发的随机分布下,期望的动作价值是多少。那么,随机搜索过程是在哪里发挥作用呢?这就是动作概率出现的地方。
▶ 英文原文 ⏱
Well, it turns out that as in, you know, computer Go before AlphaGo, we've always done some sort of Monte Carlo method where we have some, we take the expected Q value averaged over a randomly selected tree. And that randomly selected tree is where probabilities come in. So the interpretation of Q is, what is the expected action value under the random distribution induced by some random search process? And so where does the random search process come in? That's where, you know, P of action comes in.
是的。如果我们假设一个非常简单的算法,其中每一个有效动作都有相同的概率被选择,那么这就相当于1除以这个设置中有效动作的数量。你会在一个非常分散的树上计算平均值。这种积分方式是有效的,但速度很慢,因为你需要考虑很多价值很低的树。这有点类似于一个重要性抽样的问题,因为实际上只有少数动作和路径能带来较高的价值,而几乎所有其他选项的价值都很低。这就是这里的棘手问题所在。
▶ 英文原文 ⏱
Yeah. So if we assume a very naive algorithm where you have a uniform probability of taking any valid action, then this would just be one over, you know, the number of valid moves in this setup. And you would be kind of taking this average over this very diffuse tree, right? And this is a valid integral you can take, but it's very slow because you're going to consider a lot of trees that have very low value. And it's essentially almost like an important sampling problem where you want to, there's only a few actions and sort of paths that can contribute, you know, high value. And almost everything else is low value. So that's sort of a tricky problem here.
好的,这就是决定哪些行动要向下推进的选择标准。当你在树搜索中向下推进时,最终会遇到一个节点,在那里很明显你已经赢了或输了。游戏的最后阶段,根据特朗普-泰勒评分,当没有可执行的有效动作时,你可以判断自己是赢了还是输了。总结就是,你要么赢,要么输。
▶ 英文原文 ⏱
Okay, so this is the action selection criteria for how you decide which moves to move down. Now, as you move down in tree search, you will eventually run into a node where it's quite clear you've won or lost, right? At the very, very end of the game, when there are no valid moves to play left under Trump-Taylor scoring, you can decide whether you, like, you know, won or lost, right? So you either win or you lost.
基本上,这就是整个游戏的最终结果,对吧?那么问题在于,我们可以为树的最终叶节点分配一个值u,但我们该如何为该节点之前的父节点分配值呢?结果表明,你只需将你的平均动作值计算为一个平均值即可。
▶ 英文原文 ⏱
And so this is basically, you know, the final return of the whole game, right? And so the question here is, like, we can assign a value, u, to a terminal leaf node of the tree, but how do we assign the values for nodes prior to that, the parents? And it turns out, you know, what you simply do is you just take the, your mean action value is essentially your average.
好的,我们假设这些是叶节点。不好意思,这些都是叶节点。这个节点的平均行动值,仅仅是你在叶节点获胜或失败情况的平均值。因此,你可以沿着链条往上走,比如说,这个节点的平均行动值,我们称之为QB,而这是动作B,它只是这些值的加权平均。
▶ 英文原文 ⏱
So let's suppose these were leaf nodes. Sorry, these were all leaf nodes. The mean action value of this node, you know, this action here, is just the average of whether you won or lost at the leaf nodes. And correspondingly, you can kind of walk up the chain and say, like, well, the mean action value of this node, let's call this, like, QB, and this is action B, is just the average of, a weighted average of these ones here.
好的。加权平均值可能会依赖于你是否有不同的采样分布,但基本的理念是,你想解决一个游戏,在这个游戏中,你明确地胜出或失败。然后,你可以向后推导,这被称为回溯步骤,并为这些节点或动作赋值,对应于最终结果叶节点的平均值。
▶ 英文原文 ⏱
Yeah. And the weighted average is, it could be dependent on if you have a different sampling distribution or not, but the basic intuition is that you want to resolve the game where you have a deterministic win or lose, and then you can kind of go backwards, this is called the backup step, and assign values to these nodes or actions corresponding to the averaged over the final terminal leaf.
好的,所以即使你不用神经网络来做这件事,问题依然难以解决。你会发现难以确定要选择哪些动作进行采样。大多数动作带来的价值很低,尤其是在你试图摆脱劣势的情况下,只有少数几个动作能够带来高价值。因此,实际中的搜索过程依然非常昂贵。
▶ 英文原文 ⏱
Okay, so if you were to do this without neural networks, it would still be intractable. You would have trouble finding, you know, which actions to sample. A lot of the actions would contribute very low value, especially if you're, like, you know, trying to fight your way out of a losing position, and only a few actions give you high value, so the search in practice is still very, very expensive.
这个想法是这样的:在围棋中,由于棋局采取的是树状结构,如果你能够根据下游节点的价值来估算当前节点的价值,前提是这些下游节点的价值都是正确的,并且你已经搜索得足够深入。你之前的解释提到,有些棋局的状态,对于人类来说,谁会赢是很明显的,但对程序来说并不是那么显而易见,或者说程序需要运行到底才能确定。这实际上加深了我们对价值函数重要性的理解:一方面,价值函数是可以训练的;另一方面,它对于有效学习这款游戏是必不可少的。
▶ 英文原文 ⏱
But the idea is that, like, if you can, because Go follows a tree structure, you can actually, you know, inform a very good estimate of the value of this node based on the values of downstream, assuming they're all correct, and assuming you've searched deep enough. Your explanation earlier about the sorts of states where it's obvious to a human who's going to win, but it's not obvious to, or, like, deterministically, you still had to play it out, actually drove home the intuition of why the value function both is trainable and, two, why it's necessary in order to actually be able to learn this game effectively.
我的意思是,首先定义一下“价值”是很重要的。听起来不错,对吧。我们讨论了,所谓的U值实际上是指你最终获胜或失败的结果,这是终端叶节点的条件。但人类在进行高水平对弈时,并不会一直下到这个“树”的边缘或“叶子”节点。他们通常会提前几十步,甚至上百步就停止。那么,他们是怎么知道的呢?
▶ 英文原文 ⏱
I mean, it's worth defining value in the first place. Sounds good. Yeah. Yeah. So we talked about, you know, this U value being, you know, your final resolution of whether you won or lost, and this is the terminal leaf node condition. Now, humans don't play all the way to the sort of edges of the tree, the leaves of the tree, right? They kind of stop, you know, some dozens of moves before, maybe even a hundred moves before in sort of high-level play. So how do they know, right?
可以将人类理解为具有一个内在的“价值函数”神经网络。这个网络会根据棋盘上的局面进行评估,就像是判断“我会赢吗?”人类看一眼棋盘,就能知道自己可能会输。他们实际上是在运行一个神经网络,用来分析棋盘情况,隐含地综合大量可能的游戏结果,然后得出一个平均值,决定这个局面是否能赢。接着,他们会决定是否认输或者继续玩下去。
▶ 英文原文 ⏱
Like, you can think about humans as implicitly having a neural network called a value function that basically, you know, takes in a board state, and then it kind of evaluates, you know, he win. And so the human glances at the board, and they know, like, I'm probably going to lose, right? And they're essentially running a neural network that looks at a board, and implicitly they are amortizing a huge number of possible game playouts and taking that average and then deciding whether the board is winnable or not, and then whether they should concede or, you know, keep playing or not.
这真是了不起。如果你去想一想,这样事情的美妙之处就在于,人类的大脑就像神经网络,能够在一瞬间完成所有的模拟,然后仅凭积累的知识和经验,在几秒钟内得出结论,而不需要逻辑上地玩每一个游戏。因此,这让我们意识到,在围棋这样的游戏中,有方法可以大幅加速搜索过程。这也是AlphaGo之所以有效的基本直觉之一,因为我们可以训练一个价值函数,让它在快速查看棋盘后就能判断比赛结果,而不需要深入演算所有可能的变化。
▶ 英文原文 ⏱
And this is remarkable. If you think about, like, the beauty of something like this, it's like a neural network in a human can somehow do all of this simulation at a glance and then just know, like, within a few seconds, without actually playing every single game logically, based on just kind of, like, crystallized knowledge and experience that, like, they can do this. And so this gives us a hint that, like, in games like Go, there are ways to basically radically speed up the search process. And this is one of the fundamental intuitions behind why AlphaGo works, is that you can train a value function to look at a board and quickly resolve the game without playing out all of these trees into the, you know, into a very deep search depth.
好的,这很有道理。我想对听众说,在准备之前的几期节目时,我发现了解AlphaGo的工作原理似乎与主题有些相关,但让我感到非常困惑。不过,这种事情一旦你以这种方式理解了问题,然后再构建接下来的几个部分,就会变得更加容易理解并且更有意义。现在感到困惑是正常的,但在课程结束时,这可能比你预想的更容易理解。我想对听众提到这一点。
▶ 英文原文 ⏱
Yep. Makes sense. I will say for the audience, I sort of found, for previous episodes when I was prepping, and it seemed somewhat relevant to understand how AlphaGo works, I would find it very, very confusing. And, but it's the kind of thing where once you understand the problem in this way and then you'll build the next few pieces, it is actually much more understandable and it will make a lot of sense. And it's okay to be confused right now, but it's probably simpler to understand by the end of this lecture than you anticipate. So I'll just make that note for the audience.
好的,从高层次来看,我们要探讨的重要直觉是,对于像围棋这样的游戏,传统方法是构建一个决策树,但我们的计算机目前还没有足够的能力完成这样的任务。另外,估算每个可能动作的价值也很困难,因为只有到游戏结束时你才能知道结果。你可以通过把棋局下到结束来取平均值,但这同样困难,因为你不知道该采取哪些动作来取样这些平均值。
▶ 英文原文 ⏱
Yeah, the important intuition at a high level, just to, you know, step back about where we're going with all this, is that classically, for games like Go, you could build a tree, but we don't have computers powerful enough for that. Yeah. And estimating the value of every action that you could possibly take is also hard because you don't know until the end of the game. Yeah. You could take averages by playing them to the end, but that's also hard because you don't know which actions to take to sample these averages.
从概念上讲,这里有两个问题。一个是树的广度,另一个是树的深度。AlphaGo给了我们一个基本的方法,可以同时减少这两者,使它们非常吸引人。这基本上就是背后的核心思想。我们借鉴了这样的理念:人类可以瞥一眼棋盘,就能立即预测自己是否会赢。也许这使我们有机会真正缩短搜索的深度。
▶ 英文原文 ⏱
So conceptually, there's kind of two problems. There's the breadth of the tree and then there's the depth of the tree. And AlphaGo gives us a way to basically shrink both of those to be very attractive. Yeah. That's essentially the kind of core idea behind it. Okay. So we take this idea that, like, you know, humans can glance at a board and instantly predict whether we win. And maybe that gives us the opportunity to really truncate how deep we search. Yeah.
然后,你知道,我们也知道人类可以看着围棋棋盘,凭直觉在一瞬间决定哪些棋步可能是好的,对吧?所以,这就是我们可以使用深度神经网络来加速搜索过程的两个方法。现在,让我们往回讲。在我们讨论神经网络之前,先回顾一下这个策略是如何运作的。
▶ 英文原文 ⏱
And then, you know, we also know that humans can look at a board and decide, you know, what boards, you know, like, intuitively at a glance, what moves might be good on a Go board, right? So these are kind of two things that we can use deep neural networks for to accelerate this search process. Let's go back. Before we talk about neural nets, let's just go back to how this play out works.
我们只谈到了一步棋,对吧? AI会查看这个已编码的围棋棋盘。它会在决策树中进行深入搜索,以找出哪个动作可能是最好的。然后它会采取这个动作。接着轮到人类了。此时,人类看到的围棋棋盘可能是这样的,然后他们做出他们的棋步,也许他们会把他们的棋子放在这里。
▶ 英文原文 ⏱
And we've only talked about making one move, right? So the AI looks at this encoded Go board. It has a tree. It searches for, you know, deeply into the tree to find out which of its actions might be the best. And then it takes that action. And then now, you know, it goes back to the human. So maybe now the human sees a Go board that looks like, you know, like this. And then they make their move. So maybe they put their stone here.
然后现在我们回到人工智能,它现在查看一个新的编码棋盘。我用“2”表示人工智能扮白棋,用“1”表示人类扮黑棋,“0”表示空位。然后现在轮到人工智能,它会从头开始进行MCTS树搜索,这意味着它会抛弃上回合搜索过的旧树,并创建一个新的根节点,重新开始搜索。如此循环往复。因此,MCTS基本上可以被看作是一个搜索算法,借助神经网络来决定最佳的行棋方式。在每一步棋时都会进行这样的搜索。
▶ 英文原文 ⏱
And then now we go back to the AI, which now looks at a new encoded board. So I've used two to denote the AI's playing as white, and one to denote the human playing as black, and zero as empty. And then now on the AI's turn, it does the MCTS tree search all over again from scratch, right? So it throws away this old tree that it searched last round, and now there's a new root node, and it begins to search anew. And then so on and so forth. So MCTS is basically, you can think about it like a search algorithm that is deciding what moves to play best, aided by neural networks. And it's done on every move.
好的,太好了。那么让我们来谈谈神经网络部分。在进行比赛的同时,我还需要理解另一件重要的事情,那就是这个MCTS(蒙特卡罗树搜索)数据结构,它涉及到节点和节点的子节点等等。这个过程是在每一步棋时进行的,并且每当有新的棋步产生时,它就会重新初始化。比如当一个人做出下一步棋时,AI会查看这个动作,并努力运行大量的模拟来确定下一个应该怎么走。这些模拟基本上就是在探索这个MCTS树中的一个新节点。最终,在运行了一千次模拟后,它可以帮助我们决定下一步棋的概率,这就是我们要存储的信息。
▶ 英文原文 ⏱
Okay, great. So let's talk about the neural network part of this. And while you're racing, another sort of thing that was important for me to understand was this MCTS data structure with nodes and children of nodes and whatever, this is done per move and reinstantiated once a move is made. So a human makes a move, then the AI looks at this and is trying to run a bunch of simulations to figure out, okay, what move should I make next? And those simulations just, a simulation is basically like exploring one more node in this MCTS tree. And at the end, once all these, once all this, you know, you run a thousand simulations, that informs then this, I guess as you will explain, this probability of what move to make next. That's what you store.
你会根据概率选择最佳的移动。但是,你不能将所有过程丢弃。在每次移动开始时,下一位玩家会继续这个过程。不过,这里有一个小补充:你不会完全丢弃所有内容,你会保留一些以后要用的东西。是的,就像我为Reiner做的一样,我想为这一集制作记忆卡片,以帮助人们记住所讨论的概念。理想情况下,LLM(大型语言模型)可以为我生成一些候选项,然后我再进行完善。但为了获得高质量的建议,我需要设计一个完整的流程,让AI能够处理和分析黑板的截图和相应的时间戳,然后创建SVG图表以便视觉帮助有用,并通过评论进行写作和绘图,最后根据反馈修订卡片。
▶ 英文原文 ⏱
You sort of choose the best move given those probabilities. You discard all of that. Then the next player makes a move. And then you restart this process at the beginning of every move. Correct. One small addendum, you don't discard all of that. You keep one thing behind that we'll use later. Yeah. Yeah. Just like I did for Reiner, I wanted to make flashcards for this episode so that people could retain these concepts. And ideally, an LLM could generate some candidates for me to then refine. But to actually get high-quality suggestions, I needed to design a whole pipeline where the AI could take and ingest screenshots of the blackboard and the right timestamps and then make SVG diagrams in case visuals were helpful and then run their writing and drawing through a critic and then revise the card in response to this feedback.
仅仅通过调用LLM(大型语言模型)来实现这一目标是非常困难的。这样的分步操作方法如果有一个在所有前期阶段都参与任务的持久化代理,会更加有效。因此,我使用Cursor SDK为每张卡片创建了一个代理。Cursor的强大功能为我节省了大量时间,无需设计某种自定义的上下文框架,或者想办法设计截图或制作动画的工具调用。这些代理都在云端运行,因此我不必担心笔记本电脑必须保持开启。当有候选人需要我审核时,我只会收到一封电子邮件。你可以在 flashcards.dwarkesh.com 查看我的卡片。你也可以去 cursor.com/Svarkash 开始使用这个代理的SDK进行构建。
▶ 英文原文 ⏱
It's very hard to accomplish this just by sacking LLM calls. This sort of step-by-step recipe works much better if you have a durable agent that's been engaging with the task across all the previous stages. So I used the Cursor SDK to spin up an agent for each card. The Cursor hardness saved me a bunch of work in designing some custom context scaffold or figuring out how to design tool calls for taking screenshots or making animations. These agents all run in the cloud, so I don't have to worry about leaving my laptop open. I just get an email when I have candidates to review. You can check out my cards at flashcards.dwarkesh.com. You can start building with the agent's SDK at cursor.com slash Svarkash.
好的。那么现在我们对如何通过搜索来进行移动有了一个基本的直觉。接下来,我们将讨论神经网络如何通过提供类似人类直觉的方式来加速这一过程。有两个网络:一个是价值网络,它接收一个状态并预测这个状态下会赢还是输,这是一个二元分类问题。另一个是策略网络,它会生成一个关于可采取的好动作的分布。我将画一个一维的扁平移动分布,但其实这更像是一个方形的网格。所以,策略网络好比是在计算每种可能动作的概率分布。
▶ 英文原文 ⏱
Okay. So now we have a basic intuition of how moves are made with search. We're going to talk about how neural networks can speed this up by providing an analog to like the human intuition. So there's two networks. There is the value network, which takes in a state and it predicts, you know, am I going to win or lose? It's a binary classification problem. Then we're going to have a policy network, which induces a distribution over good actions to take. So I'm going to draw a one-dimensional flattened move distribution, but this is really like, you know, a square kind of grid, right? So maybe like it thinks actions are like, these are the kind of probability distribution over good actions.
这两者都是类别分类问题,对吧?所以,你可以像任何分类器一样通过深度学习来训练,比如交叉熵损失之类的。所以具体的架构其实没有太大关系。我尝试了几种不同的架构,transformers能用,resnets也能用。在小数据规模下,根据我的经验,resnets的表现还是比transformers好一些,在预算较低的情况下能提供更高的性价比,但这不一定完全正确。为什么会这样呢?这是因为resnets提供了像局部卷积这样的归纳偏置,而一般来说,当你需要更多全局上下文时,transformers开始表现得比残差卷积网络好。
▶ 英文原文 ⏱
And both of these are categorical classification problems, right? So you can train this like any classifier with deep learning, you know, cross-entropy, loss, that kind of stuff. So the specific architecture does not actually matter too much. I tried a few different architectures, transformers work, resnets work. For small data regimes, my experience is that resnets still kind of outperform transformers and kind of give you more bang for the buck at lower budgets, but this may not be true. Wait, why is that? They provide the inductive bias of like local convolutions, and generally transformers start to outperform residual convolutional networks when you want more global context.
KataGo论文中的一个有趣发现是,他们发现将全局特征整合在一起,并在整个网络中聚合全局特征,实际上非常有用。这种做法可以让网络有一个全局的视角,让它能够理解如何将棋盘一侧的价值与另一侧的价值连接起来。那么,聚合全局特征是什么意思呢?假如你在一个很大的19乘19围棋棋盘上,某些地方正在进行攻防战,当你通过卷积神经网络处理这些信息时,卷积网络的感受野非常擅长于计算局部信息并保持其不变性,但它们很难轻易地连接这些局部特征。
▶ 英文原文 ⏱
So one interesting finding from the KataGo paper was that they found it actually quite useful to pull together global features together and aggregate global features like throughout the network to kind of give the network a global sense of how to like connect value from one side of the board to another side of the board. What does it mean to aggregate global features? Yeah, so if you have a very large 19 by 19 Go board, and you know, you've got some sort of battles going on here, and you've got some battles going on here, when you pass this through a convolutional neural network, the receptive fields of the convolutional network are going to be good at computing local things and making that invariant, but they won't be able to kind of connect these two features easily, right?
他们需要被整合在一起,并以某种方式相互关注。因此,关于为什么变压器模型在计算机视觉任务中表现良好,比如视觉变压器,这种说法是因为它们在整体上有一种全局注意力,可以更轻松地建立这些联系。但确实需要更多的数据,以便通过数据学习不变的局部特征。我非常努力地尝试让变压器模型用于这个问题,因为我很好奇变压器是否能够带来某种突破,从而省去很多技巧。然而,无论我怎么努力,目前我还没有找到一种方法能让变压器表现得比ResNet更好。
▶ 英文原文 ⏱
They need to sort of be pulled together and attend to each other somehow. So the argument about, you know, why transformers are good for computer vision tasks, like with, you know, vision transformers and so forth, is that because they have a sort of global attention across the whole thing, they can more easily draw these connections. But you do need more data there so that you can kind of learn through data the sort of invariant local features. I've tried very hard to make transformers work for this problem, because I was kind of curious if transformers would present some sort of breakthrough and go and just remove a lot of those tricks. But to try as I might, I actually haven't figured out a way to make transformers better than resonance for now.
好的,对不起,还有一个倾向性问题。对于需要考虑不仅仅是空间信息的情况,transformers的全局信息聚合能力更有优势,这很容易理解。而卷积神经网络(CNN)则提供了一种偏置,即附近的事物通常更为重要,然后这些信息会被逐层汇总。对,没错。但是,如果在某些游戏中,局部信息并不那么重要,而需要考虑整体信息,你是说transformers会表现得更好。
▶ 英文原文 ⏱
So, sorry, one more tendential question. It makes sense why transformers with their, like, global pooling of information would be better if you need to consider information that is not just spatially, or, yeah, CNNs give you a sort of bias that the things that are next to you are especially irrelevant. And then they're sort of aggregated up. Yeah, exactly. Yes. But suppose, okay, so for games where it isn't that relevant, what is happening locally, you just kind of have to consider the whole thing. You're saying transformers would work better.
关于游戏,我们不仅要考虑空间维度,还要考虑时间维度。现在,我们只关注前一步,因为这是一个确定性、完全信息的游戏。但如果将其类比于扑克或外交游戏,之前的一次虚张声势对现在的决策仍然有影响,你需要回顾之前的所有状态。这种情况下,会不会改变我们对于哪些归纳偏向最相关、使用何种架构最合适的考虑呢?这是个很好的问题。
▶ 英文原文 ⏱
How about games where, so the particular, talking about the spatial dimension, how about the temporal dimension, where right now we're only considering the previous move, because it is a deterministic, full information game, where, but what if it was something like poker or diplomacy, where really a bluff they made a while back is sort of relevant to understanding now and isolating to decide to make your next move. And so you need to consider all those previous states. Would that then change the consideration of what inductive bias is most relevant and what architecture is most relevant? Right. Great question.
围棋是一种完全信息博弈。在这种类型的博弈中,确实存在一种纳什均衡策略,通过这种策略,你的表现不会比选择其他策略差。因此,如果你知道对手有某种特定的偏好,比如他们喜欢激进的下法,原则上你可以更好地针对这种特定策略,而不是仅仅依赖纳什均衡策略。不过,要针对任何特定策略,确实存在一种可以仅凭当前局面决定的纳什均衡。这是大多数围棋人工智能(如AlphaGo)所采用的一种设计选择。回过头来看,这种选择效果非常好,因为纳什均衡策略似乎超越了人类水平。
▶ 英文原文 ⏱
So Go is a perfect information game. Yeah. And in perfect information games, there does exist a Nash equilibrium strategy for which you can do no worse than any other strategy. So if you know that your opponent has a particular bias, like they love to play aggressively, you can actually, in principle, counter that specific strategy better than a Nash equilibrium policy. But to counter any given strategy, there does exist a single Nash equilibrium that can be decided solely using the current state. So that is a design choice that most Go agents, AlphaGo chose to do, which in hindsight turned out to work very well because the Nash equilibrium seems to be superhuman.
是的。好像没有任何人类策略可以打败它。现在,有些变化需要你考虑时间历史。这是一个非常令人兴奋的研究领域,我鼓励大家去分支我的项目仓库并试试这些东西,比如说,如果你要进行2对2围棋,那么你实际上需要分析你伙伴的行为,而你可能对他们的下棋风格没有信息。所以你需要收集一些他们下棋方式的信息,以便你能够作出相应的反应。
▶ 英文原文 ⏱
Yeah. Like no human strategy seems to be able to beat it. Now, there are variations of this where you would actually need to consider temporal history. And this is a very exciting research area that I would encourage people to kind of fork my repo and try these things out, which is if you were to play, let's say, 2v2Go, then you actually need to model your partner's behavior and you may not have information on how they play. So you need to aggregate some information on like how they play so that you can respond accordingly.
好的。没错。这些情况已经不是完美信息游戏了。在这种情况下,对于不完美信息或部分可观测性游戏,你确实需要一些背景来构建模型。而且我认为在这种情况下,自主学习或者类似外交的风格将会非常激动人心。
▶ 英文原文 ⏱
Yeah. Right. Like these are situations where it's no longer a perfect information game. Yeah. And then in those cases, in games of imperfect information or partial observability, then you do need some context to build a model. Yeah. Yeah. And I think that's a place where things can get very, very exciting in terms of like self-play or diplomacy style.
好的,有趣。那么回到神经网络,架构本身不是特别重要。你可以使用transformers来实现,也可以用resnets来实现。我发现,在预算较低的实验中,resnets效果会好一些。你还可以参考Karpathy的风格进行超参数自动调节,以优化你的架构。所以不用太担心这个问题。你只需要把问题设置好,确保有一个明确的优化目标就可以了。
▶ 英文原文 ⏱
Yeah, interesting. Okay, so returning back to the neural network, the architecture, again, is not super important. You can get it to work with transformers. You can get it to work with resnets. I found that for low-budget experiments, resnets work a little better. You can also use kind of a Karpathy-style auto-research hyperparameter tuning to make your architecture pretty good. And so you don't have to worry too much about that. You just need to sort of set up the problem so that you have a sort of target optimization.
好的。那么我们将选择一个有点随意的架构,这个架构之前对我的项目有效。不过呢,这部分并不是非常重要。你已经有了编码之后的棋盘状态。我们将选择使用三个类似RGB的通道。一个通道编码黑色棋子,另一个通道编码白色棋子,最后一个通道可以用来编码空位,或者如果你想在多个棋盘大小上进行训练,可以用来编码掩码区域。
▶ 英文原文 ⏱
Yeah. Okay, so we're going to pick just a somewhat arbitrary architecture that worked for, you know, what I did. But again, this part is not super important. You have your encoded board state. And we're going to just choose to, let's say, do three, like, you know, similar to an RGB, we're going to have three kind of channels. One channel to encode black, one channel to encode white. And then one channel maybe to encode, like, empties or maybe, like, a masked region if you want to train on multiple board sizes.
我暂时不会讨论多种棋盘大小的问题,因为这有点复杂。所以我们就简单地说,我们有一个类似RGB的二维或三维通道的图像。然后我们将其输入一个resnet(残差网络),这个网络有两个分支输出。一个分支用于预测价值函数,这是一个单一的logit输出,类似于R1。另一个分支用于策略预测,相当于R361。这是我们的模型架构,我们基本上会训练它来根据棋盘状态预测比赛结果,同时也训练它预测好的走棋步骤。
▶ 英文原文 ⏱
I'm actually not going to talk about multiple board sizes for now. That's a little bit too complicated. So we'll just say, like, you know, we've got this two or three channel RGB-like image. And then we go into a, you know, a resnet, and then we have two branching heads. One head predicts the value function, and this is, like, a single logit. So this is, like, R1. And then we have the policy, which is, you know, R361. So this is the architecture, and we're going to basically train this to predict the outcomes of games, given the board state. And we're also going to train this to predict what are good moves.
好的。那么最初的AlphaGo论文,也被称为AlphaGo李世石版,是通过一个人类专家对弈的监督学习数据集来初始化神经网络的。后来,他们取消了这种限制,让模型通过自我学习来提高水平。但是,我发现从实现的角度来看,对于你们的受众来说,最好还是在实验开始时用一些简单的东西来初始化。这样做的好处是先把问题解决,再去尝试解决更复杂的部分,避免一下子啃太难的骨头。
▶ 英文原文 ⏱
Yeah. Right. So the OG AlphaGo paper, or called AlphaGo Li, initialized this network with a supervised learning data set of expert human play. Later, they removed this restriction by having the model teach itself how to play well. But I find it, actually, from a matter of, like, implementation for your audience, super, super nice to always kind of initialize your experiments to something that's easy, and then, like, you know, get the problem working before, you know, trying to bite off the whole thing and learn a tabular resin.
通常在开始时你总是希望有一个良好的初始化,就像在深度学习中,初始化非常重要,对吧?你总是希望将你的研究项目初始化到一个尽可能接近成功的状态,特别是当你在做一些你以前从未尝试过的新事情时。始终选择那些已经证明有效的东西,然后让它做得更好,而不是从一个完全不起作用的东西开始,然后试图让它发挥作用。因此,在这样的理念下,从一些具有良好初始化的东西开始是一个不错的主意。
▶ 英文原文 ⏱
You generally want to kind of initialize, just as in deep learning, initialization is everything, right? You always want to initialize your research project to something as close to success as possible, especially if you're, you know, doing something new that you haven't done before. Like, always pick something that works, and then get it to do something better, rather than start from something that doesn't work at all, and then, you know, try to make it work. So, under that philosophy, it's a great idea to start from something that, like, you know, has a good initialization.
我们将使用人类专家的游戏进行训练,让这个模型能够预测合理的动作。具体来说,我们会收集所有胜利的比赛中专家的每一步棋,然后用这些动作来训练模型,不管棋盘的状态如何,包括无论比赛是赢还是输,我们都会用这些数据来预测结果。
▶ 英文原文 ⏱
So we're going to take human expert plays and train this model to predict, you know, good actions, right? So we're going to take all of the winning games, all the moves in which a human won, and, sorry, an expert won, and then predict those actions. And then, regardless of board state, like, you know, whether you won or lost, you're going to predict the outcome.
好的,你可能会想,好吧,在一些比赛刚开始时,棋盘上只放了一颗棋子,我们怎么可能知道谁会赢呢,对吧?其实,如果你有成千上万的比赛数据,你就会发现,通常来说,从这种布局开始的棋局中,大约有一半会最终获胜,还有一半会输。所以,这样的结果其实是可以接受的。
▶ 英文原文 ⏱
Yeah. So you might be wondering, like, okay, well, some of the early boards, you know, where basically only one stone has been put down, how could you possibly know whether, who the winner of this game is, right? Well, if you have, you know, hundreds of thousands of games, then, on average, you'll probably see that boards that start like this have a, sort of, half of the games that branch off from this will win, and half of the games that branch off from this will lose. So that'll actually be fine.
当你训练这个模型来预测这些情况时,logit值会趋向于0.5。因此,对于这些情况来说,当你训练好模型后,初始的棋盘状态看起来会接近于0.5。随着游戏的进行,赢的概率会从0.5开始,然后可能会上升,也可能会下降。这个过程与你的移动步数有关。
▶ 英文原文 ⏱
When you train this model to predict those, the logit will sort of converge to, you know, 0.5. And so, so for these, for these things, it's sort of expected that once you train the model, a starting board state will look like 0.5, and then as you progress towards the end of the game, it'll actually look something like, you know, if this is 0.5, the, the win probability will sort of either go like this, or it'll, it'll go like this, right? And, and this is sort of your move number.
好的。就是说,当你在游戏中走了几百步之后,谁更有可能赢或者输在你的专家数据分布下就会变得更加清晰。我不明白为什么这种思考价值的方式对专家数据特别重要。事实上,它并不只对专家数据重要。无论你用什么数据进行训练,都是如此。
▶ 英文原文 ⏱
Yeah. And so, as you, you know, get hundreds of steps into the game, it becomes much more clear, like, who's more likely to win or who's more likely to lose under your expert data distribution. I, I didn't understand the significance of why the, this way of thinking about values, especially relevant to the expert data. It is not relevant to the expert data. It's true for any data that you train it on.
好的。如果你学习一种白纸般纯净的状态(tabula rasa),你也会期待看到这样的结果。是的。所以如果你只是这样做,比如想象一下,你正在以某种"气氛编程"的方式开发AlphaGo,你收集了一些专家数据集,比如从网上学习如何下围棋,或者你有一个人类玩家的数据集,然后你用这些数据集训练这个模型。实际上,事实证明这个模型已经是一个相当不错的围棋玩家了。它大概率能战胜大多数人类玩家,对吧?
▶ 英文原文 ⏱
Yeah. So if you were to learn a tabular rasa, you would also expect this to fall out. Yeah. So if you just do this, like, so imagine, you know, you're vibe coding AlphaGo and you, you, you gather some expert data sets from, like, how to go online, or you, you know, you have a data set of human players and you train this model. Actually, it turns out this model is already a pretty good Go player. It'll most likely beat most human players, right?
如果你把这个政策建议拿出来,并对其执行argmax操作,就是说,假设这是概率,然后你选择这个动作作为你的围棋策略,那么你会得到一个非常快速的围棋玩家。这个玩家不依赖推理步骤,而是凭直觉进行决策,却依然是一个非常强的围棋选手。这简直是个奇迹,考虑到仅仅通过大约10层的神经网络,可能不到300万个参数,就能实现如此令人印象深刻的表现。
▶ 英文原文 ⏱
So, so like, if you just take this policy recommendation and take the argmax over, you know, it's, uh, if this is the, you know, probabilities, if you take the argmax and you just take this action as your Go play, um, it'll be a very, very fast Go player that doesn't think in terms of, like, reasoning steps. It just, it just kind of shoots from the hip and it'll be a very strong Go player, which is already quite miraculous if you think about, like, you know, 10 neural network layers, maybe under, like, 3 million parameters can already do something that impressive.
因此,你可以这样开始,在实施这一点时,验证其可能的真实性是很重要的。确保你的围棋规则被正确实现是很好的,比如你知道你可以相对快速地运行这些模拟。这几乎就像一个检查点,你需要确保你能实际完成这一步基础步骤后,再尝试增加更复杂的内容,比如搜索。
▶ 英文原文 ⏱
And so you can start this way and it's important when implementing this to kind of just verify that this is probably true. It's good to verify that your Go rules are implemented correctly, that, like, you know, you, you can run these simulations relatively quickly, uh, and, and just as almost like a, a sort of, uh, um, a checkpoint that, like, you want to make sure that you can actually do this basic step before you try to layer on more complex, uh, things like search.
好的。嗯,所以,我们可以比直接使用原始神经网络来下棋做得更好。这就是我们如何将其应用于蒙特卡洛树搜索的方法。那么,让我们来利用神经网络来改进蒙特卡洛树搜索。我们从我们的根节点开始,然后进行一个包含四个步骤的迭代过程来实现MCTS。
当我第一次阅读这篇论文并尝试理解它时,这部分让我有些困惑。不过,基本上我们要做的是选择模拟次数,也就是“模拟数量”。这个数量是可以变化的,可以在200到2048之间。我记得在AlphaGo对战李世乭的比赛中,他们每一步棋都使用了上万次模拟,因为他们希望尽可能地增强模型的实力。
▶ 英文原文 ⏱
Yeah. Um, so, but yeah, we can do a lot better than taking the raw neural network and playing the moves. And, uh, this is how we can apply it to Monte Carlo Tree Search. So, let's, uh, apply the neural network to, um, to improve Monte Carlo Tree Search. So, we start with our root node. And we now have a four-step, uh, iterative process to do MCTS. So, this tripped me up when I was first reading the paper and trying to understand it, but, um, uh, essentially what we're going to do is, we're going to choose a number of simulations. So, like, you know, num, simulations. And this number varies. This can be, you know, somewhere between 200 to, uh, 2048. I believe in, um, in the AlphaGo Li match, they used tens of thousands of simulations per move because they really wanted to boost the strength of the model as much as possible.
好的。那么,在训练中,其实并不需要太多。而且我认为Kodigo也使用了类似规模的计算资源。你知道他们在游戏过程中是否使用了笔记本电脑吗?如果你看过纪录片,他们在游戏过程中有一个笔记本电脑在身边。但他们实际用的不是笔记本电脑,而是某个TPU集群。这很酷。不过,我觉得这有点不公平,因为李(某人)并没有使用像1 E22这样的超级计算能力来做出一个动作。这也是可以理解的。有趣的是,现代围棋机器人在测试时不需要那么多计算资源。随着我们讨论蒙特卡洛树搜索(MCTS)策略改进的工作原理,我们会发现,随着时间的推移,原始神经网络实际上承担了那个大型TPU集群的工作负担,并将其整合到网络中。因此,你实际上可以用一个神经网络来完成所有这些工作。
▶ 英文原文 ⏱
Yeah. Um, but in training, you don't actually need too many. And Kodigo, I think, uses something on this order as well. Do you know if they used, uh, if you watch the documentary, they had a laptop out during the game. They didn't use a laptop itself. It was, like, on some. It was on some TPU pod, I think. Cool. Yeah. Um, but now. I don't think it's kind of unfair. Well, uh. Like, uh, Lee is not using, like, one E22 flops to do a move, you know? Fair enough. Um, interestingly enough, modern Go bots don't need that much compute at test time. And what we'll actually find out, uh, as we talk about how the, um, MCTS policy improvement works is that, over time, the raw network actually takes all of the burden of that big TPU pod and just pushes it into the network. And, and you can do actually all of that work with one, you know, neural network for the pass.
嗯,不过,无论如何,TPU集群总能在效果上锦上添花。这就是他们希望在比赛中使用的原因。因此,我们决定选择这种多次模拟的方法,在每次模拟中,我们基本上会同时做几件事情。我们会查看当前棋局中哪些走法是最优的;如果需要,我们会在树中添加新的分支;我们还会更新树的动作值。每次模拟都包含这样一个四步流程:选择、扩展、评估和回传。
▶ 英文原文 ⏱
Um, but, but the TPU pod will always add the extra oomph on top. And so that's what they wanted for the match. So, so we're gonna pick this kind of, like, num simulations thing, and, uh, for every simulation, we're going to basically do several things simultaneously. We're going to see which, which moves are the best in the current tree. We're going to add extra leaves to the tree if we get to a point where we need to add a leaf. And we're going to update the action values for, for the tree. So that's, that's what every, every simulation involves these kind of, like, four-step process. So, so the four-step process is basically selection, expansion, evaluation, and backup.
在我们的蒙特卡罗树搜索的开始阶段,我们的树结构非常简单,只包含了根节点,也就是我们的AI当前要操作的棋盘。因此,我们将为这个棋盘选择最佳的行动。当这个根节点被创建时,我们可以利用神经网络来评估它,并获取一些量化指标,比如 \( V_{\theta} \) 和各个行动的概率。在根节点的基础上,我们可以为所有可能的行动创建一系列子节点。比如说,在我画的这个三乘三的棋盘上,有一个位置是空的,也就是说与这个根节点相关联的可能子节点有八个。
▶ 英文原文 ⏱
So, so at the beginning of our, um, Monte Carlo tree search, our tree is very, uh, basic. It only has the, the root node, or our current board that our, um, AI wants to play at. And so, we're going to basically select the best action for this. So, when this root node is created, we're also, we also know that we can evaluate this under our neural network and get the quantities, um, you know, V theta, as well as our, um, probability over, uh, actions. And I'm gonna say root. So, for all of the, uh, actions here, we can create a bunch of children, right? So, so, so this one has, um, well, in, in this case, I'm drawing a three-by-three board with one, one, one board missing. So, basically, there are, um, you know, eight possible children, uh, associated with this root node.
所以,呃,每个行动都有一个相关的概率,对吧?比如说有P8、P1、P2等等。好了,一开始,当我们进行蒙特卡罗树搜索时,我们有一个根节点,我们可以初始化一些子节点,因为我们知道,呃,在一个三乘三的棋盘上,有一个已经放置的棋子,这个策略网络在根节点的评估给我们八个可能的子节点,这些子节点是这个AI可以采取的行动。而对于每个子节点,它们的策略网络也给出了选择该子节点的概率。所以,第一步就是进行树的选择。同样需要注意的是,这是一个非常浅的树结构。
▶ 英文原文 ⏱
So, like. Um, and each of these has an associated probability of taking that action, right? So, so there's P8, P1, P2, et cetera. Okay, so at the beginning of our Monte Carlo tree search, we have our root node, and we can initialize it with some children, right? Because we know it's, uh, the, the policy network evaluated on the root node gives us, on a three-by-three board with one existing, uh, stone placed, eight possible children that this, uh, AI could take. Um, so with each of the children, their policy network also gives us the probability of selecting that child. So, the, um, first step is to do the selection of the tree. And again, this is a very shallow tree.
我们目前只有一个深度为1的树,对吧?对,所以我们的第一步是通过最大化或Arg Max化挑选标准来做选择,这个标准基本上是Q,即Q(S, A)加上C挑选乘以P(A)除以N加1再除以N A。因此,对于每一个动作,N A一开始都是零,N也是零。因此,我们基本上就是根据这个标准来选择。最初,这里被选择的动作很可能会倾向于这里最高概率的动作,因为这些起初对所有动作来说是均匀分布的。
▶ 英文原文 ⏱
All we have so far is a tree of depth one, essentially, right? Right, so our, our first move is to select by maximizing or arg maxing the pucked criteria, which is basically, you know, Q, um, Q S A plus, um, you know, C pucked times P of A divided by N over 1 plus N A. So, for each of these, we're going to, uh, you know, N A is zero for, for all the actions initially, N is zero. And, um, and so we're going to basically just, you know, pick according to this. Um, initially, uh, what is going to be the, you know, chosen action here is most likely going to be biased towards, um, you know, the highest likelihood action here. Right, because these are sort of uniform for everybody.
好的,那么,我们假设 P1 是概率最高的节点。所以,你选择了这个节点。现在,你到了这个节点并意识到它不是叶子节点,对吧?这里还有更多的游戏,它不是一个最终的局面,所以你不能进行最终的决策。那么,你接下来的步骤是扩展。你会通过策略网络来运行这个节点,这个棋盘状态。请注意,这是 AI 的一步棋,也就是说,AI 做出了这个动作。所以,当我们扩展这个树时,我们现在要考虑人类对手或任何对手可能会怎么走,对吧?是的,所以这就像是,你知道的,你的对手。
▶ 英文原文 ⏱
Yeah. So, um, let's suppose P1 was the highest probability node. So, you, you, you selected this one here. Now, you got to this node and you realize that it's not a leaf node, right? There are more games, it's not a terminal game, so you cannot resolve the, the final resolution. So, the next step that you do is, um, expansion. So, you, um, you will then run this node, this board state, through the policy network. Note that this is the AI's move, right? Like, AI is making this move. And so, when we expand this tree, we're now thinking about what the human might do, or any opponent might do, right? Yeah. So, so this is like, you know, your, your opponent.
嗯,树的扩展过程其实是这样进行的。当我们在这里评估这个节点时,我们是从这个玩家的角度来进行的。是的。那么,这个节点有一些可能的行动,我们会扩展这个节点的叶子节点。对于每一个我们可能到达的节点,我们现在会检查这些节点的好坏。比如,从这里开始,人类玩家可以选择在这里、这里或者这里进行操作。我们会为每一个这样的节点存储 Vθ 值,比如 Vθ 节点1',Vθ 节点1。
▶ 英文原文 ⏱
Um, the tree expansion process actually is completely, uh, so, so, so when we evaluate the, um, the node here, we're going to now evaluate the, the node from the perspective of this player. Yep. Um, so, then this one has possible actions that we could take, and, uh, we, we expand basically the, the, the leaf nodes here. So, for each of these nodes, um, that we could, you know, arrive at, we're going to now check how good those nodes are, right? So, so maybe, um, from here, like the human could play here, the human could play here, or human could play here. And we're going to, um, store essentially the V theta for each of these things. So, V theta of, you know, node one, um, prime, V theta node one.
嗯,所以我们基本上在用我们的神经网络来直观地猜测这个棋盘对这个玩家来说有多好。因为这是一个零和游戏,所以很容易推导出在这个步骤中,这个玩家的价值就是1减去另一个玩家的价值。因此,根据你是哪个玩家,可以很容易地反转搜索过程。这就是扩展步骤:你选取了一个非叶子节点,对其进行了扩展并评估其价值。
▶ 英文原文 ⏱
Um, and, and, and so we're, we're basically using our neural network to make an intuitive guess of how good is this, um, board from the perspective of this player. Yeah. And, uh, fortunately because the, uh, it's a zero sum game, it's easy to deduce that, you know, the value for this player at this, this step is just one minus the value for, you know, from this perspective. So it's easy to flip the search process, uh, depending on which player you're at. Um, and so, so this is the expansion step. You've taken a, a, a non-leaf node and expanded it and evaluated the value.
这基本上是一个快速猜测,类似于如果我一直玩到最后,我是否会赢,对吧?所以你可以把 V theta 看作是一个捷径,用于任何给定的模拟中,寻找棋局结束的方法。这实际上是一个评估步骤,我们在评估每一个棋盘布局的质量。在最初的 AlphaGo Lee 中,他们实际上做了一件有趣的事情,就是把这个评估值和一个实际围棋对局的结果值进行了平均。
▶ 英文原文 ⏱
And this is essentially a quick guess as to, like, if I were to play to the end, am I going to win or not, right? So you can almost think about the V theta as a shortcut for, uh, searching to the end of the tree, for, for any given simulation. Um, and then we're, and this is, this is essentially the evaluation step. We're, we're evaluating the quality of each of these boards. In original AlphaGo Lee, they actually did something, uh, kind of interesting, which is that they took this value and they averaged it with the value of a real Go play out.
他们实际上从这里一直到最后都在进行一场真实的比赛。因此,我将用一条波浪线来表示某个路径。然后,他们就这样一直进行到一个完整棋局的终局,比如说,这个结果可能是0或1。他们将这个值与这里的这个值求了平均。他们使用的公式是:α乘以某个节点的Vθ值,加上(1-α)乘以一个真正随机的对局结果。
▶ 英文原文 ⏱
So they actually played a real game from here all the way to the end. So, so like, I'm just going to draw this squiggly line to indicate some path. And, uh, they kind of like play this all the way to res, Trump Taylor resolution of a full board. And so this is like a zero or one, right? And so they took this value and they just averaged it with, with this one here. So the, the, the, the formula they did was like, uh, you know, alpha times V theta of, of like, you know, some, some node, um, plus, uh, sort of like one minus alpha of a, of a true randomly sampled play out.
你可能会想,好吧,他们是怎么实现这一点的呢?再对这种情况进行一次搜索会非常非常昂贵,就像在树中又构建了一颗树一样。所以他们并不这样做。相反,他们使用策略网络让它自己对战,也就是把这个网络作为两个玩家,一直玩到最后。这种方法有助于使这里的估计更加接近现实,因为你可以通过单次样本估计来判断是否获胜。
▶ 英文原文 ⏱
And you might be wondering like, okay, well, how do they play this out? Right? Like it would be very, very costly to do another search on, on this play out, like almost like a tree within a tree. So they don't do this. Instead, they just, uh, take the policy network and play it against itself. So they just take this as both players and they just play it all the way to the end. And, and, um, this is something that helps ground the, um, the estimates here in, in reality, because you can get a single sample estimate of like whether you win or not.
在棋局快要结束的时候,你可以考虑这一点:棋盘几乎已经解决,而这个工具实际上变得非常有用,因为根据策略进行的随机走棋,很可能会为棋局提供一个相当合理的猜测。因此,你并不会面临一个与现实脱节的问题。不过,事实证明这是完全没有必要的。因此,在AlphaGo李世石版本之后的所有后续研究中,他们都取消了这个功能。在我的实现中,我也做了同样的修改,这样可以大大加快速度,因为不需要在每次模拟中展开这些棋局。
▶ 英文原文 ⏱
You can think about in the end game where the board is almost resolved that this one actually becomes quite useful because the random, the, the, the play according to the policy will most likely decide a pretty reasonable guess of the game. And so you're not, you know, facing a problem where this one kind of becomes untethered from reality. It turns out this is totally unnecessary. So in all subsequent papers after AlphaGo Li, they just got rid of this. Yeah. And so in my implementation, I also did the same and it speeds things up a lot because you don't have to roll these games out on every single simulation.
好的,好吧。嗯,为了强化我自己的理解,也为了重新解释一下。顺便说一下,观众,如果不太明显的话,选择中的“P”在本例中是来自网络的概率。对,这里指的是策略网络。嗯。所以从根本上讲,模拟可以理解为在搜索过程中展开一个新的节点。模拟很容易理解,比如当整棵树已经存在时,你只需要根据选择标准在树上走下来,然后继续走下去。
▶ 英文原文 ⏱
Yeah, yeah. Okay. So, uh, again, just to, uh, reinforce my own understanding and just to re-explain it. For the audience, by the way, in case it's not obvious, the P there in the select, that is the probability coming from the network in this case. Correct. The policy network here. Yeah. Okay. So fundamentally a simulation, just think of it as like rolling out one more node in the search process. Um, almost. So, a simulation is easy to think about when the whole tree already exists, right? You just walk down the tree, um, using the puck selection criteria, and you, you, uh, and then, and then you keep going.
好的。在AlphaGo中,数据结构的设计是这样的:我们从一个只有深度为1的树开始,这意味着它只有一个子节点。然后,我们需要一边选择动作,一边迭代地扩展这棵树。这里的核心思想是,围棋是一种组合复杂度极高的游戏,你无法预先构建整棵树然后去搜索它。你必须在构建树的过程中同步进行搜索。
▶ 英文原文 ⏱
Yeah. So now, uh, in AlphaGo, the data structure is such that we begin with a tree that has no, like, basically only depth one, which is its only children, and you want to iteratively build out the tree as you're also selecting actions down the tree. So that's the kind of core thing here, is that because Go is such a combinatorially complex game, you cannot afford to build the tree in advance and then search it. You must search while building the tree.
好的。那么让我来完成最后一步,即备份,对吧?一旦你对这些进行了评分,你基本上要取平均值。这里分配给节点采取这个动作的Q值,现在只是你评估值的平均值。是的,你需要对所有进行过的模拟取一个滚动平均值,然后计算子节点值的平均值。
▶ 英文原文 ⏱
Right. Okay. So let me just finish up with actually the last step, which is the backup, right? So once you've scored these things, you basically take the mean, the, the value, the, the Q value assigned to the node here for taking this action is now just the average across your evaluated values. Yeah. It's, uh, you take a running mean over, over, uh, all of the, um, the, the simulations that you've taken, and they average the values of the children nodes.
好的。这就是所谓的“备份步骤”。一旦你评估了这个步骤,你实际上可以递归返回。也就是说,如果你知道这个节点的动作值,那么你就可以对它的父节点求平均值,如此继续。这是一个四步过程,其中你选择目前已知的最佳动作。然后,你可能会碰到一个以前没有访问过的节点,这时候你就需要扩展一下树结构。
▶ 英文原文 ⏱
Yeah. So, so that's what is known as the backup step. And once you evaluate this, you can actually kind of recursively go back. So if you know the, you know, the action value of this node, you can then take the average on its parent and so and so forth. So, so you have this kind of four-step process where you are choosing the best action that you know of so far. Then you may run into a node where you, uh, you, you haven't been to before, so you need to grow the tree a bit.
然后,你将它通过网络运行,以预测你是否会赢。接着,你回到最初的节点,更新你认为最佳动作的数值。当你反复进行这个过程时,这个选择标准会让你多次检查那些节点,因为你总是根据这个标准选择,也就是说在任何分支上你都会选择你认为的最佳动作,对吧?
▶ 英文原文 ⏱
And then you run it through the network to guess whether you're going to win or not. And then you walk all the way back up to the, to the root node to update your values on what the best moves are. So as you do this iteratively, this selection criteria will cause you to visit the, because you're always selecting according to this criteria, you're always going to be selecting the best action you think at any given branch, right?
所以,最终的访问次数反映了你选择这些选项的频率,它将表现出通过这个搜索过程得到的正确策略分布。因此,你之前在节点中存储的访问次数实际上就像是一次投票,决定我们最终应该选择哪个行动。嗯,所以,作为一种理解的测试,值得花点时间思考一下,看看我们是否可以使这个过程更简单。
▶ 英文原文 ⏱
So, so the final visit counts of like how often you chose these things will reflect your correct policy distribution as induced through this search process. And so, so the visit count that you store in the node earlier actually becomes the sort of vote for like which way we should finally select an action here. Yeah. So, um, you know, as a sort of test of understanding, it's worth thinking a little bit about whether we could make this even simpler, right?
这句话可以翻译成中文,并易读地表达意思如下:
比如说,我们是否真的可以去掉这个节点,同时还能让整个系统正常运作?回忆一下,当你在某个节点进行展开和评估时,你实际上是在检查每个子节点的获胜概率,对吧?如果这个节点的概率是1,而其他都是0,你就大致知道哪个操作可能更好。那么,为什么你还需要保留这个节点呢?
▶ 英文原文 ⏱
Like could we actually maybe even get rid of this one and still make the thing work? So recall that, you know, when you do an expansion and then an evaluation at let's say this node, you are checking the sort of win probability of each of the child nodes, right? And so if this one is, you know, like one and these are zero, you do kind of know something about which action might be better to take. And so why would you need, still need this, right?
为什么不直接将这个标准化为某种分布,然后把它作为你的策略分布呢?这样做是可以的,可能也有效。但在实践中,有一个单一的前向传递能给你一个不错的猜测,这样可以减少广度。这里存在一种对偶性。比如,如果策略推荐的行动与价值不一致,那就很奇怪。比如,策略说这个动作的概率很高,但价值却说这个动作的价值很低,那就说明你的策略输出和价值输出之间其实存在某种根本上的问题。
▶ 英文原文 ⏱
Like why not just normalize this one into some distribution and call that your policy distribution? This is fine, you can do this, and this probably does work. But in practice, having a single forward pass that gives you a pretty good guess is how the breadth is pruned out. The, there is a sort of duality here. Like it would be weird if let's say the policy recommended an action that disagreed with the value, right? If let's say the policy said this was very high probability, but this one said it was, you know, low value, then there's actually something kind of fundamentally wrong between your policy head and your value head.
所以它们是有联系的,如果你能想出一种不同的方法只通过数值评估来恢复这一点,你可能就能摆脱这个问题。对吧。不过,为了确保我理解正确,你不这样做的原因是为了避免进行360次独立的前向传递来确定“嘿,这是所有事物的价值,我们找出最大值”,对吗?相反,你只需进行一次前向传递就可以获得所有这些的概率。
▶ 英文原文 ⏱
So they are linked and you probably could get rid of this if you came up with a different way to recover this from just the value evaluations. Right. But just to make sure I understand, the reason you don't do that is so that you don't have to do 360 independent forward passes to like, "Hey, here's the value of everything, let's target max over it," right? Instead, you can just do one forward pass and get like the probabilities of all of them.
您通常可以比较高效地批量处理这些内容。所以在实际操作中,这可能不会带来太大的计算负担。但确实,您需要在一个小批量更新中处理多达361个棋盘,以评估这里的所有数值,然后对它们进行标准化。现在,其实我们仍然这样做有一个更重要的原因,那就是蒙特卡罗树搜索如何用来对自身进行反馈。
▶ 英文原文 ⏱
You can usually batch these somewhat efficiently. So it probably is not a huge computational burden in practice. But yes, you would have to pass up to 361 boards into a single mini batch update to evaluate all the values here, then normalize them. Now, there's actually a more important reason why we still do this, which is how Monte Carlo tree search is used to feedback on itself.
这段话的大意是:通过递归的方法不断改进其预测和搜索能力。这就是为什么把这个作为一个明确的模型实体,而不是简单地通过值的隐式标准化来实现是个好主意。明白吗?好的。那么,我们来谈谈模拟。基本上,当你增加模拟次数的时候,最终会得到一个类似于树形结构的结果。我正在画一个非常简化的版本来说明这一点。
▶ 英文原文 ⏱
And sort of recursively improve its own predictions and search capabilities. And that's where this one, having this as an explicit entity you're modeling rather than an implicit normalization over your value is a good idea. Makes sense, okay. Okay, so we talked about the simulations. And basically, you know, what you end up with as you roll out the number of simulations is a tree that kind of looks like, I'm drawing a very low dimensional version of this.
当然,在实际游戏中,情况要复杂得多,涉及的维度也更多。但是,最终你会得到一个类似树结构的东西,其中有很多叶子节点会因为其价值被认为太低而终止并不再被访问。不过,在某一条路径上,会有一系列动作被大量访问,因为随着增加 n,决策集中到这一组决策上。这就是蒙特卡罗树搜索中的树的一个大致脑海图景。你可以将其与一个穷尽式的树结构对比,比如井字游戏,那里你可以说,有9个动作,然后是8个,再是7个和6个,因此形成一个大致9的阶乘大小的树。在围棋的蒙特卡罗树搜索中,这种树非常稀疏,因为它只考虑了已经扩展了子节点的路径。
▶ 英文原文 ⏱
Of course, in the real game, it's much more high dimensional. But like, you'll end up with basically a tree structure that like has a lot of leaves that kind of terminate and are not visited again because their value is deemed to be too low. But then, you know, along one path, there will be a set of actions with very, very high visit counts that kind of gravitate towards that one set of decisions as you increase n. So this is kind of like the mental picture of what the tree in Monte Carlo tree search looks like. And you should contrast this with like an exhaustive tree, like in tic-tac-toe, where you could say like, you know, there's nine actions and then eight and then seven and six. And so it's a sort of like nine factorial sized tree. Yeah. The Monte Carlo tree search in Go is very, very sparse, right? It only considers the paths that you've expanded children nodes on.
好的。现在我们有了应用价值函数和策略函数的搜索算法,我们可以讨论蒙特卡罗树搜索算法如何在这些基础上充当改进运算符。
20年前,Jane Street 的数据中心仅占用一个办公室的角落。负责技术团队的Ron Minsky告诉我关于一切如何开始的故事。我们的其中一个计算集群被称为“蜂巢”。我记得“蜂巢”的第一次任务就像是在一排尽头堆叠了六台戴尔机器。同时我们的交易系统也放在那,因为我们希望确保可以随时关闭它。过程中确实经历了一些波动,有一次甚至有人在办公时间打扫卫生时,拔掉了一个交易系统的电源。所以最后,事实证明把一切都放在数据中心确实更好。
▶ 英文原文 ⏱
Okay. So now that we have the search algorithm that applies the value function as well as the policy function, we can now talk about how the Monte Carlo tree search algorithm can actually act as a improvement operator on top of these guys here. 20 years ago, Jane Street's data center fit in the corner of an office. Ron Minsky, who co-leads the tech group there, told me about how it all got started. One of our compute clusters we called the Hive. And I remember the first mission of the Hive was literally like six Dell boxes stacked on top of each other at the end of the row. And the trading systems themselves we also had there because we actually wanted the ability to make sure we could turn the damn thing off. I mean, there were ups and downs, like literally at some point, you know, one of the people who was cleaning the office unplugged one of the trading systems in the middle of the day as they were vacuuming. So, you know, in the end, it is in fact better to have it all in a data center.
自从最初使用那六台戴尔服务器以来,Jane Street的数据中心取得了长足的发展。我有幸与Jane Street的物理工程团队负责人Ron和Dan Fontocorvo一起参观了位于德克萨斯的其中一个数据中心。你知道吗,这些GB300机柜在峰值时消耗大约140 KWE。相比之下,传统的风冷系统消耗约10到40 KWE,差异非常大。我们深入探讨了运营这些数据中心的细节,很多事情是我之前从未考虑过的。数据中心中充满了一种液体,这种液体是蒸馏或去离子水与丙二醇混合而成,其中丙二醇占25%。这有助于抑制任何细菌或藻类的生长。我不太喜欢担心服务器内部细菌滋生的世界。在参观过程中,我看到了数据中心内部许多以前从未见过的东西。Jane Street甚至愿意把地板掀起来,移除机架,带我去后台看看所有制冷设备的运作。你可以在janestreet.com/thwarkash网站上查看完整的参观情况。
▶ 英文原文 ⏱
Jane Street's data centers have come a long way since those six dells. And I got to tour one of them in Texas with Ron and Dan Fontocorvo, who leads Jane Street's physical engineering team. You know, these cabinets, these GB300 cabinets consume at peak about 140 KWE. Compare that to traditional air cooled, you're talking about 10 to 40 KWE. It's a lot more. We got deep into the details of running one of these data centers, things that I had never considered before. It's filled with a liquid, a mix of distilled or deionized water and propylene glycol, 25% of propylene glycol. That's to inhibit any bacteria or algae growth. I don't love the world where we have to worry about bacteria growing in our servers. I got to see way more of what actually happens in the data center than I've ever seen before. Jane Street was willing to literally pull up the floorboards and take out the racks and take me to the back where all the chillers are. You can check all of this out at janestreet.com/thwarkash where we posted the full tour.
好的,现在我们来谈谈RL(强化学习)部分,比如说,这个系统是如何通过自我对弈变得更强的。假设我们玩一个游戏,AI与你对弈。你走一步,AI会进行搜索计算,然后得到一个访问次数分布。假设这是在这个节点上,你初始策略的推荐。经过蒙特卡罗树搜索(MCTS)之后,AI对其中一个动作的信心增强,因此根据搜索结果,分布可能会变得更加集中。当然,你可以调整搜索过程,使分布更分散,但这可能并不是一个好主意。MCTS应该能够对特定的动作比对其他动作更有信心。当然,最初它可能会在其他动作上投放更多的权重。
▶ 英文原文 ⏱
Okay, so we now talk about the RL part of like how this thing gets stronger by playing itself, right? Let's say we play a game where the AI, so you make a move. AI will kind of compute the search and then this is this sort of visit count distribution. Let's say this is your policy, your policy, initial policy recommendation at this node. And then after MCTS, it gets more confident about one of these actions, right? And so maybe the distribution looks a bit more peaky like this based on the search. Now, of course, you can tune the search process so that it ends up more diffuse. But that's probably not a good idea. MCTS should get more confident about specific actions than others. But it, of course, might place a lot of weight on, you know, other actions initially.
当你增加模拟次数时,分布会逐渐收敛到一个非常尖锐的分布。所以,这就是你的新策略,我们可以把它称作“π”,用MCTS(蒙特卡罗树搜索)算子来处理给定的状态s。经过MCTS过程后,你得到的策略推荐的分布就会变成这样,比之前的分布更尖锐一些。然后你可以取最大值(ArcMax),或者从这个分布中随机选择一个动作。你不一定非要选择最大值。接着,你执行这个动作,然后舍弃现有的搜索树,为下一步重新计算新的分布。最初,你可能有一个这样的初始猜测,然后通过MCTS进行优化。棋盘上应该再有一个“X”,对,我确认是这样的。
▶ 英文原文 ⏱
And then as you increase the number of sims, it should converge to a very peaky distribution. So this is your new, let's call this like, pi, let's wrap this in like a MCTS operator of, you know, a given s, right? So after applying MCTS process, your policy recommended distribution looks like this. It's a bit more peaky than the previous one. And so then you take the ArcMax, or maybe you just sample from this. It doesn't have to be the ArcMax. And then you make your move. And then you throw away the tree, and then you begin anew on the next move, right? So, again, like you, you know, you compute a new distribution. So initially, maybe your guess looks like this, and then you refine it through MCTS. There should be one more X on the board, right? I'm sorry, that's correct, yes.
这段话可以翻译成中文如下:
“要达到这样的效果,对吧?所以,在每一步行动时,你首先通过策略网络得到一个初步的猜测。然后,通过结合策略网络和价值网络的搜索过程,可以得出一个更加坚定的行动决策。嗯,然后依此类推,游戏结束后,一方获胜,一方失败。”
▶ 英文原文 ⏱
To something that looks like this, right? So, so on every move, you have your initial guess from your policy network. And then the search process that combines your policy network and your value network arrives at a more confident action that you take. Mm-hm. And then so and so forth. And then the game ends, and one person wins and one person loses.
AlphaGo训练自己的方式非常巧妙,它能够利用最终的搜索过程的结果来优化策略网络。具体来说,AlphaGo可以告诉策略网络:“嘿,为什么不一开始就预测到这个结果,而不是让蒙特卡洛树搜索(MCTS)做这么多前期工作呢?” 换句话说,如果策略网络从一开始就有一个好的预测,MCTS就不需要做那么多计算工作来得到结果。
▶ 英文原文 ⏱
So the way that, the beauty of how AlphaGo trains itself is that it actually can take this final search process, the outcome of the search process, and tell the policy network, hey, like, you know, instead of having MCTS do all this, you know, legwork to arrive here, why don't you just predict that from the get-go, right? Like, why don't you, like, you know, not use this guess and just predict this to begin with? Mm-hm. And if you have this guess to begin with in your policy network, then MCTS has to do a lot less work to get things to work.
如果我们绘制一个测试时间缩放图,比如说这里表示模拟次数。那么在零次模拟时,你的隐含胜率大概在这里。如果你直接采取这个原始行动,这是你的胜率。然后假设我们增加模拟次数,也许你的胜率就像这样变化。当你进行一千次模拟时,你会得到一个提升后的策略,胜率提升到这里,非常不错。
▶ 英文原文 ⏱
And so if we draw like a sort of test time scaling plot, so let's say like this is like number of simulations. Let's say, you know, at zero simulations, your sort of implicit win rate is like, I don't know, here. And then without any simulation, if you just take this raw action, this is what your winning rate is. And let's say as we increase the number of sims, maybe you kind of have a win rate that looks like this, right? So when you search for, let's say, a thousand simulation steps, that gets you to a policy here, that gets you to here, which is great.
但是,如果你想将这个蒙特卡罗树搜索(MCTS)策略网络简化回一种更直观的“即兴策略网络”,你可以从这里开始。比如说,如果这被视作一种通过“蒸馏”获取的初始状态,那么在进行另一个1,000次模拟步后,你实际上可以达到这个程度。这就像是,如果你能把最开始的1,000步整合进策略网络里,而不是放在搜索过程中,那么你就能够从一个更好的起点开始,然后在你投入的模拟次数中获得更好的结果。
▶ 英文原文 ⏱
But if you were to distill this MCTS policy network back into your sort of shoot from the hip policy network, then you could actually, you know, start here. Like, if let's say this was, you know, zero by distillation, then if you spend another 1,000 sim steps, then you actually kind of get to here. It's almost like if you could just, you know, amortize the first 1,000 steps actually into the policy network instead of the search process, then you can begin at a much better starting point and then get a much better result for the number of sims that you put.
随着模拟次数的增加,测试时间的缩放性质会呈现出收益递减的特性,即胜率的增加会变得更小。这种情况在精简网络中也适用吗?换句话说,我们是否可以从精简网络开始,再次得到这些早期收益?还是说这只是蒙特卡罗树搜索(MCTS)特性的一部分?说实话,我并不清楚MCTS模拟在测试时间的缩放行为。我相信这一现象可能实际上对实际情况下的强度相当敏感。我只是画了一条单调递增且趋近于1的函数。
▶ 英文原文 ⏱
The save more type nature of test time scaling as the number of simulations increases, the increase in win rate is smaller. Is that true even for the distilled network? That is to say, is there some gain of like, okay, we start from the distilled, we get these early gains again? Or is that just inherent to like the nature of MCTS? To be honest, I actually don't know the test time scaling behavior of MCTS simulations. And I believe it might actually be quite sensitive to how strong this one is in practice. I'm just drawing a monotonically increasing function that gets to one.
好的,明白。所以不用太关注曲线的形状,只要知道它是关于 sim 的单调函数即可。蒙特卡洛树搜索(MCTS)的概念非常精彩,我们通过应用搜索得到了更好的结果。在下一次更新这个网络时,我们将训练它来逼近经过 1000 步搜索后的结果。
▶ 英文原文 ⏱
Okay, cool. So don't pay too much attention to the shape of the curve. Just know that it's monotonic with respect to sim. Okay, so the idea of MCTS is very brilliant, which is like we're gonna, we got something better by applying search. Yeah. And we're going to now on our next iteration of updating this network, just train this to approximate the outcome of 1,000 steps of search.
好的。所以现在我们并不是从这里开始,而是让一个神经网络从这个点开始。然后,一旦我们在此基础上再进行1000步优化,游戏的表现就会变得更强。是的,你可以继续这样做。AlphaGo的训练算法基本上是将每一步采取的棋步都进行搜索,不论你是赢了还是输了,这一点非常重要。然后,你会训练模型来模仿这个搜索过程。
▶ 英文原文 ⏱
Yeah. And so instead of starting here, we get to now have a neural network start here. And then, you know, the play gets stronger once we then apply another 1,000 steps on top of it. Yeah. And you can keep going, right? So the training algorithm for AlphaGo is to basically take the games where you've applied the search on every move that the policy encountered. Whether you won or lost, and that's quite important. And you're just gonna train the model to imitate the search process.
其实,这里有一个与机器人学相关的比喻,那就是匕首算法。首先,我会画一个示意图,比如说状态,像是S0, S1, S2, S3。假设我们在马尔可夫决策过程(MDP)中执行了一系列动作,形成了一条轨迹。而且这些动作可能并不是最优的,可能在这个游戏结束时我们输了。因此,有一类算法专门用来接受这些轨迹,并重新标记动作,以形成更优的轨迹。
▶ 英文原文 ⏱
So there's an analogy to robotics actually, which is the dagger algorithm. First, I'm gonna draw like a schematic of like, let's say, you know, the states, right? So S0, S1, S2, S3. So let's say, you know, we took a series of actions in an MDP to get a trajectory. And these actions may be suboptimal, right? Maybe we lost at the end of this game. So there is a family of algorithms that basically take trajectories and relabel the actions to better trajectories.
所以这里可能更好的做法是采取a0'。而在这里更好的动作可能是采取a1'。还有另一个类似的比如a2',a3'。蒙特卡洛树搜索(MCTS)的基本思路就是这样:即使你最后在这个游戏中输了,但对于每一步行动,它都会告诉你一个绝对更好的选择。
▶ 英文原文 ⏱
So maybe a better action here would have been to take, you know, a0 prime. A better action here would have been to take a1 prime. And yet another one like a2 prime, a3 prime. So what MCTS is doing is basically saying, like, you play this game where you eventually lost, but on every single action, I'm gonna give you a strictly better action that you should take instead.
这并不保证你会赢,但可以保证的是,如果你将这些元组用作训练数据,并且重新训练你的策略网络以预测这些,而不是那些,你会表现得更好。这与机器人学中的Dagger算法和模仿学习非常相关,在这些领域中,你需要在这里进行干预。即使你处于不太理想的状态,比如一辆自动驾驶汽车偏离了道路一侧,仍然有一个有效的操作可以纠正并让你回到正轨。
▶ 英文原文 ⏱
It does not guarantee that you are going to win, but it does guarantee that, you know, if you take these tuples as training data, so that you retrain your policy network to predict these ones instead of these ones, you're gonna do better. And this is very related to Dagger in robotics and imitation learning where you want to collect an intervention here. And even if you're in a not great state, for example, like a self-driving car that, you know, veers off the side of the road, there is still a valid action that kind of corrects you and brings you back.
好的,这可能是个吹毛求疵的问题。但是否有保证说蒙特卡洛树搜索(MCTS)一定比策略网络更好呢?比如说,你可以想象在训练初期,因为MCTS依赖于价值网络的信息。那么,在训练初期,当价值网络还没有很好地从完整的对局中学习到经验时,MCTS 可能其实会比某种随机化的策略更差。那么,认为MCTS优于策略网络是否仅仅是一种经验法则呢,还是说有某种保证?实际上,这确实是一种经验法则,并且在实践中确实有效。但让我举个例子,说明MCTS可能会给出比你的策略网络更差的分布。
▶ 英文原文 ⏱
Yeah, okay, so, pedantic question. But is there a guarantee that MCTS must be better than the policy? For example, you could imagine early on in training, because MCTS is informed by the value network. Yeah. Early on in training, when the value network hasn't been well trained on finished games, that like MCTS is worse than sort of randomly standardized policy. So is it just like a heuristic that MCTS is better than policy or is that like, is there some guarantee? Right. In practice, it is a heuristic, and it does work also in practice. But let me illustrate an example where MCTS can give you a worse distribution than your policy network.
所以,这种情况经常发生,当你的自对弈算法训练得不错,但由于没有在多样化的数据上训练,它突然崩溃。例如,我们有一个棋盘状态,此时策略建议非常好,就像 π(AS) 非常出色。但是,由于我们可能在很多游戏中,机器人在未进行到最后解决阶段就直接认输了,它们有点忘记了如何评估这些后期阶段的策略。
▶ 英文原文 ⏱
So, and this can often happen if your self-play algorithm has trained to a good point, but then somehow it collapses because it's not trained on diverse data or something, right? So, let's say we have a board state where the policy recommendations here are very good. So like, you know, pi of AS is like, great. But somehow, because maybe we're playing on a lot of games where the bots just resign instead of playing all the way to the Trump Taylor resolution, they kind of forget how to evaluate those kind of late stage plans.
就像我们在角落玩法中展示的情况一样,可能我们重放缓冲区中的训练数据中有100%都缺失了如何在那些状态下评估价值函数的示例。因此,你可能会遇到最终值非常不理想的情况。如果最终节点的值不好,那么这将影响整个过程,导致你的冰球选择标准和备份出现问题。
▶ 英文原文 ⏱
Like in the case that we showed with the corner play, maybe like 100% of our training data in our replay buffer has lost examples of how to evaluate the value function at those states. So you might end up in a scenario where your terminal value is like very bad. And if the terminal values of the leaves are not good, then this will actually propagate all the way up and cause your puck selection criteria and your backups to be off.
好的。然后你最后可能会访问到一个与你最初的策略建议非常不同的分布。另外,如果你的模拟次数很少,就可能会出现方差问题,因为你没有足够的探索,对吧?像这样,当次数趋向于无穷时才有收敛的保证。因此,搜索过程中的方差以及评估中的不准确性肯定会影响你的策略网络的质量。
▶ 英文原文 ⏱
Yeah. And then you end up visiting a very, very different distribution than what your policy initially recommended. Yeah. Also, if your number of sims is low, then you might also have a variance issue where you just don't explore enough, right? Yeah. Like it's only guaranteed to converge when you kind of take end to infinity. So variance in, you know, your search process as well as inaccuracies in your evaluation can definitely screw with the quality of your policy network.
好的,这就是为什么没有保证能提高的原因。而我认为,这也是为什么 AlphaGo Lee 的训练算法中会进行完整的推演。他们这样做是为了用真实的棋局来验证模型。在实践中,你也可以采取这样的方式:在10%的对局中,禁止程序自动认输,而是要求它们下到最后。这样你就能在回放缓冲区中获取一些训练数据,以便真正解决那些普通人类玩家可能不会玩到的残局。
▶ 英文原文 ⏱
Yeah. And so that's why it's not a guarantee to improve. Yeah. But, and that is why I think, I suspect why AlphaGo Lee had the playouts to the end in their training algorithm. Yeah. So they could ground this thing in real plants. Yeah. In practice, what you could also do is just like for 10% of the games, you prevent the bots from resigning and you just say like, resolve it to the end. Mm-hm. So you get some training data in your replay buffer to really resolve those kind of like late stage playouts that normal human players would kind of not play to.
好的,所以这是为什么蒙特卡罗树搜索(MCTS)在假设价值函数正确的情况下,可以给出更好的策略。因为这是一个关键的假设链。假设这个假设是准确的,那么你的搜索过程应该会给出比你最初猜测更好的建议。这样的话,如果你有一个冷启动政策或一个像AlphaZero这样的系统的话。
▶ 英文原文 ⏱
Yeah. Yeah. So, so this is why MCTS kind of, if you assume that the value functions are correct, why it gives you a better policy is because, and it's a very critical, you know, chain of assumptions. Assuming that this is accurate, then your search process should give you a better recommendation than your initial guess. Okay. So if you have a cold started policy, if you have an AlphaZero type thing.
好的。实际上,在最初的几个训练回合中,策略是没有用的。你真正做的只是说,嘿,我们来完整地玩几局游戏。一旦我们完成了整局游戏,就可以对之前的每一步标记谁赢了,谁没赢。AlphaZero的损失函数有两个组成部分,一个是策略与蒙特卡洛树搜索(MCTS)相比有多好,另一个是价值预测与谁在这一步之后实际上赢得比赛之间的准确性。这一点可以被看作应用到每一个动作或每一步。
▶ 英文原文 ⏱
Yeah. Really what's happening for the first few epochs is the policy is kind of useless. And what you're really just doing is, hey, but let's play full games. And once we have played full games, for the preceding moves, we'll have labeled who won, who didn't win. And the loss for AlphaZero has two components, which is like, how good is the policy relative to MCTS? And how good is the value prediction relative to who actually won the game from this move? And this is, this sort of like, you can think of this being applied to every single action or every single move.
好的。AlphaZero训练开始时,我们主要是让价值函数去预测谁会赢得比赛。在这个阶段,如果你是某个玩家并处于某种状态,基本上就是在进行这种训练。等价值函数训练得比较好之后,策略也开始随之改善。没错。
▶ 英文原文 ⏱
Correct. And really what's happening in the beginning of AlphaZero training is just like, we're trying to get the value function to actually predict who will win the game. If you're, if you find yourself in this state and you're this player and functionally, that's all that's happening. And later on, once that's well trained, now the policy is also improving. Correct.
好的。我发现一个比较有效的小技巧,不过这不是经过同行评审的说法,所以请谨慎对待。这只是我在实际操作中找到的有效方法。首先,你要确保这一点是正确的,然后再投入大量时间进行蒙特卡罗树搜索(MCTS)。毕竟,如果在无用的数值预测上进行搜索是没有意义的。
▶ 英文原文 ⏱
Okay. One trick I did find to be pretty useful, and this is not a peer reviewed claim. So just like take this with a grain of salt. It's like, I found it useful in my own implementation to do the following. You want to first make sure that this is good before you invest a lot of cycles doing MCTS, right? Like, it doesn't really make a lot of sense to do search on garbage value predictions.
所以,你想从一个好的起点开始这个项目。AlphaGo 的领先策略就是一个很好的例子,它从人类的棋局开始训练,效果很好,对吧?绝对没问题。你也可以使用一个开源的围棋程序,让它自我对弈来生成数据,这也行。因此,如果你有一个包含高水平对局的离线数据集,可以很容易地学习到后期的价值函数。
▶ 英文原文 ⏱
So you want to kind of start at a good place where this works. AlphaGo lead does a very good thing where it just takes human games and then you like train on it and it just works, right? Totally works. You can also take an open source go bot, play it against itself, generate data, also works. So, so if you have some like offline data set that, that has realistic good play, you can easily learn the late stage value functions pretty well.
这是您在开始搜索过程时需要的。 抱歉,您能再读一遍吗?当然可以。那么,评估一个接近结束的围棋棋局其实很简单。当棋盘上几乎所有的棋子都已落子时,就像是一个可判定的问题,对吧?对,因为棋局的可能性越来越少。因此,大多数由理性的人下到终局的棋局都可以作为训练数据,用于在树的终端部分训练一个好的价值函数。明白了,当您下更多的棋局时,搜索将把好的价值反馈到树的中间节点。
▶ 英文原文 ⏱
And that's the, that's what you kind of need to start the search process. Sorry, can you just read this in some more time? Sure. So, so it, it's quite easy to evaluate a late stage go, go game. Like when almost all the pieces are on the board. Yeah. Like it's almost like a decidable problem, right? Yeah, yeah. Because it's the lower and lower uncertainty as to like the depth of the tree. So, most games played to the end by reasonable people will be good training data to train a good value function at terminal parts of the tree. Got it, okay. Then as you play more games, the, the, the search will back up good values into the, the sort of intermediate nodes of the tree.
好的。随着数据量的增加,你的评估指标会逐渐形成一种直觉,能够判断什么是健康的棋盘状态,什么是不健康的棋盘状态。这种判断在中局时比在开局或结束时要复杂得多。因此,最难评估的部分不是开局或结束,因为开局通常显而易见是平局,而结束时也比较明显谁在获胜。所以,你希望在评估函数中学习到的困难部分是:在中局时谁更有优势。
▶ 英文原文 ⏱
Right. And then like, as you increase the amount of data, your, your value head gets a good intuition of like, what is a healthy board state versus a not healthy board state. Yeah. That, that those are much more subtle to judge in the mid game than the beginning or the end. Yeah. So, so the most difficult part to score is like, not the beginning or the, because the beginning is just like obviously 0.5 and then at the end it's like pretty obvious who's winning. Yeah. So, the hard part that you want to learn in the value function is like who is winning in the middle.
这实际上非常类似于时序差分(TD)学习。是的,与TD学习有一个很美妙的联系,我们可以稍后讨论这个关系。而不是与蒙特卡罗树搜索进行对比。首先,你需要获得良好的价值函数,专家数据可以为你提供一个快速的捷径。我建议实践者首先这样做,以便初始化为一个良好的起点。然后,如果你想进行AlphaZero或Catago那种从零开始的学习,你可以尝试在一个小型棋盘上进行随机对局。
▶ 英文原文 ⏱
And so, this, this is actually very analogous to TD learning. Yes. And there's a, a beautiful connection to TD learning that we can, you know, talk about in a bit. Okay. As opposed to, you know, contrasting with Monte Carlo Tree Search. So, so you first want to get good value functions and expert data can kind of give you a quick shortcut. I recommend for, you know, practitioners just do that first just to, you know, initialize to a good starting point. And then if you want to do the alpha zero thing or, or Catego kind of tabula rasa learning, then what you can try to do is on a small board, play random games.
可以随便选择一个代理。如果你玩大约5万盘游戏,你实际上会学到一个很不错的价值函数。因为在九乘九的棋盘上,通过随机下棋,你确实能看到足够多的常见模式。如果你训练一个模型,该模型能同时在九乘九和十九乘十九的数据上进行训练。而Catego就是其中一种被提议的架构。那么,九乘九棋盘上的价值头评估到十九乘十九棋盘上,会有相当不错的迁移学习效果。
▶ 英文原文 ⏱
Just take a random agent. And if you play like, you know, 50,000 games, you'll actually learn a pretty good value function as well. Because on a nine by nine board, there's actually, you can see enough of the common patterns with random play. And then if you train a model that kind of can train on both nine by nine and 19 by nine data. And Catego was a, was a proposed one of these architectures. Then there's some pretty good transfer learning from the value, value head evaluated at nine by nine to the 19 by nine.
好的。这与其他游戏不同的是,当你增大棋盘的尺寸或进行其他变化时,并不会引入新的棋子或规则。如果我们将围棋的棋盘缩小到极限,比如一个四乘四的小棋盘,在这种情况下,如果你玩五万局游戏,很多结局看起来会像人类下的棋,就像玩井字棋一样。因此,如果你稍微扩大这个范围到五乘五或九乘九的棋盘,其实不难想象,即使是完全随机的下法也能产生看起来相当合理的棋局。
▶ 英文原文 ⏱
Right. Because this, this unlike other games as much like a very much a sense of like, there's not like a new kind of piece that is introduced when you increase the size or something. If we take it to its limit and consider like a very tiny, like four by four Go board. Yeah. Like if you play 50,000 games, you're going to have a lot of end states that look like human play, right? Like it's just like tic-tac-toe at that point. So, so if you like broaden this a little bit to like nine by nine, five by five or nine by nine, it's not unrealistic to imagine that like purely random play will actually generate pretty reasonable looking boards.
然后,您就可以很容易地对这些进行评分。这就是为什么您可以通过搜索来改进您的策略的基础。然而,非常重要的是,MCTS(蒙特卡洛树搜索)需要有准确的数值评估。是的,您需要确保数值是有依据的。如果没有一个能够给数值提供依据的函数,MCTS最终将无法正常工作。我很好奇,通过在同一个网络上训练数值和策略,您能节省多少计算资源,因为它们共享相同的表示形式,这样学习过程会高效得多。
▶ 英文原文 ⏱
And then so you can score those pretty easily. And so that is what gives you the bootstrapping to be able to then improve your policy with search. But it's very, very critical that MCTS has accurate value estimates. Yeah. And you need to ground the value. Ultimately, MCTS will fall apart if you don't have a grounding function for, for the value. I'd be curious about how much compute you save by training the value and policy on the same network that because they share the same representations, how much more efficient learning is.
因为这会很有趣,如果他们基本上是这样,我们刚刚讨论过他们如何在本质上做出类似的预测,或者他们的预测应该是一致的。 是的。所以我会很好奇,如果实际上,你通过给他们使用同一个网络来减少计算量。最初的 AlphaGo 论文里,AlphaGo Li 使用了两个独立的网络。在之后的所有论文中,他们将其合并为两个头部。
▶ 英文原文 ⏱
Because that would be interesting if they're basically kind of, we've just talked about how they're kind of making similar predictions or they should be in line with each other. Yeah. And so I'd be curious if like actually, yeah, you just, you're, you're like having the amount of compute you had to do by giving them the same network. Right. AlphaGo Li, the original AlphaGo paper had two separate networks. Yeah. And so in all subsequent papers, they, they merge them into two heads.
是的,这样做推测上可以节省计算资源。不过,想要以非常严谨的科学方式回答这个问题,其实并不那么简单。虽然问题本身简单,但如果你真的想把这个问题追根究底,就需要付出相当多的努力来彻底解决。是的,不过凭直觉来看,它们确实共享了许多表示。所以,正如我们提到的,在进行评估时,你的策略网络和价值网络应该有某种程度上的一致性,对吧?
▶ 英文原文 ⏱
Yeah. And presumably this saves compute. But answering that question in a very rigorous scientific way is actually, it's a simple question, but, but in practice actually takes, like if you really want to chase that question down to its, its limit. It's a, it takes quite a bit of work to, you know, really resolve that. Yeah. But intuitively, yes, they share a lot of representations. So, so, and you know, as we mentioned, there is, there is a sort of like, your, your policy network and your value network when doing evaluation should kind of agree, right?
所以,真的应该在它们之间保持这种一致性。是的。但我觉得这样理解可能是错的。当我了解到大型语言模型(LLM)的工作原理以及简单的强化学习胜者策略(RLVR)作为一种算法时,我感到震惊的是,它仅通过简单的“是/否”反馈,就可以学会构建非常复杂的代码库等等。
▶ 英文原文 ⏱
So there, there really should be this sort of consistency between them. Yeah. Can I believe this is the wrong way to think about it? I feel like when I learn how an LLM works and how simple RLVR is, at least as an algorithm, how simple it is, I'm sort of stunned by the kinds of things it can do, that it can learn how to build very complicated code repositories and whatever, simply from getting like a yes/no.
在这里,我感到,如果你更深入地理解MCTS(蒙特卡洛树搜索)的预测原理,就会发现AlphaGo(围棋人工智能)其实没有那么让人惊叹。理解得越深入,你会觉得,它其实在这个过程中加入了很多人为的偏见。例如,你告诉它在游戏进行时应该如何调整探索的力度,同时你也为它构建了一个非常明确的树状搜索结构。
▶ 英文原文 ⏱
And here, I feel like the, if you understand it more deeply of like just predicting MCTS and it actually seems, awful go seems less impressive in retrospect, the more you understand it, because you're like, oh, you're putting in a lot of bias by just saying how much you do, you're like telling it how it should titrate exploration as things go on. You're building this very explicit tree search for it.
因此,我不知道你是否有这样的直觉:事实上,越了解它,2017年的成就就显得越不那么令人印象深刻。我个人不同意这种看法。我认为它们在不同方面都很深刻。而且我对 LMRL 的了解还不够,无法对你的播客发表评论。
▶ 英文原文 ⏱
And so I don't know if you share that intuition where it actually, the more you understand it, the less impressive the accomplishment in 2017 seems. I personally disagree. I think they're profound for different reasons. And I don't understand the LMRL like enough to like kind of comment on your podcast about it.
是的。但是我认为AlphaGo之所以是一个深远的成就,有必要稍微回顾一下。它确实和现代强化学习有所不同,我们可以探讨一下其中的一些算法选择。但我觉得最深刻的事情是,它通过了一个由10层神经网络组成的测试。
▶ 英文原文 ⏱
Yeah. But I think AlphaGo, so why is it a profound accomplishment? I think maybe it's worth stepping back a little bit and just like, it is different than modern RL and we can talk a little bit about like some of the algorithmic choices there. But I think the most profound thing here is that a, a 10 layer neural network pass.
基本上,这就是10个步骤,10个推理步骤。是的。当然,推理不仅仅是一个思路的延续,它可能像分布式表示一样,同时有很多想法在进行。但是,从结构上来说,一个10层的神经网络只能进行10个顺序的思考步骤,对吗?
▶ 英文原文 ⏱
So basically 10 steps of, 10 steps of reasoning. Yeah. And of course the reasoning is not just one trail of thought. It could be like the distributed representations and a lot of thoughts going on at the same time. But by construction, let's say a 10 layer neural network can only do 10 sequential steps of thinking, right?
是的。神经网络的10步并行分布式表示思维能够非常、非常高保真度地分摊和逼近一个几乎难以解决的搜索问题。是的。这是一个突破,我认为大多数人到今天还没有完全理解这个成就有多么深远的意义。
▶ 英文原文 ⏱
Yeah. 10 steps of neural network paralyzed distributed representation thinking is able to amortize and approximate to a very, very high fidelity, a nearly intractable search problem. Yeah. So, so this was a breakthrough that I think most people don't even understand today, like fully comprehend like how profound that accomplishment is.
然后,这就是支撑像 AlphaFold 这样的技术的基础,对吧?比如你本来需要进行大量微观模拟的复杂物理过程,但是一个相对较小的神经网络经过大约10步就能够把这种看似像 NP 类别的问题简化为一个单一问题。
▶ 英文原文 ⏱
And then this is what also girds like alpha fold, for example, right? Like where you have a very, very difficult physical simulation process that you would need to roll out so many micro scale simulations. And yet like 10 steps of a somewhat small neural network can somehow capture what feels like a, you know, MP class problem into a single problem.
这让我开始怀疑我们对一些问题的理解是否不够完整,比如P与NP是否相等或者类似这种计算复杂性的问题。显然,这并不是对P等于NP的证明,但其中有些事情让人感到困惑。因为本来被认为非常困难的问题,可能会在一个非常简单的宏观模拟中显得不那么复杂。
▶ 英文原文 ⏱
And, and so I, I, it actually makes me wonder if, you know, our understanding of problems like P equals MP or, you know, these very fundamental like computational hardness problems are incomplete, right? Like, like it's, it's not like, you know, obviously this is not a proof of like P equals MP or anything, but, but there's something to it that like kind of is very disturbing where like what felt like a very hard problem can fall to a very, very simple macroscopic simulation.
是的,这是一个非常有趣的观点。许多被证明是NP难的问题,比如围棋(我不确定是否已经被证明是NP难),以及蛋白质折叠等,神经网络可以解决它们,因为这些问题在最糟糕的情况下是NP难的,但我们并不总是面对这种最糟糕的情况。
▶ 英文原文 ⏱
Yeah. That is a very interesting insight that a lot of problems which are proven to be MP hard, like, I don't know if Go is proven to be MP hard, but protein folding, et cetera, have been like neural networks can solve them because they're MP hard in the worst case, but we're not dealing with the worst.
我们通常不会过于担心最坏情况,因为这些问题通常具有一定的结构。我认为我们应该问自己的问题是,我们一直以来都是在针对NP难题制定解决方案,就好像是在研究最坏情况的复杂性。
▶ 英文原文 ⏱
We're usually not concerned with the worst case where, you know, like these problems have a lot of structure to them. Yeah, I think that the, the, the kind of question we should be asking ourselves is like, we've been formulating, you know, solutions to MP hard problems as in like kind of worst case complexity.
嗯哼。我不会说,这就解决了围棋,对吧?它并没有给我们提供最佳的确切解决方案。但是在实际应用中,它非常有用。同样的情况也在 AlphaTensor 和 AlphaFold 上得到证明。在这些情况下,有一个非常复杂的问题,在最糟糕的情况下似乎是难以解决的。
▶ 英文原文 ⏱
Mm-hm. And I wouldn't say, you know, this solves Go, right? It doesn't give us a exact solution of the optimum. Yeah. But in practice, like it is extremely useful. Yeah. And the same thing has been shown in like alpha tensor, alpha fold, where like, yes, there is a very hard problem that in the worst case seems intractable.
是的。然而,我们能够实现几乎无限的进展。那么,可以想象一下这在极限情况下会是什么样子。比如说,如果你想模拟一些非常复杂的事情,比如天气,或者预测未来,比如我们是否生活在一个模拟中,这会是什么样子呢?
▶ 英文原文 ⏱
Yeah. And yet we're able to make like almost arbitrary amounts of progress. So, so here's a sort of like, you know, in the limit, what is, what, what, what might this look like, right? Well, if you want to simulate, you know, something very complex like weather or predict the future, like, you know, do we live in a simulation or not?
构建一个非常复杂的模拟所需的计算资源可能比你想象的要小得多,这得益于我们能够将大量计算分摊到一个神经网络的前向传播过程中。这很有趣。对我来说,AlphaGo 是第一个真正展示这种高效模拟压缩成少量计算能力的论文。
▶ 英文原文 ⏱
The computing resources you need to build a very complex simulation might be much smaller than you think, based on, you know, our ability to amortize a lot of that computation into the forward pass of a single network. Interesting. So to me, yeah, alpha go was the first paper that kind of like really showed this like profound level of, you know, simulation being compressed into a small amount of compute.
我觉得自己完全没有资格评论这其中数学计算复杂性的部分。不过,我在想,混沌在这里是否起着重要的作用,比如,天气预报到底有什么问题,为什么预测一天后的天气需要10倍的资源?对吧。
▶ 英文原文 ⏱
I feel totally not at all qualified on the computational complexity of the math to comment on this. But I wonder if there's an important role of chaos here where if, what is the problem with weather and why does it take 10x the amount of resources to predict weather a day out? Right.
这段话翻译成中文如下:
"而且每天的变化都持续如此,因为这是一个混沌系统。因此,随着时间的推移,微小的扰动可能完全改变最终的预测结果。我觉得这很有趣。嗯,我想你也会预期围棋和蛋白质折叠会是这样的情况。这里举一个和天气类似的例子,可能在围棋中也有相关性。问题就像,你知道的,这是我们当前的棋盘状态。是的,基于我们对两位棋手的了解,未来的棋盘状态会怎样?对,未来的棋盘状态究竟会是什么样子?对,这对初始条件极其敏感。像这里放一颗棋子就可以打乱整个预测,对吧?所以这很困难。这直观上就是一个混沌问题。但不知何故,我们仍然能预测谁会赢。是的,这里包含了很多可能性。"
▶ 英文原文 ⏱
And continually so for every more day out is because it's a chaotic system. And so small perturbations can totally change the final estimate as time goes on. And I guess it's interesting. Well, I guess you would expect that for Go and protein folding as well. So here's an analogy to weather that might be relevant in Go. So the problem of like, you know, here's our current board state. Yeah. Given what we know about both players, what is the board state in the future? Yeah. What is the exact board state in the future? Yeah. So this is extremely sensitive to initial conditions. Yeah. Like a single stone place here can kind of disrupt the entire prediction. Yeah. Right? So this is hard. This is kind of intuitively the chaotic problem. And yet somehow, so this is hard. Somehow we can predict who's going to win. Yeah. Yeah. Like, and this captures a lot of possibilities here.
所以,有一个我们真正关心的、更宏观的量,那就是平均值、期望值或者某种整体宏观结构,它涉及很多可能的未来。这是一种有趣的思考方式。在天气预报中情况类似,我们并不关注在某个特定纬度经度上空6,000英尺的风速,而是更关心飓风的位置或类似的事情。在混沌理论中,有一个经典的洛伦兹吸引子就像这个例子一样。是的,如果你在洛伦兹吸引子的任意位置开始,你不知道你会最终在哪儿,但你知道它整体的形状是这样的。
▶ 英文原文 ⏱
And so there's this more macroscopic quantity that we really care about, which is the average or expectation or some sort of global macro structure over a lot of like, you know, possible futures. That's an interesting way to think about it. And so in weather it could be the same thing, right? Like we don't exactly care like what the, you know, velocity of wind 6,000 feet above a specific latitude, longitude is. We kind of care like where's the hurricane or, you know, things like that. And I would say like in chaos, you know, there's a classic like Lorenzo tractor, which kind of looks like this, right? Yes, you don't, if you start anywhere on the Lorenzo tractor, you don't know where you're going to end up. But you do know that the thing looks like this. Yeah, yeah. Right.
这段话的大意是,有时我们不太关心微观层面的小事,而更关注宏观结构的美。这些宏观的东西可以是可预测的。这与哈希函数形成对比——哈希函数同样对初始条件非常敏感,但没有宏观结构。至少理论上说应该是这样的。哈希函数中并没有类似于“价值函数”或者“总体天气情况”这样的宏观结构。它更专注于具体结果,比如100步之后棋盘的具体状态。直观上来看,这种理解是正确的。
▶ 英文原文 ⏱
And so there's this kind of beauty of like, sometimes we don't necessarily care about the micro scale things. We actually care about the macroscopic structure. That's interesting. And these things can be predictable. And contrast that say to something like a hash function, which is also incredibly dependent on initial conditions, but doesn't have a macro structure. Or at least hopefully if like the arguments work. Yes, one would hope. And so there's like no equivalent of a value function or like broadly how's the weather going to be that is interesting there. It's really just about what is the move, what is the board going to look like 100 moves from now exactly. Yes, intuitively that seems correct.
这不是我的专业领域,但我觉得有趣的是,像密码学这样的领域,并没有能够通过其工具,比如加密和哈希,来证明你不能找到快速近似解。如果他们能做到这一点,那么你就可以证明P不等于NP。实际上,我们知道许多密码协议中存在结构,显然像RSA加密就有结构,这些结构正是量子计算机用来攻破它们的关键。
▶ 英文原文 ⏱
And then again, this is also out of my area of expertise, but I find it interesting that like cryptography has not been able to, like the tools of cryptography and, you know, hashing have also not been able to prove that like you cannot come up with fast approximations. Right. Like you cannot come up with fast approximations, right? Yeah. Like if that, if they were able to do that, then you could prove P is not equal to MP. Yeah, yeah. In fact, we know that there's structure in many cryptographic protocols, obviously like RSA cryptography. There is structure and that structure is what quantum computers exploit to break them, right?
我明白了。Reiner 有一篇非常有趣的博客文章,我们在节目中讨论了这篇文章。他提到,从高层次来看,密码协议和神经网络非常相似,都是通过一层层地把信息混合在一起。这是因为两者在算法中都有一种相似的演化过程。在密码学中,你希望最终的结果对初始条件极其敏感,这样即使你改变一点内容,结果看起来也会非常混乱。而在神经网络中,你同样希望所有内容都相互依赖,因为你想处理所有信息,并考虑它们之间的关系。
▶ 英文原文 ⏱
I see. Reiner has a very interesting blog post, which we talked about in the episode, where he talks about how if you look at a high level, what cryptographic protocols look like and what neural networks look like, it's extremely similar, where you have sequential layers of jumbling information together. And it's because there's this conversion devolution in the algorithms, where in cryptography, you want the final state to be incredibly sensitive to initial conditions so that it can come out sort of looking jumbled based on if you change anything. And then neural networks, you similarly want everything to be dependent on all the information because you want to process all the information and consider how it relates to itself.
是的,当神经网络处于混沌边缘时,你能够获得其最大的力量。我记得Joshua Straldick有些相关的研究论文。混沌现象中有一些非常基本的东西,它不仅仅是无望的噪声。至少在那个边缘地带,混沌系统中有一些有用的东西。不过,这只是我对这个哲学观点的看法,我不太了解其中的数学内容以对此发表评论。无论如何,如果我们回到主题,我们稍后会讨论一下LMRL,因为这两者之间存在联系。但先让我们回到MCTS,它在做什么呢?它并不是在说要直接提高获胜的概率。
▶ 英文原文 ⏱
Yeah, you have the maximum power of a neural network at the edge of chaos. I think there's some research papers from Joshua Straldick on this. Yeah. Yeah, there's something kind of quite fundamental about chaos that it's not just hopeless noise. It's like there's something kind of useful in chaotic systems, at least at that boundary. But yeah, this is just my thing about this philosophy. I don't actually know the math well enough to comment on it. Anyway, if we go back to, we'll talk about LMRL in a little bit because there's some connections there. But let's just go back to like the MCTS, like what is it doing? It is not, crucially, it is not saying we're going to increase the probability of winning directly.
这句话的大意是:我们不会简单地对所有成功的操作加权,对不成功的操作减权。重要的是,对于我们采取的每一个操作,我们都会对蒙特卡洛树搜索(MCTS)进行全面搜索,看看是否有更好的选择。然后通过策略网络来预测结果,使我们采取的每个操作都变得更好。这是一个非常好的想法,因为我们对每个动作都有一个明确的监督目标。因此,与其他朴素强化学习方法相比,我们的学习信号变化更小。
▶ 英文原文 ⏱
It's not going to say like we're going to up weight all actions that won and down weight all actions that didn't win. Yeah. Importantly, what it is doing is saying for every action we took, we did a pretty exhaustive search on MCTS. To see if we could do better. And we're just going to make every action that we took better by predict, like having the policy network predict that outcome instead. And so this is a very, very nice idea because you have one supervision target for every single action. Yeah. So the variance of your learning signal is very low compared to the alternative naive RL thing.
所以让我们实事求是地考虑一下,让我们想一下一个非常简单的算法,这个算法看起来很像现代的LMRL。它的做法是,例如,让自对弈游戏的获胜者多做一些这样的事情。那么值得稍微想一下,我们可以用什么其他方法来训练自对弈代理,而不是用MCTS,对吧?如今我们经常使用LMRL,那么这是否相关?我们可以用这种方法吗?
▶ 英文原文 ⏱
So let's actually consider what, let's consider a very naive algorithm that looks a lot more like, you know, modern LMRL today, where we do something like, let's take the winner of a self-play game and encourage it to do more of that. Okay, so it's worth kind of thinking a little bit about like, okay, what are some alternatives that we could do to train self-play agents instead of MCTS, right? You know, we use a lot of LMRL these days, like is that relevant? Could we do that instead?
让我们来仔细思考一下这个问题。假设我们有一个非常简单的算法,其中我们让不同阶段的智能体进行竞赛,互相对抗。在某个比赛中,如果某个选手获胜,我们就会加强这些获胜的动作,然后重新训练策略网络去模仿这些选手,而不是继续以MCTS(蒙特卡罗树搜索)作为目标。结果会是怎样呢?举例来说,如果你有一系列导致胜利的动作,而它们是在两名实力基本相同的智能体之间进行的比赛中发生的。
▶ 英文原文 ⏱
So let's think through this a little bit. Let's suppose we have a very naive algorithm where we take a league of agents of different checkpoints and we play them against each other. And for the games where a single player wins, we're going to reinforce those actions up and then retrain the policy network to imitate those guys instead of the MCTS objective. So what ends up happening is, let's say you have a chain of actions that led to a win. And you have a matchup between two agents that are basically the same.
所以,实际上,我们假设政策A和政策B实力相当,它们的真实胜率都是50%。假设你进行100场比赛,每场比赛大约有300步。你可能会采取一些进化策略或者其他方法来稍微改变这些策略,让它们表现得不一样。或者你不做任何改变,只是让它们互相对抗。你可能会发现,其中一个偶尔会采用比另一个更好的策略。
▶ 英文原文 ⏱
So in fact, let's just assume that like, you know, policy A and policy B are like evenly matched, right? So their true, their true win rate is like 50%. So let's say you play 100 games. And then each game, let's say lasts, you know, 300 moves. And you're doing some sort of like evolution strategy or some way to perturb these things to get them to do different things. Or maybe you don't and you just play them against each other. And you see like, occasionally this one might actually have a better strategy than this one, right?
所以比如说,有51场比赛中,策略A获胜,而另外49场比赛中,策略B获胜。这可能只是由于随机运气,或者你对策略A进行了一些调整,使得它获得了这样的结果。为了简单起见,我们假设在其中的50场比赛中,它们的表现完全相同。而在那场策略A获胜的比赛中,它的表现稍有不同。
▶ 英文原文 ⏱
So let's say, you know, 51 games, policy A wins. And then 49 games, policy B wins. And this is just due to random luck or maybe you perturbed policy A in some way that let it do this. And just to have a very, very simple model, let's pretend that for like 49 of the games, they played exactly equally. I'm sorry, for 50 of the games, they played exactly equally, right? And on that one game where this one won, it played slightly differently.
它做了一个关键性的动作,通常情况下可能会有不同的选择,但由于某种探索或随机因素,它恰好做出了比之前更聪明的决定。这就像是给你的策略网络提供了一个真实的指导信号。同时,你还有99场比赛,每场比赛有300个动作,通过模仿这些动作,你得到的策略和之前的是完全相同的。
▶ 英文原文 ⏱
It made like one critical move that like, you know, normally it would have done differently, but due to some exploration or some random noise, it just happened to make a smarter move than it did previously. So you have one supervision signal, like one true supervision signal for your policy network. And then you have 99 games times 300 moves for which imitating those actions gives you exactly the same policy you had before.
实际上,你的数据差异非常大,因为在这个庞大的数据集中,你只有一个标签用于监督操作。实际上,抱歉,让我稍微澄清一下。我们讨论的是,在所有你想要训练的游戏中,好的动作、即非分布中的动作,仅占所有动作的一小部分。
▶ 英文原文 ⏱
And so the scale of your variance is actually very bad because it's like you only have one label out of this enormous data set of actions, of supervision actions, where you want-- actually, sorry, let me clarify a little bit. Okay, so we're just talking about how the good move, the out of distribution move, is a small fraction of all the moves that are played across all the games on which you'd want to train.
这当然让我想到了LLM(大语言模型)是如何使用策略梯度法进行训练的。Karpathy在播客上形容这就像是通过吸管吸取监督信息。所以,是的,你说的这个在围棋中可能难以解决的问题,实际上却是LLM训练的默认方式,这不是很有趣吗?
▶ 英文原文 ⏱
And this, of course, reminds me of how LLMs are trained with policy gradient methods. Karpathy, when he was on the podcast, called it like sucking supervision through a straw. And so, yeah, it's interesting that this, like, this thing you're saying which would be intractable and prevents you from actually getting beyond a certain level in Go is just by default how LLMs are trained, question mark?
好的,所以在这种情况下,这并不是说它不起作用,对吧?就像,如果你想象把游戏的数量增加到,比如说,数百万个样本,你实际上可以得到一些有意义的监督。只要你找到一种方法来遮蔽掉这些人的监督,样本就可以用于指导。这也是为什么这和强化学习中关于优势、基线等概念开始有紧密关联的地方。
▶ 英文原文 ⏱
Right, so in this case, this is not to say it doesn't work, right? Like, if you imagine increasing the number of games to like, you know, millions of samples, you actually can get some meaningful supervision. Like, samples so long as you find a way to sort of mask out the supervision from these guys. And this is where things start to get pretty related to RL in terms of advantage and baselines and so forth.
好的,我们来看一下这种非常简单的方法的梯度方差,我称它为“梯度强化学习”(gradient RL)。基本上,它就是奖励的总和。嗯,我明白你的意思。奖励的总和就是回报,对吧?在我们这个简单的设置中,我们只有一个表示回报的指示变量,表示你要么赢了,要么输了。
▶ 英文原文 ⏱
Yeah. So let's look at the, you know, the gradient variance of a very naive approach like this where I'm just going to call it like gradient RL. And it's basically the, you know, sum of rewards. Okay, I see what you're saying. So, so the sum of rewards is the return, right? So, so like, in, in our naive setup here, we only have an indicator variable for the return, where either you won or lost.
在这种情况下,当你失败时,你不会用梯度为零的那些例子进行训练。而当你成功时,你会去预测这些事情。所以可以把这个设置视为这里一般公式的一个特例。问题在于方差非常高,因为当你把这些项相乘,尝试计算这个的方差时,你会发现,由于假设这里的平均值接近零或者无信号,方差就等于平方的期望值减去一些常数。方差基本上就意味着你在计算这个乘积项的平方。因此,你最终得到的项会随着T成二次增长。所以,在这样的设置中,方差就如同对这些项施加了一个耦合效应。
▶ 英文原文 ⏱
So, so in the case where you lost, well, you just don't train on, your gradient is zero, you don't train on those examples. And when you won, you try to predict those, those things, right? So you can think about this setup as a, as a special case of this general formula here. The, the trouble here is that this is very high variance because when you multiply these terms out, when you take, when you try to compute the variance of this. And so, so variance of the gradient is equal to expectation of squared minus. And just for simplicity, we can pretend this is like, you know, on average zero or something. If you're, if you're centering it at, you know, no signal. And the variance here basically means that you're, you know, taking the square of this product term. And so you end up with a term that kind of grows quadratically with the, with T. So, so variance, when you have a setup like this, this thing acts as a coupling effect on top of, of these terms here.
那么,让我们把这个问题转化到一个大型语言模型(LLM)的场景中,并探讨为什么LLM只执行一步的强化学习(RL),而不是多步的RL场景。在LLM中,你有一个解码器,它可能会预测一些词语,比如"hello world"。在当前的LLM强化学习中,他们把整个序列视为一个单一的动作,也就是说,AT和大T都只是一个步骤。因此,确实,由于变换器(transformer)的结构是通过条件概率的乘积来构建的,我们的确可以说,这个序列的概率等于整个序列的对数概率之和,而这又等于每个单独词元的概率之和。举个例子,我可能会这样说:log hello + log world。这是正确的,如果这个序列只包含一个步骤的话,它们确实是相同的事情。
▶ 英文原文 ⏱
So, let's actually map this to an LLM case and we can answer like why do LLMs only do one step RL instead of a multi-step RL scenario. So, in LLMs you have a decoder that might, you know, predict some words like hello world. And so in current LLM RL, they treat this entire sequence as a single action, just AT and big T is just one, right? And so, yes, it is true that, you know, the, because of how, you know, transformers are formulated through the sort of product of conditional probabilities. We do have, you know, probability of this sequence is equal to the sort of sum of, log probability of the whole sequence is equal to the sum of the probabilities of like, you know, individual tokens, right? So, so in this case I would, I would say something like, you know, log L plus log low plus log world. So, this is true. And if this term were one, then they would be the same thing.
然而,在对事物进行采样时,如果你为每个特定的标记分配了一个奖励项,那么这些项之间就会出现交叉乘积的相互影响。对吧。因此,问题变成了,如何将与每个情节相关的奖励归因于这里的所有不同项。我困惑的是,在大型语言模型(LLM)中,这样做会是什么样子的?在LLM中,因为你只在情节结束时得到一个奖励。所以,你可以想象一个奖励机制,它就像是给你一些过程监督。是的,在每一步的每个行动中,你都会得到一个奖励。好的,所以你的意思是,不是像那样在最后做一个总和。
▶ 英文原文 ⏱
However, in sampling things, if you have a reward term assigned to every specific token, now you have these interaction effects between the cross multiplication of these terms and these terms. Right. And so the problem becomes, how do you ascribe the credit associated with every episode to all these different terms here? I guess the thing I'm confused on is, what would that even look like to do it that way in LLMs? In LLMs because you do, you only do get a reward at the end of the episode. So. You could imagine a reward that says like, I'm going to give you some process supervision. Yeah. Where you get a reward for each of these actions on every step. Okay. So you're saying it, instead of doing it that way where you, well, I guess the way you've written it, it would be a sum at the end anyways.
翻译成易读中文:
所以,你说不需要把它们相乘,而是说与其这样做,你会在最后把这个过程的奖励累加起来,然后把它当作一个单一的奖励信号?是的。对于一个单一的对数概率动作。我知道。但这不就是一开始写的那样吗?像是奖励的总和?这里数学中有一点隐藏的东西,就是我们假设当你把问题分解成多步骤问题时,通过这个计算会在你的行为之间引入一种相关性。所以,如果你把这些东西分开,那么这将放大这个的方差。
▶ 英文原文 ⏱
So they wouldn't have to be multiplied, but you're saying instead of doing it that way, you would just add up this process rewards at the end and then treat that as one single reward signal? Correct. For one single log prob action. I know. But isn't that how it's written to begin with anyways? Like the sum of the rewards? So the thing that's a little bit hidden here in the math is that we're assuming that when you decompose the problem to a multi-step problem, that you're now introducing kind of correlations between your actions through the computation of this guy. And so if you separate these things out, then there will be, this will magnify the variance of this one.
好的。在这种情况下,如果你不将其分开处理,而是直接将 t 设置为 1,你就会得到一个对数概率的单一估计和一个奖励的单一估计。在大型语言模型(LLM)中,这个术语仍然会出现,简单的强化估计器看起来有点像是单次行动的回报。这是一个非常基础的形式,但它仍然会对方差产生影响。因此,你需要确保不犯错误,就像我们在许多中性标签上进行训练一样,你要确保减去那些没有帮助的标签,并只奖励那些真正让你变得更好的标签。
▶ 英文原文 ⏱
All right. So in the case where you don't separate it out, if you just have t equals one, you just have a single estimate of log prob and a single estimate of reward. Now, there are, this term still shows up in, so in LLMs, it looks a little bit more like, the naive reinforced estimator looks a bit like return of the single action plus times, you know. It looks kind of like this, this is sort of the very basic form here, but this is still a contributor to variance. So you want to make sure that like, you don't, similar to how in this case we were training on a lot of neutral labels, you want to make sure that you're subtract, you're sort of penalizing the labels that don't help and only rewarding the ones that actually make you better.
好的。直观来看,这里的类比就像是:我们能否在训练目标中找到一个项,使得它不鼓励这种行为,或者说这些不影响梯度的因素。而这确实对梯度有影响。对吧。我想如果你应用这个思路的话,你唯一能做的就是取消49场比赛。因此,按照你那种描述方式,就是51次。实际上,最佳的情况是剔除所有这些动作,只对那个让你变得更好的动作进行梯度更新。对,但你要怎么做到这一点呢?实际上,这是一个非常棘手的问题。
▶ 英文原文 ⏱
Right. So intuitively, the analogy here is like, can we find a term in our training objective such that it's actually kind of discouraged from doing this or, you know, these don't have any effect on the gradient. And this has an effect on the gradient. Right. I guess if you apply that there, the only thing you could do is eliminate 49 of the games. So at least the way you have it written there, it would be 51 times. Actually, the optimal case is to pull out, discard all of these moves, and only get a gradient on that single move that you got better. Yeah, but how would you do that? Right, so this is a pretty tricky problem in practice.
在强化学习中,这就是所谓的优势估计。你需要从你的乘数中减去一个项,而不是简单地用1和0这样的指示函数。你想要的是某种在这些情况下表现得像0,在那些情况下表现得像1的东西。这样当你可以说“嘿,我赢了这场比赛”时,就能理解这是稍微超出基准表现的情况。的确,你赢了很多比赛,但你不知道哪些胜利是真正的优秀表现,哪些是侥幸获胜。那么,你该如何设计一个真正优秀的基准呢?
在这种情况下,强化学习中会使用类似TD学习的方法,以便更好地逼近前面提到的质量函数(Q函数)。你可以尝试从你的回报中减去这个值。因此,理想情况下,在强化学习中,你希望提升那些让你表现优于平均水平的行动,同时降低那些让你低于平均水平的行动。这就是所谓的“优势”。
▶ 英文原文 ⏱
And so this is where advantage estimation happens in reinforcement learning. So you want to subtract, you know, a term from your multiplier instead of an indicator function of like one and zero. You want something that kind of behaves like a zero for all of these guys, and then a one for all of these ones. So you could do that if you can say, hey, I won this game. So this is slightly above baseline performance. Well, you won on a lot of games. Exactly. But you don't know which ones let you win because they were truly better versus winning on axis. Right. So how would you design a baseline where it's truly better? Yeah, so this is where in RL people use things like TD learning to better approximate the quality function, the queue that we mentioned earlier. So you can try to subtract that from your return. I see. So ideally what you really want to do is in RL, you want to push up the actions that make you better than the average. Yeah. And push down the actions that make you worse than the average. Right. And they call this advantage.
有多种方法可以计算它。我强烈推荐约翰·舒尔曼的《通用优势估计》论文,因为它很好地探讨了如何思考各种计算方法。不过,归根结底,你希望通过尽量缩小它来减少方差,以免它放大另一项的方差。对吧,合理。那么,这要求你对某个状态的平均表现有一个非常好的估计。是的,这让我们回到之前讨论的价值函数问题。要记住在这种情况下,无模型强化学习(RL)设定是在解决一个信号分配问题,你并不知道哪些操作实际上是好的,哪些是不好的。
▶ 英文原文 ⏱
There are multiple ways to compute it. I highly recommend John Schulman's general advantage estimation paper as like a good, you know, treatment on how to, like, think about various ways to compute it. But at the end of the day, you know, you want to reduce variance by trying to make this smaller and so that it doesn't magnify the variance of this one. Right, makes sense. So, but this requires you to have a very good estimate of what average performance from a state would look like. Yes. And this gets us back to the value function thing we were talking about earlier. Right. And so this, keep in mind that in this case, this model free RL setting is trying to solve a credit assignment problem where you don't know which actions were actually good. Yeah. And which ones were bad.
蒙特卡洛树搜索的做法有很大的不同,它并不是试图对胜利进行奖励分配,而是尝试改进你所做的每一个动作的标记。因此,我们可以考虑一种完全不同的算法,称为神经虚构自我博弈,这在像AlphaStar和OpenAI的Dota这样的系统中被有效地使用。那么,让我来谈谈如何在模型无关设置和自我博弈设置中融合这些强化学习的想法。
好,那么如果你不能轻易地搜索一个树结构怎么办呢?比如围棋,它是一个完全可观测的游戏,你可以轻松构建一个非常深的树来完全捕获游戏状态。而在像星际争霸这样的游戏中,你无法完全控制二进制文件,这让构建这样的树变得有点困难。而且我甚至不确定它是否是一个确定性游戏,这从数据结构的角度看增加了难度。
▶ 英文原文 ⏱
Monte Carlo Tree Search is doing something very fundamentally different, which is it's not trying to do credit assignment on wins. It's trying to improve the label for any given action you took. And so we can actually think about a completely different algorithm called neural fictitious self-play, which was used to great effect in systems like AlphaStar and OpenAI's Dota. So let me talk a little bit about how you can kind of unify some of these RL ideas in the model free setting as well as the self-play setting. Okay. So what happens if you don't have the ability to easily search a tree, right? Like in Go, it's a perfectly observable game. You can easily construct a pretty deep tree that completely captures the game state. In a game like StarCraft, where you don't have really complete control over the binary, it's a little bit hard to do this. I'm not even sure if it's a deterministic game, right? So that makes this kind of difficult from a data structures perspective.
所以,取而代之的是,通过一个更好的“老师”来监督你的行为的基本思想依然存在,对吧?现在我们来讨论一下神经虚拟自我对弈是如何工作的。基本思路是:为我们采取的每个行动提供更好的标签,就像蒙特卡洛树搜索(MCTS)一样。但是,我们如何得出这些更好的标签呢?在MCTS中,我们进行搜索,并假设我们有一个良好的价值函数,搜索结果会比我们的初始猜测更好。在某些无法轻易模拟搜索过程的游戏中,可采用的策略是训练所谓的最佳响应策略。具体来说,你可以固定对手,比如假设你正在训练策略 PI A 对抗一个强大的对手 PI B。在《星际争霸》中,这可能意味着你在用星灵对抗异虫。
▶ 英文原文 ⏱
So what is done instead is that the basic idea of supervising your actions with a better teacher is still there, right? So given neural fictitious, so we're going to talk a little bit about how neural fictitious self-play works. Same idea, we're going to like come up with better labels for each of the actions we took, just like an MCTS. But how do we derive the better labels? In MCTS, we perform search to, and assuming we have a good value function, the search will kind of give us a better result than our initial guess. In a game where you can't easily simulate a search process, what they do instead is train what is known as a best response policy. So you fix your opponent, so let's say you're currently training PI A against a strong opponent PI B. In StarCraft, maybe like, you know, these are the Zergs and you're playing Protoss or something.
你固定你的对手,然后把这个问题当作一个典型的无模型强化学习算法问题来处理,你的目标就是打败这个对手。在这种情况下,你可以使用标准的时序差分学习方法、近端策略优化(PPO)或者任何一种无模型强化学习算法,努力提高自己战胜这位玩家的能力。基本上,你会设置一个奖励函数,比如如果赢了对手PI B,得分就是1,否则就是0。
这已经不再是一个自对弈问题,而是一个固定对手的问题,你只需要专注于打败这个对手。在这个固定的环境中,所有的努力都只为了击败这个人。一旦你通过训练得到了一个好的策略,无论是使用你喜欢的无模型强化学习算法,比如PPO、SAC,或者任何它们的混合,或是VMPO之类的,你的目标都已经实现。
▶ 英文原文 ⏱
So you fix your opponent and you treat this as a classic model-free RL algorithm where your goal is just to beat this guy. And so here you use your standard TD learning style tricks or use PPO or any actually like, you know, model-free RL algorithm to try to hill climb against winning this player. And so you train, you train, basically you have a reward function that's like, you know, return is like, you know, one if wins against PI B. So this is no longer a self-play kind of problem, right? This is just like a fixed opponent. And you're just solving, trying to maximize a score against them. And then, you know, zero otherwise. And so you have a sort of fixed environment where all you care about is just beating this guy. And once you have a good policy that you train with, you know, pick your favorite model-free RL algorithm PPO or SAC or, you know, any kind of mixture of the, or, you know, VMPO or whatever.
您现在有一个很好的策略,它为您提供了一个清晰的标签,用于指导在与特定对手对局时应该采取的行动。当您训练多个最佳应对策略时,基本上就可以将强化学习算法转化为针对特定对手的标签。比如,您可能有一个针对对手PI B的最佳应对策略。然后,您可能会收集一组对手,比如PI B、PI C、PI D,然后对每个固定对手训练一个最佳应对策略。对于这些策略,您将用它们各自提供的标签来进行监督。这有点像一个替代的蒙特卡洛树搜索(MCTS)老师。您使用无模型的强化学习算法来寻找最佳的搜索动作,以便击败对手,而不是依赖于MCTS老师。这样做可以帮助您在对抗中找到更好的行动策略。
▶ 英文原文 ⏱
You now have a good policy that gives you a good label for what this one should do when playing against that player. And when you train multiple best response policies, you can basically then distill the RL algorithms into the labels for a given opponent. So you might have, let's say, a best response policy against PI B. And then maybe you have a collect, a league of, you know, of opponents like PI B, PI C, PI D. And you're going to take the best response policy that you train against each of these fixed opponents. And for this one, you're going to supervise them with the label that this one would provide. So it is kind of like, this is almost like a proxy for your MCTS teacher, right? Instead of MCTS teacher, you use a model-free RL algorithm to find the best search action that you could do to kind of beat your opponent.
然后,你最终将这里的策略提炼成一种被称为混合策略的东西,基本上就是在所有可能的对手之间进行平均。这种策略使得你在最差的情况下,也能表现得不比联盟中平均水平的对手差。这就解决了从蒙特卡洛树搜索(MCTS)中获得教学信号的问题。但归根结底,其核心还是通过用更好的行动重标记你的状态,以改进你的策略。确保你明白的是,这就像是如果你在与另一个策略对抗中获胜,你就会强化所有的行动。
▶ 英文原文 ⏱
And then you're finally, you're distilling the policy here into what is known as like a mixed strategy, where it's trying to basically average across all possible opponents you could play against. And this is what gives you something that can do no worse than like, you know, an average selected, average selected opponent from the league. And so this gets around the problem of having to derive a teaching signal from MCTS. But it's still fundamentally is about relabeling your states with better actions so that they improve your policy. And just to make sure you understand this is like, if you win against this other policy, you sort of reinforce all the actions.
是的,是这样的。因此,这里你可以使用很多算法,比如PPO。如果你愿意,也可以使用Q学习。在这种情况下,通常是无模型的,因为你没有搜索过程。但我想提到一个关于MCTS(蒙特卡洛树搜索)和Q学习的有趣联系。在MCTS中,你会构建一棵树,并通过计算树叶子节点或者近似叶子节点的价值函数来解决问题。然后,你可以通过多个序列的回溯来得到某种平均价值估计。也就是说,你的Q值是从多个模拟的平均结果得出的。
▶ 英文原文 ⏱
Yes. On that trajectory. Yes. So here you can use a number of algorithms like PPO. Yeah. You know, queue learning even if you want, like, the specific algorithm here can be, you know, it's usually a model-free thing because you don't have search. But there's an interesting connection from MCTS and queue learning that I want to, you know, bring up. So in MCTS you do something where you have a tree. And through the resolution of your value function at the leaves of the tree or, you know, your approximate leaves of the tree. You can kind of back up through the, you know, the sequence of many sequences and then obtain some sort of mean value estimate, right? Like your queue is kind of derived from the average of a bunch of simulations.
在无模型算法中,通常会有一个组件用于估计队列值。因此,队列值往往通过时序差分(TD)学习来学习。虽然在PPO(近端策略优化)中,他们进行优势估计的方式不一定是通过Bellman备份。然而,在Q学习中,有一种非常巧妙的技巧,即Q(s, a) 是通过r加上某个折扣因子乘以你下一步Q的最大值来备份的。从直观上来讲,这种方法的工作原理是:如果你有一个马尔可夫决策过程(MDP),那么这就像是在说,你在这个状态下能采取的最佳行动等于你采取这个行动获得的奖励加上你在下一个状态能达到的最佳值。
▶ 英文原文 ⏱
In model-free algorithms, there is often a component of estimating a queue value. And so, queue values are often learned through TD learning. Although in PPO, the way that they do advantage estimation is not necessarily through a Bellman backup. But in queue learning, there's this kind of a very cool trick where you do, you know, queue s a is backed up as r plus, you know, some discount factor times the max a queue of your next step. So intuitively, how this works is like if you have an MDP, and then this is like, you know, terminal, what this is sort of saying is that like the best action you can take at this state is equal to the reward you take for, you know, taking this action plus the best that you can do at the next state.
所以,MDP(马尔可夫决策过程)具有一种递归和动态编程的特性。你可以训练神经网络来基本上保持这种一致性。也就是说,一旦你知道了某个动作的Q值,你就可以用它来计算关于Q值的其他一些东西。所以之前我提到为什么我们要训练策略,而不是只训练值,这就是原因。这是一种算法,用于在你没有能力进行前向搜索时恢复中间步骤的价值估算。因此,你必须先收集一个n步的轨迹,然后才能应用这个方法。
▶ 英文原文 ⏱
So there's a sort of recursive and dynamic programming property of MDPs. And you can train neural networks to basically try to enforce this consistency, right? So you can say like, well, once I know the queue value of this action, I can then use that to kind of compute something about the queue value. So when earlier I was like, hey, why are we training policy? Why don't we just train the value alone? That is what this is. This is an algorithm for recovering value estimates of intermediate steps when you don't have the ability to do forward search. So you must collect a trajectory first of like n steps before you're able to do this trick.
这个直觉是相似的:了解这里的 Q 值(队列值)可以告诉你关于那里的 Q 值的一些信息。实际上,你可以根据 Q 值恢复一个策略,因此不需要明确地建模策略分布。你可以通过对 Q 值进行 argmax 操作来恢复策略分布。Q 学习或类似的近似动态规划方法,会将你对未来 Q 值的了解向后传播。在这种情况下,你可以看到类似的结构;也就是说,你正在规划一些你的代理尚未实际经历过的轨迹。
▶ 英文原文 ⏱
But the intuition is kind of the same, which is that like knowing something about the queue value here can tell you something about the queue value here. And indeed, you can recover a policy from a queue value, right? So you don't need to explicitly model the policy distribution. You can actually recover the policy distribution by doing argmax over your queue values. Right. So queue learning or, you know, this kind of like approximate dynamic programming kind of propagates what you know about the future queues backward like this, right? And you can see that there's a sort of similar structure that goes on here where, in this case, you're planning over trajectories your agent hasn't actually been to yet.
在这个情况下,你是在规划你的智能体已经访问过的轨迹。因此,重要的是,为什么Q学习会受到重视呢?这是因为过去,我们无法在像机器人这样的高维问题中进行有效搜索。所以,我们长期以来都假设,如果我们不能通过一个世界模型来模拟动态系统,那我们就只能收集轨迹,并依据唯一真正重要的数字——也就是奖励来进行规划。这个观点非常有趣。
▶ 英文原文 ⏱
Whereas in this case, you're planning over trajectories your agent has visited. So, so importantly, why does queue learning, you know, why was queue learning a big deal, right? Like it's because historically, we just haven't had the ability to do search on fairly high dimensional problems like robotics or whatever. So for a long time, we kind of make the assumption that like, okay, well, if we can't model the dynamics with like a world model or something, we're going to instead just collect trajectories and then plan with respect to the only number that really matters, which is reward. Okay, so this is very interesting.
然后将这与我们关于大型语言模型(LLM)的讨论结合起来。在使用LLM时,你在做一些事情,没有明确的值函数,但是你在进行一种反向学习——就是说,我们找出通过某些单元测试和某些编码环境的路径,然后对这些路径进行强化。这与使用蒙特卡罗树搜索(MCTS)的前向方法有很大的不同。你可以使用MCTS的原因是,因为可以针对每一步进行优化,使每一步都变得更好,而不是必须学习整条路径。因此,使用MCTS通常更有优势。
▶ 英文原文 ⏱
And then to unify this with our discussion of LLMs. So with LLMs, you're doing something, you don't have queue values, but you're doing this sort of backwards learning where, hey, let's find the trajectories, which pass some unit tests and some coding environment. And then let's reinforce those trajectories. And then there's a huge difference between that and this forward approach with MCTS. And the reason you can do MCTS, and it's much more preferable to do MCTS because you can do it per move and make each move better rather than having to learn per trajectory.
希望你知道,正如Karpathy所说,希望通过“吸管”来学习。是的,你通过“吸管”获得监督。基本上,就是升级轨迹中的所有标记,无论这些标记是否对正确答案有相关性。之所以你能在围棋这样的环境中更高效地进行采样,是因为蒙特卡洛树搜索(MCTS)在围棋中奏效。你基本上知道,如果我就在这里进行局部搜索,并且这个搜索被一个有效的价值函数在结尾处截断,即使我没有展开整个轨迹,它也能工作。
▶ 英文原文 ⏱
And hope, you know, as Karpathy said, hope to learn this. Through a straw. Yeah, so you get the supervision through a straw. Basically, just upgrade all the tokens in a trajectory that might or might not have been relevant to getting the answer right. So the reason you can do this much more sort of sample efficient, much more favorable thing with Go is that because MCTS works in Go, you basically know that, hey, if I just do search locally here, and this search is sort of truncated at the end by this value function that works, even if I haven't unfolded my whole trajectory.
我可以说,这就是我的新策略,而且我可以通过更迭代、更局部的方式进行改进,而不需要展开所有这些轨迹。关于这一点,谷歌在2023年或2024年进行了一些研究,他们试图将树结构应用于推理上。我认为,目前还不确定这种方法是否能真正奏效。所以我认为,我们可能会在未来重新审视前向搜索这个想法。
▶ 英文原文 ⏱
I can just say, this is my new policy, and I can improve in a more iterative, like local way rather than having to unfold all these trajectories. So there was some research, I think, from Google in 2023, 2024, where they did try to apply tree structures to reasoning. Yeah. And I think it's, you know, the jury's still out as to whether this can ever work. So I would say, like, we probably will see, like, you know, revisiting of this idea of forward search in the future.
有两件事情让蒙特卡罗树搜索(MCTS)在围棋中显得非常简单。首先,价值评估相对具体,你可以实际确定它。然后,如你所说,可以用它来修剪搜索深度。同时,树的广度也是确定的。关键在于,行动选择算法——即反复访问并扩展树的方法——非常适合围棋这样大小规模的问题。
▶ 英文原文 ⏱
But there's two things that make MCTS very simple for Go, which is that value estimation is kind of concrete, and you can determine it for real. Yeah. And then you can kind of sort of use it to truncate depth, as you said. Yeah. And then the breadth is also determined. And what's kind of critical is that the action selection algorithm, where you iteratively visit and grow the tree, is well suited for the size of problem that Go is.
好的,问题的复杂程度也很深。不过对于像大型语言模型(LLM)的推理能力,你知道,pucked 可能并不是一个足够好的启发式方法。它可能在处理局部词汇时过于贪心,只会给你一些明显正确的想法,但实际上并没有真正解决你的最终问题。所以,我会说关于 LLM 的推理能力最终会是什么样子,现在还很难下结论。
▶ 英文原文 ⏱
Yeah. And the depth of the problem. But for something like LLM reasoning, you know, pucked might actually not be a good enough heuristic. Right. It might be too greedy with local tokens, and it might do something like, oh, only give you, you know, sort of obvious thoughts that are correct, but not really solve your final problem. Yeah. So I would say the jury is probably still out on how, like, what the final instantiation of reasoning for LLMs would look like.
我不会排除这种情况,也就是说,这些东西可能会卷土重来。不过,这一直很困难。大型语言模型(LLMs)是否天生就会学会做蒙特卡洛树搜索(MCTS),就像它们会尝试一种方法,然后发现不行,就退回去,尝试另外一种方法,然后朝着更有成效的方向前进。是的,我确实认为大型语言模型能够在不需要明确的树状结构情况下,实现看起来像是人类真正的推理过程。
▶ 英文原文 ⏱
And I wouldn't rule out that, like, this stuff could, you know, come back. But it's been hard. Don't LLMs sort of natively learn to do MCTS, where they'll try an approach and be like, oh, that doesn't work. Let's back up. Let's try this other thing. And then go in the direction that proves to be more fruitful. Yeah, certainly I think that LLMs managed to do something that looks like real human reasoning without having to do an explicit tree structure.
好的。尽管如此,我认为进行前向搜索和模拟,以更好地了解什么是有价值的这一想法可能会重新流行起来,尽管不会完全以ALFGA的形式出现。但我想确认一下我理解的核心,就是合法行动数量广泛增加带来的广度,以及无法轻松训练价值函数带来的深度之间的关系。
▶ 英文原文 ⏱
Yeah. That being said, I think the idea of doing forward search and simulation to get a better sense of what is valuable might make a comeback, even though not exactly in the same instantiation as ALFGA. But just to make sure I understand the crux of it, like the breadth from the number of legal actions being wider and the depth from being able to, not being able to train a value function as easily.
因为——这里是一个大型语言模型(LLM)表现不佳的例子。CPuct规则涉及到 "N的平方根除以1加上NA"。在一个大型语言模型中,你极有可能不会多次采样到同一个子节点,对吧?所以,如果你有多步骤的思考,因为语言如此广泛和开放,采用一组离散的动作并不是LLM的合适选择。
▶ 英文原文 ⏱
Because -- So here's an example where LLMs break down. The CPuct rule involves, you know, square root of N over 1 plus NA. In an LLM, like, you're most likely never going to sample the same child more than once, right? So if you have, let's say, multi-steps of thinking, because language is so broad and open-ended, it's a sort of discrete set of actions is not really an appropriate choice for an LLM.
是的。即使这些是独立的标记。是的。这个数字太大了,这种探索启发式方法可能不是引导树搜索的正确选择。不过,我想关键点在于,在围棋中,你知道蒙特卡罗树搜索(MCTS)几乎肯定比你当前的策略要好,即使你还没有探索到任何路径的终点。
▶ 英文原文 ⏱
Yeah. Even though they're discrete tokens. Yeah. It's just such a large number that this type of exploration heuristic is probably not the right thing to do to guide how to search down a tree. Right. But I guess the crux comes down to the fact that in Go, you know that the MCTS is almost certainly better than your current policy, even though you haven't gotten -- even though you haven't explored the end of any trajectory.
没错。在大多数语言模型(LLM)或机器人学的常规推理中,很难仅通过局部评估改进下一步行动,而不涉及实际解决问题。说"没有办法"可能过于绝对,但确实有很多人尝试将蒙特卡罗树搜索(MCTS)或其后继方法如Mu Zero应用到连续控制空间中。我相信,有很多非常酷的研究工作仍在进行中,试图解决这个问题。不过,目前的主要挑战在于,对于动作空间维度更高,或者像语言这样组合复杂度更大的问题,它们似乎不太适用于像围棋那样的离散动作选择启发式方法和游戏评估类的方法。
▶ 英文原文 ⏱
Correct. And then in normal reasoning for LLMs or robotics, there's no way to just locally evaluate and improve your next move in a way that doesn't result in -- in a way that's independent of actually, like, solving the problem. No way is a strong word. I think lots of people have thought about how to try to apply MCTS or its kind of successors like Mu Zero to continuous control spaces. Yeah. And I'm sure, you know, very cool research work is still ongoing to try to crack that problem. But, yes, the seeming challenge right now is that, like, most problems in much higher dimensional, you know, action spaces or something that's combinatorially much bigger like language, they don't seem as amenable to the kind of discrete action selection heuristics as well as kind of game evaluation type stuff that Go does.
但这并不意味着,在脑海中沿着多条平行路径思考未来这个想法,不能为你提供一些关于该往哪个方向探索的信息。就像在数学领域,我认为数学常常涉及一种更接近逻辑搜索的程序,在这个过程中你可以回溯,并观察哪些路径似乎更好或不好。数学中有更多固定的结构,而在商业谈判中,可能就不像一棵树那样有分明的路径,而是有些不同的方式。
▶ 英文原文 ⏱
But that's not to say the idea of, like, you know, thinking into the future along multiple parallel tracks might not give you some information about, like, which way to search, right? Like, if you think about mathematics, I think mathematics often occupies a little bit more of, like, a logical search kind of procedure where you kind of can back up. You can kind of see, like, which paths seem good or not. There's more of a rigid structure there, whereas maybe, like, in a, you know, business negotiation or something, it's less of a tree and maybe, you know, something a bit different.
好的,我们现在可以看到它了,所以我可以问您更多关于Afligo和人工智能研究的问题。2021年,Andy Jones发表了一篇名为《Scaling, Scaling Loss for Board Games》的论文。他基本上预见了推理计算或推理扩展,展示了可以在测试时计算和训练时计算之间进行权衡。也就是说,你可以在前向搜索中花费更多的计算资源,比如在蒙特卡洛树搜索(MCTS)中进行搜索。如果你这样做的话,你可以获得与花费更多时间训练模型相当的性能。所以,如果你看到这种模式,你可能会想,好的,在大型语言模型(LLM)上,你将来可能也会用到这种方法。实际上,这确实发生了。
▶ 英文原文 ⏱
Okay, so we're now seeing it so I can ask you some more questions about Afligo and about AI research more generally. In 2021, Andy Jones had a paper called Scaling, Scaling Loss for Board Games. And he basically anticipated inference compute or inference scaling by showing that you can trade off test time compute and training compute. That is to say that you can spend more compute on the forward, the searching through the MCTS. And if you do that, you can get the equivalent performance as having spent more time training the model. And so if you, you know, if you see this pattern, you might think, okay, well, with LLMs, you might do something like that in the future. In fact, that's what ended up happening.
好的,那么现在可以进行一种有趣的探索,以在小规模设置中探索扩展的其他维度,这将有助于理解未来几年人工智能的发展可能会是什么样子?当然可以。我认为,确实如此,测试时的扩展性和推理能力以及它们与模型大小的互动,在决定必须通过显式搜索来完成的工作量与可以通过神经网络的前向传递过程来处理的工作量时,是非常重要的。那么,神经网络的前向传递过程如何学习完成本应是顺序且递归的步骤呢?这非常有趣。
▶ 英文原文 ⏱
Okay, so what is a kind of fun exploration one could do now to explore other axes of scaling in toy settings, which will be important to understanding what AI development might be like in a few years? Sure, yeah. I think that, indeed, test time scaling and reasoning and how it interacts with model size are quite profound when it comes to, like, how much needs to be actually done as explicit search versus how much can be packed into the forward pass of a neural network, right? And how does a forward pass of a neural network sort of learn how to do something that should be a sort of sequential and, you know, recursive step? That's quite interesting.
是的,Andy Jones关于棋盘游戏的缩放定律那篇论文非常有趣。论文中还有另一个非常好的结果,他展示了不仅可以预测类似大型语言模型(LLM)的缩放定律,这种定律表明随着参数的增加,可以减少搜索所需的计算量,或反之亦然。他还展示了实际上可以预测解决更大版本棋盘游戏所需的计算量。例如,对于围棋这种可以从3x3扩展到无限大小的棋盘游戏,你可能可以再次研究这个问题,尝试验证这种情况是否会出现。
▶ 英文原文 ⏱
Yeah. So, yeah, the Andy Jones scaling laws for board games paper is quite cool. There's another really nice result from that paper where he showed that not only can you predict scaling laws of, like, you know, the sort of LLM variety where as you increase parameters, you can decrease the amount of compute for search or vice versa. He also showed that you can actually predict how much compute is needed to solve a larger version of the board game, for example. And so with Go, you know, which can scale from, you know, three by three to infinitely sized, you know, Go board, you might actually be able to sort of revisit this question and try to reproduce whether this shows up.
你知道吗,我开始这个项目时,其实是受到一个动机的驱动:即通过痛苦的经验或我们对缩放法则的理解,能否更好地执行一个优化计算的围棋机器人?嗯哼。我们能够仅仅依靠专注于痛苦的经验和缩放法则,不借助所有KataGo的技巧,来构建一个强大的围棋机器人吗?是的。到目前为止,我还没有取得成功,但我认为通常当你想要缩放法则奏效时,你应该处于一个已经行得通的方案和优质数据集的环境中,而不是一边试图理解如何执行缩放,一边还要摸索正确的数据集是什么。
▶ 英文原文 ⏱
You know, I actually started this project with this sort of a motivation that, like, does the bitter lesson or does our knowledge of scaling laws allow us to kind of execute a lot better on a sort of compute optimal Go bot? Mm-hm. And can we kind of build a strong Go bot without all of the KataGo tricks, right? Just by really focusing on the bitter lesson and the scaling laws. Yeah. I have not been successful so far, but I think it's sort of a fact that, like, usually when you want scaling laws to work, you want to be in the regime where the recipe already works and the data sets are good rather than trying to kind of figure out how to do scaling while also trying to figure out what the right data set are.
好的,这就像是在研究中的科学理解部分,通常遵循这样的步骤:首先让某个系统正常运行,然后使用这个系统收集数据,这些数据可以帮助你建立对事物运作方式的心理模型,比如说,扩展定律,对吧?所以,通常情况下,其实如果你想用扩展定律打造一个强大的围棋 AI,你首先得先做出一个强大的围棋 AI,然后再利用扩展定律来大致推测未来的发展情况。为了更好地理解你的意思,你是说扩展定律不起作用,或者在你的围棋 AI 中你没有看到任何扩展定律的模式吗?
▶ 英文原文 ⏱
Okay. So this is, like, the scientific understanding component in research often follows a step where you get something to work first, and then you use that system to collect data that then helps you build a mental model of how things work, such as scaling laws, right? And so usually, actually, if you want to build a strong Go bot using scaling laws, you actually have to make a strong Go bot first and then use the scaling laws to kind of extrapolate a bit farther into the future. Say more, just so I understand, first of all, you're saying scaling laws did not work or you could not, there was no scaling laws pattern that you could see in your Go bot?
好的。我最初犯了一个错误,当时我在处理MCTS标记问题时出现了一些错误。我收集了一批专家策略的数据,然后将其视为一个监督学习问题,并尝试使用专家数据集确定扩展规律。确实可以绘制出类似的图表,但如果你的策略效果不好,你可能只是在“坏数据”上研究扩展规律。因此,一个重要的实现细节是,如果你想研究扩展规律问题,你需要确保你的数据是好的、架构是好的,并且没有错误,然后在这个条件下解决问题。事先,我无法在整个系统正常工作之前,仅凭扩展规律来指导研究方向。
▶ 英文原文 ⏱
Yeah. So a mistake I made initially when I had some bugs around how MCTS labeling was working was I would collect a bunch of data with an expert policy and then treat it as a supervised learning problem and try to identify scaling laws with expert data sets. So you can indeed plot things that look kind of like this, but if you're in a regime where, you know, your policy is not working well, you might be just studying scaling laws on, like, bad data, right? So just, like, one important implementation detail is that if you want to study a scaling laws problem, you kind of have to have a problem for which, like, the data is good, the architecture is good, and there's no bugs, and then, like, you solve it there. Ex ante, I wasn't able to, like, apply scaling laws to direct what to look at until, you know, I had the rest of the system working.
这听起来很明显,对于研究人员来说,你当然希望在研究扩展之前拥有一个正常运作且无错误的系统。但作为对实践者的一点建议,我在开始这个项目时的一个教训是:在你的人工制品足够有趣到可以被研究之前,不一定马上就要投入到研究上去。说到计算能力,你可以查看过去十年用于训练全球顶尖人工智能模型的计算能力的图表。在对数坐标上,这是一条非常平滑的线,每年呈指数增长。不过,有一个很大的异常,那就是Off-Lego Zero,这个模型在当时使用的计算量远超其他任何AI模型,大概是三乘以十的二十三次方的浮点运算。这可以与一个Frontier大语言模型相比,虽然数量级有差异,但仍有可比性。
▶ 英文原文 ⏱
And this sounds obvious, like, to researchers, of course, you want to have, like, a working bug-free system before you study scaling. But just as a sort of advice for practitioners on, like, where I actually tripped up when I started this project was you don't necessarily want to kind of jump into the science of studying your man-made artifact before your man-made artifact is, like, interesting enough to be studied. Speaking of compute, so you can look at these charts of compute used to train the best AI model in the world over time, going back 10 years. And it's a very smooth line in log space that is exponentially growing year over year. Except there's this huge aberration, and that aberration is off-lego zero, which is trained on way more compute than any other AI model at the time. It was, like, three E23 flops. It's sort of comparable to, like, a Frontier LLM. I mean, orders of magnitude off, but still.
好的,所以问题是,尤其是你能够实现某些功能,你是自己训练的吗?我收到了一笔来自Prime Intellect的捐款,大约有1万美金,然后我大概花了前4000美金进行一些探索性研究。嗯,然后大约花了3000美金在最终的训练上。剩余的一些则用于模型的维护。很棒,对吧?如果现在你只用1万美金就能完成,那他们是否有一种徽章式的认可呢?完成一项新事物所需的计算资源总是比追赶上来的计算资源要大得多。在语言模型中也是同样的故事。就像一旦有人完成了任务,你可以使用诸如蒸馏之类的技巧,或者各种“捷径”来成功起步。
▶ 英文原文 ⏱
And so, yeah, the question is, especially with you being able to get something off, did you train it on your own? I got a donation from Prime Intellect for, like, about 10K, and then I spent maybe the first 4K doing kind of exploratory research. Yeah. And then about 3K on the kind of final run. Yeah. And then some of it remaining for serving the model. Cool. Yeah, is there a sense that they were just, did a badge up for reading it if you can do it in 10K now? The compute required to be the first to do something is always, like, much larger than the compute it takes to catch up. And it's the same story playing out in LMs, right? Like, once someone else has done it, you could use tricks like distillation. You could use all sorts of, like, kind of crutches to kind of bootstrap your way to success.
所以,我在网上托管了一个我自己的机器人,实际上是用一种最佳反应策略训练对抗Katago模型,以获得强大的表现。录制这段话的时候,我正在验证是否可以做到第一步,也就是采纳"白板"方法,对吧?是的。但对于研究来说,重要的是你通常需要从一个好的初始化开始。因此,我首先做的一件简单事情就是训练一个最佳反应的智能体来对抗Katago。AlphaZero 团队没有任何可以进行训练的策略,因为他们尝试的是完全从零开始。
▶ 英文原文 ⏱
So with my own bot that I've hosted online, I actually used sort of best response training against the Katago models to kind of get a strong level performance. And, you know, as a time of recording, I'm validating whether this can be, I can kind of do that first step, which is to do the tabula rasa, right? Yeah. But importantly for research, you often want to start from a good init, right? So the kind of simple thing I did first was train best response agents against Katago. Yeah. The AlphaZero team, they did not have any policy that they could train against, right? Because they were trying to do everything tabula rasa.
好的,所以,当你是第一个去做某件事时,证明你优先考虑的是让事情运作起来,而不是追求比如说最计算效率的实现方案。这种情况在机器人领域中也能看到。如果你观察目前大型模型在机器人领域的训练,结果分布是杂乱无章的,并不像前沿语言模型那样有一个很清晰的趋势线。这是因为训练这些模型的人通常没有达到那种每个计算步都很重要的规模,他们不需要在预训练过程中,把每一步计算的性能提升作为决定性的因素。
▶ 英文原文 ⏱
Yeah. So, and being the first to do it means that you're prioritizing getting the thing working rather than, like, let's say, the most compute efficient possible implementation. So this actually plays out in robotics as well. Like, if you look at the kind of frontier of large models trained for robotics, the scatter plot is all over the place and there isn't a very clean line the way that there is for frontier LMs. And that is because the folks training these models often are not, you know, at the scale where every flop counts and they need to, like, kind of squeeze out, you know, the performance of every single flop as the dominating decision, deciding factor in pre-training, right?
相反,他们更关注的是,我们需要某种特定的能力出现,所以我们会优化训练设置,让这种能力更容易获得。而一旦你具备了这种能力,如果你增加计算量,就不得不尽量提高计算效率,因为这涉及到数亿美元的投入。但在过去,实验所需的计算资源比较充足,或者说研究人员不太需要为计算成本负责,所以大家往往会为了其他目标优化,而不是专注于提高计算效率。
▶ 英文原文 ⏱
Instead, their focus is more like, we want a certain capability to show up, so we optimize the training setup to kind of make it easy to derive that capability. And once you have that capability, well, invariably if you scale up the compute, you are forced to kind of make it compute efficient because this is like hundreds of millions of dollars we're talking about. But in the past, when compute for experiments was kind of more plentiful or, you know, not accounted in a way that the researcher was really responsible for, then you kind of end up with people optimizing for things besides kind of being on the compute optimal PRITO frontier.
我明白了,比如速度之类的。是的,比如达到结果所需的时间或者只是开始工作的时间。我觉得第一个AlphaGo可能有很多计算能力,所以他们不需要太在意让它成为计算上最优的东西。是啊,那些提高计算效率的方法中,有多少是2017年还不存在的,而有多少是他们在2017年就可以做的呢?嗯,很好的问题。
▶ 英文原文 ⏱
I see, like speed or something. Yeah, like time to result or just getting to work. I think the first AlphaGo, like, probably they had lots of compute and they didn't need to be, they didn't need to worry too much about making it the most compute optimal thing. Yeah, and how much of the improvements to compute efficiency are methods that did not exist as of 2017 versus things which they could have done in 2017, but. Yeah, great question.
在开始这个项目时,我心里有种预感,事情总会随着时间变得容易。我想了解现在Goat的发展情况,尤其是自2020年Katago之后,似乎就没有出现过强大的开源围棋程序了。在阅读了Katago的论文后,发现里面有很多聪明的想法。我很好奇是否出现了所谓的“痛苦教训”,这些巧妙的技巧是否因为NVIDIA推出了更快的GPU而变得不再那么重要。那么,我们大致处于什么阶段呢?
▶ 英文原文 ⏱
So, going into this project, I kind of knew in the back of my mind that, like, things always get easier to do over time. And I want to see, like, where is Goat given that, like, it didn't seem like there has been any major open source, you know, strong bot after Katago in 2020. And then, you know, reading the Katago paper, there's a lot of clever ideas. I was kind of wondering, like, okay, let's see if the bitter lesson has happened where, like, a lot of these kind of tricks just sort of go away because the NVIDIA made faster GPUs, right? And so, roughly, where are we on that?
所以,我再次声明,这不是经过同行评审的结论。这只是我根据自己实验所得的初步感觉。但似乎架构选择并没有那么重要,比如选择Transformer还是ResNet。在如今GPU的速度下,模型的大小还没有大到足以造成显著影响。实际上,您可以大幅简化这一设置。
▶ 英文原文 ⏱
So, again, this is not a peer reviewed claim. So, this is just my preliminary, you know, vibe guess on, like, what I've seen based on my own experiments. But it seems like, you know, architecture choices don't matter that much. You know, transformer versus ResNet. We're at the sort of speed of GPU where the size of the model is not so big that this really matters. You can actually simplify this setup quite a lot.
所以,与其使用带有重放缓冲区、推送器和收集器的分布式异步强化学习设置,你可以采用一种简单的同步方法,比如说,先收集数据,然后训练一个监督学习模型,接着再进行数据收集。这样会有机会简化基础设施。NVIDIA的GPU确实变得更快了。因此,虽然Katago是在V100上训练的,但你可以用大约一半数量的桌面Blackwell GPU进行训练,而且效果依然很好。此外,Katago开发的一些辅助监督目标,如果有一个强力的初始化,其实并不是很必要。
▶ 英文原文 ⏱
So, instead of doing a distributed asynchronous RL setup with replay buffers and pushers and collectors, you can kind of do a dumb synchronous thing where you, like, collect, you just train a supervised learning model, and then you collect again. And so, there's, like, opportunities to simplify infrastructure. NVIDIA GPUs have indeed got faster. So, whereas Katago was trained on V100s, you can train on, like, half the number of, you know, desktop Blackwell GPUs, and it still works. And some of the kind of auxiliary supervision objectives that Katago developed aren't really necessary if you have a strong initialization.
好的,所以,如果你在初始化时采用对Katago本身的最优回应训练,那么你的模型实际上不需要Katago所需的那些技巧。对,因此,核心问题在于:如何尽快找到一些强大的对手?这比具体的架构创新重要得多。不过,还是有一些不错的计算倍增方法。我发现,在9x9棋盘上进行训练对解决终局价值函数非常有益。而且,如果能够在一个能够在9x9和19x19之间转移的架构上共同训练,那么从头学习的起步时间将会明显缩短。
▶ 英文原文 ⏱
Right? So, if you're initializing against, you know, best response training against Katago itself, then your own model actually needs none of the tricks that Katago needs. Yeah, yeah. So, then the core thing is, like, how can you get as quickly as possible to some strong opponents? And that matters a lot more than the specific architectural innovations. But there are still some nice compute multipliers. So, I found that training on 9x9 boards was very nice for resolving end game value functions. And then, like, if you can co-train that on an architecture that can transfer between 9x9 and 19x19, then you can really cut down the warm start time to learn that from scratch.
我认为AlphaGo Zero的计划是,最初的30小时左右基本上花在追赶监督学习的基准上。通过在一个小棋盘上进行预训练,然后将其作为热启动应用到19x19的棋盘上,可以大大缩短这个时间。还有一些其他的技巧,比如在不同的训练回合中改变模拟次数。结果发现,这个部分其实并不是很敏感。无论是固定还是增加模拟次数,对结果影响不大。
▶ 英文原文 ⏱
I think AlphaGo Zero, their plot was first 30 hours or so are spent basically catching up to the supervised learning baseline. And you can cut down that time a lot by kind of pre-training on a small board and then, like, you know, warm starting that into your, you know, 19x19 board play. There was some other stuff, like, you know, varying the number of sims between episodes. This turns out to be not that sensitive, actually. Like, you can kind of, you know, fix it or increase it. It doesn't matter too much.
所以,从科学的角度来看,重新回顾一篇旧论文并弄清楚什么才是真正重要的,这真是一件令人愉快的事。不过,我有个有点跑题的问题:为什么 AlphaGo 可以有一个缓冲区?因为每次我和 AI 研究人员交谈时,他们总是告诉我偏离策略是多么糟糕。但在 AlphaGo Zero 的简单实现中,在一个给定的回退步骤或一批回退步骤中,大多数棋步并不是由最近训练的模型做出的。那么,为什么这样也是可以的呢?
▶ 英文原文 ⏱
So, anyway, it's kind of just nice from a scientific perspective just revisiting, like, an old paper and seeing, like, what really matters. Wait, this is sort of a tendential question, but why is it okay to have a buffer in AlphaGo? Because every time I talk to an AI researcher, they're telling me about how bad it is to be off policy. But then the way a naive implementation of AlphaGo Zero would work is that most of the moves in a given backward step or in a batch of backward steps would be not among the ones that were made by the most recently trained model. So why is that okay?
这是个很好的问题,是的。这涉及到强化学习中关于离策略和在策略的一些基本问题。回想一下,在蒙特卡罗树搜索(MCTS)中,你会采取一些行动,然后重新标记这些行动,使其在相同状态下采取不同的行动,对吧?这里的离策略部分指的是,如果你正在重新标记一些你的新策略永远不会访问的状态,该怎么办?这又有什么意义呢?这就有点像在浪费资源。
▶ 英文原文 ⏱
Great question, yeah. And this gets into the sort of fundamental off policy versus on policy reinforcement learning kind of questions. So, as you recall, in MCTS, you take actions that you took and you relabel them to take different actions on the same states, right? So the off policy part here comes where what if you're relabeling states that your new policy would never visit, right? Like, what's the point? You're kind of wasting capacity.
在极端情况下,假设你的训练缓冲区中的状态分布都是那些你永远不会访问的状态。这样,你实际上是在指导它们在你永远不会遇到的状态下采取好的行动,因此你的策略可能会变得非常糟糕,对吧?这就是为什么离线策略可能对AlphaGo造成重大影响的原因。然而,如果你从像Dagger(决策聚合)这样的角度来理解,这种方法基本上就是说在给定某些数据的情况下,找到一种方法纠正自己回到最佳轨迹。那么在像这样的算法中,你希望的是大多数都是你会访问的状态。
▶ 英文原文 ⏱
And in the extreme limit, imagine your distribution of states in your training buffer are all states that you would never visit. Then you're basically supervising them to take good actions on states you would never achieve, and therefore your policy can get really bad, right? So this is where off policy can really hurt AlphaGo. However, if you interpret this sort of from like the Dagger perspective, which is basically saying like a way to kind of correct yourself back to the optimal trajectory given some data, what you kind of want in an algorithm like this is to have mostly states that you would visit.
在这种情况下,你会发现有一小部分或可能是合理比例的状态存在于围绕你最佳轨迹的高维“管道”中。对于这些状态中的任何一个,都会给出一个监督目标,以帮助将你引导回最佳轨迹。所以我可以快速画个图来说明。在类似Dagger风格的设置中,你的最佳训练数据分布如下:这里是你的最佳状态和动作。这就像是,你希望处于这个状态、这个状态、以及这个状态,然后在这里获得成功。这些是你的最佳策略动作,它们是你一定要训练的内容。但为了让其对干扰具有鲁棒性,你需要确保如果你不小心偏离到其他状态时,可以引导自己回到最佳路径。
▶ 英文原文 ⏱
But then you have a small percentage or maybe a reasonable percentage of states in this kind of high dimensional tube around your optimal trajectories. And any of those states are given a supervision target to kind of sort of funnel you back into your optimal trajectory. So maybe I can just draw quickly here. Great. So in sort of a Dagger style setup, what your kind of optimal training data distribution is, is that here is your optimal states and actions. So this is like, you know, you want to be in this state, you want to be in this state, you want to be in this state, and then you win here. And then these are your optimal policy actions. So these are the things that you definitely want to train on, but to make it robust to disturbances, you want to make sure that if you happen to drift off into some other states, you can kind of funnel yourself back into-
为什么这不是一个普遍适用的关于离线训练的论点呢?实际上,这就是为什么有时候你想进行离线训练。因为,如果你犯了错误,而没有关于如何恢复到最佳分布的数据,你不希望错误积累。而最优控制并没有过多关注如何避免意外偏离,因为它假设一旦你学会了策略,就会达到目标。但在诸如机器人学的应用中,比如一阵风把机器人稍微吹偏了,你需要进行校正,或者汽车某个轮子的摩擦力比另一个低,导致汽车打滑,你需要进行调整。
▶ 英文原文 ⏱
But why isn't this a fully general argument for off policy training? This is actually why you want to do off policy training sometimes, is that like, you don't want to have a compounding error where if you make a mistake, you don't have the data of how to return back to your optimal distribution. Yeah. And so optimal control does not really say too much about like, you know, how to, you know, not accidentally get here because it's sort of making the assumption that like, once you learn the policy, you're going to get it here. Yeah. But in applications like robotics, right, like, I don't know, a gust of wind blows you slightly off and then now you need to like correct, right? Yeah. Or the friction on one of your tires is kind of a little bit like lower than the other wheel and then now your car is drifting and you got to like correct it.
在更真实的环境中,这种情况经常发生,有一个关于国际象棋和围棋的有趣说法,那就是问题在于对手总是试图搞点什么,对吧?因此,事情可能会偏离预期。你总是希望能够纠正回到你想要的胜利状态。所以,你的重播缓冲区应该包含你的策略可能会访问的状态,以及在偏离时可能遇到的状态分布,并且还包括如何返回到最优状态的策略。
▶ 英文原文 ⏱
So these kind of things in like more real environments often happen where like, actually there's a funny quote about chess and also Go. It's like, the problem with Go and chess is that the other player is always trying to do some shit, right? Like, so like, you know, things can kind of drift off. Yeah. And you always want to be able to correct back to your winning condition. So your replay buffer really should have like your, you know, the states that your policy would visit plus some distribution of states that you might drift to and then how to return back to your optimal states.
好的,如果把这个问题推到极端,你可以设想一下,我们没有任何相关数据,然后就只是用蒙特卡洛树搜索(MCTS)去标注那些与我们最佳行为相差甚远的状态。比如说,这个状态集合。当每个状态都被MCTS标注时,你的策略会学到在这些状态下采取最佳行动的方法,但实际上你从不会真正到达这些状态。所以,你是在训练模型去在根本达不到的状态下采取行动,这就不是个问题了吗?这就是离线策略可能带来负面影响的原因。
▶ 英文原文 ⏱
Yeah. Now, if you take this to the extreme and you say like, well, let's, we don't have any of this data. And we're going to just like, be labeling with MCTS, you know, states that are so far away from our optimal behavior. Like this, this bag of states over here. Well, like now, yeah, I mean, like each of them gets MCTS label and your policy learns how to take sort of the best possible action here, but you never get here. So like you're training your model on states you would never reach, like this is not there. So then this is a problem, right? And this is where off policy can really hurt.
好的。所以实际上在这个项目中,我尝试了一个实验:我拿了一堆轨迹,尽量使GPU达到饱和状态。我从数据集中随机选取了几个状态,然后只在这些状态上重新运行蒙特卡洛树搜索(MCTS)。也就是说,我没有像平常那样在每一步都进行MCTS来玩完整个游戏,而是跳过步骤之间的因果关系,只是随机选择棋局状态,然后用我当前的网络给这些状态打上标签。我可能会重新访问以前标记过的状态,并用我当前的网络重新标记它们。
▶ 英文原文 ⏱
Yeah. So actually, as part of this project, I did try an experiment where I took a bunch of trajectories and to try to saturate the GPU as much as possible. What I did was I took, you know, random states from the data set and re-ran MCTS on just those states, right? So instead of playing a whole game where I'm doing MCTS on every move, I just ignore the sort of causality of moves and just pick random board states and I just label those with my current network. And I might revisit old states that I've labeled before and relabel them again with my current network, right?
好的,实际上,这种方法是可行的。我们可以选取一些合理的状态,并在训练过程中不断地重新标记这些状态。这样一来,它就逐渐接近一种非常常见的机器人设置。你会有一个行动轨迹的数据集,然后用一个类似回放缓冲区推送器的东西。这些是非策略的、离线的轨迹。回放缓冲区推送器将转换元组推送到回放缓冲区中。然后,会有一个工作不断规划出在某一动作上你本该采取的最佳行动。
▶ 英文原文 ⏱
Yeah. And so in practice, this actually does work. You can actually say like, let's take some states that are reasonable and constantly be relabeling them while we're training. And so this actually starts to converge on a very robotics-like setup, which is very common, which is you have your data set of trajectories. And then you have something like a replay buffer pusher. And these are off-policy, offline trajectories, right? So your replay buffer pusher pushes transition tuples to the replay buffer. And then you have some job that's kind of continuously replanning what the best action you should have done instead of taking this action is, right?
在机器人领域中,实际上很常用缩小TD误差的方法。比如你的Bellman更新器会不断从这里获取信息,并尝试满足Q值函数(QSA)的要求。在这个过程中,你有一个训练器,它尝试将状态(S)与动作(A)相匹配,或者将Q值与Q目标相匹配。你可以把这看作一种规划者的角色。你会重新审视你曾经访问过的旧状态,然后利用当前模型重新思考:如果我再次访问这些状态,我还能做些什么来做得更好。这实际上就是很多非策略(off-policy)的机器人学习系统的训练方式。
▶ 英文原文 ⏱
So in robotics, it's actually very common to use the sort of minimized TD error. So like your Bellman updater constantly is pulling things from here and trying to satisfy, you know, the QSA. So, and then from here you have your trainer, which is trying to fit the S to A or fit the, you know, Q to the Q target. So here you can think about this as a sort of planner, right? You revisit old states that you've been to and you take your current model and you rethink, like, what could I have done better if I visited this? And so this is actually how, like, kind of off-policy robotic learning systems are usually trained.
这些天,有一种更简单的做法,但在Google QtOp时代,我们是这样做的。那么,训练器是什么呢?哦,对,训练器的作用是尽量最小化QSA和Q目标。可以再解释一下整个流程吗?从宏观来看。好的,你有通过各种策略收集的离线数据,并不断将之前观察到的转换信息推送到一个重放缓冲区中。接着,你会用到一个叫贝尔曼更新器的东西,它基本上会重新规划。在某个状态下,应该采取哪个行动来获得更好的价值?实现这一点的方法是尽量最小化时间差(TD)误差。
▶ 英文原文 ⏱
These days there's a sort of simpler recipe, but like, you know, in the Google QtOp days we kind of did things like this. So what is the trainer? Oh, yeah. The trainer is you try to minimize QSA and QTarget. Wait, can you explain the whole setup again? Like at a high level? Yeah. So you have your off-policy data that came from various policies. Yeah. You're constantly pushing transitions that you saw before to a replay buffer. Yeah. And then you've got this thing called a Bellman updater, which basically replans. Instead of this action, what action should I have taken at S to have a better, you know, value? And the way you enforce that is you try to minimize the TD error.
所以,实际上,考虑到这个情况,你有了 S' 对吧?你计算 S' 的 Q 值,然后找到与 S' 对应的能够使这个 Q 值尽可能高的动作。接着,你把这个值加到这里的奖励上,这样就得出了你的实际目标值,也就是 QTarget。因此,对于当前的 S 和 A,你的 QTarget 就是这样来的。现在,你将 QTarget 反馈给这个转换。因此,你用这个二元组与 QTarget 配对。在训练的过程中,你只需要使用监督学习,使当前网络的 QSA 与其目标值的差距最小化。明白了吗?
▶ 英文原文 ⏱
So actually, given this, you have S prime, right? You compute Q of S prime, and you find the action that should go with S prime that makes this Q value as high as possible. And then you add that to the reward here, and that gives you your actual target, right? So for this current S and A, your QTarget is this. So now you have a, now you send back the QTarget to this transition. So with this tuple, you pair with that a QTarget. And then here on the trainer, you simply just use supervised learning and you minimize your current network's QSA with its target. Got it, okay.
所以,在背景中,你就像在说,嘿,让我来理清楚这些行动实际上有多大价值。在一个更优化的策略中,当你试图最大化某个东西时,这个转换的Q目标是什么?这有点像是在幻想。没错。你可以把这看作是事后回顾,就像,嗯,鉴于我在历史缓冲区中看到的,是否有更好的行动我本可以采取?
▶ 英文原文 ⏱
So in the background, you're just like, hey, let me basically think through how valuable were all these actions actually? Yeah. In a more optimal policy where you're trying to maximize this, what is the QTarget of this transition? It's sort of like basically daydreaming. Exactly. Yeah. You can think about it's like you're kind of going back in hindsight and being like, hmm, like, given what I've seen in historical buffer, like, was there a better action I could have taken?
好的。现在,我尝试着在这里建立一个连接,它取得了中等程度的成功,但由于过于复杂而无法开源。在这种方法中,你可以用MCTS(蒙特卡罗树搜索)替代目标网络计算。就是说,你在状态转移过程中运行MCTS。在这种情况下,你有当前的状态、采取的行动以及游戏是否胜利的信息。实际上,你可以不用管行动和胜利这些细节,只需考虑状态,然后使用MCTS在当前网络上规划出最佳策略(即PI)。
▶ 英文原文 ⏱
Yeah. Now, the connection to go here that I tried and it was, you know, moderately successful but too complex to kind of like open source was you replace this with like a MCTS relabeler. Where instead of doing this kind of target network computation, you run MCTS on your transition, right? So in this case, you have your state, your action, and then whether you want or not at the game. And actually, you can just toss these two, you don't care about these ones. You just take your state and you just plan MCTS to get your best policy, you know, PI on your current network, right?
这不是网络采取的行动,而是你当前最好的策略网络,你只需在这些转变上离线重新运行你的搜索。如果这些转变是你的策略可以达到的状态,那么这实际上起到了非常好的稳定作用。另外一个好处是,你可以更好地充分利用 GPU,因为你不需要等待围棋游戏提供棋盘状态。你可以在任何深度并行搜索所有的棋盘状态。
▶ 英文原文 ⏱
Not the network that took this action, but your current best policy network, you just rerun your search offline on these transitions. And if these are transitions that your policy can get to, then this actually acts as a very nice stabilizing effect. And also the one other benefit is that you can like kind of fully saturate your GPU better because you're not like blocking on the Go game to kind of like give you board states. You just simply search across all board states at any depth in peril.
好的。在这里,训练者的任务就是尽量预测MCTS标签。因此,这种方法在机器人领域非常有用,因为机器人通常会有很多选择,而只能依赖离线数据,无法模拟像MCTS这样的情况。不过在实际应用中,会遇到一个问题,那就是如果当前模型正在查看一些它永远不会达到的状态,那么这实际上是在浪费它的能力。
▶ 英文原文 ⏱
Yeah. So, and then here the trainer would be just, you know, predict the MCTS label as possible. So, again, like this kind of works and this is quite relevant in robotics where you're really, you just have a lot of options. Offline data and you can't simulate things like MCTS. But in practice, like it does run into the problem where, you know, like if the current model is looking at states that it would never reach, then it's kind of wasting capacity.
好的。在这里你需要稍微小心一点。强化学习中的“在线政策”(on policy)方法,以及绝大部分强化学习都逐渐趋向于更加依赖在线政策的设置,他们并不直接在“非在线政策”(off policy)数据上进行训练。顶多是将非在线政策数据用来降低方差,但不会直接影响目标。嗯,为什么他们要转向这种方式?因为这样做更稳定。好的。
▶ 英文原文 ⏱
Yeah. And so you have to be a little bit careful here. So, the on policy thing and then also much of RL has kind of converged to a much more on policy setup where they don't really try to like directly train on off policy data. At best, they use off policy data as a way to reduce variance but not directly influence the objective. Hmm. I'm sorry, why have they converted to that? It's just more stable. Okay.
好的。是这样的,你可以使用离政策Q值来进行优势计算。比如说,Q值减去Q值的总和。这就像是你的价值,或者说,假设有N个动作,然后,这就是你的价值,而这是你当前的Q值。某个动作的优势就是平均价值减去你当前的Q值。
▶ 英文原文 ⏱
Yeah. Yeah. So, like you might use the off policy Q as a way to do like, you know, advantage computation. Like, you know, Q minus like sum of Q. Yeah. That's kind of like your, or sorry, like, you know, sum of, like there's N actions and then, yeah. So, so like, so like this is your value and then this is your, your kind of current Q values. Your advantage for that action is like the average value minus your current one.
好的。所以,人们可以尝试通过离线策略的方式来估计Q值,然后在这里使用优势函数。如果这些动态存在问题,它们不会让你的损失无限增大。在机器人领域,现在有一种趋势是更多地使用离线策略数据来塑造奖励,而不是直接应用这些数据。
▶ 英文原文 ⏱
Yeah. So, so like people can try to estimate Q in an off policy way and then like just use advantage here. And then, and then the, the sort of, if there's a problem in these dynamics that it doesn't like blow up your loss as much. Yeah. And so in robotics, there's a kind of convergence towards more like using off policy data to just shape your rewards but not actually be directly here.
所以,我现在想起我们之前讨论过的,为什么蒙特卡洛树搜索(MCTS)相比于强化学习中的策略梯度方法,更有优势。也许这完全错误,但我几个月前写了一篇博客文章,探讨强化学习(特别是策略梯度方法)比人们想象中还要低效。这种低效在初步考虑时,是因为你必须完整地执行一个轨迹,才能获得任何学习信号。
▶ 英文原文 ⏱
So, I'm reminded now of our earlier conversation of why MCTS is so favorable as compared to the kind of, you know, reinforce a policy gradient kind of thing LLMs do. And this might be totally wrong, but I wrote a blog post a few months ago about how RL, at least policy gradient RL, is even more inefficient than you might think. And so the inefficiency one thinks about naively is the fact that you have to roll out a whole trajectory in order to get any learning signal at all.
当这些轨迹变得越来越长时,一个智能体不再只是完成下一个单词,而是要通过执行两天的任务来判断项目是否正确完成,信息的密度(每次操作的信息量)在下降。你需要花两天的思考来确认你是否真的完成了某项功能。从这个角度来看,样本密度(每次操作的样本量)也在下降。但在学习过程中,你希望最大化每次操作获得的信息量。你可以把每次操作的信息量理解为每次操作的样本量乘以每个样本的信息量。正如我刚才提到的,随着强化学习任务的时间跨度变长,每次操作的样本量在下降。
▶ 英文原文 ⏱
And so as these trajectories become longer and longer, as an agent has to, instead of just previously like complete the next word in the sentence, it has to go instead to, hey, do two days' worth of work to figure out even if you did this project correctly. The amount of information per flop has been decreasing. As you had to unroll two days' worth of thinking in order to see if you even did something correctly to like, did I implement this feature? The amount of samples per flop has been decreasing. But so you can think of, you're trying to maximize as you're learning bits per flop, right? Mm-hmm. And this is, you can think of bits per flop as samples per flop times bits per sample. And what I just mentioned a second ago is that the samples per flop go down as RL becomes more and more long horizon.
嗯嗯。但是,至少从每个样本的信息量角度来看,这种简单的强化学习(RL)并不理想。我的意思是,与监督学习相比的情况。例如,在训练的早期阶段,假设一个大型语言模型(LLM)的词汇量是10万。因此,可能有10万个可能的词汇供模型选择,而你的模型完全没有经过训练。如果这时你给了一个提示,比如“天空是......”。在监督学习中,模型会对所有可能的输出形成一个概率分布。同时有一个标签告诉它这里应该输出“蓝色”。然后模型会基于交叉熵损失,学习到它当前的概率分布和正确输出“蓝色”之间的差距有多大。
▶ 英文原文 ⏱
Mm-hmm. But at least this kind of naive RL is also terrible from a bits per sample perspective. And here's what I mean, at least compared to supervised learning. So early on in training, let's say you have a vocabulary size for an LLM that is 100K long. So there's 100K possible, you know, tokens that one could answer. And you have a totally untrained model. And you have a prompt like the sky is. But with supervised learning, what would happen is that the model would have some probability distribution over all the things it could say. There's a label that says actually the term here is blue. And it would figure, it would learn basically for cross entropy loss exactly how far its distribution is from correctly saying blue.
现在,如果你通过强化学习(RL)来做这件事,你会发现模型可能会尝试"天空是halicon"。不对,这是错的。然后又可能尝试"天空是told"。不对,这还是错的。因为这完全是个未经训练的模型,所以你可能需要进行约十万次尝试,才能偶然碰到"蓝色",然后从中获得一些学习信号。
而如果是在监督学习的模式下,你会有一个概率分布,然后你被告知正确答案是"蓝色",于是你会检查自己猜得有多接近正确答案。你学到的量取决于你的通过率,比如说,你距离"蓝色"越远,你通过使用交叉熵损失来调整的程度也越大,从而学习到更多向"蓝色"靠近的方法。
▶ 英文原文 ⏱
Now, if you're doing this through RL, you would say the model would try the sky is halicon. Nope, that's wrong. The sky is told. Nope, that's wrong. This is a totally untrained model, right? And so you would have to do this on the order of 100,000 times in order to just stumble on blue, then get some learning signal off of that. So if you're in the supervised learning regime and you just get, you have your distribution of probabilities, you get told that it's blue and you figure out how far off you are. The amount you learn is a function of your pass rate. So like the further away you are from blue, the more you've learned to go towards blue using cross entropy loss.
你可以把这理解为你的通过率,也就是你之前预测为“蓝色”的概率。根据这个概率,就像在监督学习中通过交叉熵损失来学习一样,你会学习到负的对数概率(p),其中p是你得到这个标签后的通过率位。相反,在强化学习(RL)中,如果你只是随机猜测并看看效果如何,那基本上就是一个二元随机变量的熵。
而且,困难之处在于,你所采样的分布实际上是你的策略分布。没错。所以如果你的策略完全没有可能采样到蓝色,那么你永远不会得到信号。对,就是这样。
▶ 英文原文 ⏱
And so you can think of it as like your pass rate, your like prior probability of having said blue. And as a function of that, like in supervised learning through cross entropy loss, you would, you would learn negative log p, p being pass rate bits once you get this label. Whereas in RL, if you're just randomly guessing shit and seeing if it works or not, that's, that just basically going to be the entropy of, the entropy of a binary random variable, which is. And what's also tough here is that actually the distribution that you're sampling under is your policies distribution. Hmm. Right. So, so it's like, if your policy has no chance of sampling blue, then you will never get a signal. Exactly. Right. Right.
所以,这个情况的模型表现在你抽取到蓝色样本的概率极低。如果你真的抽到了,你学到的知识与监督学习中能学到的一样多。在其他情况下,比如在一个未经训练的模型中,有99.99%的情况,你几乎学不到什么东西,因为"con"不是正确的词,或者"told"也不是。因此,你大多数时候都学得很少。如果你试图绘制图表,把通过率放在x轴上,把每个样本中学到的信息量放在y轴上,比如从0%到50%,再到100%。
▶ 英文原文 ⏱
So that's, that's being modeled by the fact that your probability of sampling blue is extremely low. If you do sample it, you do learn as much as you would have learned in a supervised learning. In all other cases, like, you know, 99.99% of in an untrained model, you're, you're just learning incredibly little from like seeing how the con is not the correct word or told is not the correct word. Yeah. Um, and that's what happens most of the time. So you're just like, um, learn very little. So if you try to graph, um, if you put on the x-axis, your pass rate, um, and, uh, here you put the, like, sort of the bits you, bits you're learning from a sample. If you have like 0% here, 50% here, and 100% here.
所以交易结束后,你就到这儿了。嗯,如果你进行监督学习,负对数通过率大概会像这样。然后,熵二元随机变量看起来会是这样的。嗯,这取决于你是使用结(knots)还是比特(bits)。如果你用比特,它在峰值时大概是1。嗯,这就像抛硬币。你从抛硬币中学到的最多。嗯,这是监督学习。这是强化学习(RL)。然而问题是,你在训练中大部分时间都处于这种状态,对吧?就像是在低通过率的状态中。
▶ 英文原文 ⏱
So the end of trading, you're here. Um, if you have, um, supervised learning, negative log pass rate would look something like this. And then the, uh, entropy binary random variable would look like this. Um, and this is, uh, depending on whether you're doing knots or bits. Yeah. If you do bits, it's like one right here at the, at the peak. Um, this is like a coin flip. Um, this is like a coin flip. You learn the most from a coin flip. Mm-hmm. Uh, this is supervised learning. This is RL. However, the problem is you spend most of training in this regime, right? Like in the, in the low pass rate regime.
翻译成中文:
呃,实际上,如果你的学习速度是一个函数,那么你每个样本获得的信息量影响你的学习速度。而在这里,你获得的信号非常少。如果你在对数坐标系上绘制通过率图表,将X轴设为对数刻度,比如在训练开始时,词汇量为10万个,通过率是101除以100,000,然后是1除以10,000,再是1除以1,000,接着是1除以100。然后,嗯,好吧,这里这个图表看起来是这样的,其中监督学习看起来就是这个样子。然后强化学习(RL),如果你基本上处理我刚才展示的内容,它看起来会是那个样子。
▶ 英文原文 ⏱
And, um, in fact, if how fast you're learning is a function, how many bits per sample you're getting. Uh, and you're getting very little signal here. If you chart the pass rate on a log scale. So you put the X axis on a log scale where like at the beginning of training with a vocab size of 100K. The pass rate is 101 over 100,000. Then one over 10,000. One over 1000. Uh, one over 100. And then, um, okay. What this graph looks like here where supervised learning would look like this. And, and then RL. If you just basically crunch what I just showed there, it would look like that.
是的,可以说你会把所有的时间都花在这里。确实有可能甚至一次成功都没有,对吧?没错。所以,这看起来有点沮丧,因为一旦你在这里,就完全不知道该如何到达那里。你知道的,一旦你到达那里,你会有一些收获,但其实在很多强化学习问题中,你会把所有时间都耗在这里。所以,这就引出了一个问题:如何初始化才能让成功率至少不为零,而是有一定的非零通过率。
▶ 英文原文 ⏱
Yeah. And arguably you spend all your time here. Exactly. Potentially never even getting a single success, right? Exactly. Like, uh, so, so it's, it's a sort of depressing plot in the sense that like, once you're here, it's not at all obvious how you get to here. Yeah. Um, you know, once you're here, you have something, but like you actually, in many RL problems, spend all the time here. Yeah. Uh, so, so that there's a sort of question of like, how do you initialize so you're at least not at zero, but like at a non-zero pass rate.
好的,我想补充一点关于该样本的信息,这对于任何机器学习问题都很相关。这个与软目标和蒸馏有关,如果你能获取到概率分布的中间结果(logits),而不仅仅是单一的“一个热”编码(比如这种“一个热”编码的符号答案),那么你会发现软目标的熵比“一个热”分布高得多。因此,在一个软标签中,每个样本所包含的信息量和比特实质上更多。
▶ 英文原文 ⏱
Yeah. Um, one more thing I'd like to add about this per sample that's very relevant to, um, uh, you know, you know, any kind of machine learning problem is that, um, it, and there's a connection to soft targets and distillation where if you have access to the logits, right? Not just the one hot, like this, this is a sort of one hot, uh, token answer. Yeah. Um, if you have access to the soft targets, um, the entropy of this distribution is far, far higher than, than the one hot. So there's actually way more, uh, there's way more information and bits per sample, um, in a soft label.
是的,所以这就是为什么蒸馏非常有效的原因。是的,每个样本能提供更丰富的信息。啊,是的。不过,我在想这个方程是什么,但显而易见地,它应该就是这个分布的熵。对,所以这个的熵是零。嗯,这个的熵就像你知道的熵方程。而且这也是为什么AlphaGo非常精妙。在AlphaGo中,你并不是训练策略网络去模仿MCTS的动作,而是训练它去模仿MCTS的分布。
▶ 英文原文 ⏱
Yeah. So that's why distillation is so effective. Yeah. Per sample is that it's actually giving you way more information per sample. Ah, yeah. Well, I wonder what the equation would be, but obviously it's- It would just be the entropy of this distribution. Right. So the entropy of this is zero. Yeah. Um, the entropy of this is like, you know, the entropy equation. And, and this is also why, like, you know, AlphaGo is quite beautiful. In AlphaGo, you don't train the policy network to imitate the MCTS action. You train it to imitate the MCTS distribution.
是的。有趣。但实际上这两个都是有效的。如果你想进行一个关于这种软标签和暗知识蒸馏有多重要的科学实验,你可以进行一个实验,你可以让策略网络在动作MCTS选择上进行训练,而不是在软件上。有趣的是,之前我有些直观地在摸索,为什么这种迭代搜索的能力很重要,在一开始你不一定需要赢得比赛,你只需要能够改进你当前的策略。
▶ 英文原文 ⏱
Yeah. Interesting. But both of these are actually valid. And if you wanted to do a scientific experiment of like, how important are this kind of soft label, dark knowledge distillation, you can run an experiment where you, you, uh, reach out, you can train the policy network on the action MCTS selected rather than the software. Interesting. Uh, earlier I was sort of stumbling around this intuitively. Why is this ability to do, um, iterative search where you don't necessarily need to be able to win the game in the beginning. You just need to be able to improve your current policy.
为什么这样的能力在学习中如此强大,相较于当前的大型语言模型(LLM)运行和学习强化学习(RL)的方式?这实际上是考虑了整个过程的通过率。其实我不知道一个正式的方法来思考这个问题,也许你可以帮帮我。为什么 AlphaGo 是一个优雅的强化学习算法?
▶ 英文原文 ⏱
Why is that so powerful capability in learning as compared to how LLMs currently run, learn RL. And, um, and yeah, it's exactly this thing of, uh, this is considering your pass rate of the entire trajectory. I actually don't know a formal way to think about this. Maybe you should help me out here. Why is AlphaGo an elegant RL algorithm?
好的。主要原因是,你不必从0%的成功率开始,也不需要解决如何达到非零成功率的探索问题。这就是为什么你可以在这个漂亮的监督学习信号上进行攀升。如果你看看AlphaGo的实际实现过程的话,其实每一步中(至少显性地)都没有使用TD误差学习或动态规划。
▶ 英文原文 ⏱
Yeah. Like, so, um, uh, the major reason is that you never have to initialize at a 0% success rate and solve the exploration problem of how to get a non-zero success rate. And this is what allows you to hill climb this beautiful supervised learning signal where, and if you look at the actual implementation of AlphaGo, um, every step of the way, there's no, uh, there's actually no, um, you know, TD error learning or dynamic programming, uh, at least explicitly.
嗯,这只是一个监督学习问题,涉及到价值分类以及策略的KL最小化。简单来说,就是在改进的标签上进行的监督学习问题。因此,训练过程非常稳定。你可以训练任意规模的网络,并在数据集上重新训练一切都能稳定进行。实现这种基础设施也很简单。
▶ 英文原文 ⏱
Um, it's just supervised learning on a value classification as well as a policy, uh, you know, KL minimization. So it's just a supervised learning problem on improved labels. And so the training is very stable, right? You can train, like, as big of a network as you want. You can kind of retrain this on the dataset. Everything will just go stably. The infrastructure is very simple to implement as well.
嗯,你其实不需要一个复杂的分布式系统来保持所有的策略。在最终分析时,你只是说,我有一些改进过的标签,让我们在这些目标上重新训练我的监督模型。是的。所以你总是处于这样一个美好的状态——你只是在努力改进策略,而不是陷入一种本地最小值的困境,在那里每个信号都是平坦的。
▶ 英文原文 ⏱
Um, you don't need a complex distributed system to kind of keep everything on policy. Yeah. Um, at the end of the day, you're just saying, like, I have some improved labels. Let's retrain my supervised model on these targets. Yeah. Yeah. And, and so you're, you're always in this beautiful regime where you're just trying to improve the policy. Rather than, uh, escape this kind of, like, uh, sort of local minima where every, every signal is flat all around you.
好的,可以这样翻译成中文:
是的,所以绘制这个曲线的一种方法是,看看MCTS策略的胜率与原始网络的对比。假设这条虚线代表原始网络,那么MCTS策略看起来会像这样。所以在每一步,这个监督信号都是非常清晰的。对,对,你永远不会遇到MCTS不给你信号的情况。
▶ 英文原文 ⏱
Yeah. So, so one way to draw the, the curve is like, if you draw the sort of win rate of an MCTS policy versus the raw network. Um, let's say that's dotted line is the raw network. The MCTS policy kind of looks like, like this. And so every step of the way, this supervision signal is very clean. Right. Right. You're never in a situation where, you know, the MCTS is kind of like giving you no signal.
好的。除非你的MCTS分布恰好与你的策略网络一致。是的。对,对,对。好的,这是个不错的解释方式。嗯,酷。那么,也许我们可以坐下来,我问你一些关于自动化研究的问题。听起来不错。我非常想和你讨论的一件事是:你通过这种自动化的大型语言模型(LLM)编码助手循环,进行了一些项目研究。
▶ 英文原文 ⏱
Yeah. Unless your MCTS distribution converges to exactly what your policy network brings. Yeah. Yeah, yeah, yeah. Okay. That's, that's, that's a great way to explain it. Um, cool. Okay. Maybe we sit down and I ask some questions about automated research. Sounds good. One thing I really wanted to talk to you about is that you did a bunch of the research for this project through this kind of automated, uh, LLM coding assistant loop.
嗯,有一个想法是,如果完全自动化人工智能研究,你可能会达到某种奇点。不过,显然我们还没有到那个地步,但我们有一些早期迹象可以显示这个过程可能是什么样的。我很好奇你对人工智能擅长什么、不擅长什么,以及你对这一情景和它最终发生的可能性有什么看法。总体来说,你对此有什么想法吗?当然,我认为自动化的科学研究是目前前沿实验室正在开发的最令人兴奋的技能之一。我也认为,对于所有从事任何形式研究的人来说,了解它现在能做什么、不能做什么是很重要的。
▶ 英文原文 ⏱
And, um, there's an idea that if you fully automated AI research, you could have some sort of singularity. Uh, obviously we're not there yet, but to the extent that we have early indications of what this process might look like. I am curious what your observations about, um, what the AI is good at, what it's not good at, what you think about this scenario, its likelihood eventually. What thoughts you have about this in general? For sure, yeah. I think automated scientific research is one of the most exciting, um, uh, skills that, you know, the frontier labs are developing right now. And I think it's important for everyone who's doing any kind of research to get a good intuition of like what it can do now and what it can't.
在未来,科学研究过程可能会发生变化,尤其是在人工智能能够自动化许多调查工作的时候。简单来说,我在工作中主要使用了Opus 4.6和4.7版本。效果非常好,因为这些模型能够很好地进行超参数优化。在过去,人们会根据学习率、权重衰减,以及网络中的层数等,制定一个超参数搜索空间。然后,他们可能会使用网格搜索或者贝叶斯超参数优化方法来找到一些经过调优的参数。
▶ 英文原文 ⏱
And how might the sort of science process work in the future once we're having AI is automating a lot of this, this investigation. Um, so, uh, in brief I mostly use Opus 4.6 and 4.7 throughout the, uh, working on this. And, um, what works is that the models can do a very good job of doing hyperparameter optimization. So, in the past people would kind of come up with a search base of hyperparameters like learning rate and, you know, weight decay and maybe how many layers are in your network. And, um, they would just kind of do a grid search or a sort of Bayesian hyperparameter optimization, uh, approach. And then it would find some tune parameters.
嗯,现在自动化编码的一个非常酷的功能是,它可以处理更开放的问题集。也就是说,它能够识别出类似于某一层的梯度较小的问题,然后进行调整。它可以重新编写代码,为数据加载器添加一种新的数据增强方法,尝试找到符合优化问题约束的最佳方法。最终,你会得到一种更加灵活、高级的解决方案,几乎像研究生一样,能够专注于提升性能指标。
▶ 英文原文 ⏱
Um, the kind of really cool thing that automated, uh, you know, uh, coding can do now is that it can search a much more open-ended set of problems, right? It can say like, well, um, I've identified that like the gradients are kind of small in this layer, so let me change it up here. Let me rewrite the code so the data loader has a new augmentation I came up with. Let's, uh, let's, uh, sort of try to find the best way to kind of fit the constraints of the optimization problem. And, and you end up with this much more flexible and kind of high level, almost like grad student-like ability to just, you know, grind a performance metric.
这可以大幅提升性能。在固定的数据集和固定的时间预算下,你可以在类似于语言模型(LMs)或围棋这样的分类问题上大幅提高困惑度。现在它在执行任何实验方面也非常出色。我编写了一个叫做"实验"的Claude技能,我只需要给出我想要绘制的图表的描述,比如我想要的x轴和y轴的数据,它就会去执行所有实验,编制图表,生成报告,并建议可能导致结果的原因等等。
▶ 英文原文 ⏱
And, and so this can squeeze out quite a lot of performance. You can, you know, on a fixed data set with a fixed time budget, um, improve perplexity by quite a lot on, on a sort of classification problem like LMs or, um, or Go. Um, and, uh, it is also fantastic now at basically executing any experiment, right? So I have a, a Claude skill that I wrote called experiment, where, um, I give it a description of what I wanted to plot. And like I just described, here's the x-axis I want, here's the y-axis, answer this question for me. And it'll go run off and do all the experiments, compile the plot, make a report, and suggest like, you know, what might have caused it or, or so forth.
嗯,所以,这就是目前效果相当不错的地方,我认为我们可以期待这些能力在未来会有所提升。同时,了解目前有哪些方面做得不太好也很有帮助。在我这个教程的博客版本中,我做了一个大致的实验图表。图表上的每个节点都代表一个失败、成功或结果不一的实验结果,从这里分支出子节点,表示后续的实验。
▶ 英文原文 ⏱
Um, so, so that's what works quite well today, and I think we can expect that these abilities get better in the future. And it's also kind of useful to know, you know, what, what is it not doing so well today. Um, so on my, uh, blog version of this tutorial, I have a, a plot of basically all the kind of experiments I did grouped in a sort of tree. Where, um, you know, every node kind of represents a failed successful or sort of mixed experimental result. And then from there it branches off into a child where it's like the follow on experiment.
嗯,有时候我会突然对某个话题产生浓厚的兴趣,比如最近的这个“离线策略蒙特卡罗树搜索(MCTS)重标记”,于是我会做一些实验,但后来意识到这可能不太值得。所以我就会转而研究完全不同的主题。我把这样的事情称为“走过的道路”。我发现,目前我们可以使用的那些封闭模型,不太擅长在某个研究方向上选择下一个实验是什么,也不太擅长退一步进行横向思考,比如质疑这个研究方向是否真正有意义。
▶ 英文原文 ⏱
Um, occasionally I'll kind of rabbit hole down a track like this off policy MCTS relabeling, do a few experiments and then realize it's probably not worth it. So then I'll kind of jump to a completely different track, right? And I call these kind of things like rows, right? So, so what I find is that current, uh, you know, closed models that we can access, the public can access today. Um, they don't seem to be that great at selecting what the next experiment should be in a given track. And they don't seem to be able to kind of step back and do the lateral thinking of like, wait a minute, this track doesn't really make sense.
好的,我们来回到基本原则,想一想瓶颈可能在哪里,或者我们到底想要达成什么目标,对吧?我经常需要通过询问云端相关的问题,自己来发现基础架构中的错误,以调查是什么导致了这些不一致的现象。然后云端会给出答案。我觉得随着mythos class模型或mythos plus plus模型的推出,这种状况可能会彻底改变,这些问题可能会通过技能的提升而得到解决。
▶ 英文原文 ⏱
Like, let's go back to sort of first principles and, and think about, you know, what the bottleneck might be or like, what are we trying to achieve, right? And, and so often I had to catch infrabugs myself by prompting the right question to cloud to like investigate, you know, why, why, what, what is causing this discrepancy? And then it'll answer the question. Um, I think with like, you know, mythos class models or mythos plus plus models coming online. Um, maybe this just completely changes and these, these problems just fall to just improve skilling.
嗯,但与此同时,我认为有很多丰富的机会可以开发强化学习环境,这些环境可能会激励这种横向思维。因此,建立这个围棋环境的动机之一是因为我认为围棋包含了很多非常有趣的研究问题,这些问题常常与语言模型或机器人学重叠。它的验证速度很快。外循环最终就是看智能体是否按照我的预期行动,你可以很容易地检查围棋游戏的结果。而内循环则涉及到围绕分布式系统的各种研究工程。
▶ 英文原文 ⏱
Um, but at the same time, I think there's a lot of like rich opportunity to, um, develop RL environments that might incentivize this kind of lateral thinking. And, and so one of the motivations for setting up this Go environment was that I think that, you know, Go captures a lot of very interesting research problems, often overlapping with, you know, LMs or robotics. And, and yet it's like very quick to verify. Um, the outer loop is ultimately like, does the agent do what I think it does? And, and, and you can kind of check the outcome of a Go game quite easily. Um, and then the inner loop involves all this kind of like, you know, research engineering around distributed systems.
预测一个想法是否会成功,以及预测某个对训练算法的特定修改可能带来的影响。 我认为,有丰富的子任务和子环境库,可以用来训练一个自动化科学家去处理这些任务。以围棋作为一种外部验证环,让你掌握这些技能后,可以将它们应用到其他领域,比如生物科学或机器人技术,甚至AI研究的自动化。真正的核心或是令人既畏惧又惊叹的事情是让AI来创造未来版本的AI。
▶ 英文原文 ⏱
Predicting whether an idea is going to work or not. Um, predicting the, you know, the difference a particular modification to your training algorithm might make. Um, and I think there's a rich library of sub tasks and sub environments that you can kind of train an automated scientist to work on. Uh, with Go as a sort of outer verification loop that then once you acquire these skills, maybe you can apply them to like other domains like, uh, you know, um, biosciences or robotics. Or automating AI research. Or automating AI research. Which is, which is the real crux or the, um, scary slash, uh, incredible thing of just making AIs making future versions of AIs.
你是在说这里的外循环就相当于你的围棋程序对抗KataGo的胜率吗?嗯,这只是其中之一。我认为有很多更深入的问题可以探讨。举个例子,假设你有一个想法,想改进计算倍增的比例因子。结果不一定是我创造出了史上最强的围棋程序,结果可能只是说,我能否预测我的围棋程序的胜率?或者,我能否预测我的想法产生的缩放曲线图?然后,你可以通过在外循环中使用像围棋这样可以被验证的游戏,来确认自己没有通过奖励机制投机取巧。
▶ 英文原文 ⏱
And you're suggesting the outer loop here could just be your win rate against KataGo basically? Um, that's one of them. Um, I think there's a lot of deeper questions that one could tackle, right? So for example, um, let's say you have an idea on, um, how to improve a scaling log compute multiplier. Yeah. Um, the outcome isn't necessarily like I, I, uh, achieved the best GoBot ever. The outcome might just be like, can I predict what the win rate of my GoBot will be? Yeah, yeah. Uh, or can I predict the scaling log plots that emerge from my idea? But then you can verify that you haven't kind of reward hacked anything by using a very verifiable game like Go on the outer loop.
我认为这里有几个有趣的后续问题,包括内循环和外循环的问题。在内循环中,有一个关于任何你可能进行的修改有多大程度上可以在本地进行验证的问题。也就是说,你能否知道所做的一些尝试实际上是改进还是退化?又比如,当某些事情不起作用时,你如何判断这是由于一个Bug引起的,还是因为想法本身就是错误的?Ilya谈到了为什么他认为自己是一个出色的研究者,其中一个原因是他有直觉,他坚信自己的想法是正确的。
▶ 英文原文 ⏱
I think there's a couple of interesting follow on questions. There's questions on the inner loop and the outer loop. On the inner loop, there's a question of how locally verifiable any modification you might make is. That is to say, could, would you know whether something is actually improvement or degradation? Some idea you try out. Would you know that if something isn't working as a result of, um, a bug? Or is it the result of the idea itself being wrong? Um, Ilya was talking about why having, one of the reasons he thinks he's a good researcher is, he is a good researcher. One of the things he thinks makes him a good researcher is that, um, he has intuition about, he has strong belief in what the correct idea is.
他能够坚持不懈地解决问题,并且基于自己对这个理念应该如何运作的高层次信念,分辨出哪些问题是漏洞,哪些是根本理念上的错误。所以,他认为问题一定是程序漏洞,而不是理念本身有问题。我们为什么不从这个问题开始呢?好的想法在多大程度上可以在局部验证呢?就像深度学习的成功案例,你可以把它看作是一个历时数十年的想法,实现它需要相当大的信心。
▶ 英文原文 ⏱
And he is able to persevere through bugs and know which things are bugs versus mistakes in the fundamental idea based on his high level belief about this idea should work. So therefore it's, there has to be bug versus the other way around. Why don't we start with that question actually? Yeah. How locally verifiable are things which are good ideas? Yeah. I think as in the case of the success story for deep learning, you can think about this as like a decades long idea that took, like took a lot of faith to get it to work.
嗯,这就带来了一个非常具有挑战性的长期强化学习(RL)问题。在整个过程中,你可能会遇到一个团队告诉你这不是个好主意,但最终你还是成功了。那么,该如何设计RL环境,以便能够更早地给出反馈呢?我认为这是一个非常困难且未解决的问题,我也没有答案。不过,最终要开发出一个非常厉害的围棋机器人,可能确实需要发现深度学习技术。对吧?
▶ 英文原文 ⏱
Um, and so this presents a very challenging long horizon, you know, RL problem where, you know, every, every step of the way you have like a committee telling you that this is a bad idea and then ultimately you break it through. Right. Like how do you design RL environments that maybe give you some feedback, uh, uh, earlier? Um, and, and I think this is a very tough open question that I don't have an answer to. But, um, but, you know, ultimately to play a very strong GoBot, you probably did need to discover deep learning. Yeah. Right.
所以,我想,拥有一个不容易在外循环中被作弊的挑战性游戏,可以被用作类似于发现深度学习原理的外循环信号。当然,要使这个过程可行,研究的策略就变得非常重要。你必须设计出方法来初始化你的问题,这样你就不会去解决一个非常棘手的问题,对吧?
▶ 英文原文 ⏱
And so, um, um, I think that like having a challenging game that cannot be, you know, cheated easily on the outer loop could be used as a sort of outer loop signal for something like discovering the principles of deep learning. Now, of course, like to make attractable, and this is where research takes really matters. Um, like, you have to come up with ways to initialize your problems so that you don't solve a sort of very intractable problem, right?
也许你可以利用大型语言模型(LLMs)作为一种通用语法,来在中间给予你一些本地化的反馈。因为大型语言模型具有通用语法的特性,这意味着它们能够在几乎任何层面上运作。它们既可以非常本地化地思考,也能够后退一步,从更广泛的角度进行思考。
▶ 英文原文 ⏱
Like, like maybe you can leverage LLMs as, as a sort of a universal grammar in the middle to kind of give, give you some sort of local feedback. Um, um, the, the fact that LLMs are universal grammar means that they can kind of move at almost any level of the stack, right? They can think very locally as well as step back and think like in very broad steps.
我认为,这里是人类横向思维能力的来源之一。例如,如何判断你正在追求的方向或目标不正确,需要转换思路,提出不同的问题。另一个问题是,在实现更好结果的过程中,本地改进的可叠加性有多大。我听说,在一些人工智能实验室,出现的问题是,人们各自追求好的想法,但这些想法并不能很好地结合在一起。因此,由于两个看似不错的想法之间出现了一些奇怪的互动,训练过程可能会失败。因此,拥有一个自上而下的明确愿景,了解事情应如何运作是很重要的。我曾在不同的人工智能实验室工作,并尝试过让多个并行代理实验不同的想法。在你看来,人工智能的创新在多大程度上是可以并行化的?
▶ 英文原文 ⏱
And I think that's where a lot of, uh, um, the, the lateral thinking ability of humans kind of come from. Like, like how to know if the track that you're pursuing or the objective that you're pursuing is not right, and you should be asking a different question. Hmm. The, uh, the other question is how stackable local improvements are in the attempt to get to a better result on the outer loop. Um, I've heard rumors that at some AI labs, the thing that has gone wrong is that people will individually pursue good ideas. Um, but those don't end up stacking well. And so the training run falls because of some weird, uh, interaction between two seemingly good ideas. And having a single top down vision of how things should work is very important. Um, having worked at, uh, different AI labs and also playing around with, I guess, parallel agents trying different ideas. What is your sense of how parallelizable, um, AI, AI innovation is?
好的,这个问题问得很好。我认为研究的关键在于如何很好地执行“痛苦的教训”。你需要知道这个“痛苦的教训”能带给你多少好处,以及在某个时刻提出的要求是否过多。当然,从长远来看,计算能力是决定事物运作方式的最重要因素,可以说几乎是不可避免的。随着能量、计算能力和参数的增加,智能就会随之而来。这是非常美丽且深刻的。除了这些,算法的细节其实并不重要。但是,现阶段我们没有无限的计算能力、参数和良好的初始条件,所以我们需要找到一些启发式的方法来弥补。不过,这些启发式方法可能有点冗余。所以,你会发现很多计算的叠加效应并不总是完全有效,可能是因为它们提供的好处有一定的相关性。
▶ 英文原文 ⏱
Yeah, great question. Um, I think the research tastes for executing well on, you know, the bitter lesson is that you need to know how much the bitter lesson can buy you and, uh, how much is too much to ask for at any given moment, right? Like, of course, in the fullness of time, compute kind of is the single most important determinant on, like, how things work. And, uh, and, uh, it's almost, like, inevitable that as you scale up energy and compute, uh, and parameters, intelligence will just fall out of that. Yeah. And that's super, super beautiful, super profound. No algorithmic detail really matters beyond that. But, um, in present day, we don't have infinite compute and parameters and, and, and arbitrarily good initialization. So we have to come up with, like, heuristics that kind of give us that. But these heuristics are probably somewhat redundant. So that's probably why you see this effect where, like, a lot of these compute multipliers don't necessarily stack. Is that, like, they might have some correlated benefit.
嗯,然后,你知道,三年后,NVIDIA的GPU变得更强大时,也许它们的搭配效果就更差了,对吧?就像,在任何时候,任何计算倍增器的好处都是短暂的。这是我对Kadego论文的一点怀疑。论文中应用了很多算法上的想法。但你可以看到,随着现代Blackwell GPU和ADA级别GPU的出现,它们比论文中使用的V100级别GPU要好得多。有些加速算法收敛的技巧与其他因素相比,重要性就不那么大了。我认为这在当下是一个品味的问题。
▶ 英文原文 ⏱
Um, and then, and then, you know, three years down the line when the NVIDIA GPUs have gotten even stronger, maybe, maybe they stack even less well. Right? Like, maybe, like, at any given point in time, the, the, the sort of benefit of any given compute multiplier is transitory. Which is what I sort of suspected with the Kadego paper. Like, there was many algorithmic ideas kind of applied. And then you can see that, like, with, you know, modern Blackwell GPUs and ADA class GPUs that are much better than the sort of V100 grade GPUs that, that paper used. Um, you can see that, like, some of these algorithmic tricks to speed up convergence just don't matter so much compared to something else. And I think that's a matter of taste in a, in, in the present time.
好的,有意思。那外循环呢?它在让 AI 更智能方面有多大的可验证性?在围棋中,确实存在这样的外循环,比如与最好的开源模型的胜率。而正如你所说,还有其他外循环,比如你有没有发现新的现象?实际上这很难验证,比如如果你不知道扩展规律(scaling laws)很重要,比如回到 Chinchilla 或 Kaplan 这些扩展规律发布的时候,大概是2019年左右,那么如果你回到2015年,你是否能想象出一个简单的自动化程序,来判断哪个论文阐述的是扩展规律,而哪个只是另一种普通图表。所以即使在围棋的情况下,验证外循环也是很困难的。而整个外循环的理念就是要有一个关于改进的最后保障。
▶ 英文原文 ⏱
Yeah, interesting. How about the outer loop? How verifiable for making AI smarter? With Go, you do have this outer loop of, um, win rate against the best open source model out there. And even there, as you were saying, there are other outer loops of, did you discover a new phenomenon? Which is actually very hard to, if you didn't know scaling laws were important, if you're back in, when was Chinchilla or Kaplan scaling laws released? Like, 2019. Yeah, so if you're back in 2015, would you, there's not an automated procedure one can easily imagine of, uh, knowing which paper is the scaling laws paper versus which is just, like, another random plot. And so that, even in the Go case, is hard to verify outer loop. And the whole, the whole idea of an outer loop is to have, like, some backstop on, um, on improvement.
嗯,但对于通用人工智能来说更不用说了,我们当然有很多基准测试,但问题在于我们知道能测量的东西,并改进这些能测量的东西。但我们其实关心的是更广泛的能力,也就是能做出经济上有用的工作。而这种能力至少在你把所有简单的事都自动化之前,并不是那么容易衡量的。所以,是的,有一个问题是,人工智能自我改进的外部验证环如何,是否真的重要?
▶ 英文原文 ⏱
Uh, but let alone for general AGI, where, of course, we have a bunch of these benchmarks, but there's a problem that, like, we know the things we can measure, and we improve on the things we can measure, but we're, we care about this broader ability to do economically useful work, which is, um, at least until you automate everything easy and not, not super easy to measure. Um, so yeah, there's, there's a question of, okay, how, how good is the outer verification loop for, uh, for AI self-improvement, and does that matter?
好的,我想给出一个不太严格的论点,但这是我直观上认为的。DeepMind,这家AI研究实验室,一开始主要专注于游戏。他们把游戏当作一个外包框架,通过解决游戏问题的经验来学习。现在,他们在处理语言模型(LMs)。可以推测,他们在处理游戏时,比如使用Atari、围棋和星际争霸时获得的经验对他们现在开发优秀的语言模型有所帮助。我认为在某些方面有正向迁移作用,无论是编码能力、一般的研究能力,还是项目管理能力,这些都可能帮助他们取得成功。
▶ 英文原文 ⏱
Yeah, um, I'm gonna give a non-rigorous argument, but one that I kind of intuitively believe, which is that, you know, um, DeepMind, the AI research lab, um, they started as a sort of focus on games, right? Like, they, they kind of used games as their outer loop, and then the researchers learned, uh, from experience of solving games, and then, like, now they're working on LMs. And presumably, there was some positive transfer from their time working on games, and, like, Atari, and Go, and, and, uh, you know, StarCraft, that, like, now helps them make good LMs. Like, I assume that there's, like, positive transfer in, in some regard, whether it's coding, or general research ability, or project management, right? Like, all these things kind of, like, probably help them do well.
嗯,那么,如果是这样的话,为什么这对于自动化的AI研究者来说也不成立呢?他们应该能够将快速验证和快速迭代的经验,成功地应用到更具挑战性和经济价值的领域,比如自动化药物研发上。我不太确定历史上是否存在这样的问题,直到像Gemini 3这样的东西出现。几年前,人们还在说谷歌在大型语言模型(LLM)上没有追赶上来,因为他们过于依赖旧的方法。是的,这种方法有它的好处,但同时也有些地方会妨碍你。所以,对我来说,这一点并不明确——这个问题仍然悬而未决,对吧?
▶ 英文原文 ⏱
Um, and so, if that's the case, why wouldn't it also be true for automated AI researchers? Like, they should be able to positively transfer experience tackling quick to verify, uh, quick to iterate on environments to something more ambitious and economically useful, like, uh, you know, automating drug discovery, or so forth. I mean, I don't know if it isn't the, it isn't the issue with, uh, historically, until Gemini 3 or whatever, been, uh, I, a couple of years ago, people were saying, look, Google hasn't, uh, isn't catching up in LLMs because they're too tied to the old, too tied to the old approach. And, yeah, there's gains, but there's also, um, there's ways in it which actively hinders you. Um, so it's actually not obvious to me that there's, like-- The jury's still out, right?
是啊。我觉得,谁知道呢,比如说目前谷歌做得很好,但谁知道在游戏上进行初始化训练会不会最终限制他们长期成为赢家的能力,对吧?这很难确定。还有,谁知道那个看起来像是晚启动的策略,实际上只是他们花更长时间进行预训练,以学习如何大规模提升TPU性能呢?他们将所有技术重点投入到让TPU变得出色,这在短期内看起来可能不太有用,但从长远来看,也许会是一个重要的...
▶ 英文原文 ⏱
Yeah. Like, I think, like, who knows if the, you know, let's say currently Google's doing quite well, who knows if the, uh, initialization on training on games is ultimately gonna hobble their ability to be the winner in the long term, right? Like, like, uh, it's, it's hard to say for sure. Yeah. Um, and, uh, you know, likewise, who knows if the, the late, seeming late start was really just them kind of pre-training for longer on how, how, how, how to, like, scale up TPUs, right? They, they invested all their tech tree in, like, uh, getting TPUs to be good, which seemed not that useful in the short term, but then long term it becomes maybe like a--
好的,所以即使对于人类来说,想要推理出最优的研究策略是什么也很困难,对吧?即便我们今天有这么多数据也是如此。嗯,没错。我们应该让大家知道如何获取有关这个项目的更多信息,比如如何自己分叉,或者查看你的博客文章。我们在解说这些想法时做得很出色。那么,大家接下来可以去哪里了解更多呢?太好了,我的网站是evjang.com,上面有一篇博客文章,链接到这个教程的互动版本。
▶ 英文原文 ⏱
Yeah. So it's, it's, it's even hard for humans to reason about what the optimal research strategy should be, right? Yeah. Even with, uh, the, the data we have today. Yeah, yeah. Cool. Um, okay, we should let people know how they can find out more about this project, whether to fork it themselves, whether to check out your blog post. We're doing an excellent job explaining many of these ideas. Um, uh, where do people go next? Great, yeah. So my, my website is, uh, evjang.com. The, there's a blog post that kind of links to a interactive version of this tutorial.
嗯,在我的GitHub上,我的用户名是Eric Chang。有一个自动围棋的仓库,大家可以fork并复现训练结果。我也非常推荐大家看看这篇博文《Rocks May Think》。我们在这次讨论中提到了一些想法,但这篇文章更全面地阐述了当“思考”作为计算机科学的原始概念时,会发生什么。所以,我也强烈推荐大家去看看这篇博文。
▶ 英文原文 ⏱
Um, and on my GitHub, uh, which is, the username is just Eric Chang. Uh, there's a, there's a, uh, auto go repo that people can fork and reproduce the, uh, training results. Yeah. And I also highly recommend people check out this blog post as rocks may think, which we touched on some of the ideas in this conversation, but it's this, uh, grander, uh, you know, um, thesis of what happens when you have, uh, thinking as a primitive in computer science. Exactly. Um, and so I highly recommend people check out that blog post as well.
好的。我鼓励观众思考一下思维与围棋之间的关系,通过蒙特卡罗树搜索(MCTS)和搜索,以及它们与语言模型(LMs)的联系。我觉得在这方面有一些相当深刻的东西,而且可能因为围棋相对语言模型的繁荣发展而言受到的关注较少,所以尚未得到充分探索。这并不是说我认为我们的语言模型中应该有树状结构,但它们之间确实存在一些非常有趣的对偶关系,而且利用非常小的预算,实际上可以在围棋、MCTS和推理方面进行大量研究。
▶ 英文原文 ⏱
Yeah. And I encourage to the, you know, the audience to, you know, think about the relationship between thinking and go, you know, via MCTS and search and how it relates to LMs. I think there's something quite like profound there. Um, and probably underexplored just because go has been relatively underexplored compared to, you know, the boom in LMs. Yeah. Um, it's not to say that I think we should have trees in our, in our LMs, but, um, but, but there is some very interesting duality between them and you can actually do a lot of research on go, um, MCTS and, and reasoning with, you know, very small budgets.
好的,非常令人兴奋。太酷了。感谢你,Eric。很荣幸能参与这个播客。
▶ 英文原文 ⏱
Yeah. So that's very exciting. Cool. Awesome, Eric. Thanks for doing this. It's, it's an honor to be on the podcast.