Value targets in off-policy AlphaZero: A new greedy backup
This article presents and evaluates a family of AlphaZero value targets, subsuming previous variants and introducing AlphaZero with greedy backups (A0GB). Current state-of-the-art algorithms for playing board games use sample-based planning, such as Monte Carlo Tree Search (MCTS), combined with deep neural networks (NN) to approximate the value function. These algorithms, of which AlphaZero is a prominent example, are computationally extremely expensive to train, due to their reliance on many neural network evaluations. This limits their practical performance. We improve the training process of AlphaZero by using more effective training targets for the neural network. We introduce a three-dimensional space to describe a family of training targets, covering the original AlphaZero training target as well as the soft-Z and A0C variants from the literature. We demonstrate that A0GB, using a specific new value target from this family, is able to find the optimal policy in a small tabular domain, whereas the original AlphaZero target fails to do so. In addition, we show that soft-Z, A0C and A0GB achieve better performance and faster training than the original AlphaZero target on two benchmark board games (Connect-Four and Breakthrough). Finally, we juxtapose tabular learning with neural network-based value function approximation in Tic-Tac-Toe, and compare the effects of learning targets therein.
|, , ,|
|Neural Computing and Applications|
|Organisation||Intelligent and autonomous systems|
Willemsen, J.D, Baier, H.J.S, & Kaisers, M. (2021). Value targets in off-policy AlphaZero: A new greedy backup. Neural Computing and Applications. doi:10.1007/s00521-021-05928-5