Value targets in off-policy AlphaZero: A new greedy backup

Willemsen, Daniël; Baier, Hendrik; Kaisers, Michael

doi:10.1007/s00521-021-05928-5

J.D. Willemsen (Daniël), H.J.S. Baier (Hendrik) and M. Kaisers (Michael)

2021-06-16

Value targets in off-policy AlphaZero: A new greedy backup

Neural Computing and Applications , Volume 34 p. 1801- 1814

This article presents and evaluates a family of AlphaZero value targets, subsuming previous variants and introducing AlphaZero with greedy backups (A0GB). Current state-of-the-art algorithms for playing board games use sample-based planning, such as Monte Carlo Tree Search (MCTS), combined with deep neural networks (NN) to approximate the value function. These algorithms, of which AlphaZero is a prominent example, are computationally extremely expensive to train, due to their reliance on many neural network evaluations. This limits their practical performance. We improve the training process of AlphaZero by using more effective training targets for the neural network. We introduce a three-dimensional space to describe a family of training targets, covering the original AlphaZero training target as well as the soft-Z and A0C variants from the literature. We demonstrate that A0GB, using a specific new value target from this family, is able to find the optimal policy in a small tabular domain, whereas the original AlphaZero target fails to do so. In addition, we show that soft-Z, A0C and A0GB achieve better performance and faster training than the original AlphaZero target on two benchmark board games (Connect-Four and Breakthrough). Finally, we juxtapose tabular learning with neural network-based value function approximation in Tic-Tac-Toe, and compare the effects of learning targets therein.

Additional Metadata
Keywords	Reinforcement learning, Sample-based planning, AlphaZero, MCTS
Persistent URL	doi.org/10.1007/s00521-021-05928-5
Journal	Neural Computing and Applications
Project	Flexible Assets Bid Across Markets
Grant	This work was funded by the CWI PPS samenwerking; grant id pps/TEUE117015 - Flexible Assets Bid Across Markets (FABAM)
Organisation	Intelligent and autonomous systems
Citation APA Style AAA Style APA Style Cell Style Chicago Style Harvard Style IEEE Style MLA Style Nature Style Vancouver Style American-Institute-of-Physics Style Council-of-Science-Editors Style BibTex Format Endnote Format RIS Format CSL Format DOIs only Format	Willemsen, D., Baier, H., & Kaisers, M. (2021). Value targets in off-policy AlphaZero: A new greedy backup. Neural Computing and Applications, 34, 1801–1814. doi:10.1007/s00521-021-05928-5

View at Publisher

Full Text ( Author Manuscript , 742kb )

Value targets in off-policy AlphaZero: A new greedy backup

Publication

Publication

Address

CWI researchers

Questions or comments?

Value targets in off-policy AlphaZero: A new greedy backup

Publication

Publication

Workflow

Workflow

Add Content