Double Q-learning

P. van Hasselt, Hado

In some stochastic environments the well-known reinforcement learning algorithm Q-learning performs very poorly. This poor performance is caused by large overestimations of action values, which result from a positive bias that is introduced because Q-learning uses the maximum action value as an approximation for the maximum expected action value. We introduce an alternative way to approximate the maximum expected value for any set of random variables. The obtained double estimator method is shown to sometimes underestimate rather than overestimate the maximum expected value. We apply the double estimator to Q-learning to construct Double Q-learning, a new off-policy reinforcement learning algorithm. We show the new algorithm converges to the optimal policy and that it performs well in some settings in which Q-learning performs poorly due to its overestimation.

Additional Metadata
Keywords	reinforcement learning, double Q-learning, Q-learning, bias
ACM	Problem Solving, Control Methods, and Search (acm I.2.8), Learning (acm I.2.6)
MSC	Stochastic learning and adaptive control (msc 93E35)
THEME	Software (theme 1), Logistics (theme 3), Energy (theme 4)
Publisher	The MIT Press
Series	Advances in Neural Information Processing Systems
Conference	Annual Conference on Advances in Neural Information Processing Systems
Organisation	Intelligent and autonomous systems
Citation APA APA Style APA-ALL Style AAA Style Cell Style Chicago Style Harvard Style IEEE Style MLA Style Nature Style Vancouver Style American-Institute-of-Physics Style Council-of-Science-Editors Style BibTex Format Endnote Format RIS Format CSL Format DOIs only Format	van Hasselt, H. (2010, December). Double Q-learning. Advances in Neural Information Processing Systems.

Free Full Text ( Author Manuscript , 189kb )

Double Q-learning

Publication

Publication

Address

CWI researchers

Questions or comments?

Double Q-learning

Publication

Publication

Workflow

Workflow

Add Content