Q-learning is a very popular reinforcement learning algorithm being proven to converge to optimal policies in Markov decision processes. However, Q-learning shows artifacts in non-stationary environments, e.g., the probability of playing the optimal action may decrease if Q-values deviate significantly from the true values, a situation that may arise in the initial phase as well as after changes in the environment.These artifacts were resolved in literature by the variant Frequency Adjusted Q-learning (FAQL). However, FAQL also suffered from practical concerns that limited the policy subspace for which the behavior was improved. Here, we introduce the Repeated Update Q-learning (RUQL), a variant of Q-learning that resolves the undesirable artifacts of Q-learning without the practical concerns of FAQL.We show (both theoretically and experimentally) the similarities and differences between RUQL and FAQL (the closest state-of-the-art). Experimental results verify the theoretical insights and show how RUQL outperforms FAQL and QL in non-stationary environments.
Additional Metadata
Keywords Q-learning, Non-stationary Environment, Dynamics
MSC Population dynamics (general) (msc 92D25), Stochastic learning and adaptive control (msc 93E35), Artificial intelligence (msc 68Txx), Learning and adaptive systems (msc 68T05)
THEME Null option (theme 11)
Editor T. Ito , C.M. Jonker (Catholijn) , M. Gini , O. Shehory (Onn)
Conference International Joint Conference on Autonomous Agents and Multiagent Systems
Citation
Abdallah, S, & Kaisers, M. (2013). Addressing the Policy-bias of Q-learning by Repeating Updates. In T Ito, C.M Jonker, M Gini, & O Shehory (Eds.), .