Q-learning is a very popular reinforcement learning algorithm being proven to converge to optimal policies in Markov decision processes. However, Q-learning shows artifacts in non-stationary environments, e.g., the probability of playing the optimal action may decrease if Q-values deviate significantly from the true values, a situation that may arise in the initial phase as well as after changes in the environment.These artifacts were resolved in literature by the variant Frequency Adjusted Q-learning (FAQL). However, FAQL also suffered from practical concerns that limited the policy subspace for which the behavior was improved. Here, we introduce the Repeated Update Q-learning (RUQL), a variant of Q-learning that resolves the undesirable artifacts of Q-learning without the practical concerns of FAQL.We show (both theoretically and experimentally) the similarities and differences between RUQL and FAQL (the closest state-of-the-art). Experimental results verify the theoretical insights and show how RUQL outperforms FAQL and QL in non-stationary environments.
, ,
, , ,
T. Ito (Tsuyoshi) , C.M. Jonker (Catholijn) , M. Gini , O. Shehory (Onn)
International Joint Conference on Autonomous Agents and Multiagent Systems
Intelligent and autonomous systems

Abdallah, S, & Kaisers, M. (2013). Addressing the Policy-bias of Q-learning by Repeating Updates. In T Ito, C.M Jonker, M Gini, & O Shehory (Eds.), .