Addressing the Policy-bias of Q-learning by Repeating Updates
Presented at the International Joint Conference on Autonomous Agents and Multiagent Systems, St. Paul, MN, USA
Q-learning is a very popular reinforcement learning algorithm being proven to converge to optimal policies in Markov decision processes. However, Q-learning shows artifacts in non-stationary environments, e.g., the probability of playing the optimal action may decrease if Q-values deviate significantly from the true values, a situation that may arise in the initial phase as well as after changes in the environment.These artifacts were resolved in literature by the variant Frequency Adjusted Q-learning (FAQL). However, FAQL also suffered from practical concerns that limited the policy subspace for which the behavior was improved. Here, we introduce the Repeated Update Q-learning (RUQL), a variant of Q-learning that resolves the undesirable artifacts of Q-learning without the practical concerns of FAQL.We show (both theoretically and experimentally) the similarities and differences between RUQL and FAQL (the closest state-of-the-art). Experimental results verify the theoretical insights and show how RUQL outperforms FAQL and QL in non-stationary environments.
|Q-learning, Non-stationary Environment, Dynamics|
|Population dynamics (general) (msc 92D25), Stochastic learning and adaptive control (msc 93E35), Artificial intelligence (msc 68Txx), Learning and adaptive systems (msc 68T05)|
|Null option (theme 11)|
|T. Ito (Tsuyoshi) , C.M. Jonker (Catholijn) , M. Gini , O. Shehory (Onn)|
|International Joint Conference on Autonomous Agents and Multiagent Systems|
|Organisation||Intelligent and autonomous systems|
Abdallah, S, & Kaisers, M. (2013). Addressing the Policy-bias of Q-learning by Repeating Updates. In T Ito, C.M Jonker, M Gini, & O Shehory (Eds.), .