We propose a novel hybrid stochastic pol-icy gradient estimator by combining an un-biased policy gradient estimator, the REIN-FORCE estimator, with another biased one,an adapted SARAH estimator for policy op-timization. The hybrid policy gradient esti-mator is shown to be biased, but has vari-ance reduced property. Using this estimator,we develop a new Proximal Hybrid StochasticPolicy Gradient Algorithm (ProxHSPGA) tosolve a composite policy optimization prob-lem that allows us to handle constraints orregularizers on the policy parameters. Wefirst propose a single-looped algorithm thenintroduce a more practical restarting vari-ant.We prove that both algorithms canachieve the best-known trajectory complex-ityO(ε−3)to attain a first-order stationarypoint for the composite problem which is bet-ter than existing REINFORCE/GPOMDPO(ε−4)and SVRPGO(ε−10/3)in the non-composite setting. We evaluate the perfor-mance of our algorithm on several well-knownexamples in reinforcement learning. Numer-ical results show that our algorithm outper-forms two existing methods on these exam-ples. Moreover, the composite settings in-deed have some advantages compared to thenon-composite ones on certain problems.

Additional Metadata
Conference International Conference on Artificial Intelligence and Statistics
Citation
Pham, N.H, Nguyen, L.M, Phan, D.T, Nguyen, P.H, van Dijk, M.E, & Tran-Dinh, Q. (2020). A Hybrid Stochastic Policy Gradient Algorithm for Reinforcement Learning. In Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics.