We propose a novel hybrid stochastic pol-icy gradient estimator by combining an un-biased policy gradient estimator, the REIN-FORCE estimator, with another biased one,an adapted SARAH estimator for policy op-timization. The hybrid policy gradient esti-mator is shown to be biased, but has vari-ance reduced property. Using this estimator,we develop a new Proximal Hybrid StochasticPolicy Gradient Algorithm (ProxHSPGA) tosolve a composite policy optimization prob-lem that allows us to handle constraints orregularizers on the policy parameters. Wefirst propose a single-looped algorithm thenintroduce a more practical restarting vari-ant.We prove that both algorithms canachieve the best-known trajectory complex-ityO(ε−3)to attain a first-order stationarypoint for the composite problem which is bet-ter than existing REINFORCE/GPOMDPO(ε−4)and SVRPGO(ε−10/3)in the non-composite setting. We evaluate the perfor-mance of our algorithm on several well-knownexamples in reinforcement learning. Numer-ical results show that our algorithm outper-forms two existing methods on these exam-ples. Moreover, the composite settings in-deed have some advantages compared to thenon-composite ones on certain problems.

IBM Research, Thomas J. Watson Research Center, USA
International Conference on Artificial Intelligence and Statistics
Computer Security

Pham, N., Nguyen, L., Phan, D., Nguyen, P. H., van Dijk, M., & Tran-Dinh, Q. (2020). A hybrid stochastic policy gradient algorithm for reinforcement learning. In Proceedings of the International Conference on Artificial Intelligence and Statistics (pp. 374–385).