Classical regret minimization in a bandit frame-work involves a number of probability distributions or arms that are not known to the learner but that can be sampled from or pulled. The learner's aim is to sequentially pull these arms so as to maximize the number of times the best arm is pulled, or equivalently, minimize the regret associated with the sub-optimal pulls. Best is classically defined as the arm with the largest mean. Lower bounds on expected regret are well known, and lately, in great generality, efficient algorithms that match the lower bounds have been developed. In this paper we extend this methodology to a more general risk-reward set-up where the best arm corresponds to the one with the lowest average loss (negative of reward), with a multiple of Conditional-Value-At-Risk (\mathbf{CVaR}) of the loss distribution added to it. (\mathbf{CVaR}) is a popular tail risk measure. The settings where risk becomes an important consideration, typically involve heavy-Tailed distributions. Unlike in most of the previous literature, we allow for all the distributions with a known uniform bound on the moment of order (1+\epsilon), allowing for heavy-Tailed bandits. We extend the lower bound of the classical regret minimization setup to this setting and develop an index-based algorithm. Like the popular KL-UCB algorithm for the mean setting, our index is derived from the proposed lower bound, and is based on the empirical likelihood principle. We also propose anytime-valid confidence intervals for the mean-CVaR trade-off metric. En route, we develop concentration inequalities, which may be of independent interest.
7th Indian Control Conference, ICC 2021
Centrum Wiskunde & Informatica, Amsterdam, The Netherlands

Agrawal, S, Juneja, S, & Koolen-Wijkstra, W.M. (2021). Regret-minimization in risk-averse bandits. In Proceedings of the Indian Control Conference, ICC (pp. 195–200). doi:10.1109/ICC54714.2021.9703134