We present a Bayesian value-iteration framework for contextual multi-armed bandit problems that treats the agents posterior distribution for the pay-off as the state of the Markov Decision Process. We apply finite-dimensional priors on the unknown reward parameters, and the exogenous context transition kernel. Value iteration on the belief-MDP yields an optimal policy. We illustrate the approach in an airline seat-pricing simulation. To address the curse of dimensionality, we approximate the value function with a dual-stream deep learning network and benchmark our deep value iteration algorithm on a standard contextual bandit instance.

, , , ,
39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: MLxOR: Mathematical Foundations and Operational Integration of Machine Learning for Uncertainty-Aware Decision-Making
Stochastics

Duijndam, K., Koole, G., & van der Mei, R. (2025). Contextual value iteration and deep approximation for Bayesian contextual bandits. In Proceedings NeurIPS (Annual Conference on Neural Information Processing Systems) (pp. 18:1–18:5).