Start of Main Content
Working Paper
Poisson Limits of Bernoulli Bandits
Author(s)
We study Bernoulli bandits in the “hard luck” parameter regime. That is, the probability of success on any given trial is assumed to be small, regardless of what action the experimenter may choose, and regardless of what exogenous context may be observed. Adopting a general model formulation propounded by Kuang and Wager (2023), we prove a limit theorem that justifies approximating a Bernoulli bandit in the hard luck regime by a corresponding Poisson bandit. The latter is a continuous-time treatment selection model in which cumulative reward is driven by one of several possible Poisson processes, depending on the action or treatment selected by the experimenter; each of the Poisson processes has an average success rate that is initially unknown but can be learned through experimentation. Poisson bandits have several advantages over the Bernoulli bandits that they replace or approximate, including both sharper theoretical insights and ease of computation. To illustrate the computational advantage, we study the use of policy gradient methods to optimize the parameters of randomized treatment selection policies in both Bernoulli bandit and Poisson bandit settings. Our theory and simulation experiments indicate that gradient estimators for Poisson bandit models have much lower variance and yield significantly better policies than those for the Bernoulli bandit models to which they correspond.
Date Published:
2025
Citations:
Ba, Wenjia, Lin Fan, Peter Glynn, J. Michael Harrison. 2025. Poisson Limits of Bernoulli Bandits.