P. Mertikopoulos and W. H. Sandholm. Mathematics of Operations Research, vol. 41, no. 4, pp. 1297–1324, November 2016.
We investigate a class of reinforcement learning dynamics in which players adjust their strategies based on their actions' cumulative payoffs over time - specifically, by playing mixed strategies that maximize their expected cumulative payoff minus a strongly convex, regularizing penalty term. In contrast to the class of penalty functions used to define smooth best responses in models of stochastic fictitious play, the regularizers used in this paper need not be infinitely steep at the boundary of the simplex; in fact, dropping this requirement gives rise to an important dichotomy between steep and nonsteep cases. In this general setting, our main results extend several properties of the replicator dynamics such as the elimination of dominated strategies, the asymptotic stability of strict Nash equilibria and the convergence of time-averaged trajectories to interior Nash equilibria in zero-sum games.
arXiv link: https://arxiv.org/abs/1407.6267