Learning with bandit feedback in potential games

[C31] - Learning with bandit feedback in potential games

J. Cohen, A. Héliou, and P. Mertikopoulos. In NeurIPS '17: Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017.

Abstract

This paper examines the equilibrium convergence properties of no-regret learning with exponential weights in potential games. To establish convergence with minimal information requirements on the players' side, we focus on two frameworks: the partial information case (where players have access to a noisy estimate of their payoff vectors, including strategies they did not play), and the bandit case (where players are only able to observe their in-game, realized payoffs). In the partial information case, we show that the induced sequence of play converges almost surely to a Nash equilibrium at a quasi-exponential rate. In the bandit case, the same result holds for $\varepsilon$-approximations of Nash equilibria if we introduce an exploration factor $\varepsilon > 0$ that guarantees that action choice probabilities never fall below $\varepsilon$. In particular, if the algorithm is run with a suitably decreasing exploration factor, the sequence of play converges to a bona fide Nash equilibrium with probability $1$.