A. Giannou, K. Lotidis, P. Mertikopoulos, and E. V. Vlatakis-Gkaragkounis. In NeurIPS '22: Proceedings of the 36th International Conference on Neural Information Processing Systems, 2022.

Multi-agent learning in stochastic $N$-player games is a notoriously difficult problem because, in addition to their changing strategic decisions, the players of the game must also contend with the fact that the game itself evolves over time, possibly in a very complicated manner. Because of this, the equilibrium convergence properties of popular learning algorithms – like policy gradient and its variants – are poorly understood, except in specific classes of games (such as potential or two-player, zero-sum games). In view of all this, we examine the long-run behavior of policy gradient methods with respect to Nash equilibrium policies that are second-order stationary (SOS) in a sense similar to the type of KKT sufficiency conditions used in optimization. Our analysis shows that SOS policies are locally attracting with high probability, and we show that policy gradient trajectories with gradient estimates provided by the Reinforce algorithm achieve an $\mathcal{O}(1/\sqrt{n})$ convergence rate to such equilibria if the method’s step-size is chosen appropriately. On the other hand, when the equilibrium in question is *deterministic*, we show that this rate can be improved dramatically and, in fact, policy gradient methods converge within a finite number of iterations in that case.