A. Alacaoglu, Y. Malitsky, P. Mertikopoulos, and V. Cevher. In ICML '20: Proceedings of the 37th International Conference on Machine Learning, 2020.
In this paper, we focus on a theory-practice gap for Adam and its variants (AMSgrad, AdamNC, etc.). In practice, these algorithms are used with a constant first-order moment parameter $\beta_1$ (typically between 0.9 and 0.99). In theory, regret guarantees for online convex optimization require a rapidly decaying $\beta_1\to0$ schedule. We show that this is an artifact of the standard analysis and propose a novel framework that allows us to derive optimal, data-dependent regret bounds with a constant $\beta_1$, without further assumptions. We also demonstrate the flexibility of our analysis on a wide range of different algorithms and settings.
arXiv link: https://arxiv.org/abs/2003.09729