Towards better generalization of adaptive gradient methods

Yingxue Zhou, Belhal Karimi, Jinxing Yu, Zhiqiang Xu, Ping Li

Research output: Contribution to journalConference articlepeer-review

1 Scopus citations


Adaptive gradient methods such as AdaGrad, RMSprop and Adam have been optimizers of choice for deep learning due to their fast training speed. However, it was recently observed that their generalization performance is often worse than that of SGD for over-parameterized neural networks. While new algorithms (such as AdaBound) have been proposed to improve the situation, the provided analyses are only committed to optimization bounds for the training objective, leaving critical generalization capacity unexplored. To close this gap, we propose Stable Adaptive Gradient Descent (SAGD) for non-convex optimization which leverages differential privacy to boost the generalization performance of adaptive gradient methods. Theoretical analyses show that SAGD has high-probability convergence to a population stationary point. We further conduct experiments on various popular deep learning tasks and models. Experimental results illustrate that SAGD is empirically competitive and often better than baselines.

Original languageEnglish (US)
JournalAdvances in Neural Information Processing Systems
StatePublished - 2020
Externally publishedYes
Event34th Conference on Neural Information Processing Systems, NeurIPS 2020 - Virtual, Online
Duration: Dec 6 2020Dec 12 2020

All Science Journal Classification (ASJC) codes

  • Computer Networks and Communications
  • Information Systems
  • Signal Processing


Dive into the research topics of 'Towards better generalization of adaptive gradient methods'. Together they form a unique fingerprint.

Cite this