Skip to main navigation Skip to search Skip to main content

Deep Neural Network Training With Distributed K-FAC

  • J. Gregory Pauloski
  • , Lei Huang
  • , Weijia Xu
  • , Kyle Chard
  • , Ian T. Foster
  • , Zhao Zhang

Research output: Contribution to journalArticlepeer-review

Abstract

Scaling deep neural network training to more processors and larger batch sizes is key to reducing end-to-end training time; yet, maintaining comparable convergence and hardware utilization at larger scales is challenging. Increases in training scales have enabled natural gradient optimization methods as a reasonable alternative to stochastic gradient descent and variants thereof. Kronecker-factored Approximate Curvature (K-FAC), a natural gradient method, preconditions gradients with an efficient approximation of the Fisher Information Matrix to improve per-iteration progress when optimizing an objective function. Here we propose a scalable K-FAC algorithm and investigate K-FAC's applicability in large-scale deep neural network training. Specifically, we explore layer-wise distribution strategies, inverse-free second-order gradient evaluation, and dynamic K-FAC update decoupling, with the goal of preserving convergence while minimizing training time. We evaluate the convergence and scaling properties of our K-FAC gradient preconditioner, for image classification, object detection, and language modeling applications. In all applications, our implementation converges to baseline performance targets in 9-25% less time than the standard first-order optimizers on GPU clusters across a variety of scales.

Original languageEnglish (US)
Pages (from-to)3616-3627
Number of pages12
JournalIEEE Transactions on Parallel and Distributed Systems
Volume33
Issue number12
DOIs
StatePublished - Dec 1 2022
Externally publishedYes

All Science Journal Classification (ASJC) codes

  • Signal Processing
  • Hardware and Architecture
  • Computational Theory and Mathematics

Keywords

  • Optimization methods
  • high-performance computing
  • neural networks
  • scalability

Fingerprint

Dive into the research topics of 'Deep Neural Network Training With Distributed K-FAC'. Together they form a unique fingerprint.

Cite this