Surpassing gradient descent provably

A cyclic incremental method with linear convergence rate

Aryan Mokhtari, Mert Gurbuzbalaban, Alejandro Ribeiro

Research output: Contribution to journalArticle

Abstract

Recently, there has been growing interest in developing optimization methods for solving large-scale machine learning problems. Most of these boil down to the problem of minimizing an average of a finite set of smooth and strongly convex functions where the number of functions n is large. The gradient descent (GD) method is successful in minimizing convex problems at a fast linear rate; however, it is not applicable to the considered large-scale optimization setting because of the high computational complexity. Incremental methods resolve this drawback of gradient methods by replacing the required gradient for the descent direction with an incremental gradient approximation. They operate by evaluating one gradient per iteration and executing the average of the n available gradients as an approximate gradient. Although incremental methods reduce the computational cost of GD, their convergence rates do not justify their advantage relative to GD in terms of the total number of gradient evaluations until convergence. In this paper, we introduce a double incremental aggregated gradient method (DIAG) that computes the gradient of only one function at each iteration, which is chosen based on a cyclic scheme, and uses the aggregated average gradient of all the functions to approximate the full gradient. The iterates of the proposed DIAG method uses averages of both iterates and gradients in contrast to classic incremental methods that utilize gradient averages but do not utilize iterate averages. We prove that not only does the proposed DIAG method converge linearly to the optimal solution, but also its linear convergence factor justifies the advantage of incremental methods over GD. In particular, we prove that the worst-case performance of DIAG is better than the worst-case performance of GD. Numerical experiments on quadratic programming and logistic regression problems showcase the advantage of DIAG relative to GD and other incremental methods.

Original languageEnglish (US)
Pages (from-to)1420-1447
Number of pages28
JournalSIAM Journal on Optimization
Volume28
Issue number2
DOIs
StatePublished - Jan 1 2018

Fingerprint

Linear Convergence
Gradient methods
Gradient Descent
Convergence Rate
Gradient
Gradient Method
Iterate
Worst-case Performance
Quadratic programming
Justify
Learning systems
Logistics
Computational complexity
Iteration
Gradient Descent Method
Large-scale Optimization
Logistic Regression
Descent
Quadratic Programming
Convex function

All Science Journal Classification (ASJC) codes

  • Software
  • Theoretical Computer Science

Keywords

  • Finite sum minimization
  • Incremental methods
  • Large-scale optimization
  • Linear convergence rate
  • Worst-case analysis

Cite this

@article{85e5b67d64f64e46b5cd1ed9ef0f90db,
title = "Surpassing gradient descent provably: A cyclic incremental method with linear convergence rate∗",
abstract = "Recently, there has been growing interest in developing optimization methods for solving large-scale machine learning problems. Most of these boil down to the problem of minimizing an average of a finite set of smooth and strongly convex functions where the number of functions n is large. The gradient descent (GD) method is successful in minimizing convex problems at a fast linear rate; however, it is not applicable to the considered large-scale optimization setting because of the high computational complexity. Incremental methods resolve this drawback of gradient methods by replacing the required gradient for the descent direction with an incremental gradient approximation. They operate by evaluating one gradient per iteration and executing the average of the n available gradients as an approximate gradient. Although incremental methods reduce the computational cost of GD, their convergence rates do not justify their advantage relative to GD in terms of the total number of gradient evaluations until convergence. In this paper, we introduce a double incremental aggregated gradient method (DIAG) that computes the gradient of only one function at each iteration, which is chosen based on a cyclic scheme, and uses the aggregated average gradient of all the functions to approximate the full gradient. The iterates of the proposed DIAG method uses averages of both iterates and gradients in contrast to classic incremental methods that utilize gradient averages but do not utilize iterate averages. We prove that not only does the proposed DIAG method converge linearly to the optimal solution, but also its linear convergence factor justifies the advantage of incremental methods over GD. In particular, we prove that the worst-case performance of DIAG is better than the worst-case performance of GD. Numerical experiments on quadratic programming and logistic regression problems showcase the advantage of DIAG relative to GD and other incremental methods.",
keywords = "Finite sum minimization, Incremental methods, Large-scale optimization, Linear convergence rate, Worst-case analysis",
author = "Aryan Mokhtari and Mert Gurbuzbalaban and Alejandro Ribeiro",
year = "2018",
month = "1",
day = "1",
doi = "10.1137/16M1101702",
language = "English (US)",
volume = "28",
pages = "1420--1447",
journal = "SIAM Journal on Optimization",
issn = "1052-6234",
publisher = "Society for Industrial and Applied Mathematics Publications",
number = "2",

}

Surpassing gradient descent provably : A cyclic incremental method with linear convergence rate. / Mokhtari, Aryan; Gurbuzbalaban, Mert; Ribeiro, Alejandro.

In: SIAM Journal on Optimization, Vol. 28, No. 2, 01.01.2018, p. 1420-1447.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Surpassing gradient descent provably

T2 - A cyclic incremental method with linear convergence rate∗

AU - Mokhtari, Aryan

AU - Gurbuzbalaban, Mert

AU - Ribeiro, Alejandro

PY - 2018/1/1

Y1 - 2018/1/1

N2 - Recently, there has been growing interest in developing optimization methods for solving large-scale machine learning problems. Most of these boil down to the problem of minimizing an average of a finite set of smooth and strongly convex functions where the number of functions n is large. The gradient descent (GD) method is successful in minimizing convex problems at a fast linear rate; however, it is not applicable to the considered large-scale optimization setting because of the high computational complexity. Incremental methods resolve this drawback of gradient methods by replacing the required gradient for the descent direction with an incremental gradient approximation. They operate by evaluating one gradient per iteration and executing the average of the n available gradients as an approximate gradient. Although incremental methods reduce the computational cost of GD, their convergence rates do not justify their advantage relative to GD in terms of the total number of gradient evaluations until convergence. In this paper, we introduce a double incremental aggregated gradient method (DIAG) that computes the gradient of only one function at each iteration, which is chosen based on a cyclic scheme, and uses the aggregated average gradient of all the functions to approximate the full gradient. The iterates of the proposed DIAG method uses averages of both iterates and gradients in contrast to classic incremental methods that utilize gradient averages but do not utilize iterate averages. We prove that not only does the proposed DIAG method converge linearly to the optimal solution, but also its linear convergence factor justifies the advantage of incremental methods over GD. In particular, we prove that the worst-case performance of DIAG is better than the worst-case performance of GD. Numerical experiments on quadratic programming and logistic regression problems showcase the advantage of DIAG relative to GD and other incremental methods.

AB - Recently, there has been growing interest in developing optimization methods for solving large-scale machine learning problems. Most of these boil down to the problem of minimizing an average of a finite set of smooth and strongly convex functions where the number of functions n is large. The gradient descent (GD) method is successful in minimizing convex problems at a fast linear rate; however, it is not applicable to the considered large-scale optimization setting because of the high computational complexity. Incremental methods resolve this drawback of gradient methods by replacing the required gradient for the descent direction with an incremental gradient approximation. They operate by evaluating one gradient per iteration and executing the average of the n available gradients as an approximate gradient. Although incremental methods reduce the computational cost of GD, their convergence rates do not justify their advantage relative to GD in terms of the total number of gradient evaluations until convergence. In this paper, we introduce a double incremental aggregated gradient method (DIAG) that computes the gradient of only one function at each iteration, which is chosen based on a cyclic scheme, and uses the aggregated average gradient of all the functions to approximate the full gradient. The iterates of the proposed DIAG method uses averages of both iterates and gradients in contrast to classic incremental methods that utilize gradient averages but do not utilize iterate averages. We prove that not only does the proposed DIAG method converge linearly to the optimal solution, but also its linear convergence factor justifies the advantage of incremental methods over GD. In particular, we prove that the worst-case performance of DIAG is better than the worst-case performance of GD. Numerical experiments on quadratic programming and logistic regression problems showcase the advantage of DIAG relative to GD and other incremental methods.

KW - Finite sum minimization

KW - Incremental methods

KW - Large-scale optimization

KW - Linear convergence rate

KW - Worst-case analysis

UR - http://www.scopus.com/inward/record.url?scp=85049696599&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85049696599&partnerID=8YFLogxK

U2 - 10.1137/16M1101702

DO - 10.1137/16M1101702

M3 - Article

VL - 28

SP - 1420

EP - 1447

JO - SIAM Journal on Optimization

JF - SIAM Journal on Optimization

SN - 1052-6234

IS - 2

ER -