### Abstract

Recently, there has been growing interest in developing optimization methods for solving large-scale machine learning problems. Most of these boil down to the problem of minimizing an average of a finite set of smooth and strongly convex functions where the number of functions n is large. The gradient descent (GD) method is successful in minimizing convex problems at a fast linear rate; however, it is not applicable to the considered large-scale optimization setting because of the high computational complexity. Incremental methods resolve this drawback of gradient methods by replacing the required gradient for the descent direction with an incremental gradient approximation. They operate by evaluating one gradient per iteration and executing the average of the n available gradients as an approximate gradient. Although incremental methods reduce the computational cost of GD, their convergence rates do not justify their advantage relative to GD in terms of the total number of gradient evaluations until convergence. In this paper, we introduce a double incremental aggregated gradient method (DIAG) that computes the gradient of only one function at each iteration, which is chosen based on a cyclic scheme, and uses the aggregated average gradient of all the functions to approximate the full gradient. The iterates of the proposed DIAG method uses averages of both iterates and gradients in contrast to classic incremental methods that utilize gradient averages but do not utilize iterate averages. We prove that not only does the proposed DIAG method converge linearly to the optimal solution, but also its linear convergence factor justifies the advantage of incremental methods over GD. In particular, we prove that the worst-case performance of DIAG is better than the worst-case performance of GD. Numerical experiments on quadratic programming and logistic regression problems showcase the advantage of DIAG relative to GD and other incremental methods.

Original language | English (US) |
---|---|

Pages (from-to) | 1420-1447 |

Number of pages | 28 |

Journal | SIAM Journal on Optimization |

Volume | 28 |

Issue number | 2 |

DOIs | |

State | Published - Jan 1 2018 |

### Fingerprint

### All Science Journal Classification (ASJC) codes

- Software
- Theoretical Computer Science

### Keywords

- Finite sum minimization
- Incremental methods
- Large-scale optimization
- Linear convergence rate
- Worst-case analysis

### Cite this

^{∗}.

*SIAM Journal on Optimization*,

*28*(2), 1420-1447. https://doi.org/10.1137/16M1101702

}

^{∗}',

*SIAM Journal on Optimization*, vol. 28, no. 2, pp. 1420-1447. https://doi.org/10.1137/16M1101702

**Surpassing gradient descent provably : A cyclic incremental method with linear convergence rate ^{∗}.** / Mokhtari, Aryan; Gurbuzbalaban, Mert; Ribeiro, Alejandro.

Research output: Contribution to journal › Article

TY - JOUR

T1 - Surpassing gradient descent provably

T2 - A cyclic incremental method with linear convergence rate∗

AU - Mokhtari, Aryan

AU - Gurbuzbalaban, Mert

AU - Ribeiro, Alejandro

PY - 2018/1/1

Y1 - 2018/1/1

N2 - Recently, there has been growing interest in developing optimization methods for solving large-scale machine learning problems. Most of these boil down to the problem of minimizing an average of a finite set of smooth and strongly convex functions where the number of functions n is large. The gradient descent (GD) method is successful in minimizing convex problems at a fast linear rate; however, it is not applicable to the considered large-scale optimization setting because of the high computational complexity. Incremental methods resolve this drawback of gradient methods by replacing the required gradient for the descent direction with an incremental gradient approximation. They operate by evaluating one gradient per iteration and executing the average of the n available gradients as an approximate gradient. Although incremental methods reduce the computational cost of GD, their convergence rates do not justify their advantage relative to GD in terms of the total number of gradient evaluations until convergence. In this paper, we introduce a double incremental aggregated gradient method (DIAG) that computes the gradient of only one function at each iteration, which is chosen based on a cyclic scheme, and uses the aggregated average gradient of all the functions to approximate the full gradient. The iterates of the proposed DIAG method uses averages of both iterates and gradients in contrast to classic incremental methods that utilize gradient averages but do not utilize iterate averages. We prove that not only does the proposed DIAG method converge linearly to the optimal solution, but also its linear convergence factor justifies the advantage of incremental methods over GD. In particular, we prove that the worst-case performance of DIAG is better than the worst-case performance of GD. Numerical experiments on quadratic programming and logistic regression problems showcase the advantage of DIAG relative to GD and other incremental methods.

AB - Recently, there has been growing interest in developing optimization methods for solving large-scale machine learning problems. Most of these boil down to the problem of minimizing an average of a finite set of smooth and strongly convex functions where the number of functions n is large. The gradient descent (GD) method is successful in minimizing convex problems at a fast linear rate; however, it is not applicable to the considered large-scale optimization setting because of the high computational complexity. Incremental methods resolve this drawback of gradient methods by replacing the required gradient for the descent direction with an incremental gradient approximation. They operate by evaluating one gradient per iteration and executing the average of the n available gradients as an approximate gradient. Although incremental methods reduce the computational cost of GD, their convergence rates do not justify their advantage relative to GD in terms of the total number of gradient evaluations until convergence. In this paper, we introduce a double incremental aggregated gradient method (DIAG) that computes the gradient of only one function at each iteration, which is chosen based on a cyclic scheme, and uses the aggregated average gradient of all the functions to approximate the full gradient. The iterates of the proposed DIAG method uses averages of both iterates and gradients in contrast to classic incremental methods that utilize gradient averages but do not utilize iterate averages. We prove that not only does the proposed DIAG method converge linearly to the optimal solution, but also its linear convergence factor justifies the advantage of incremental methods over GD. In particular, we prove that the worst-case performance of DIAG is better than the worst-case performance of GD. Numerical experiments on quadratic programming and logistic regression problems showcase the advantage of DIAG relative to GD and other incremental methods.

KW - Finite sum minimization

KW - Incremental methods

KW - Large-scale optimization

KW - Linear convergence rate

KW - Worst-case analysis

UR - http://www.scopus.com/inward/record.url?scp=85049696599&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85049696599&partnerID=8YFLogxK

U2 - 10.1137/16M1101702

DO - 10.1137/16M1101702

M3 - Article

VL - 28

SP - 1420

EP - 1447

JO - SIAM Journal on Optimization

JF - SIAM Journal on Optimization

SN - 1052-6234

IS - 2

ER -

^{∗}. SIAM Journal on Optimization. 2018 Jan 1;28(2):1420-1447. https://doi.org/10.1137/16M1101702