A Systematic Methodology for Continuous-Time Analysis of Accelerated Gradient Methods

SNU AI
SNU AIIS Blog
Published in
7 min readJul 24, 2022

--

In machine learning, we often minimize the loss function so that the difference between the solution point and the estimates converges to zero.

We impose a reasonable structural assumption that the function is convex, as it has a convenient property that all local minima are also global minima. Considering a convex and differentiable function f, our optimization problem is to minimize f, where X_★ is a minimizer (if there exists), and f_★ is the optimal value of the problem.

Gradient descent is a basic first-order algorithm of solving this convex optimization problem. However, naive gradient descent develops many zig-zag patterns of updates and has a slow rate of convergence.

Developing efficient first-order methods with faster rate of convergence has received great interest in convex optimization, leading to multiple modifications of gradient descent. The celebrated accelerated gradient method (AGM) by Nesterov (1983) enables the function value to decrease at an accelerated rate of O(1/k²). That is, the error at the kth point of the sequence x_k (LHS) is bounded by a function of k (RHS) with order O(1/k²).

O(1/k²) rate of convergence achieved by AGM. Note that the numerator is a constant where X_0 is the initial point.
O(1/k²) rate of convergence achieved by AGM. Note that the numerator is a constant where X_0 is the initial point.

How can we arrive at such proof of rate of convergence? A widely-used technique is to construct a Lyapunov function and exploit the fact that it is a nonincreasing function. In the context of first-order convex optimization, a typical example of Lyapunov function V(k) can be formed as below where C(k) is some positive function. Primary work in this procedure is dedicated to arranging the terms that constitute the function C(k).

An example Lyapunov function V(k), and the following Lyapunov analysis to prove AGM
An example Lyapunov function V(k), and the following Lyapunov analysis to prove AGM

Analyzing accelerated gradient methods from a continuous perspective

The gradient algorithm discussed above is illustrated in a discrete setting, since we analyze a sequence of numbers through a recursive rule. On the other hand, a continuous model describes dynamically changing quantities such as time by using a corresponding ordinary differential equation (ODE) as a mathematical tool. Continuous time-models are much simpler to analyze than their discrete counterparts (one reason is that we can utilize differentials and integrals), and the extensively-explored ODE modeling ideas in mathematical physics can contribute to our intuition for better optimization methods.

Given t = 𝛼k, we can transform a discrete gradient descent equation into an equivalent ODE in a continuous landscape. It is implicitly assumed that the step size 𝛼 is significantly small.
Given t = 𝛼k, we can transform a discrete gradient descent equation into an equivalent ODE in a continuous landscape. It is implicitly assumed that the step size 𝛼 is significantly small.

Taking advantage of the continuous framework, continuous-time analysis of accelerated gradient methods is composed of the following steps:

  1. Model an ODE that has a fast convergence rate.
  2. Analyze and prove the ODE’s convergence rate in the continuous setting using a Lyapunov function.
  3. Discretize, or transfer the ODE into its discrete counterpart, in order to implement the algorithm for actual computation.
  4. Perform convergence analysis to the discretized version as well.

Despite the simplified procedure by nature, continuous-time convergence analysis still suffers from a bottleneck in finding the appropriate Lyapunov function. The analyses of prior work often start by stating a Lyapunov function of unclear origin. In truth, these Lyapunov functions are obtained through many hours of trial and error.

In a paper accepted at ICML 2022, we provide a systematic methodology for obtaining such Lyapunov functions. Using the novel methodology, we make the following contributions:

  1. We recover many known continuous-time analyses of Nesterov-type acceleration in a streamlined manner.
  2. We perform the first continuous-time analysis of OGM-G, an acceleration mechanism that was understood far less than that of Nesterov’s AGM.
  3. We show that the central idea of our method enables a direct and “natural” discretization based on numerical analysis and preserves an accelerated rate of convergence as well.

In this blog post, we will give a step-by-step explanation of our main methodology and show how it works for OGM-G.

Our idea: conservation laws from dilated coordinates

Our proposed recipe for continuous-time analysis is to (1) perform a coordinate change and then (2) obtain a conservation law. We will explain our approach by applying it to the classic AGM ODE, an ODE model of Nesterov’s AGM initiated by Su et al. (2014).

(1) Perform a coordinate change

1. The given AGM ODE in the X coordinate is as follows, for r = 3 (nominator of the coefficient of ) and t ∈ (0, ∞).

AGM ODE in the X coordinate. For the sake of notational brevity, we write X in place of X(t).
AGM ODE in the X coordinate. For the sake of notational brevity, we write X in place of X(t).

2. Perform the following coordinate change from X to W, where W is a dilated coordinate since we multiply time t — an increasing value — to the original.

3. We choose 𝛼 = 2 in anticipation of the O(1/t²) rate presented by AGM. This results in the following AGM ODE in the W coordinate.

AGM ODE in the W coordinate with 𝛼 = 2
AGM ODE in the W coordinate with 𝛼 = 2

(2) Obtain a conservation law

Our goal is to form Lyapunov functions used in prior analyses in a systematic way. We discover that after deriving a conservation law from the ODE in the dilated coordinate system, the resulting constant function contains the exact terms that constitute the Lyapunov function.

In physics, the law of conservation of mechanical energy states that total energy — the sum of kinetic energy and potential energy — remains constant. And such a relation is derived from a differential equation by taking an inner product with velocity and integrating from t_0 to t on both sides. In the same spirit, we take an inner product with “velocity” and integrate with respect to time, getting a constant as the result.

To continue with our example,

4. Take the inner product between (“velocity”) and the AGM ODE in the W coordinate. Note that the chain rule is applied to the second term.

5. Integrate from t_0 to t on both sides to obtain the corresponding conservation law. The first two terms of the conservation law form the exact Lyapunov function presented by the authors of AGM ODE.

Conservation law for the AGM ODE
Conservation law for the AGM ODE

6. Since f is convex, the integrand is nonnegative, and we conclude with the promised O(1/t²) convergence rate.

Proof of O(1/t²) convergence rate of AGM
Proof of O(1/t²) convergence rate of AGM

General form of conservation laws

We have verified our methodology by using a generalized form of the conservation law to recover analyses in prior works. In fact, we could simplify the analyses of AGM ODE r > 3 by Attouch et al. (2018), AGM ODE r < 3 by Attouch et al. (2019), AGM ODE with growth condition by Aujol et al. (2019), SC-AGM by Wilson et al. (2021), and the gradient flow ODE.

Canonical form of ODE used in our analysis
General form of conservation law used in our analysis
General form of conservation law used in our analysis

Novel continuous-time analysis of OGM-G

We now present a novel ODE model of OGM-G by Kim & Fessler, which optimally reduces the squared gradient magnitude (rather than the function value) for smooth convex minimization.

1. Obtain the OGM-G ODE.

OGM-G ODE
OGM-G ODE

2. Choose the dilated coordinate W=(Tt)^𝛼 (XX_c) for some X_c ∈ ℝ^n. Choose 𝛼 = -2 in anticipation of the O(1/T²) rate.

3. Obtain the corresponding conservation law.

Conservation law for OGM-G ODE
Conservation law for OGM-G ODE

Note that there is subtraction in the denominator of each term in the conservation law.

4. In order to avoid any undefined term due to division by zero, we characterize the dynamics of the solution X to the OGM-G near the terminal time t = T.

Continuously extend X(t), Ẋ(t) Ẍ(t) to t=T
Continuously extend X(t), Ẋ(t) Ẍ(t) to t=T

5. Define the Lyapunov function Φ(t) from the first three terms of the conservation law with X_c = X(T). This is monotonically nonincreasing.

Lyapunov function for OGM-G
Lyapunov function for OGM-G

6. Using the limit of Φ(t) after applying L’Hôpital’s rule, we prove that the extended solution X exhibits the promised rate of convergence.

Proof of O(f(x_0)-f_★)/T²) convergence rate of OGM-G ODE
Proof of O(f(x_0)-f_★)/T²) convergence rate of OGM-G ODE

7. Similarly, extend the analysis for r < -3 to achieve a generalized proof of OGM-G.

Takeaways

In this post, we outlined the preliminaries for proving convergence rate in convex optimization methods. Also, we discussed the direction in which a study of continuous-time models analyzes accelerated gradient methods. Motivated by the difficulty of obtaining a Lyapunov function during the analysis, we presented a methodology that derives conservation laws in dilated coordinate systems. Using this methodology, a novel continuous-time analysis of OGM-G could be conducted. We hypothesize that our dilated coordinates can simplify analyses of other setups beyond our exploration, such as in stochastic differential equations modeling stochastic optimization.

written by Sue Hyun Park

Acknowledgement

This blog post covers the following paper from Seoul National University:

Thanks to Jaewook Suh for providing background and sharing valuable feedback on this blog post. The views and opinions expressed in this post are solely of the authors.

--

--

SNU AI
SNU AIIS Blog

AIIS is an intercollegiate institution of Seoul National University, committed to integrate and support AI related research at Seoul National University.