Non-asymptotic Analysis of Momentum for Quadratic Functions

(Polyak’s) Momentum is one of the most popular optimization techniques. When minimizing $f(x)$, gradient descent with momentum is simply

\[x_{t+1}\gets x_t - \eta\nabla f(x_t) + \gamma(x_t-x_{t-1}).\]

It is well known that when $f$ is quadratic, i.e. when $f(x)=\frac{1}{2}x^TAx-b^Tx$, gradient descent with momentum converges linearly assuming that $A$ is positive definite. In this case the algorithm is a linear fixed point iteration, so one only needs to analyze the spectrum of the iteration matrix. For instance see the analysis in this and this lecture note.

However, these results are asymptotic: what they showed is that

\[\Vert x_T-x^*\Vert\le C\left(1-\frac{1}{\sqrt{Q}}\right)^T\Vert x_0-x^*\Vert,\]

where $C$ is a constant that is independent of $T$ but is not explicitly determined. In this post, I will try to fix this by bounding $C$, which isn’t really much trouble.

Basic setup

The objective function is $f(x)=\frac{1}{2}x^TAx-b^Tx$, where $mI \preccurlyeq A\preccurlyeq LI$. $Q=L/m$ is the condition number of $A$.

The learning rate shall be chosen as $\eta = 1/L$, while the momentum parameter would be $\gamma = 1+\frac{1}{2Q}-\sqrt{\frac{2}{Q}}$.

Now, observe that GD with momentum can be written in the matrix form

\[\left[\begin{matrix} x_{t+1}\\ x_{t} \end{matrix}\right]\gets \left[\begin{matrix} (1+\gamma)I & -\gamma I\\ I & 0 \end{matrix}\right] \left[\begin{matrix} x_{t}\\ x_{t-1} \end{matrix}\right]-\left[\begin{matrix} \eta (Ax_t-b)\\ 0 \end{matrix}\right].\]

Some simple manipulation will show that this is a fixed point iteration:

\[\left[\begin{matrix} x_{t+1}-x^*\\ x_{t}-x^* \end{matrix}\right]\gets \left[\begin{matrix} (1+\gamma)I-\eta A & -\gamma I\\ I & 0 \end{matrix}\right] \left[\begin{matrix} x_{t}-x^*\\ x_{t-1}-x^* \end{matrix}\right].\]

Define $M:=\left[ (1+\gamma)I-\eta A \; \;-\gamma I ;I \;\; 0\right]$ to be the iteration matrix. Then

\[\Vert x_{t}-x^*\Vert_2\le \left\Vert\left[\begin{matrix} x_{t}-x^*\\ x_{t-1}-x^* \end{matrix}\right]\right\Vert= \left\Vert M^t\left[\begin{matrix} x_{0}-x^*\\ x_{0}-x^* \end{matrix}\right]\right\Vert_2\le 2\left\Vert M^t\right\Vert_2 \Vert x_0-x^*\Vert.\]

So it all boils down to upper bounding $\Vert M^t\Vert_2$.

Eigenvalues

First, what can we say about the eigenvalues of $M$? Observe that (when $\lambda\neq 0$),

\[\begin{align*} \det(\lambda I-M) &= \det\left(\left[\begin{matrix} (\lambda-1-\gamma) I+\eta A & \gamma I\\ -I & \lambda I \end{matrix}\right]\right)\\ &= \det\left(\left[\begin{matrix} (\lambda-1-\gamma+\frac{\gamma}{\lambda}) I+\eta A & \gamma I\\ 0 & \lambda I \end{matrix}\right]\right)\\ &= \lambda^d\det\left((\lambda-1-\gamma+\frac{\gamma}{\lambda}) I+\eta A\right)\\ &= \det\left((\lambda^2-(1+\gamma)\lambda+\gamma)I+\lambda \eta A\right). \end{align*}\]

If $\lambda$ is an eigenvalue of $M$, then $\det\left((\lambda^2-(1+\gamma)\lambda+\gamma)I+\lambda \eta A\right)=0$; in other words, for some eigenvalue $r$ of $\eta A$,

\[\lambda^2-(1+\gamma)\lambda+\gamma+\lambda r= \lambda^2-(1+\gamma- r)\lambda+\gamma=0.\]

Recall that the eigenvalues of $A$ fall in $[m,L]$ and $\eta=1/L$, so $r\in [1/Q,1]$. As a result,

\[\begin{align*} 4\gamma-(1+\gamma-r)^2 &\ge 4\gamma-\left(1+\gamma-\frac{1}{Q}\right)^2 \\ &\ge 4(1+\frac{1}{2Q}-\frac{2}{\sqrt{2Q}})-\left(2-\frac{1}{2Q}-\frac{2}{\sqrt{2Q}}\right)^2\\ &\ge \frac{1}{3Q}>0. \end{align*}\]

Therefore, the solutions to $\lambda^2-(1+\gamma- r)\lambda+\gamma=0$ are

\[\frac{1+\gamma-r\pm\sqrt{4\gamma-(1+\gamma-r)^2}i}{2},\]

whose magnitude is exactly $\sqrt{\gamma}$. Therefore, $\rho(M)=\sqrt{\gamma}\le 1-\frac{1}{3\sqrt{Q}}$. This implies that $\lim_{t\to\infty} A^t=0$, and the (asymptotic) linear convergence of momentum. However, since $M$ is not symmetric, this doesn’t directly give us a meaningful upperbound on $\Vert M\Vert_2$ or $\Vert M^t\Vert_2$.

Eigenvectors

What’s missing in the previous analysis is actually the eigenvectors of $M$. Consider the equation

\[\left[\begin{matrix} (1+\gamma)I-\eta A & -\gamma I\\ I & 0 \end{matrix}\right]\left[\begin{matrix} u\\ v \end{matrix}\right]=\left[\begin{matrix} \lambda u\\\lambda v \end{matrix}\right],\]

where $\lambda$ is a complex eigenvalue of $M$. Then we can see that $u=\lambda v$, and that

\[(1+\gamma)\lambda v-\eta\lambda Av-\gamma v= \lambda^2v.\]

This is in fact equivalent to $\eta A v = r v$, so $v$ is in fact the eigenvector of $A$ corresponding $r$. Since $A$ has a complete set of eigenvectors, we can write the eigenvectors of $M$ as

\[U:=\left[\begin{matrix} \lambda_1 v_1 & \lambda_1^* v_1&\lambda_2 v_2&\cdots& \lambda_{n+m}v_{n+m}&\lambda_{n+m}^*v_{n+m}\\ v_1 & v_1 & v_2&\cdots& v_{n+m}&v_{n+m} \end{matrix}\right].\]

Since for any $i$, $Im(\lambda_i)>0$, this is a complete set of eigenvectors for $M$. Assume that $v_i$ are normalized such that $\Vert [\lambda_i v_i;v_i]\Vert^2 = (1+\vert\lambda_i\vert^2)\Vert v_i\Vert^2=1.$ Then we can diagonalize $M$ as $M=UDU^{-1}$, where

\[D=diag([\lambda_1\;\lambda_1^{*}\;\cdots\;\lambda_{n+m}\;\lambda_{n+m}^{*}]).\]

Since $\Vert M^t\Vert_2\le \Vert U\Vert_2\Vert U^{-1}\Vert_2\cdot \Vert D\Vert^t$, what remains is to bound the norm of $U$ and $U^{-1}$. The key observation is that,

\[U^\dagger U=\left[\begin{matrix} 1 & \frac{1+\left(\lambda_1^*\right)^2}{1+|\lambda_1|^2} & 0& 0& \cdots & 0& 0\\ \frac{1+\lambda_1^2}{1+|\lambda_1|^2} & 1 & 0 & 0 & \cdots & & \\ 0 & 0 & 1 & \frac{1+\left(\lambda_2^*\right)^2}{1+|\lambda_2|^2} & & \vdots& \vdots\\ 0 & 0 & \frac{1+\lambda_2^2}{1+|\lambda_2|^2} & 1 & & &\\ \vdots & \vdots& & & \ddots& &\\ 0 & &\cdots & & &1 &\frac{1+\left(\lambda_{n+m}^*\right)^2}{1+|\lambda_{n+m}|^2}\\ 0 & &\cdots & & &\frac{1+\lambda_{n+m}^2}{1+|\lambda_{n+m}|^2} & 1 \end{matrix}\right],\]

which is a block diagonal matrix with $2\times 2$ blocks. Hence,

\[\Vert U^\dagger U\Vert_2 \le \max_i\left\Vert\left[\begin{matrix} 1 &\frac{1+\left(\lambda_{i}^*\right)^2}{1+|\lambda_{i}|^2}\\ \frac{1+\lambda_{i}^2}{1+|\lambda_{i}|^2} & 1 \end{matrix}\right]\right\Vert_2 \le 2.\]

The more difficult part is

\[\Vert U^{-1}(U^{-1})^\dagger\Vert_2=\Vert (U^\dagger U)^{-1}\Vert_2\le \max_i\left\Vert\left[\begin{matrix} 1 &\frac{1+\left(\lambda_{i}^*\right)^2}{1+|\lambda_{i}|^2}\\ \frac{1+\lambda_{i}^2}{1+|\lambda_{i}|^2} & 1 \end{matrix}\right]^{-1}\right\Vert_2.\]

Recall that

\[Im(\lambda_i)>\frac{1}{2}\sqrt{\frac{1}{3Q}}\ge \frac{1}{4\sqrt{Q}},\]

and that $\vert\lambda_i\vert=\sqrt{\gamma}<1$. Let $\theta:=Arg(\lambda_i)$, then $\sin\theta>\frac{1}{4\sqrt{Q}}$. Thus

\[\lambda_{\min}\left(\left[\begin{matrix} 1 &\frac{1+\left(\lambda_{i}^*\right)^2}{1+|\lambda_{i}|^2}\\ \frac{1+\lambda_{i}^2}{1+|\lambda_{i}|^2} & 1 \end{matrix}\right]\right)= 1-\frac{| 1+\lambda_i^2|}{1+|\lambda_{i}|^2}=1-\sqrt{1-\frac{2\gamma (1-\cos(2\theta))}{(1+\gamma^2)}}\ge\frac{1}{96Q}.\]

(The factor $96$ is likely a vast overestimate.) Now we can see that

\[\Vert U\Vert_2 \le \sqrt{2},\quad \Vert U^{-1}\Vert_2\le \sqrt{96Q}.\]

Therefore

\[\Vert M^t\Vert_2\le \Vert U\Vert_2\Vert U^{-1}\Vert_2\cdot \Vert D\Vert^t\le 14\sqrt{Q}\left(1-\frac{1}{3\sqrt{Q}}\right)^t.\]

This leads to an explicit convergence bound for momentum

\[\Vert x_T-x^*\Vert\le 28\sqrt{Q}\left(1-\frac{1}{3\sqrt{Q}}\right)^T\Vert x_0-x^*\Vert.\]

Closing remarks

A natural question is: is this bound on the constant factor tight? I have no idea on that matter, but at least it seems natural: the convergence bound of Conjugate Gradient (in the form of Euclidean norm of error) also has a $\sqrt{Q}$ factor.

Another interesting question is: does the block-diagonal structure of $U^\dagger U$ have additional meaning? Here, we only used it to show that $M$ is “not too far away” from a normal matrix, and that it is a contraction in a norm that is “not too different” from the Euclidean norm.

Finally, I want to point out that if one chooses $\gamma=1+1/Q-2/\sqrt{Q}$ (or very close to that, as in this note), $U^\dagger U$ can be singular or very close to singular. In that case the bound on $\Vert U^{-1}\Vert$ would drastically worsen, but the asymptotic convergence rate, determined by $\rho(M)$, would still be $1-\Theta(Q^{-0.5})$. I wonder whether this reflects an actual “failure mode” of the algorithm.

References

Ioannis Mitliagkas. IFT 6085 Lecture 5. Accelerated Methods-Polyak’s Momentum. http://mitliagkas.github.io/ift6085-2019/ift-6085-lecture-5-notes.pdf
Yuxin Chen. Accelerated gradient methods. http://www.princeton.edu/~yc5/ele522_optimization/lectures/accelerated_gradient.pdf