Non-asymptotic Analysis of Momentum for Quadratic Functions

(Polyak’s) Momentum is one of the most popular optimization techniques. When minimizing $f(x)$, gradient descent with momentum is simply

\[x_{t+1}\gets x_t - \eta\nabla f(x_t) + \gamma(x_t-x_{t-1}).\]

It is well known that when $f$ is quadratic, i.e. when $f(x)=\frac{1}{2}x^TAx-b^Tx$, gradient descent with momentum converges linearly assuming that $A$ is positive definite. In this case the algorithm is a linear fixed point iteration, so one only needs to analyze the spectrum of the iteration matrix. For instance see the analysis in this and this lecture note.

Click to read more ...

Apr 29, 2020

SGD, Growth Condition, and Finite Sums

Recently, I came across this paper by Vaswani et al. and some other related work showing the linear convergence of SGD. The only additional assumption is, essentially “zero training loss”. This differs so much from the $\tilde\Theta(1/T)$ rate I learnt in class, that I decide it’s worth writing a blog post about.

Click to read more ...

Sep 06, 2019