SGD, Growth Condition, and Finite Sums
Recently, I came across this paper by Vaswani et al. and some other related work showing the linear convergence of SGD. The only additional assumption is, essentially “zero training loss”. This differs so much from the $\tilde\Theta(1/T)$ rate I learnt in class, that I decide it’s worth writing a blog post about.
\[\newcommand{\E}{\mathbb{E}}\]The Textbook Treatment of SGD
For simplicity, let’s consider the unconstrained minimization of $f(x)$ where $f$ is $\beta$-smooth and $\alpha$-strongly convex. Let $x^*$ denote the global minimizer, and $\kappa:=\frac{\beta}{\alpha}$ denote the condition number of the Hessian. The standard algorithm, gradient descent, is given by \(x_{t+1}\gets x_t -\eta_t \nabla f(x_t).\) It is well known that gradient descent converges linearly.
When $\eta_t=\frac{1}{\beta}$, \(\Vert x_{t}-x^*\Vert^2\le(1-\frac{1}{\kappa})^t\Vert x_{0}-x^*\Vert^2\). In other words, it takes $O\left(\kappa \log\frac{1}{\epsilon}\right)$ iterations to get $\epsilon$-close to $x^*$.
This can be proven by stacking together the three conditions we have \(\Vert x_{t+1}-x^*\Vert^2=\Vert x_{t}-x^*\Vert^2-2\eta_t \nabla f(x_t) ^T (x_t-x^*)+\eta_t^2\Vert \nabla f(x_t)\Vert^2,\) \(\nabla f(x_t) ^T (x_t-x^*)\ge f(x_t)-f(x^*)+\frac{\alpha}{2}\Vert x_t-x^*\Vert^2,\) \(\Vert \nabla f(x_t)\Vert^2\le 2\beta\left[f(x_t)-f(x^*)\right].\)
On the other hand, the standard SGD formulation is \(x_{t+1}\gets x_t - \eta_t g_t,\) where $\eta_t$ is the learning rate for the $t$-th iteration, and $g_t$ is a stochastic gradient, i.e. a random vector with $E[g_t]=\nabla f(x_t)$. The textbook treatment is typically the following.
Suppose that $E\left[\Vert g_t\Vert^2\right]\le \rho^2$. When $\eta_t=\frac{1}{\alpha t}$, SGD converges in a $O(\frac{\rho^2\log T}{\alpha T})$ rate.
The proof is not complicated either. It follows from the update rule that
\[\Vert x_{t+1}-x^*\Vert^2=\Vert x_t-x^*\Vert^2 -2\eta_t g_t ^T (x_t-x^*)+\eta_t^2\Vert g_t\Vert^2.\]By strong convexity
\[\begin{align*} f(x_t)-f(x^*)&\le \nabla f(x_t)^T(x_t-x^*)-\frac{\alpha}{2}\Vert x_t-x^*\Vert^2 \\ &=\frac{1}{2\eta_t}E\left[\Vert x_t-x^*\Vert^2-\Vert x_{t+1}-x^*\Vert^2+\eta_t^2\Vert g_t\Vert^2\right]-\frac{\alpha}{2}\Vert x_t-x^*\Vert^2\\ &\le \frac{\eta_t \rho^2}{2}+E\left[\left(\frac{1}{2\eta_t}-\frac{\alpha}{2}\right)\Vert x_t-x^*\Vert^2 - \frac{1}{2\eta_t}\Vert x_{t+1}-x^*\Vert^2\right]. \end{align*}\]Let us set $\eta_t=\frac{1}{\alpha t}$. In that case $\frac{1}{2\eta_t}-\frac{\alpha}{2}=\frac{1}{2\eta_{t-1}}$. Therefore
\[\begin{align*} E\left[f\left(\frac{\sum_{t=1}^T x_t}{T}\right)-f(x^*)\right]&\le \frac{1}{T}\sum_{t=1}^T E\left[f(x_t)-f(x^*)\right]\\ &\le \frac{1}{T}\sum_{t=1}^T\frac{\eta_t\rho^2}{2}+\sum_{t=1}^T\left(\frac{1}{2\eta_{t-1}}E\left[\Vert x_t-x^*\Vert^2\right]-\frac{1}{2\eta_{t}}E\left[\Vert x_{t+1}-x^*\Vert^2\right]\right)\\ &\le \frac{\rho^2(1+\ln T)}{\alpha T}. \end{align*}\]Albeit simple, the analysis above is not improvable in general (see here for instance). With an elegant unimprovable upper bound, the story of SGD should probably end here, right?
Finite Sums
Probably not. To start with, at least in a large number (perhaps majority?) of machine learning applications, SGD isn’t actually about stochastic gradients; it’s about finite sums. That is, many loss functions in machine learning can be written as \(f(x)=\frac{1}{n}\sum_{i=1}^n f_i(x),\) where $n$ is huge. Typically $n$ is just the number of data points, and the total loss is the average of some loss function averaged over data points. (Note: $x$ denotes the weights here.)
In these applications, people have access to a deterministic gradient oracle for each $f_i$, which is considered cheap. What people don’t have is a simple way to compute $\nabla f(x)$ other than computing each $\nabla f_i(x)$ and taking the average. Therefore, in terms of oracle queries, the cost of gradient descent is no longer $O(\kappa \log \frac{1}{\epsilon})$, but $O\left(n\kappa \log\frac{1}{\epsilon}\right)$. Note only the complexity is high, gradient descent is also intuitively suboptimal. When computing $\nabla f(x)$, one passes through the entire dataset but makes no updates on $x$ until the pass is finished. Intuitively, the optimization can be much faster if one make updates during the pass.
This precisely lead to the SGD that people usually talk about in ML contexts. The update rule is \(x_{t+1}\gets x_t-\eta_t\nabla f_{i_t}(x_t),\) where $i_t$ is randomly chosen in $[n]$ each iteration. Apparently $E[\nabla f_{i_t}(x_t)]=E[\nabla f(x_t)]$. Therefore we can apply the analysis above and get a $\tilde O(\frac{1}{\epsilon})$ bound. There is no (explicit) dependence on $n$, but the dependence on $\epsilon$ becomes much worse. How is finite sums really different?
SGD with Zero Training Loss
The answer lies in the assumption $E\left[\Vert g_t\Vert^2\right]\le \rho^2$. This seems reasonable when thinking SGD as gradient descent with noise, but may not be that reasonable when it comes to finite sums.
Suppose that $f(x)$ is a finite sum $f(x)=\frac{1}{n}\sum_{i=1}^n f_i(x)$, where each $f_i(x)$ is convex and $\beta$-smooth, and $f(x)$ is $\alpha$-strongly convex. Without loss of generality, assume that $\min_x f_i(x)=0$. Then we can say the following about the stochastic gradient:
$E\left[\Vert g_t\Vert^2\right]\le 2\beta f(x_t)$.
In other words, for finite sum problems, the noise level decreases as the loss decreases. In particular, when $f(x^)=0$, that is when zero training loss is achievable, the gradient noise decreases to zero. Similar properties are called *growth conditions in Vaswani’s paper. With them, SGD converges linearly under a constant learning rate.
(Vaswani et al. Theorem 5) When $f(x^*)=0$, with a constant learning rate $\eta_t=\frac{1}{2\beta}$, \(E\left[\Vert x_{t}-x^*\Vert^2\right]\le(1-\frac{1}{2\kappa})^t\Vert x_{0}-x^*\Vert^2.\)
Let’s try to prove this, and something slightly more general.
Let us start from the unproven claim about gradient noise. By $\beta$-smoothness of $f$,
\[\Vert \nabla f_i(x_t)\Vert^2 \le 2\beta \left[f_i(x_t)-\min_x f_i(x)\right]=2\beta f_i(x_t).\]It directly follows that
\[E\left[\Vert g_t\Vert^2\right]=\frac{1}{n}\sum_{i=1}^n \Vert \nabla f_i(x_t)\Vert^2\le 2\beta f(x_t).\]We will now try to plug this into the analysis of SGD. As above,
\[\Vert x_{t+1}-x^*\Vert^2=\Vert x_t-x^*\Vert^2 -2\eta_t g_t ^T (x_t-x^*)+\eta_t^2\Vert g_t\Vert^2.\]Therefore
\[\begin{align*} E\left[\Vert x_{t+1}-x^*\Vert^2\right]= \Vert x_t-x^*\Vert^2-2\eta_t \nabla f(x_t)^T(x_t-x^*)+\eta_t^2E\left[\Vert g_t\Vert^2\right]. \end{align*}\]By strong convexity, \(f(x^*)\ge f(x_t)+\nabla f(x_t)^T(x^*-x_t)+\frac{\alpha}{2}\Vert x_t-x^*\Vert^2.\) It follows that \(\begin{align*} E\left[\Vert x_{t+1}-x^*\Vert^2\right]&\le (1-\eta_t\alpha)\Vert x_t-x^*\Vert^2-2\eta_t(f(x_t)-f(x^*))+\eta_t^2E\left[\Vert g_t\Vert^2\right]\\ &\le (1-\eta_t\alpha)\Vert x_t-x^*\Vert^2-2\eta_t(f(x_t)-f(x^*))+2\beta\eta_t^2 f(x_t)\\ &=\left(1-\frac{\kappa}{2}\right)\Vert x_t-x^*\Vert^2-\frac{1}{2\beta}f(x_t)+\frac{1}{\beta}f(x^*). \end{align*}\)
Therefore, as long as $f(x_t)\ge 2f(x^*)$,
\[E\left[\Vert x_{t+1}-x^*\Vert^2\right]\le \left(1-\frac{1}{2\kappa}\right)\Vert x_t-x^*\Vert^2.\]In other words, in the initial phase of optimization (when $f(x_t)\ge 2f(x^*)$), the behavior of SGD is actually linear convergence, almost the same as full-batch GD. Note that essentially no additional assumption is needed for this. When SGD enters the second phase, the observation that $E\left[\Vert g_t\Vert^2\right]\le 2\beta f(x_t)$ is no longer powerful and becomes almost the same as a constant upperbound $E\left[\Vert g_t\Vert^2\right]\le \rho^2$.
When $f(x^*)=0$, there would be no second phase, so SGD simply converges linearly to zero! This is precisely the Theorem 5 in Vaswani’s paper. Although \(f(x^*)=0\) is usually justified via over-parameterization, I think there is a small caveat for this. If the problem is over-parameterized, $f(x)$ should have multiple global minimizers; in this case, strong convexity of $f(x)$ is no longer reasonable, which is crucial for the linear convergence.
A Unified View
We can now see that $f(x^*)$ is actually an important parameter for SGD. Why don’t we try to incorporate it into the complexity bound?
Consider again a finite sum problem $f(x)=\frac{1}{n}\sum_{i=1}^n f_i(x)$, where each $f_i(x)$ is convex and $\beta$-smooth, and $f(x)$ is $\alpha$-strongly convex. Assume that $\min_x f_i(x)=0$. Suppose that we use the following learning rate schedule for SGD: \(\eta_t=\begin{cases} \frac{1}{2\beta},\quad (f(x_t)>2f(x^*))\\ \frac{1}{\alpha t},\quad (f(x_t)\le 2f(x^*)) \end{cases}.\) Because \(\begin{align*} \frac{\alpha}{2}\Vert x_0-x^*\Vert \le &f(x_0)-f(x^*), &f(x_t)-f(x^*)\le \frac{\beta}{2}\Vert x_t-x^*\Vert, \end{align*}\) the distance that need to be shortened in the first phase is at most \(\frac{\Vert x_0-x^*\Vert^2}{\Vert x_t-x^*\Vert^2}\le \frac{\kappa (f(x_0)-f(x^*))}{2f(x^*)-f(x^*)}\le \frac{\kappa f(x_0)}{f(x^*)}.\) Therefore, the length of the first phase is \(O\left(\kappa\log\frac{\kappa f(x_0)}{f(x^*)}\right)\). (Let us ignore the dependence on probability here.) Meanwhile, assume in the second phase, \(f(x_t)\) never exceeds \(cf(x^*)\), then \(E\left[\Vert g_t\Vert^2 \right]\le 2\beta cf(x^*)\). Using the analysis in the start, we can show that the length of the second phase is $O\left(\frac{\kappa f(x^*)}{\epsilon}\right)$. So the total number of iterations needed for \(E\left[f(x_t)-f(x^*)\right]<\epsilon\) is
\[O\left(\kappa\log\frac{\kappa f(x_0)}{\max\{\epsilon,f(x^*)\}}+\frac{\kappa f(x^*)}{\epsilon}\right).\]We may compare this to the bound before, which is (substituting $\rho^2$ with $2\beta f(x_0)$)
\[O\left(\frac{\kappa f(x_0)}{\epsilon}\right).\]We can see that the asymptotic dependence on $\epsilon$ is the same. However, when \(\epsilon/f(x^*)\) is not so small, the first term may be dominant; in that case, SGD would be faster than you may think.