On the different regimes of Stochastic Gradient Descent
Antonio Sclocchi (EPFL Lausanne)
The success of modern deep learning relies on the way neural networks are trained, which consists in the optimization of a high-dimensional loss landscape. This is done with the Stochastic Gradient Descent (SGD) algorithm, where the loss gradients are estimated on a small batch of the data at each time step. The choice of the batch size and the step size (or learning rate) is observed to be important to have good performances in real applications, but it is poorly theoretically understood and relies heavily on expensive grid-search procedures. In this work, we clarify how the batch size and the learning rate affect the training dynamics of neural networks, leading to a phase diagram with three distinct dynamical regimes: (i) a noise-dominated phase, where SGD is described by a stochastic process, (ii) a large-first-step dominated phase, and (iii) a phase where it is equivalent to simple Gradient Descent (GD). We obtain these results in a teacher-student perceptron model and show empirically that our predictions still apply to deep networks on benchmark tasks, like image classification. Our results lead to new predictions on how the size of the training dataset and the hardness of the task affect the training dynamics, and open the way to understanding its relationship with neural network performances.
This seminar will be on zoom also:
Meeting ID: 936 6839 7465