目前尚不清楚为什么类似 ADAM 的自适应梯度算法尽管训练速度更快,但泛化性能却比 SGD 差。这项工作旨在通过分析它们的局部收敛行为来提供对这种泛化差距的理解。具体来说,我们观察到这些算法中梯度噪声的重尾。这促使我们通过其 Levy 驱动的随机微分方程 (SDE) 来分析这些算法,因为算法及其 SDE 的收敛行为相似。然后我们确定这些 SDE 从当地盆地的逃逸时间。结果表明:(1)SGD和ADAM~的逃逸时间正依赖于盆地的Radon测度,负依赖于梯度噪声的重度;(2) 对于同一个盆地,SGD 的逃逸时间比 ADAM 小,主要是因为 (a) ADAM 中的几何自适应~通过自适应缩放每个梯度坐标井减少了梯度噪声中的各向异性结构,并导致盆地的氡测量值更大;(b) ADAM 中的指数梯度平均值平滑其梯度并导致比 SGD 更轻的梯度噪声尾部。所以SGD比ADAM更局部不稳定~在定义为局部盆地具有较小氡测量的最小值的尖锐最小值处,并且可以更好地从它们逃逸到具有较大氡测量的平坦盆地。由于这里的平坦最小值通常指平坦或不对称盆地/山谷的最小值,通常比尖锐的最小值泛化得更好~\cite{keskar2016large,he2019asymmetric},我们的结果解释了 SGD 优于 ADAM 的泛化性能。最后, It is not clear yet why ADAM-alike adaptive gradient algorithms suffer from worse generalization performance than SGD despite their faster training speed. This work aims to provide understandings on this generalization gap by analyzing their local convergence behaviors. Specifically, we observe the heavy tails of gradient noise in these algorithms. This motivates us to analyze these algorithms through their Levy-driven stochastic differential equations (SDEs) because of the similar convergence behaviors of an algorithm and its SDE. Then we establish the escaping time of these SDEs from a local basin. The result shows that (1) the escaping time of both SGD and ADAM~depends on the Radon measure of the basin positively and the heaviness of gradient noise negatively; (2) for the same basin, SGD enjoys smaller escaping time than ADAM, mainly because (a) the geometry adaptation in ADAM~via adaptively scaling each gradient coordinate well diminishes the anisotropic structure in gradient noise and results in larger Radon measure of a basin; (b) the exponential gradient average in ADAM~smooths its gradient and leads to lighter gradient noise tails than SGD. So SGD is more locally unstable than ADAM~at sharp minima defined as the minima whose local basins have small Radon measure, and can better escape from them to flatter ones with larger Radon measure. As flat minima here which often refer to the minima at flat or asymmetric basins/valleys often generalize better than sharp ones~\cite{keskar2016large,he2019asymmetric}, our result explains the better generalization performance of SGD over ADAM. Finally, experimental results confirm our heavy-tailed gradient noise assumption and theoretical affirmation.