Lock-Free Optimization for Non-Convex Problems Shen-Yi Zhao,Gong-Duo Zhang and Wu-Jun Li National Key Laboratory for Novel Software Technology Department of Computer Science and Technology,Nanjing University,China zhaosy, zhanggd)elamda.nju.edu.cn,liwujun@nju.edu.cn Abstract 2013)to solve it.Many works (Roux,Schmidt,and Bach 2012:Shalev-Shwartz and Zhang 2013:Johnson and Zhang Stochastic gradient descent (SGD)and its variants have at- tracted much attention in machine learning due to their ef- 2013)have found that SGD-based methods can achieve ficiency and effectiveness for optimization.To handle large- promising performance in large-scale learning problem- scale problems,researchers have recently proposed several s.According to the implementation platforms or system- lock-free strategy based parallel SGD (LF-PSGD)method- s,existing SGD-based methods can be divided into three s for multi-core systems.However,existing works have on- categories:sequential SGD (SSGD)methods,parallel S- ly proved the convergence of these LF-PSGD methods for GD (PSGD)methods.and distributed SGD (DSGD)meth- convex problems.To the best of our knowledge,no work ods.SSGD methods are designed for a single thread on has proved the convergence of the LF-PSGD methods for a single machine,PSGD methods are designed for multi- non-convex problems.In this paper,we provide the theoret- core (multi-thread)on a single machine with a shared memo- ical proof about the convergence of two representative LF- PSGD methods.Hogwild!and AsySVRG.for non-convex ry,and DSGD methods are designed for multiple machines. problems.Empirical results also show that both Hogwild!and When the problem in (1)is convex,the SGD meth- AsySVRG are convergent on non-convex problems,which ods,including SSGD (Roux,Schmidt.and Bach 2012: successfully verifies our theoretical results. Shalev-Shwartz and Zhang 2013;Johnson and Zhang 2013), PSGD (Recht et al.2011)and DSGD (Jaggi et al.2014; Li et al.2014;Xing et al.2015;Zhang,Zheng,and K- Introduction wok 2016),have achieved very promising empirical perfor- Many machine learning models can be formulated as the fol- mance.Furthermore,good theoretical results about the con- lowing optimization problem: vergence of the SGD methods are also provided by these existing works. 1>f(w) min (1 In many real applications,the problems to optimize can be i=1 non-convex.For example,the problems for the neural net- works are typically non-convex.Because many researcher- where w is the parameter to learn (optimize),n is the num s (Li et al.2014;Xing et al.2015)find that the SGD ber of training instances,fi(w)is the loss defined on in- methods can also achieve good empirical results for non- stance i.For example,assuming we are given a set of la- convex problems,theoretical proof about the convergence beled instances {(xy)i=1,2,...,n},where xiERd is of SGD methods for non-convex problems has recently at- the feature vector and yi{1,-1}is the label of xi,fi(w) tracted much attention.Some progress has been achieved can be log(1+e)+wll2 which is known as the For example,the works in (Ghadimi and Lan 2013:Red- regularized loss in logistic regression(LR).We can also take di et al.2016:Li et al.2016:Allen-Zhu and Hazan 2016: fi(w)to be max{0,1-w}+w2 which is known as Allen-Zhu and Yuan 2016)have proved the convergence of the regularized loss in support vector machine(SVM).Here, the sequential SGD and its variants for non-convex prob- A is the regularization hyper-parameter.Moreover,many lems.There are also some other theoretical results for some other machine learning models,including neural network- particular non-convex problems,like PCA (Shamir 2015; s(Krizhevsky,Sutskever,and Hinton 2012),matrix factor- 2016a;2016b)and matrix factorization (Sa,Re,and Oluko- ization (Koren,Bell,and Volinsky 2009),and principal com- tun 2015).But all these works are only for SSGD methods. ponent analysis (PCA)(Shamir 2015)and so on,can also be There have appeared only two works (Lian et al.2015: formulated as that in (1). Huo and Huang 2016)which propose PSGD methods for When the problem in (1)is large-scale,i.e.,n is large, non-convex problems with theoretical proof of convergence. researchers have recently proposed stochastic gradient de- scent(SGD)and its variants like SVRG (Johnson and Zhang IIn some literatures,PSGD refers to the methods implemented on both multi-core and multi-machine systems.In this paper,PS- Copyright c)2017,Association for the Advancement of Artificial GD only refers to the methods implemented on multi-core systems Intelligence (www.aaai.org).All rights reserved. with a shared memory.Lock-Free Optimization for Non-Convex Problems Shen-Yi Zhao, Gong-Duo Zhang and Wu-Jun Li National Key Laboratory for Novel Software Technology Department of Computer Science and Technology, Nanjing University, China {zhaosy, zhanggd}@lamda.nju.edu.cn, liwujun@nju.edu.cn Abstract Stochastic gradient descent (SGD) and its variants have attracted much attention in machine learning due to their ef- ficiency and effectiveness for optimization. To handle largescale problems, researchers have recently proposed several lock-free strategy based parallel SGD (LF-PSGD) methods for multi-core systems. However, existing works have only proved the convergence of these LF-PSGD methods for convex problems. To the best of our knowledge, no work has proved the convergence of the LF-PSGD methods for non-convex problems. In this paper, we provide the theoretical proof about the convergence of two representative LFPSGD methods, Hogwild! and AsySVRG, for non-convex problems. Empirical results also show that both Hogwild! and AsySVRG are convergent on non-convex problems, which successfully verifies our theoretical results. Introduction Many machine learning models can be formulated as the following optimization problem: min w 1 n Xn i=1 fi(w), (1) where w is the parameter to learn (optimize), n is the number of training instances, fi(w) is the loss defined on instance i. For example, assuming we are given a set of labeled instances {(xi , yi)|i = 1, 2, . . . , n}, where xi ∈ R d is the feature vector and yi ∈ {1, −1} is the label of xi , fi(w) can be log(1 + e −yix T i w) + λ 2 kwk 2 which is known as the regularized loss in logistic regression (LR). We can also take fi(w)to be max{0, 1−yix T i w}+ λ 2 kwk 2 which is known as the regularized loss in support vector machine (SVM). Here, λ is the regularization hyper-parameter. Moreover, many other machine learning models, including neural networks (Krizhevsky, Sutskever, and Hinton 2012), matrix factorization (Koren, Bell, and Volinsky 2009), and principal component analysis (PCA) (Shamir 2015) and so on, can also be formulated as that in (1). When the problem in (1) is large-scale, i.e., n is large, researchers have recently proposed stochastic gradient descent (SGD) and its variants like SVRG (Johnson and Zhang Copyright c 2017, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. 2013) to solve it. Many works (Roux, Schmidt, and Bach 2012; Shalev-Shwartz and Zhang 2013; Johnson and Zhang 2013) have found that SGD-based methods can achieve promising performance in large-scale learning problems. According to the implementation platforms or systems, existing SGD-based methods can be divided into three categories: sequential SGD (SSGD) methods, parallel SGD (PSGD) methods, and distributed SGD (DSGD) methods. SSGD methods are designed for a single thread on a single machine, PSGD methods are designed for multicore (multi-thread) on a single machine with a shared memory1 , and DSGD methods are designed for multiple machines. When the problem in (1) is convex, the SGD methods, including SSGD (Roux, Schmidt, and Bach 2012; Shalev-Shwartz and Zhang 2013; Johnson and Zhang 2013), PSGD (Recht et al. 2011) and DSGD (Jaggi et al. 2014; Li et al. 2014; Xing et al. 2015; Zhang, Zheng, and Kwok 2016), have achieved very promising empirical performance. Furthermore, good theoretical results about the convergence of the SGD methods are also provided by these existing works. In many real applications, the problems to optimize can be non-convex. For example, the problems for the neural networks are typically non-convex. Because many researchers (Li et al. 2014; Xing et al. 2015) find that the SGD methods can also achieve good empirical results for nonconvex problems, theoretical proof about the convergence of SGD methods for non-convex problems has recently attracted much attention. Some progress has been achieved. For example, the works in (Ghadimi and Lan 2013; Reddi et al. 2016; Li et al. 2016; Allen-Zhu and Hazan 2016; Allen-Zhu and Yuan 2016) have proved the convergence of the sequential SGD and its variants for non-convex problems. There are also some other theoretical results for some particular non-convex problems, like PCA (Shamir 2015; 2016a; 2016b) and matrix factorization (Sa, Re, and Olukotun 2015). But all these works are only for SSGD methods. There have appeared only two works (Lian et al. 2015; Huo and Huang 2016) which propose PSGD methods for non-convex problems with theoretical proof of convergence. 1 In some literatures, PSGD refers to the methods implemented on both multi-core and multi-machine systems. In this paper, PSGD only refers to the methods implemented on multi-core systems with a shared memory