正在加载图片...
Lock-Free Optimization for Non-Convex Problems Shen-Yi Zhao,Gong-Duo Zhang and Wu-Jun Li National Key Laboratory for Novel Software Technology Department of Computer Science and Technology,Nanjing University,China zhaosy, zhanggd)elamda.nju.edu.cn,liwujun@nju.edu.cn Abstract 2013)to solve it.Many works (Roux,Schmidt,and Bach 2012:Shalev-Shwartz and Zhang 2013:Johnson and Zhang Stochastic gradient descent (SGD)and its variants have at- tracted much attention in machine learning due to their ef- 2013)have found that SGD-based methods can achieve ficiency and effectiveness for optimization.To handle large- promising performance in large-scale learning problem- scale problems,researchers have recently proposed several s.According to the implementation platforms or system- lock-free strategy based parallel SGD (LF-PSGD)method- s,existing SGD-based methods can be divided into three s for multi-core systems.However,existing works have on- categories:sequential SGD (SSGD)methods,parallel S- ly proved the convergence of these LF-PSGD methods for GD (PSGD)methods.and distributed SGD (DSGD)meth- convex problems.To the best of our knowledge,no work ods.SSGD methods are designed for a single thread on has proved the convergence of the LF-PSGD methods for a single machine,PSGD methods are designed for multi- non-convex problems.In this paper,we provide the theoret- core (multi-thread)on a single machine with a shared memo- ical proof about the convergence of two representative LF- PSGD methods.Hogwild!and AsySVRG.for non-convex ry,and DSGD methods are designed for multiple machines. problems.Empirical results also show that both Hogwild!and When the problem in (1)is convex,the SGD meth- AsySVRG are convergent on non-convex problems,which ods,including SSGD (Roux,Schmidt.and Bach 2012: successfully verifies our theoretical results. Shalev-Shwartz and Zhang 2013;Johnson and Zhang 2013), PSGD (Recht et al.2011)and DSGD (Jaggi et al.2014; Li et al.2014;Xing et al.2015;Zhang,Zheng,and K- Introduction wok 2016),have achieved very promising empirical perfor- Many machine learning models can be formulated as the fol- mance.Furthermore,good theoretical results about the con- lowing optimization problem: vergence of the SGD methods are also provided by these existing works. 1>f(w) min (1 In many real applications,the problems to optimize can be i=1 non-convex.For example,the problems for the neural net- works are typically non-convex.Because many researcher- where w is the parameter to learn (optimize),n is the num s (Li et al.2014;Xing et al.2015)find that the SGD ber of training instances,fi(w)is the loss defined on in- methods can also achieve good empirical results for non- stance i.For example,assuming we are given a set of la- convex problems,theoretical proof about the convergence beled instances {(xy)i=1,2,...,n},where xiERd is of SGD methods for non-convex problems has recently at- the feature vector and yi{1,-1}is the label of xi,fi(w) tracted much attention.Some progress has been achieved can be log(1+e)+wll2 which is known as the For example,the works in (Ghadimi and Lan 2013:Red- regularized loss in logistic regression(LR).We can also take di et al.2016:Li et al.2016:Allen-Zhu and Hazan 2016: fi(w)to be max{0,1-w}+w2 which is known as Allen-Zhu and Yuan 2016)have proved the convergence of the regularized loss in support vector machine(SVM).Here, the sequential SGD and its variants for non-convex prob- A is the regularization hyper-parameter.Moreover,many lems.There are also some other theoretical results for some other machine learning models,including neural network- particular non-convex problems,like PCA (Shamir 2015; s(Krizhevsky,Sutskever,and Hinton 2012),matrix factor- 2016a;2016b)and matrix factorization (Sa,Re,and Oluko- ization (Koren,Bell,and Volinsky 2009),and principal com- tun 2015).But all these works are only for SSGD methods. ponent analysis (PCA)(Shamir 2015)and so on,can also be There have appeared only two works (Lian et al.2015: formulated as that in (1). Huo and Huang 2016)which propose PSGD methods for When the problem in (1)is large-scale,i.e.,n is large, non-convex problems with theoretical proof of convergence. researchers have recently proposed stochastic gradient de- scent(SGD)and its variants like SVRG (Johnson and Zhang IIn some literatures,PSGD refers to the methods implemented on both multi-core and multi-machine systems.In this paper,PS- Copyright c)2017,Association for the Advancement of Artificial GD only refers to the methods implemented on multi-core systems Intelligence (www.aaai.org).All rights reserved. with a shared memory.Lock-Free Optimization for Non-Convex Problems Shen-Yi Zhao, Gong-Duo Zhang and Wu-Jun Li National Key Laboratory for Novel Software Technology Department of Computer Science and Technology, Nanjing University, China {zhaosy, zhanggd}@lamda.nju.edu.cn, liwujun@nju.edu.cn Abstract Stochastic gradient descent (SGD) and its variants have at￾tracted much attention in machine learning due to their ef- ficiency and effectiveness for optimization. To handle large￾scale problems, researchers have recently proposed several lock-free strategy based parallel SGD (LF-PSGD) method￾s for multi-core systems. However, existing works have on￾ly proved the convergence of these LF-PSGD methods for convex problems. To the best of our knowledge, no work has proved the convergence of the LF-PSGD methods for non-convex problems. In this paper, we provide the theoret￾ical proof about the convergence of two representative LF￾PSGD methods, Hogwild! and AsySVRG, for non-convex problems. Empirical results also show that both Hogwild! and AsySVRG are convergent on non-convex problems, which successfully verifies our theoretical results. Introduction Many machine learning models can be formulated as the fol￾lowing optimization problem: min w 1 n Xn i=1 fi(w), (1) where w is the parameter to learn (optimize), n is the num￾ber of training instances, fi(w) is the loss defined on in￾stance i. For example, assuming we are given a set of la￾beled instances {(xi , yi)|i = 1, 2, . . . , n}, where xi ∈ R d is the feature vector and yi ∈ {1, −1} is the label of xi , fi(w) can be log(1 + e −yix T i w) + λ 2 kwk 2 which is known as the regularized loss in logistic regression (LR). We can also take fi(w)to be max{0, 1−yix T i w}+ λ 2 kwk 2 which is known as the regularized loss in support vector machine (SVM). Here, λ is the regularization hyper-parameter. Moreover, many other machine learning models, including neural network￾s (Krizhevsky, Sutskever, and Hinton 2012), matrix factor￾ization (Koren, Bell, and Volinsky 2009), and principal com￾ponent analysis (PCA) (Shamir 2015) and so on, can also be formulated as that in (1). When the problem in (1) is large-scale, i.e., n is large, researchers have recently proposed stochastic gradient de￾scent (SGD) and its variants like SVRG (Johnson and Zhang Copyright c 2017, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. 2013) to solve it. Many works (Roux, Schmidt, and Bach 2012; Shalev-Shwartz and Zhang 2013; Johnson and Zhang 2013) have found that SGD-based methods can achieve promising performance in large-scale learning problem￾s. According to the implementation platforms or system￾s, existing SGD-based methods can be divided into three categories: sequential SGD (SSGD) methods, parallel S￾GD (PSGD) methods, and distributed SGD (DSGD) meth￾ods. SSGD methods are designed for a single thread on a single machine, PSGD methods are designed for multi￾core (multi-thread) on a single machine with a shared memo￾ry1 , and DSGD methods are designed for multiple machines. When the problem in (1) is convex, the SGD meth￾ods, including SSGD (Roux, Schmidt, and Bach 2012; Shalev-Shwartz and Zhang 2013; Johnson and Zhang 2013), PSGD (Recht et al. 2011) and DSGD (Jaggi et al. 2014; Li et al. 2014; Xing et al. 2015; Zhang, Zheng, and K￾wok 2016), have achieved very promising empirical perfor￾mance. Furthermore, good theoretical results about the con￾vergence of the SGD methods are also provided by these existing works. In many real applications, the problems to optimize can be non-convex. For example, the problems for the neural net￾works are typically non-convex. Because many researcher￾s (Li et al. 2014; Xing et al. 2015) find that the SGD methods can also achieve good empirical results for non￾convex problems, theoretical proof about the convergence of SGD methods for non-convex problems has recently at￾tracted much attention. Some progress has been achieved. For example, the works in (Ghadimi and Lan 2013; Red￾di et al. 2016; Li et al. 2016; Allen-Zhu and Hazan 2016; Allen-Zhu and Yuan 2016) have proved the convergence of the sequential SGD and its variants for non-convex prob￾lems. There are also some other theoretical results for some particular non-convex problems, like PCA (Shamir 2015; 2016a; 2016b) and matrix factorization (Sa, Re, and Oluko￾tun 2015). But all these works are only for SSGD methods. There have appeared only two works (Lian et al. 2015; Huo and Huang 2016) which propose PSGD methods for non-convex problems with theoretical proof of convergence. 1 In some literatures, PSGD refers to the methods implemented on both multi-core and multi-machine systems. In this paper, PS￾GD only refers to the methods implemented on multi-core systems with a shared memory
向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有