《人工智能、机器学习与大数据》课程教学资源（参考文献）Robust Frequent Directions with Application in Online Learning.pdf_大学文库

Joumal of Machine Leaming Research 20(2019)1-41 Submitted 12/17:Revised 8/18:Published 2/19 Robust Frequent Directions with Application in Online Learning Luo Luo RICKY @SJTU.EDU.CN Cheng Chen JACK_CHEN1990@SJTU.EDU.CN Department of Computer Science and Engineering Shanghai Jiao Tong University 800 Dongchuan Road,Shanghai,China 200240 Zhihua Zhang* ZHZHANG@MATH.PKU.EDU.CN National Engineering Lab for Big Data Analysis and Applications School of Mathematical Sciences Peking University 5 Yiheyuan Road,Beijing,China 100871 Wu-Jun Li LIWUJUN@NJU.EDU.CN National Key Laboratory for Novel Software Technology Collaborative Innovation Center of Novel Software Technology and Industrialization Department of Computer Science and Technology Nanjing University 163 Xianlin Avenue,Nanjing,China 210023 Tong Zhang TONGZHANG @TONGZHANG-ML.ORG Computer Science Mathematics Hong Kong University of Science and Technology Hong Kong Editor:Qiang Liu Abstract The frequent directions(FD)technique is a deterministic approach for online sketching that has many applications in machine learning.The conventional FD is a heuristic procedure that often outputs rank deficient matrices.To overcome the rank deficiency problem,we propose a new sketching strategy called robust frequent directions(RFD)by introducing a regularization term.RFD can be derived from an optimization problem.It updates the sketch matrix and the regularization term adaptively and jointly.RFD reduces the approximation error of FD without increasing the computational cost.We also apply RFD to online learning and propose an effective hyperparameter- free online Newton algorithm.We derive a regret bound for our online Newton algorithm based on RFD,which guarantees the robustness of the algorithm.The experimental studies demonstrate that the proposed method outperforms state-of-the-art second order online learning algorithms. Keywords:Matrix approximation,sketching,frequent directions,online convex optimization, online Newton algorithm 1.Introduction The sketching technique is a powerful tool to deal with large scale matrices(Ghashami et al.,2016; Halko et al.,2011;Woodruff,2014),and it has been widely used to speed up machine learning *Corresponding author. C2019 Luo Luo,Cheng Chen,Zhihua Zhang,Wu-Jun Li and Tong Zhang. License:CC-BY 4.0.see https://creativecommons.org/licenses/by/4.0/.Attribution requirements are provided at http://jmlr.org/papers/v20/17-773.html

Journal of Machine Learning Research 20 (2019) 1-41 Submitted 12/17; Revised 8/18; Published 2/19 Robust Frequent Directions with Application in Online Learning Luo Luo RICKY@SJTU.EDU.CN Cheng Chen JACK CHEN1990@SJTU.EDU.CN Department of Computer Science and Engineering Shanghai Jiao Tong University 800 Dongchuan Road, Shanghai, China 200240 Zhihua Zhang∗ ZHZHANG@MATH.PKU.EDU.CN National Engineering Lab for Big Data Analysis and Applications School of Mathematical Sciences Peking University 5 Yiheyuan Road, Beijing, China 100871 Wu-Jun Li LIWUJUN@NJU.EDU.CN National Key Laboratory for Novel Software Technology Collaborative Innovation Center of Novel Software Technology and Industrialization Department of Computer Science and Technology Nanjing University 163 Xianlin Avenue, Nanjing, China 210023 Tong Zhang TONGZHANG@TONGZHANG-ML.ORG Computer Science & Mathematics Hong Kong University of Science and Technology Hong Kong Editor: Qiang Liu Abstract The frequent directions (FD) technique is a deterministic approach for online sketching that has many applications in machine learning. The conventional FD is a heuristic procedure that often outputs rank deficient matrices. To overcome the rank deficiency problem, we propose a new sketching strategy called robust frequent directions (RFD) by introducing a regularization term. RFD can be derived from an optimization problem. It updates the sketch matrix and the regularization term adaptively and jointly. RFD reduces the approximation error of FD without increasing the computational cost. We also apply RFD to online learning and propose an effective hyperparameterfree online Newton algorithm. We derive a regret bound for our online Newton algorithm based on RFD, which guarantees the robustness of the algorithm. The experimental studies demonstrate that the proposed method outperforms state-of-the-art second order online learning algorithms. Keywords: Matrix approximation, sketching, frequent directions, online convex optimization, online Newton algorithm 1. Introduction The sketching technique is a powerful tool to deal with large scale matrices (Ghashami et al., 2016; Halko et al., 2011; Woodruff, 2014), and it has been widely used to speed up machine learning ∗. Corresponding author. c 2019 Luo Luo, Cheng Chen, Zhihua Zhang, Wu-Jun Li and Tong Zhang. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v20/17-773.html

LUO.CHEN,ZHANG,LI AND ZHANG algorithms such as second order optimization algorithms(Erdogdu and Montanari,2015;Luo et al., 2016;Pilanci and Wainwright,2017;Roosta-Khorasani and Mahoney,2016a,b;Xu et al.,2016;Ye et al.,2017).There exist several families of matrix sketching strategies,including sparsification, column sampling,random projection (Achlioptas,2003;Indyk and Motwani,1998;Kane and Nelson, 2014;Wang et al.,2016),and frequent directions(FD)(Desai et al.,2016;Ghashami et al.,2016; Huang,2018;Liberty,2013;Mroueh et al.,2017;Ye et al.,2016). Sparsification techniques (Achlioptas and McSherry,2007;Achlioptas et al.,2013;Arora et al., 2006;Drineas and Zouzias,2011)generate a sparse version of the matrix by element-wise sampling, which allows the matrix multiplication to be more efficient with lower space.Column(row)sampling algorithms (Mahoney,2011)include the importance sampling (Drineas et al.,2006a,b;Frieze et al., 2004)and leverage score sampling (Drineas et al.,2012,2008;Papailiopoulos et al.,2014).They define a probability for each row(column)and select a subset by the probability to construct the estimation.Random projection maps the rows(columns)of the matrix into lower dimensional space by a projection matrix.The projection matrix can be constructed in various ways (Woodruff, 2014)such as Gaussian random projections (Johnson and Lindenstrauss,1984;Sarlos,2006),fast Johnson-Lindenstrauss transforms(Ailon and Chazelle,2006;Ailon and Liberty,2009,2013;Kane and Nelson,2014)and sparse random projections (Clarkson and Woodruff,2013;Nelson and Nguyen, 2013).The frequent directions(Ghashami et al.,2016;Liberty,2013)is a deterministic sketching algorithm and achieves optimal tradeoff between approximation error and space. In this paper we are especially concerned with the FD sketching(Desai et al.,2016;Ghashami et al.,2016;Liberty,2013),because it is a stable online sketching approach.FD considers the matrix approximation in the streaming setting.In this case,the data is available in a sequential order and should be processed immediately.Typically,streaming algorithms can only use limited memory at any time.The FD algorithm extends the method of frequent items(Misra and Gries, 1982)to matrix approximation and has tight approximation error bound.However,FD usually leads to a rank deficient approximation,which in turn makes its applications less robust.For example, Newton-type algorithms require a non-singular and well-conditioned approximated Hessian matrix but FD sketching usually generates low-rank matrices.An intuitive and simple way to conquer this gap is to introduce a regularization term to enforce the matrix to be invertible (Luo et al.,2016; Roosta-Khorasani and Mahoney,2016a,b).Typically,the regularization parameter is regarded as a hyperparameter and its choice is separable from the sketching procedure.Since the regularization parameter affects thethe performance heavily in practice,it should be chosen carefully. To overcome the weakness of the FD algorithm,we propose a new sketching approach that we call robust frequent directions(RFD).Unlike conventional sketching methods which only approximate the matrix with a low-rank structure,RFD constructs the low-rank part and updates the regularization term simultaneously.In particular,the update procedure of RFD can be regarded as solving an optimization problem(see Theorem 2).This method is different from the standard FD,giving rise to a tighter error bound. Note that Zhang (2014)proposed matrix ridge approximation (MRA)to approximate a positive semi-definite matrix using an idea similar to RFD.There are two main differences between RFD and MRA.First,RFD is designed for the case that data samples come sequentially and memory is limited, while MRA has to access the whole data set.Second,MRA aims to minimize the approximation error with respect to the Frobenius norm while RFD tries to minimize the spectral-norm approximation error.In general,the spectral norm error bound is more meaningful than the Frobenius norm error bound(Tropp,2015). 2

LUO, CHEN, ZHANG, LI AND ZHANG algorithms such as second order optimization algorithms (Erdogdu and Montanari, 2015; Luo et al., 2016; Pilanci and Wainwright, 2017; Roosta-Khorasani and Mahoney, 2016a,b; Xu et al., 2016; Ye et al., 2017). There exist several families of matrix sketching strategies, including sparsification, column sampling, random projection (Achlioptas, 2003; Indyk and Motwani, 1998; Kane and Nelson, 2014; Wang et al., 2016), and frequent directions (FD) (Desai et al., 2016; Ghashami et al., 2016; Huang, 2018; Liberty, 2013; Mroueh et al., 2017; Ye et al., 2016). Sparsification techniques (Achlioptas and McSherry, 2007; Achlioptas et al., 2013; Arora et al., 2006; Drineas and Zouzias, 2011) generate a sparse version of the matrix by element-wise sampling, which allows the matrix multiplication to be more efficient with lower space. Column (row) sampling algorithms (Mahoney, 2011) include the importance sampling (Drineas et al., 2006a,b; Frieze et al., 2004) and leverage score sampling (Drineas et al., 2012, 2008; Papailiopoulos et al., 2014). They define a probability for each row (column) and select a subset by the probability to construct the estimation. Random projection maps the rows (columns) of the matrix into lower dimensional space by a projection matrix. The projection matrix can be constructed in various ways (Woodruff, 2014) such as Gaussian random projections (Johnson and Lindenstrauss, 1984; Sarlos, 2006), fast Johnson-Lindenstrauss transforms (Ailon and Chazelle, 2006; Ailon and Liberty, 2009, 2013; Kane and Nelson, 2014) and sparse random projections (Clarkson and Woodruff, 2013; Nelson and Nguyen, ˆ 2013). The frequent directions (Ghashami et al., 2016; Liberty, 2013) is a deterministic sketching algorithm and achieves optimal tradeoff between approximation error and space. In this paper we are especially concerned with the FD sketching (Desai et al., 2016; Ghashami et al., 2016; Liberty, 2013), because it is a stable online sketching approach. FD considers the matrix approximation in the streaming setting. In this case, the data is available in a sequential order and should be processed immediately. Typically, streaming algorithms can only use limited memory at any time. The FD algorithm extends the method of frequent items (Misra and Gries, 1982) to matrix approximation and has tight approximation error bound. However, FD usually leads to a rank deficient approximation, which in turn makes its applications less robust. For example, Newton-type algorithms require a non-singular and well-conditioned approximated Hessian matrix but FD sketching usually generates low-rank matrices. An intuitive and simple way to conquer this gap is to introduce a regularization term to enforce the matrix to be invertible (Luo et al., 2016; Roosta-Khorasani and Mahoney, 2016a,b). Typically, the regularization parameter is regarded as a hyperparameter and its choice is separable from the sketching procedure. Since the regularization parameter affects the the performance heavily in practice, it should be chosen carefully. To overcome the weakness of the FD algorithm, we propose a new sketching approach that we call robust frequent directions (RFD). Unlike conventional sketching methods which only approximate the matrix with a low-rank structure, RFD constructs the low-rank part and updates the regularization term simultaneously. In particular, the update procedure of RFD can be regarded as solving an optimization problem (see Theorem 2). This method is different from the standard FD, giving rise to a tighter error bound. Note that Zhang (2014) proposed matrix ridge approximation (MRA) to approximate a positive semi-definite matrix using an idea similar to RFD. There are two main differences between RFD and MRA. First, RFD is designed for the case that data samples come sequentially and memory is limited, while MRA has to access the whole data set. Second, MRA aims to minimize the approximation error with respect to the Frobenius norm while RFD tries to minimize the spectral-norm approximation error. In general, the spectral norm error bound is more meaningful than the Frobenius norm error bound (Tropp, 2015). 2

RFD WITH APPLICATION IN ONLINE LEARNING In a recent study,Luo et al.(2016)proposed a FD-based sketched online Newton(SON)algorithm (FD-SON)to accelerate the standard online Newton algorithms.Owing to the shortcoming of FD,the performance of FD-SON is significantly affected by the choice of the hyperparameter.Naturally,we can leverage RFD to improve online Newton algorithms.Accordingly,we propose a sketched online Newton step based on RFD(RFD-SON).Different from conventional sketched Newton algorithms, RFD-SON is hyperparameter-free.Setting the regularization parameter to be zero initially,RFD-SON will adaptively increase the regularization term.The approximation Hessian will be well-conditioned after a few iterations.Moreover,we prove that RFD-SON has a more robust regret bound than FD-SON,and the experimental results also validate better performance of RFD-SON. The remainder of the paper is organized as follows.In Section 2 we present notation and preliminaries.In Section 3 we review the background of second order online learning and its sketched variants.In Sections 4 and 5 we propose our robust frequent directions(RFD)method and the applications in online learning,with some related theoretical analysis.In Section 6 we demonstrate empirical comparisons with baselines on serval real-world data sets to show the superiority of our algorithms.Finally,we conclude our work in Section 7. 2.Notation and Preliminaries We let Ia denote the dxd identity matrix.For a matrix A=[A]Rnxd of rank r where r min(n,d),we let the condensed singular value decomposition(SVD)of A be A =UZVT where U E Rnxr and V E Rdxr are column orthogonal and =diag(a1,02,...,r)with o1≥o2≥.·，≥or>0 places the nonzero singular values on its diagonal entries.. We use omax(A)to denote the largest singular value and omin(A)to denote the smallest non-zero singular vale.Thsthe ondiomer ofAsA)The matri pseudoinverse of A is defined by At=V>-1UT E Rdxn Additionally,we let IAlF=V∑iA号=V∑=1 be the Frobenius norm and A2= omax(A)be the spectral norm.A matrix norm is said to be unitarily invariant ifPAQ=A for any unitary matrices PRnx and QRdxd.It is easy to verify that both the Frobenius norm and spectral norm are unitarily invariant.We define [A]=UV for<r,where UERnxk and VkE Rdxk are the first k columns of U and V,andk=diag(1,2,...,)Rkxk.Then [Alk is the best rank-k approximation to A in both the Frobenius and spectral norms,that is, [Ak=argmin‖A-Alr=argmin‖A-Al2, rank(A)< rank(A)< Given a positive semidefinite matrix HRdxd,the notation is called H-norm of vector x Rd,that is,x =VxTHx.If matrices A and B have the same size,we let (A,B)denote ∑iAiB 2.1.Frequent Directions We give a brief review of frequent directions(FD)(Ghashami et al.,2016;Liberty,2013),because it is closely related to our proposed method.FD is a deterministic matrix sketching in the row-updates model.For any input matrix ARTxd whose rows come sequentially,it maintains a sketch matrix B E R(m-1)xd with m<T to approximate AA by BB. 3

RFD WITH APPLICATION IN ONLINE LEARNING In a recent study, Luo et al. (2016) proposed a FD-based sketched online Newton (SON) algorithm (FD-SON) to accelerate the standard online Newton algorithms. Owing to the shortcoming of FD, the performance of FD-SON is significantly affected by the choice of the hyperparameter. Naturally, we can leverage RFD to improve online Newton algorithms. Accordingly, we propose a sketched online Newton step based on RFD (RFD-SON). Different from conventional sketched Newton algorithms, RFD-SON is hyperparameter-free. Setting the regularization parameter to be zero initially, RFD-SON will adaptively increase the regularization term. The approximation Hessian will be well-conditioned after a few iterations. Moreover, we prove that RFD-SON has a more robust regret bound than FD-SON, and the experimental results also validate better performance of RFD-SON. The remainder of the paper is organized as follows. In Section 2 we present notation and preliminaries. In Section 3 we review the background of second order online learning and its sketched variants. In Sections 4 and 5 we propose our robust frequent directions (RFD) method and the applications in online learning, with some related theoretical analysis. In Section 6 we demonstrate empirical comparisons with baselines on serval real-world data sets to show the superiority of our algorithms. Finally, we conclude our work in Section 7. 2. Notation and Preliminaries We let Id denote the d×d identity matrix. For a matrix A = [Aij ] ∈ R n×d of rank r where r ≤ min(n, d), we let the condensed singular value decomposition (SVD) of A be A = UΣV> where U ∈ R n×r and V ∈ R d×r are column orthogonal and Σ = diag(σ1, σ2, . . . , σr) with σ1 ≥ σ2 ≥ . . . , ≥ σr > 0 places the nonzero singular values on its diagonal entries. We use σmax(A) to denote the largest singular value and σmin(A) to denote the smallest non-zero singular value. Thus, the condition number of A is κ(A) = σmax(A) σmin(A) . The matrix pseudoinverse of A is defined by A† = VΣ−1U> ∈ R d×n . Additionally, we let kAkF = qP i,j A2 ij = qPr i=1 σ 2 i be the Frobenius norm and kAk2 = σmax(A) be the spectral norm. A matrix norm |||·||| is said to be unitarily invariant if |||PAQ||| = |||A||| for any unitary matrices P ∈ R n×n and Q ∈ R d×d . It is easy to verify that both the Frobenius norm and spectral norm are unitarily invariant. We define [A]k = UkΣkV> k for k ≤ r, where Uk ∈ R n×k and Vk ∈ R d×k are the first k columns of U and V, and Σk = diag(σ1, σ2, . . . , σk) ∈ R k×k . Then [A]k is the best rank-k approximation to A in both the Frobenius and spectral norms, that is, [A]k = argmin rank(Ab )≤k kA − Ab kF = argmin rank(Ab )≤k kA − Ab k2. Given a positive semidefinite matrix H ∈ R d×d , the notation kxkH is called H-norm of vector x ∈ R d , that is, kxkH = √ P x>Hx. If matrices A and B have the same size, we let hA, Bi denote i,j AijBij . 2.1. Frequent Directions We give a brief review of frequent directions (FD) (Ghashami et al., 2016; Liberty, 2013), because it is closely related to our proposed method. FD is a deterministic matrix sketching in the row-updates model. For any input matrix A ∈ R T ×d whose rows come sequentially, it maintains a sketch matrix B ∈ R (m−1)×d with m T to approximate A>A by B>B. 3

LUO.CHEN,ZHANG,LI AND ZHANG We present the detailed implementation of FD in Algorithm 1.The intuition behind FD is similar to that of frequent items.FD periodically shrinks orthogonal vectors by roughly the same amount(Line 5 of Algorithm 1).The shrinking step reduces the square Frobenius norm of the sketch reasonable and guarantees that no direction is reduced too much. Algorithm 1 Frequent Directions 1:Inputs:A=a四，.，aT)]TERTxd,Bm-)=a①..，am-]T 2:for t =m,...,T do 3: Bt-1)= 「B(t-1) (a())T Compute SVD:B(t-1)=U(t-1)>(t-1)(V(t-1))T B9=V(②-2-(og-)21m-·(v-)T 6:end for 7:Output:B=B(T) FD has the following error bound for any k<m, IATA-B'B≤m天A-Ae2 (1) The above result means that the space complexity of FD is optimal regardless of streaming issues because any algorithm satisfying ATA-BTBll2 A-[A](m-k)requires O(md) space to represent matrix B(Ghashami et al.,2016).The dominated computation of the algorithm is computating the SVD of B(-1),which costs (m2d)by the standard SVD implementation. However,the total cost can be reduced from (Tm2d)to (Tmd)by doubling the space (Algorithm 4 in Appendix A)or using the Gu-Eisenstat procedure(Gu and Eisenstat,1993). Desai et al.(2016)proposed some extensions of FD.More specifically,Parameterized FD(PFD) uses an extra hyperparameter to describe the proportion of singular values shrunk in each iteration. PFD improves the performance empirically,but has worse error bound than FD by a constant. Compensative FD(CFD)modifies the output of FD by increasing the singular values and keeps the same error guarantees as FD. 3.Online Newton Methods For ease of demonstrating our work,we would like to introduce sketching techniques in online learning scenarios.First of all,we introduce the background of convex online learning including online Newton step algorithms.Then we discuss the connection between online learning and sketched second order methods,which motivates us to propose a more robust sketching algorithm. 3.1.Convex Online Learning Online learning is performed in a sequence of consecutive rounds(Shalev-Shwartz,2011).We consider the problem of convex online optimization as follows.For a sequence of examples() Rd),and convex smooth loss functions {f:KR}where f(w)(wTx()and K:C Rd 4

LUO, CHEN, ZHANG, LI AND ZHANG We present the detailed implementation of FD in Algorithm 1. The intuition behind FD is similar to that of frequent items. FD periodically shrinks orthogonal vectors by roughly the same amount (Line 5 of Algorithm 1). The shrinking step reduces the square Frobenius norm of the sketch reasonable and guarantees that no direction is reduced too much. Algorithm 1 Frequent Directions 1: Input: A = [a (1) , . . . , a (T) ] > ∈ R T ×d , B(m−1) = [a (1) , . . . , a (m−1)] > 2: for t = m, . . . , T do 3: Bb(t−1) = B(t−1) (a (t) ) > 4: Compute SVD: Bb(t−1) = U(t−1)Σ(t−1)(V(t−1)) > 5: B(t) = q Σ (t−1) m−1 2 − σ (t−1) m 2 Im−1 · V (t−1) m−1 > 6: end for 7: Output: B = B(T) FD has the following error bound for any k A − B >Bk2 ≤ 1 m − k kA − [A]kk 2 F . (1) The above result means that the space complexity of FD is optimal regardless of streaming issues because any algorithm satisfying kA>A − B>Bk2 ≤ kA − [A]kk 2 F /(m − k) requires O(md) space to represent matrix B (Ghashami et al., 2016). The dominated computation of the algorithm is computating the SVD of Bb(t−1), which costs O(m2d) by the standard SVD implementation. However, the total cost can be reduced from O(Tm2d) to O(Tmd) by doubling the space (Algorithm 4 in Appendix A) or using the Gu-Eisenstat procedure (Gu and Eisenstat, 1993). Desai et al. (2016) proposed some extensions of FD. More specifically, Parameterized FD (PFD) uses an extra hyperparameter to describe the proportion of singular values shrunk in each iteration. PFD improves the performance empirically, but has worse error bound than FD by a constant. Compensative FD (CFD) modifies the output of FD by increasing the singular values and keeps the same error guarantees as FD. 3. Online Newton Methods For ease of demonstrating our work, we would like to introduce sketching techniques in online learning scenarios. First of all, we introduce the background of convex online learning including online Newton step algorithms. Then we discuss the connection between online learning and sketched second order methods, which motivates us to propose a more robust sketching algorithm. 3.1. Convex Online Learning Online learning is performed in a sequence of consecutive rounds (Shalev-Shwartz, 2011). We consider the problem of convex online optimization as follows. For a sequence of examples {x (t) ∈ R d}, and convex smooth loss functions {ft : Kt → R} where ft(w) , `t(w>x (t) ) and Kt ⊂ R d 4

RFD WITH APPLICATION IN ONLINE LEARNING are convex compact sets, the learner outputs a predictor w(t) and suffers the loss ft(w(t) ) at the t-th round. The cumulative regret at round T is defined as: RT (w∗ ) = X T t=1 ft(w(t) ) − X T t=1 ft(w∗ ), where w∗ = argminw∈K PT t=1 ft(w) and K = TT t=1Kt . We make the following assumptions on the loss functions. Assumption 1 The loss functions `t satisfy |` 0 t (z)| ≤ L whenever |z| ≤ C, where L and C are positive constants. Assumption 2 There exists a µt ≤ 0 such that for all u, w ∈ K, we have ft(w) ≥ ft(u) + ∇ft(u) >(w − u) + µt 2 ∇ft(u) >(w − u) 2 . Note that for a loss function ft whose domain and gradient have bounded diameter, holding Assumption 2 only requires the exp-concave property, which is more general than strong convexity (Hazan, 2016). For example, the square loss function ft(w) = (w>xt − yt) 2 satisfies Assumption 2 with µt = 1 8C2 if the function is subject to constraints |w>xt | ≤ C and yt ≤ C (Luo et al., 2016), but it is not strongly convex. One typical online learning algorithm is online gradient descent (OGD) (Hazan et al., 2007; Zinkevich, 2003). At the (t+1)-th round, OGD exploits the following update rules: u (t+1) = w(t) − βtg (t) , w(t+1) = argmin w∈Kt+1 kw − u (t+1)k, where g (t) = ∇ft(w(t) ) and βt is the learning rate. The algorithm has linear computation cost and achieves O( L2 H log T) regret bound for the H-strongly convex loss. In this paper, we are more interested in online Newton step algorithms (Hazan et al., 2007; Luo et al., 2016). The standard online Newton step keeps the curvature information in the matrix H(t) ∈ R d×d sequentially and iterates as follows: u (t+1) = w(t) − βt(H(t) ) −1g (t) , w(t+1) = argmin w∈Kt+1 kw − u (t+1)kH(t) . (2) The matrix H(t) is constructed by the outer product of historical gradients (Duchi et al., 2011; Luo et al., 2016), such as H(t) = X t i=1 g (i) (g (i) ) > + α0Id, (3) or H(t) = X t i=1 (µt + ηt)g (i) (g (i) ) > + α0Id, (4) 5

LUO.CHEN,ZHANG,LI AND ZHANG where ao >0 is a fixed regularization parameter,ut is the constant in Assumption 2,and nt is typically chosen as (1/t).The second order algorithms enjoy logarithmical regret bound without the strongly convex assumption but require quadratical space and computation cost.Some variants of online Newton algorithms have been applied to optimize neural networks(Martens and Grosse, 2015;Grosse and Martens,2016;Ba et al.,2017),but they do not provide theoretical guarantee on nonconvex cases. 3.2.Efficient Algorithms by Sketching To make the online Newton step scalable,it is natural to use sketching techniques(Woodruff,2014). The matrix H()in online learning has the form H()=(A())TA()+aoId,where A()Rtxd is the corresponding term of (3)or(4)such as A©=g,gT,orA因=V1+hg,,V+g@]T. The sketching algorithm employs an approximation of(A()TA(t)by (B())TB(),where the sketch matrix B()Rmxd is much smaller than A()and md.Then we can use(B())TB()+aola to replace H(t)in update(2)(Luo et al.,2016).By the Woodbury identity formula,we can reduce the computation of the update from (d2)to (m2d)or (md).There are several choices of sketching techniques,such as random projection (Achlioptas,2003;Indyk and Motwani,1998;Kane and Nelson,2014),frequent directions(Ghashami et al.,2016;Liberty,2013)and Oja's algorithm (Oja, 1982;Oja and Karhunen,1985).However,all above methods treat ao as a given hyperparameter which is independent of the sketch matrix B().In practice,the performance of sketched online Newton methods is sensitive to the choice of the hyperparamter ao. 4.Robust Frequent Directions In many machine learning applications such as online learning (Hazan and Arora,2006;Hazan et al.,2007;Hazan,2016;Luo et al.,2016),Gaussian process regression(Rasmussen and Williams, 2006)and kernel ridge regression(Drineas and Mahoney,2005),we usually require an additional regularization term to make the matrix invertible and well-conditioned,while conventional sketching methods only focus on the low-rank approximation.On the other hand,the update of frequent directions is not optimal in the view of minimizing the approximation error in each iteration.Both of them motivate us to propose robust frequent directions(RFD)that incorporates the update of sketch matrix and the regularization term into one framework. 4.1.The Algorithm The RFD approximates ATA by BTB+ala with a>0.We demonstrate the detailed implementa- tion of RFD in Algorithm 2. The main difference between RFD and conventional sketching algorithms is the additional term ala.We can directly use Algorithm 2 to approximate ATA with a(m-1)=ao0 if the target matrix is AA+aold.Compared with the standard FD,RFD only needs to maintain one extra variable a(t)by scalar operations in each iteration,hence the cost of RFD is almost the same as FD. Because the value of a()is typically increasing from the(m+1)-th round in practice,the resulting BB+ald is positive definite even the initial a(0)is zero.Also,we can further accelerate the algorithm by doubling the space. 6

LUO, CHEN, ZHANG, LI AND ZHANG where α0 ≥ 0 is a fixed regularization parameter, µt is the constant in Assumption 2, and ηt is typically chosen as O(1/t). The second order algorithms enjoy logarithmical regret bound without the strongly convex assumption but require quadratical space and computation cost. Some variants of online Newton algorithms have been applied to optimize neural networks (Martens and Grosse, 2015; Grosse and Martens, 2016; Ba et al., 2017), but they do not provide theoretical guarantee on nonconvex cases. 3.2. Efficient Algorithms by Sketching To make the online Newton step scalable, it is natural to use sketching techniques (Woodruff, 2014). The matrix H(t) in online learning has the form H(t) = (A(t) ) >A(t) + α0Id, where A(t) ∈ R t×d is the corresponding term of (3) or (4) such as A(t) = [g (1) , . . . , g (t) ] >, or A(t) = [√ µ1 + η1g (1) , . . . , √ µt + ηtg (t) ] >. The sketching algorithm employs an approximation of (A(t) ) >A(t) by (B(t) ) >B(t) , where the sketch matrix B(t) ∈ R m×d is much smaller than A(t) and m d. Then we can use (B(t) ) >B(t) + α0Id to replace H(t) in update (2) (Luo et al., 2016). By the Woodbury identity formula, we can reduce the computation of the update from O(d 2 ) to O(m2d) or O(md). There are several choices of sketching techniques, such as random projection (Achlioptas, 2003; Indyk and Motwani, 1998; Kane and Nelson, 2014), frequent directions (Ghashami et al., 2016; Liberty, 2013) and Oja’s algorithm (Oja, 1982; Oja and Karhunen, 1985). However, all above methods treat α0 as a given hyperparameter which is independent of the sketch matrix B(t) . In practice, the performance of sketched online Newton methods is sensitive to the choice of the hyperparamter α0. 4. Robust Frequent Directions In many machine learning applications such as online learning (Hazan and Arora, 2006; Hazan et al., 2007; Hazan, 2016; Luo et al., 2016), Gaussian process regression (Rasmussen and Williams, 2006) and kernel ridge regression (Drineas and Mahoney, 2005), we usually require an additional regularization term to make the matrix invertible and well-conditioned, while conventional sketching methods only focus on the low-rank approximation. On the other hand, the update of frequent directions is not optimal in the view of minimizing the approximation error in each iteration. Both of them motivate us to propose robust frequent directions (RFD) that incorporates the update of sketch matrix and the regularization term into one framework. 4.1. The Algorithm The RFD approximates A>A by B>B + αId with α > 0. We demonstrate the detailed implementation of RFD in Algorithm 2. The main difference between RFD and conventional sketching algorithms is the additional term αId. We can directly use Algorithm 2 to approximate A>A with α (m−1) = α0 > 0 if the target matrix is A>A + α0Id. Compared with the standard FD, RFD only needs to maintain one extra variable α (t) by scalar operations in each iteration, hence the cost of RFD is almost the same as FD. Because the value of α (t) is typically increasing from the (m + 1)-th round in practice, the resulting B>B + αId is positive definite even the initial α (0) is zero. Also, we can further accelerate the algorithm by doubling the space. 6

RFD WITH APPLICATION IN ONLINE LEARNING Algorithm 2 Robust Frequent Directions 1: Input: A = [a (1) , . . . , a (T) ] > ∈ R T ×d , B(m−1) = [a (1) , . . . , a (m−1)] >, α (m−1) = 0 2: for t = m, . . . , T do 3: Bb(t−1) = B(t−1) (a (t) ) > 4: Compute SVD: Bb(t−1) = U(t−1)Σ(t−1)(V(t−1)) > 5: B(t) = q Σ (t−1) m−1 2 − σ (t−1) m 2 Im−1 · V (t−1) m−1 > 6: α (t) = α (t−1) + σ (t−1) m 2 /2 7: end for 8: Output: B = B(T) and α = α (T) . 4.2. Theoretical Analysis Before demonstrating the theoretical results of RFD, we review FD from the aspect of low-rank approximation which provides a motivation to the design of our algorithm. At the t-th round iteration of FD (Algorithm 1), we have the matrix B(t−1) which is used to approximate (A(t−1)) >A(t−1) by (B(t−1)) >B(t−1) and we aim to construct a new approximation which includes the new data a (t) , that is, (B (t) ) >B (t) ≈ (B (t−1)) >B (t−1) + a (t) (a (t) ) > = (Bb(t−1)) >Bb(t−1) . (5) The straightforward way to find B(t) is to minimize the approximation error of (5) based on the spectral norm with low-rank constraint: B 0(t) = argmin rank(C)=m−1 (Bb(t−1)) >Bb(t−1) − C>C 2 . (6) By the SVD of Bb(t−1), we have the solution B0(t) = Σ (t−1) m−1 V (t−1) m−1 > . In this view, the update of FD B (t) = q Σ (t−1) m−1 2 − σ (t−1) m 2 Im−1 · V (t−1) m−1 > (7) looks imperfect, because it is not an optimal low-rank approximation. However, the shrinkage operation in (7) is necessary. If we take a greedy strategy (Brand, 2002; Hall et al., 1998; Levey and Lindenbaum, 2000; Ross et al., 2008) which directly replaces B(t) with B0(t) in FD, it will perform worse in some specific cases1 and also has no valid global error bound like (1). Hence, the question is: can we devise a method which enjoys the optimality in each step and maintains global tighter error bound in the same time? Fortunately, RFD is just such an algorithm holding both the properties. We now explain the update rule of RFD formally, and provide the approximation error bound. We first give the following theorem which plays an important role in our analysis. 1. We provide an example in Appendix F. 7

LUO, CHEN, ZHANG, LI AND ZHANG Theorem 1 Given a positive semi-definite matrix M ∈ R d×d and a positive integer k be the SVD of M. Let Uk denote the matrix of the first k columns of U and σk be the k-th singular value of M. Then the pair (Cb, δb), defined as Cb = Uk(Σk − ξIk) 1/2Q and δb = (σk+1 + σd)/2 where ξ ∈ [σd, σk+1] and Q is an arbitrary k × k orthonormal matrix, is the global minimizer of min C∈Rd×k,δ∈R kM − (CC> + δId)k2. (8) Additionally, we have kM − (CbCb > + δbId)k2 ≤ kM − UkΣkU> k k2, and the equality holds if and only if rank(M) ≤ k. Theorem 1 provides the optimal solution with the closed form for matrix approximation with a regularization term. In the case of rank(M) > k, the approximation CbCb > + δbId is full rank and has strictly lower spectral norm error than the rank-k truncated SVD. Note that Zhang (2014) has established the Frobenius norm based result about the optimal analysis2 . Recall that in the streaming case, our goal is to approximate the concentration of historical approximation and current data at the t-th round. The following theorem shows that the update of RFD is optimal with respect to the spectral norm for each step. Theorem 2 Based on the updates in Algorithm 2, we have (B (t) , α(t) ) = argmin B∈Rd×(m−1),α∈R (Bb(t−1)) >Bb(t−1) + α (t−1)Id − (B >B + αId) 2 . (9) Theorem 2 explains RFD from an optimization viewpoint. It shows that each step of RFD is optimal for current information. Based on this theorem, the update of the standard FD corresponds (B, α) = (B(t) , 0), which is not the optimal solution of (9). Intuitively, the regularization term of RFD compensates each direction for the over reduction from the shrinkage operation of FD. Theorem 2 also implies RFD is an online extension to the approximation of Theorem 1. We can prove Theorem 2 by using Theorem 1 with M = (Bb(t−1)) >Bb(t−1) + α (t−1)Id. We defer the details to Appendix C. RFD also enjoys a tighter approximation error than FD as the following theorem shows. Theorem 3 For any k A − (B >B + αId) 2 ≤ 1 2(m − k) kA − [A]kk 2 F , (10) where [A]k is the best rank-k approximation to A in both the Frobenius and spectral norms. The right-hand side of inequality (10) is the half of the one in (1), which means RFD reduces the approximation error significantly with only one extra scalar. The real applications usually consider the matrix with a regularization term. Hence we also consider approximating the matrix M = A>A+α0Id where α0 > 0 and the rows of A are available 2. We also give a concise proof for the result of Zhang (2014) in Appendix B. 8

RFD WITH APPLICATION IN ONLINE LEARNING in sequentially order. Suppose that the standard FD approximates A>A by B>B. Then it estimates M as MFD = B>B + α0Id. Meanwhile, RFD generates the approximation MRFD = B>B + αId by setting α (m−1) = α0. Theorem 4 shows that the condition number of MRFD is better than MFD and M. In general, the equality in Theorem 4 usually can not be achieved for t > m unless (a (t) ) > lies in the row space of B(t−1) exactly or the first t rows of A have perfect low rank structure. Hence RFD is more likely to generate a well-conditioned approximation than others. Theorem 4 With the notation of Algorithms 1 and 2, let M = A>A + α0Id, MFD = B>B + α0Id, MRFD = B>B+αId and α (m−1) = α0, where α0 > 0 is a fixed scalar. Then we have κ(MRFD) ≤ κ(MFD) and κ(MRFD) ≤ κ(M). 5. The Online Newton Step by RFD We now present the sketched online Newton step by robust frequent directions (RFD-SON). The procedure is shown in Algorithm 3, which is similar to sketched online Newton step (SON) algorithms (Luo et al., 2016) but uses the new sketching method RFD. The matrix H(t) in Line 10 will not be constructed explicitly in practice, which is only used to the ease of analysis. The updates of u (t) and w(t) can be finished in O(md) time and space complexity by the Woodbury identity. We demonstrate the details in Appendix . When d is large, RFD-SON is much efficient than the standard online Newton step with the full Hessian that requires O(d 2 ) both in time and space. Note that we do not require the hyperparameter α0 to be strictly positive in RFD-SON. In practice, RFD-SON always archives good performance by setting α0 = 0, which leads to a hyperparameterfree algorithm, while the existing SON algorithm needs to select α0 carefully. We consider the general case that α0 ≥ 0 in this section for the ease of analysis. Theorem 5 Let µ = minT t=1{µt} and K = TT t=1Kt . Then under Assumptions 1 and 2 for any w ∈ K, Algorithm 3 has the following regret for α0 > 0 RT (w) ≤ α0 2 kwk 2 + 2(CL) 2X T t=1 ηt + m 2(µ + ηT ) ln tr (B(T) ) >B(T) mα0 + α (T) α0 + ΩRFD (11) where ΩRFD = d − m 2(µ + ηT ) ln α (T) α0 + m 4(µ + ηT ) X T t=1 (σ (t) m ) 2 α(t) + C 2X T t=1 (σ (t−1) m ) 2 . We present the regret bound of RFD-SON for positive α0 in Theorem 5. The term ΩRFD in (11) is the main gap between RFD-SON and the standard online Newton step without sketching. ΩRFD is dominated by the last term which can be bounded as (1). If we exploit the standard FD to sketched online Newton step (Luo et al., 2016) (FD-SON), the regret bound is similar to (11) but the gap will be ΩFD = mΩk 2(m − k)(µ + ηT )α0 , 9