正在加载图片...
to denote the vector representation of V.the Hes- matrix can be expressed as the inverse of each block,the up- sian of f w.r.t.v will be a block-diagonal matrix: date of the whole matrix V can naturally be decomposed into K≌ a2f the update of each row Vi. avavT =diag K1,K2,...,Kml,where UU+al.It is easy to verify that Unlike the learning of U,all the Hessian matrices for Kj 0 and Kv0.Hence,f is convex w.r.t.V. different rows of V are equal,i.e,K;=Kk=K UUi.+aI.Furthermore,K is a small matrix of Learning Strategy size D x D,where D is typically a small number and is less We adopt an alternating projection method to learn the pa- than 50 in our experiments.Hence,we can directly update rameters U and V.More specifically,each time we fix one Vj.as follows:Vj.=(XiUi)K-1 parameter and then update the other one.This procedure will be repeated for several iterations until some termination con- 2.4 Convergence and Complexity Analysis dition is satisfied.We will prove that the learning algorithm is convergent. Theorem 3 The learning algorithm will converge. Proof:(Sketch)In each iteration,the learning algorithm en- Learning U According to Theorem 1,one straightforward sures that the objective function value in(3)always decreases way to learn U is to set the gradient of f w.r.t.U to 0 and Furthermore,the objective function is bounded below by 0. solve the corresponding linear system.However,this is com- Hence,the learning algorithm will converge.Because f is putationally demanding ifn is large because we have to invert not jointly convex w.r.t.U and V,the solution is a local opti- a Hessian matrix of size as large as nD x nD.In this paper, mum.■ we adopt an alternative strategy to perform optimization on We have seen that in each iteration,the time required for U,which is to optimize one column U.d at a time with the updating U is O(n),and it is easy to see that the time com- other columns fixed.Because f is convex w.r.t.U,this alter- plexity for updating V is also O(n).In our experiments,typ- native strategy will be guaranteed to converge to the optimal ically the algorithm can converge in less than 50 iterations. solution. In fact,we find that 5 iterations are sufficient to achieve good Because f is convex w.r.t.U,f is also convex w.r.t.U.d performance.Hence,the overall time complexity of the learn- with all other variables fixed.Hence,computing the gradient ing algorithm is O(n). of f w.r.t.U.d and setting it to 0,we can optimize U.d by solving the following linear system: 3 Experiments F(d)U.d =e(d) (4) 3.1 Data Sets and Evaluation Scheme where F(d)=E(d)+aI BC,and E(d) 2g 82g 2g 82g We use the same data sets,WebKB [Craven et al.,1998]and diag(au品aa2品2aa0品nd)with品a Cora McCallum et al.,2000],and the same bag-of-words =1Vid:and een)r with eld representation with the same original link structure as those =Vid(Xij-Ui-VT.+UidVjd). in Zhu et al.,2007]to evaluate our method.The WebKB One direct way to solve the linear system in (4)is to data set contains about 6,000 web pages collected from the set U.d=F(d)]-1e(d).However,the computation cost web sites of computer science departments of four universi- ties(Cornell,Texas,Washington,and Wisconsin).Each web is O(n3),which is computationally prohibitive for gen- page is labeled with one out of seven categories:student, eral text classification applications because n is always very professor,course,project,staff,department,and"other".It large.Here,we propose to use the steepest descent method should be noted that to train our RRMF method we adopt [Shewchuk,1994]to iteratively update U.:r(t)=e(d)- the simple preprocessing strategy for the link structure in this r(t)'r(t) F(dU.a(t),6(t)=U.(t+1)=U.a(t)+ data set.That is,if two web pages are co-linked by another 6(t)r(t). common web page,we add a link between these two pages. Steepest descent will guarantee the algorithm to converge After preprocessing,all the directed links are converted into to the global minimum with the objective function value de- undirected links.The characteristics about the WebKB data creased in each iteration.Suppose the number of nonzero set are briefly summarized in Table 1. elements in C is M.We can see that the computation cost Table 1:Characteristics of the WebKB data set in each iteration is O(n+M).If the number of iterations #classes #entities #terms is K,the time complexity will be O(K(n+M)).From our Cornell 827 4.134 experiments,good performance can be achieved with a small lexas 814 4,029 value of K,such as K=10 in our following experiments. Washington 1166 4165 Furthermore,M is typically a constant multiple of n.Hence, Wisconsin 1,210 4.189 the overall complexity is roughly O(n),which is dramatically The Cora data set contains the abstracts and references of less than O(n). about 34,000 research papers from the computer science com- munity.For fair comparison,we adopt the same subset of the Learning V One very nice property of V is that the Hes- data as that in [Zhu et al.,2007]to test our method.The task sian K is block-diagonal,where each nonzero block corre- is to classify each paper into one of the subfields of data struc- sponds to a row of V.Since the inverse of a block-diagonal ture(DS),hardware and architecture(HA),machine learningto denote the vector representation of V, the Hes￾sian of f w.r.t. v will be a block-diagonal matrix: Kv , ∂ 2 f ∂v∂vT = diag[K1, K2, . . . , Km], where Kj = Pn i=1 UT i∗Ui∗ + αI. It is easy to verify that Kj  0 and Kv  0. Hence, f is convex w.r.t. V.  Learning Strategy We adopt an alternating projection method to learn the pa￾rameters U and V. More specifically, each time we fix one parameter and then update the other one. This procedure will be repeated for several iterations until some termination con￾dition is satisfied. We will prove that the learning algorithm is convergent. Learning U According to Theorem 1, one straightforward way to learn U is to set the gradient of f w.r.t. U to 0 and solve the corresponding linear system. However, this is com￾putationally demanding if n is large because we have to invert a Hessian matrix of size as large as nD × nD. In this paper, we adopt an alternative strategy to perform optimization on U, which is to optimize one column U∗d at a time with the other columns fixed. Because f is convex w.r.t. U, this alter￾native strategy will be guaranteed to converge to the optimal solution. Because f is convex w.r.t. U, f is also convex w.r.t. U∗d with all other variables fixed. Hence, computing the gradient of f w.r.t. U∗d and setting it to 0, we can optimize U∗d by solving the following linear system: F (d)U∗d = e (d) , (4) where F (d) = E(d) + αI + βL, and E(d) = diag( ∂ 2 g ∂U1d∂U1d , ∂ 2 g ∂U2d∂U2d , . . . , ∂ 2 g ∂Und∂Und ) with ∂ 2 g ∂Uid∂Uid = Pm j=1 V 2 jd, and e (d) = (e (d) 1 , e (d) 2 , . . . , e (d) n ) T with e (d) i = Pm j=1 Vjd(Xij − Ui∗VT j∗ + UidVjd). One direct way to solve the linear system in (4) is to set U∗d = [F (d) ] −1e (d) . However, the computation cost is O(n 3 ), which is computationally prohibitive for gen￾eral text classification applications because n is always very large. Here, we propose to use the steepest descent method [Shewchuk, 1994] to iteratively update U∗d: r(t) = e (d) − F (d)U∗d(t), δ(t) = r(t) T r(t) r(t)T F(d)r(t) , U∗d(t + 1) = U∗d(t) + δ(t)r(t). Steepest descent will guarantee the algorithm to converge to the global minimum with the objective function value de￾creased in each iteration. Suppose the number of nonzero elements in L is M. We can see that the computation cost in each iteration is O(n + M). If the number of iterations is K, the time complexity will be O(K(n + M)). From our experiments, good performance can be achieved with a small value of K, such as K = 10 in our following experiments. Furthermore, M is typically a constant multiple of n. Hence, the overall complexity is roughly O(n), which is dramatically less than O(n 3 ). Learning V One very nice property of V is that the Hes￾sian Kv is block-diagonal, where each nonzero block corre￾sponds to a row of V. Since the inverse of a block-diagonal matrix can be expressed as the inverse of each block, the up￾date of the whole matrix V can naturally be decomposed into the update of each row Vj∗. Unlike the learning of U, all the Hessian matrices for different rows of P V are equal, i.e., Kj = Kk = K = n i=1 UT i∗Ui∗ + αI. Furthermore, K is a small matrix of size D × D, where D is typically a small number and is less than 50 in our experiments. Hence, we can directly update Vj∗ as follows:Vj∗ = (Pn i=1 XijUi∗) K−1 . 2.4 Convergence and Complexity Analysis Theorem 3 The learning algorithm will converge. Proof: (Sketch) In each iteration, the learning algorithm en￾sures that the objective function value in (3) always decreases. Furthermore, the objective function is bounded below by 0. Hence, the learning algorithm will converge. Because f is not jointly convex w.r.t. U and V, the solution is a local opti￾mum.  We have seen that in each iteration, the time required for updating U is O(n), and it is easy to see that the time com￾plexity for updating V is also O(n). In our experiments, typ￾ically the algorithm can converge in less than 50 iterations. In fact, we find that 5 iterations are sufficient to achieve good performance. Hence, the overall time complexity of the learn￾ing algorithm is O(n). 3 Experiments 3.1 Data Sets and Evaluation Scheme We use the same data sets, WebKB [Craven et al., 1998] and Cora [McCallum et al., 2000], and the same bag-of-words representation with the same original link structure as those in [Zhu et al., 2007] to evaluate our method. The WebKB data set contains about 6,000 web pages collected from the web sites of computer science departments of four universi￾ties (Cornell, Texas, Washington, and Wisconsin). Each web page is labeled with one out of seven categories: student, professor, course, project, staff, department, and “other”. It should be noted that to train our RRMF method we adopt the simple preprocessing strategy for the link structure in this data set. That is, if two web pages are co-linked by another common web page, we add a link between these two pages. After preprocessing, all the directed links are converted into undirected links. The characteristics about the WebKB data set are briefly summarized in Table 1. Table 1: Characteristics of the WebKB data set. #classes #entities #terms Cornell 7 827 4,134 Texas 7 814 4,029 Washington 7 1,166 4,165 Wisconsin 6 1,210 4,189 The Cora data set contains the abstracts and references of about 34,000 research papers from the computer science com￾munity. For fair comparison, we adopt the same subset of the data as that in [Zhu et al., 2007] to test our method. The task is to classify each paper into one of the subfields of data struc￾ture (DS), hardware and architecture (HA), machine learning
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有