正在加载图片...
Jie Tang and Jing Zhang The probability p(l,lwd)can be defined as p)=1)+可))=0(2Mm)+()+ where o( )is a sigmoid function, defined as o(r)=1/(1+erp(-x)); e are bias terms for citation relationships: f(hk)is the feature function for hidden variable hk; f(i)and f(wi)are feature functions for citation relationship l, and word w;, respectively; a are bias terms for hidden variables. For simplicity, we define f(w'i) as the count of word in document d. We define binary value for the feature function of citation relationship L. For example, for document d, f(li)=l denotes that the document d has a citation relationship with another paper d Now, the task is to learn the model parameters 0=(m, U, a, b, e)given a training set D. Maximum-likelihood (ML) learning of the parameters can be done by gradient ascent with respect to the model parameters(b are bias terms for words). The exact gradient, for any parameter b Ee can be written as follows: alog p(Iw)=En咧-EP啊 where Epol denotes an expectation with respect to the data distribution and EPx is an expectation with respect to the distribution defined by the model. Computation of the expectation EPx is intractable. In practice, we use a stochastic approximation of this gradient, called the contrastive divergence gradient [4]. The algorithm cycles througl the training data and updates the model parameters according to Algorithm l, where the probabilities p(hk w, 1), p(w h)and p(l, h)are defined as p(hkw,D=a(∑Mk∫(m)+∑Uf()+ak) p(ml)=a(∑M1k∫(hk)+b) p()=a∑Uk∫(hk)+e) where b are bias terms for words; f(Li)is the feature function for citation relationship repeat a) for each document by 2. until all model parameters e converge4 Jie Tang and Jing Zhang The probability p(lj |wd) can be defined as: p(lj |w) = σ ÃXT k=1 Ujkf(hk) + ej ! , f(hk) = σ  XV i=1 Mij f(wi) +X j Ukj f(lj ) + ak   (2) where σ(.) is a sigmoid function, defined as σ(x) = 1/(1 + exp(−x)); e are bias terms for citation relationships; f(hk) is the feature function for hidden variable hk; f(lj ) and f(wi) are feature functions for citation relationship lj and word wi , respectively; a are bias terms for hidden variables. For simplicity, we define f(wi) as the count of word wi in document d. We define binary value for the feature function of citation relationship l. For example, for document d, f(lj ) = 1 denotes that the document d has a citation relationship with another paper dj . Now, the task is to learn the model parameters Θ = (M, U, a, b, e) given a training set D. Maximum-likelihood (ML) learning of the parameters can be done by gradient ascent with respect to the model parameters (b are bias terms for words). The exact gradient, for any parameter θ ∈ Θ can be written as follows: ∂log p(l|w) ∂θ = EP0 [l|w] − EPM [l|w] (3) where EP0 [.] denotes an expectation with respect to the data distribution and EPM is an expectation with respect to the distribution defined by the model. Computation of the expectation EPM is intractable. In practice, we use a stochastic approximation of this gradient, called the contrastive divergence gradient [4]. The algorithm cycles through the training data and updates the model parameters according to Algorithm 1, where the probabilities p(hk|w, l), p(wi |h) and p(lj |h) are defined as: p(hk|w, l) = σ( XV i=1 Mikf(wi) +XL j=1 Ujkf(lj ) + ak) (4) p(wi|h) = σ( XT k=1 Mikf(hk) + bi) (5) p(lj |h) = σ( XT k=1 Ujkf(hk) + ej ) (6) where b are bias terms for words; f(lj ) is the feature function for citation relationship. Algorithm 1. Parameter learning via contrastive divergence Input: training data D = {(wd, ld)}, topic number T, and learning rate λ 1. repeat (a) for each document d: i. sampling each topic hk according to (4); ii. sampling each word wi according to (5); iii. sampling each citation relationship lj according to (6); (b) end for (c) update each model parameter θ ∈ Θ by θ = θ + λ( ∂logp(l|w) ∂θ ) 2. until all model parameters Θ converge
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有