It is easy to prove the following rel_中国高校课件下载中心

点击下载：《人工智能、机器学习与大数据》课程教学资源（参考文献）Supervised Hashing with Latent Factor Models

正在加载图片...

It is easy to prove the following relationship between the The gradient vector and the Hessian matrix of the objec- Hamming distance dist(,)and the inner product of two tive function L defined in (2)with respect to Ui.can be binary codes: derived as: aL 1 distn (bi,b)=(Q-bf bj)=(Q20). >(si-ag)U. j:siES We can find that the smaller the dist#(bi,bi)is,the larger +1 p(si;1 B)will be.Maximizing the likelihood of S in ∑.(-aU-g (1)will make the Hamming distance between two similar 1:s∈S points as small as possible,and that between two dissimilar points as high as possible.Hence this model is reasonable 02L aUg8U:=-4∑ aij(1-aij)Uf.Uj. and matches the goal to preserve similarity. j:s4j∈S With some prior p(B),the posteriori of B can be comput- ed as follows: p(BS)~p(SB)p(B) If we define H;as: We can use maximum a posteriori estimation to learn the optimal B.However,directly optimizing on B is an NP-hard H=- 1 problem 34.Following most existing hashing methods,we 16 ∑明u-∑明u-点⑧) j:sji∈S compute the optimal B through two stages.In the first stage,we relax B to be a real valued matrix U E RNxQ we can prove that The ith row of U is called the latent factor for the ith data 02L point.We learn an optimal U under the same probabilistic aU∂U* 之H, framework as for B.Then in the second stage,we perform some rounding technique on the real valued U to get the where A B means A-B is a positive semi-definite matrix. binary codes B. Then we can construct the lower bound of L(Us),denot- More specifically,we replace all the occurrences of B in ed by L(Ui),as: previous equations with U.O;is then re-defined as: i(U.)=L(U.(+(U:.-U() 6=2U.U明 +2U-U.)H,(U.-U.()7 Similarly,p(S|B),p(B),p(BS)are replaced with p(SU), p(U),p(U S),respectively.We put a normal distribution The values of U and other parameters that depend on U on p(U): change through the updating process.Here we use the no- tation r(t)to denote the value of a parameter r at some Q iteration t.We update Ui to be the one that gives the p(U)= W(U*d|0,I) maximum value of L(Ui).It is easy to see that L(Ui)is a d= quadratic form in the variable Ui,which can be proved to where A()denotes the normal distribution,I is an identity be convex.Hence,we can find the optimum value of Ui by matrix,and B is a hyper-parameter.The log posteriori of setting the gradient of L(Ui)with respect to Ui to 0.As U can then be derived as: a result,we end up with the following update rule for U: L=logp(U S) U.t+)=U.(0-l0L (PH()-1. (4) =∑(s,6,-l1og1+e9w》- 2川U喉+G (2) Lau? We can then update other rows of U iteratively using the above rule. whereF denotes the Frobenius norm of a matrix,and The convergence of the iterative updating process is con- c is a constant term independent of U.The next step is to trolled by some threshold value e and the maximum allowed learn the optimal U that maximizes L in(2). number of iterations T.Here e and T are both hyper- parameters.During each iteration,we update U by up- 2.3 Learning dating each of its rows sequentially.The initial value of U Since directly optimizing the whole U can be very time- can be obtained through PCA on the feature space X.The consuming,we optimize each row of U at a time with its pseudocode of the updating process is shown in Algorithm 1. other rows fixed.We adopt the surrogate algorithm [15 to optimize each row Uis.The surrogate learning algo- 2.3.I Rounding rithm can be viewed as a generalization of the expectation- After the optimal U is learned,we can obtain the final maximization(EM)algorithm.It constructs a lower bound binary codes B using some rounding techniques.In this of the objective function,and then updates the parameters paper,to keep our method simple,the real values of U are to maximize that lower bound.Just like EM algorithm,we quantized into the binary codes of B by taking their signs, need to derive different lower bounds and optimization pro- that is: cedures for different models [15].In the following content, 1. U>0 we will derive the details of the surrogate learning algorithm for our model. -1, otherwiseIt is easy to prove the following relationship between the Hamming distance distH(·, ·) and the inner product of two binary codes: distH(bi, bj ) = 1 2 (Q − b T i bj ) = 1 2 (Q − 2Θij ). We can find that the smaller the distH(bi, bj ) is, the larger p(sij = 1 | B) will be. Maximizing the likelihood of S in (1) will make the Hamming distance between two similar points as small as possible, and that between two dissimilar points as high as possible. Hence this model is reasonable and matches the goal to preserve similarity. With some prior p(B), the posteriori of B can be computed as follows: p(B | S) ∼ p(S | B)p(B). We can use maximum a posteriori estimation to learn the optimal B. However, directly optimizing on B is an NP-hard problem [34]. Following most existing hashing methods, we compute the optimal B through two stages. In the first stage, we relax B to be a real valued matrix U ∈ R N×Q. The ith row of U is called the latent factor for the ith data point. We learn an optimal U under the same probabilistic framework as for B. Then in the second stage, we perform some rounding technique on the real valued U to get the binary codes B. More specifically, we replace all the occurrences of B in previous equations with U. Θij is then re-defined as: Θij = 1 2 Ui∗U T j∗. Similarly, p(S | B), p(B), p(B | S) are replaced with p(S | U), p(U), p(U | S), respectively. We put a normal distribution on p(U): p(U) = Y Q d=1 N (U∗d | 0, βI), where N (·) denotes the normal distribution, I is an identity matrix, and β is a hyper-parameter. The log posteriori of U can then be derived as: L = log p(U | S) = X sij∈S (sijΘij − log(1 + e Θij )) − 1 2β kUk 2 F + c, (2) where k · kF denotes the Frobenius norm of a matrix, and c is a constant term independent of U. The next step is to learn the optimal U that maximizes L in (2). 2.3 Learning Since directly optimizing the whole U can be very timeconsuming, we optimize each row of U at a time with its other rows fixed. We adopt the surrogate algorithm [15] to optimize each row Ui∗. The surrogate learning algorithm can be viewed as a generalization of the expectationmaximization (EM) algorithm. It constructs a lower bound of the objective function, and then updates the parameters to maximize that lower bound. Just like EM algorithm, we need to derive different lower bounds and optimization procedures for different models [15]. In the following content, we will derive the details of the surrogate learning algorithm for our model. The gradient vector and the Hessian matrix of the objective function L defined in (2) with respect to Ui∗ can be derived as: ∂L ∂UT i∗ = 1 2 X j:sij∈S (sij − aij )U T j∗ + 1 2 X j:sji∈S (sji − aji)U T j∗ − 1 β U T i∗, ∂ 2L ∂UT i∗∂Ui∗ = − 1 4 X j:sij∈S aij (1 − aij )U T j∗Uj∗ − 1 4 X j:sji∈S aji(1 − aji)U T j∗Uj∗ − 1 β I. If we define Hi as: Hi = − 1 16 X j:sij∈S U T j∗Uj∗ − 1 16 X j:sji∈S U T j∗Uj∗ − 1 β I, (3) we can prove that ∂ 2L ∂UT i∗∂Ui∗ Hi, where A B means A−B is a positive semi-definite matrix. Then we can construct the lower bound of L(Ui∗), denoted by Le(Ui∗), as: Le(Ui∗) = L(Ui∗(t)) + (Ui∗ − Ui∗(t)) ∂L ∂UT i∗ (t) + 1 2 (Ui∗ − Ui∗(t))Hi(t)(Ui∗ − Ui∗(t))T . The values of U and other parameters that depend on U change through the updating process. Here we use the notation x(t) to denote the value of a parameter x at some iteration t. We update Ui∗ to be the one that gives the maximum value of Le(Ui∗). It is easy to see that Le(Ui∗) is a quadratic form in the variable Ui∗, which can be proved to be convex. Hence, we can find the optimum value of Ui∗ by setting the gradient of Le(Ui∗) with respect to Ui∗ to 0. As a result, we end up with the following update rule for Ui∗: Ui∗(t + 1) = Ui∗(t) − [ ∂L ∂UT i∗ (t)]T Hi(t) −1 . (4) We can then update other rows of U iteratively using the above rule. The convergence of the iterative updating process is controlled by some threshold value ε and the maximum allowed number of iterations T. Here ε and T are both hyperparameters. During each iteration, we update U by updating each of its rows sequentially. The initial value of U can be obtained through PCA on the feature space X. The pseudocode of the updating process is shown in Algorithm 1. 2.3.1 Rounding After the optimal U is learned, we can obtain the final binary codes B using some rounding techniques. In this paper, to keep our method simple, the real values of U are quantized into the binary codes of B by taking their signs, that is: Bij = ( 1, Uij > 0 −1, otherwise

<<向上翻页向下翻页>>

点击下载：《人工智能、机器学习与大数据》课程教学资源（参考文献）Supervised Hashing with Latent Factor Models