正在加载图片...
WWW '21,April 19-23,2021,Ljubljana,Slovenia Zhiwei Jiang,Meng Liu,Yafeng Yin,Hua Yu,Zifeng Cheng,and Qing Gu of the input essay:ordinal classifier that predicts a unimodal label where Yik denotes the k-th element of Yi.Based on Yi,the final distribution on the pre-defined scores for each input essay. predicted label i of essay xi can be obtained by: 3.3.1 Essay Encoder.We employ a neural network fo()to ex- i=arg max Yik. (12) tract features of an input xi,where fo(xi;)refers to the essay 1≤k≤K embedding and o indicates the parameters of the network.This module is not limited to a specific architecture and can be var- 3.4 Ordinal Distillation ious existing AES encoders.To demonstrate the universality of We introduce the Ordinal Distillation illustrated in Figure 1,which our framework and provide more fair comparisons in the experi- distills the pseudo-label knowledge of Teacher Model into the Stu- ments,we adopt the encoders adopted in some recent work(e.g., dent Model,and consists of three main steps:label selection that CNN-LSTM-Att [9].HA-LSTM [5],BERT [5]). selects high confidence pseudo-labels for later distillation;unimodal 3.3.2 Unimodal Ordinal Classifier.Unlike previous neural net- smoothing that enforces the label distribution of pseudo-label to be a unimodal probability distribution;unimodal distillation that min- work based AES models which predict the score of the input essay by using a regression layer(ie.a one-unit layer),we view the essay imizes the KL divergence between the predicted label distribution of Student Model and the unimodal smoothed label distribution of scoring as an ordinal classification problem and adopt an ordinal Teacher Model. classifier [3]for prediction To capture the ordinal relationship among classes,the unimodal 3.4.1 Label Selection.Considering that only one-shot labeled probability distribution(ie,the distribution has a peak at class k data is available for label propagation,the pseudo labels generated while decreasing its value when the class goes away from k)is usu- by Teacher Model may be noisy.Therefore,we propose a label ally used to restrict the shape of the predicted label distributions. selection strategy to select a subset of pseudo labels with high According to previous studies [3,22],some special exponential confidence. functions and the probability mass function(PMF)of both Pois- Specifically,for each distinct score k e y,we first collect all son distribution and binomial distribution can be used to enforce corresponding pseudo labels,that is,Ck=yly=k,xiE Xu). discrete unimodal probability distribution. and then rank these pseudo labels Ck according to their confidence. In our framework,we choose an extension of the binomial dis- We measure the confidence of a pseudo label y by calculating tribution,Conway-Maxwell binomial distribution(CMB)[16].as the negative Shannon entropy of its corresponding label distribu- the base distribution,and employ the PMF of the CMB to generate tion(Eq.13),so that a peaked distribution may tend to get a high the predicted unimodal probability distribution of essay xi: confidence. P(yi=k)= 1/K-1 S(p,v)k-1 p-1(1-p)K-k (8) Confidence(y)=-H(Y:) ∑写logz (13) where S(p.v)= (9) After that,we select top mk pseudo labels with high confidence from Ck by Here k∈y={1,2,..,K,0≤p≤1,and-oo≤v≤co.The mk min (Ckl,max(a,Cklx y)), (14) parameter v can be used to control the variance of the distribution. The case v=1 is the usual binomial distribution. where the threshold ratio y and the threshold number a are set to To be more specifically,we now describe the neural network ensure a sufficient number of pseudo labels are selected in the end architecture of the employed ordinal classifier based on the PMF of and avoid serious class imbalance problem. the CMB.As shown in Figure 1,the essay encoder is followed by a linear layer which transforms the essay embedding into a number 3.4.2 Unimodal Smoothing.Previous studies on knowledge dis- vER and a probability p e[0,1](by using sigmoid activation func- tillation [13,49]have shown that a soft or smoothed probability tion).The linear layer is then followed by a 'copy expansion'layer distribution from teacher model is more suitable for knowledge which expands the probability p into K probabilities corresponding distillation than one-hot probability distribution.Considering that to K distinct scores,that is,Pk=1 =Pk=2=...=Pk=K.The follow- essay scoring is an ordinal classification problem and an essay is ing layer then applies the 'Log CMB PMF'transformation on these more likely to be mispredicted as a score close to the ground-truth probabilities with different k: score,we enforce the distribution of pseudo labels produced by teacher model to be a unimodal smoothed probability distribution K-1 LCP(k:v,p)=vlog As mentioned before,some special exponential functions [22] k-1 +(k-1)logp (10) can be used to enforce discrete unimodal probability distribution. +(K-k)log(1-p), Therefore,we employ an exponential function to perform the uni- where the log operation is used to address numeric stability is- modal smoothing on both one-shot labels and pseudo labels: sues.Finally,a softmax layer is applied on the logit,LCP(k:v.p),to cp(-以 produce a unimodal probability distribution Yi for essay xi: ∑A:ep(U- xi∈Xo eLCP(k:v.p) g'(yi=kxi)= (15) exp(- =1eLCP(k:v.p) (11) 1p(-T xiEXu 2350WWW ’21, April 19–23, 2021, Ljubljana, Slovenia Zhiwei Jiang, Meng Liu, Yafeng Yin, Hua Yu, Zifeng Cheng, and Qing Gu of the input essay; ordinal classifier that predicts a unimodal label distribution on the pre-defined scores for each input essay. 3.3.1 Essay Encoder. We employ a neural network fφ (·) to ex￾tract features of an input xi , where fφ (xi ;φ) refers to the essay embedding and φ indicates the parameters of the network. This module is not limited to a specific architecture and can be var￾ious existing AES encoders. To demonstrate the universality of our framework and provide more fair comparisons in the experi￾ments, we adopt the encoders adopted in some recent work (e.g., CNN-LSTM-Att [9], HA-LSTM [5], BERT [5]). 3.3.2 Unimodal Ordinal Classifier. Unlike previous neural net￾work based AES models which predict the score of the input essay by using a regression layer (i.e. a one-unit layer), we view the essay scoring as an ordinal classification problem and adopt an ordinal classifier [3] for prediction. To capture the ordinal relationship among classes, the unimodal probability distribution (i.e., the distribution has a peak at class k while decreasing its value when the class goes away from k) is usu￾ally used to restrict the shape of the predicted label distributions. According to previous studies [3, 22], some special exponential functions and the probability mass function (PMF) of both Pois￾son distribution and binomial distribution can be used to enforce discrete unimodal probability distribution. In our framework, we choose an extension of the binomial dis￾tribution, Conway–Maxwell binomial distribution (CMB) [16], as the base distribution, and employ the PMF of the CMB to generate the predicted unimodal probability distribution of essay xi : P(yi = k) = 1 S(p,υ)  K − 1 k − 1 υ p k−1 (1 − p) K−k , (8) where S(p,υ) = Õ K k=1  K − 1 k − 1 υ p k−1 (1 − p) K−k . (9) Here k ∈ Y = {1, 2, . . . ,K}, 0 ≤ p ≤ 1, and −∞ ≤ υ ≤ ∞. The parameter υ can be used to control the variance of the distribution. The case υ = 1 is the usual binomial distribution. To be more specifically, we now describe the neural network architecture of the employed ordinal classifier based on the PMF of the CMB. As shown in Figure 1, the essay encoder is followed by a linear layer which transforms the essay embedding into a number υ ∈ R and a probability p ∈ [0, 1] (by using sigmoid activation func￾tion). The linear layer is then followed by a ‘copy expansion’ layer which expands the probability p into K probabilities corresponding to K distinct scores, that is, pk=1 = pk=2 = · · · = pk=K . The follow￾ing layer then applies the ‘Log CMB PMF’ transformation on these probabilities with different k: LCP(k;υ,p) = υ log  K − 1 k − 1  + (k − 1) logp + (K − k) log (1 − p), (10) where the log operation is used to address numeric stability is￾sues. Finally, a softmax layer is applied on the logit, LCP(k;υ,p), to produce a unimodal probability distribution Yˆ i for essay xi : Yˆ ik = e LCP(k;υ,p) ÍK k=1 e LCP(k;υ,p) , (11) where Yˆ ik denotes the k-th element of Yˆ i . Based on Yˆ i , the final predicted label yˆi of essay xi can be obtained by: yˆi = arg max 1≤k ≤K Yˆ ik . (12) 3.4 Ordinal Distillation We introduce the Ordinal Distillation illustrated in Figure 1, which distills the pseudo-label knowledge of Teacher Model into the Stu￾dent Model, and consists of three main steps: label selection that selects high confidence pseudo-labels for later distillation; unimodal smoothing that enforces the label distribution of pseudo-label to be a unimodal probability distribution; unimodal distillation that min￾imizes the KL divergence between the predicted label distribution of Student Model and the unimodal smoothed label distribution of Teacher Model. 3.4.1 Label Selection. Considering that only one-shot labeled data is available for label propagation, the pseudo labels generated by Teacher Model may be noisy. Therefore, we propose a label selection strategy to select a subset of pseudo labels with high confidence. Specifically, for each distinct score k ∈ Y, we first collect all corresponding pseudo labels, that is, Ck = {y ′ i |y ′ i = k, xi ∈ Xu }, and then rank these pseudo labels Ck according to their confidence. We measure the confidence of a pseudo label y ′ i by calculating the negative Shannon entropy of its corresponding label distribu￾tion (Eq. 13), so that a peaked distribution may tend to get a high confidence. Confidence(y ′ i ) = −H(Y ′ i ) = Õ K j=1 Y ′ ij log2 Y ′ ij (13) After that, we select top mk pseudo labels with high confidence from Ck by mk = min (|Ck |, max(a, |Ck | ×γ )), (14) where the threshold ratio γ and the threshold number a are set to ensure a sufficient number of pseudo labels are selected in the end and avoid serious class imbalance problem. 3.4.2 Unimodal Smoothing. Previous studies on knowledge dis￾tillation [13, 49] have shown that a soft or smoothed probability distribution from teacher model is more suitable for knowledge distillation than one-hot probability distribution. Considering that essay scoring is an ordinal classification problem and an essay is more likely to be mispredicted as a score close to the ground-truth score, we enforce the distribution of pseudo labels produced by teacher model to be a unimodal smoothed probability distribution. As mentioned before, some special exponential functions [22] can be used to enforce discrete unimodal probability distribution. Therefore, we employ an exponential function to perform the uni￾modal smoothing on both one-shot labels and pseudo labels: q ′ (yi = k|xi) =    exp( −|k−yi | τ ) ÍK j=1 exp( −|j−yi | τ ) xi ∈ Xo exp( −|k−y ′ i | τ ) ÍK j=1 exp( −|j−y ′ i | τ ) xi ∈ Xu , (15) 2350
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有