正在加载图片...
Learning from Graph Propagation via Ordinal Distillation for One-Shot Automated Essay Scoring WWW'21,April 19-23,2021,Ljubljana,Slovenia Algorithm 1 The Training Flow of TGOD Table 1:Statistics of the ASAP datasets.For column Genre, Input:The whole set of essays X,one-shot labeled data Do. ARG denotes argumentative essays,RES denotes response Output:An optimized student model. essays,and NAR denotes narrative essays.The last column Run the Teacher Model: lists the score ranges. Construct multiple graphs G*=(G1,G2,....GB}on X. for each Gh eG do Prompt #Essay Genre Avg Len Range Apply label propagation algorithm on Gh as Eq.5. 1 1,783 ARG 350 2-12 end for 2 1,800 ARG 350 1-6 Generate pseudo labels by label guessing as Eq.6 and 7. 3 1,726 RES 150 0-3 Train the Student Model by Ordinal Distillation: 4 1,772 RES 150 0-3 Select the pseudo labels with high confidence by Eq.13 and 14. 5 1,805 RES 150 0-4 Smooth the selected labels as Eg.15. 6 1,800 RES 150 0-4 Split selected essays into training set Dt and validation set Du 1,569 NAR 250 0-30 for all iter=1,...,MaxIter do 8 723 NAR 650 0-60 Optimize the student model on Dt by minimizing Eq.16. Validate the student model on D,, end for 4.1 Dataset and Evaluation Metric return The student model with best performance on D We conduct experiments on a public dataset ASAP(Automated Student Assessment Prize),which is a widely-used benchmark dataset for the task of automated essay scoring.In ASAP,there are where k ey and r is a parameter used to control the variance of eight sets of essays corresponding to eight different prompts,and a total of 12,978 scored essays.These eight essay sets vary in essay the distribution. number,genre,and score range,the details of which are listed in 3.4.3 Unimodal Distillation.Since the one-shot labeled data Table 1. Do is not sufficient to train a neural network,we use the pseudo To evaluate the performance of AES methods,we employ the labels produced by teacher model as a supplement to train the quadratic weighted kappa(OWK)as the evaluation metric,which student model. is the official metric of ASAP dataset.For each set of essays with Specifically,we train the student model by matching the output possible scores y={1,2.....K),the QWK can be calculated to label distribution of student model g(xi)=Y;and the unimodal measure the agreement between the automated predicted scores smoothed pseudo label of teacher model q'(xi)via a KL-divergence (Rater A)and the resolved human scores(Rater B)as follows: loss: K=1- ∑i,jwi,j0i,j (17) COD= DKL(gx)llg'(x》, (16) ∑ijwi,jEij where wi.j= (is calculated based on the difference between (i-2 where Xs denotes the set of essays from either one-shot data or the raters'scores,O is a K-by-K histogram matrix,Oi.j is the number selected essays after label selection. of essays that received a score i by Rater A and a score j by Rater B and E is calculated as the normalized outer product between each 3.5 Training Flow of TGOD rater's histogram vector of scores. In summary,there are two steps in TGOD to train the Student Model 4.2 Experimental Settings under the one-shot setting,i.e.,first generating pseudo labels of unlabeled essays by running the Teacher Model,and then training For the setting of'one-shot',we conduct experiments by randomly the Student Model by Ordinal Distillation.The whole training flow sampling the one-shot labeled data to train the model and test the of TGOD is illustrated in Figure 1 and Alg.1. model on the rest unlabeled essays.To reduce randomness,under In particular,considering that model selection is difficult to im- each case,we repeat the sampling of one-shot labeled data 20 times, plement under the one-shot supervised setting,we design a model and the average results are reported.For our proposed framework selection strategy based on pseudo labels,which validates the model we perform model selection based on the pseudo validation set. on a subset of pseudo labels. For other baseline methods.since one-shot labeled data is used for training and no extra labeled data can be used as a validation set to perform model selection,we report their best performance on test 4 EXPERIMENTS set as their upper bound performance for comparison. In this section,we first introduce the dataset and evaluation metric. For the setting of'one-shot+history prompt',we combine the one- Then we illustrate the experimental settings,the implementation shot labeled data and the labeled data in a history prompt of the details,and the performance comparison.Finally,we conduct ab- similar score range(e.g,P1→P2,P2→P1,P3→P4,P4→P3,and lation study and model analysis to investigate the effectiveness of our proposed approach. 1https://www.kaggle.com/c/asap-aes/data 2351Learning from Graph Propagation via Ordinal Distillation for One-Shot Automated Essay Scoring WWW ’21, April 19–23, 2021, Ljubljana, Slovenia Algorithm 1 The Training Flow of TGOD Input: The whole set of essays X, one-shot labeled data Do . Output: An optimized student model. Run the Teacher Model: Construct multiple graphs G ∗ = {G1,G2, . . . ,GB } on X. for each Gb ∈ G ∗ do Apply label propagation algorithm on Gb as Eq. 5. end for Generate pseudo labels by label guessing as Eq. 6 and 7. Train the Student Model by Ordinal Distillation: Select the pseudo labels with high confidence by Eq. 13 and 14. Smooth the selected labels as Eq. 15. Split selected essays into training set Dt and validation set Dv . for all iter=1,. . . ,MaxIter do Optimize the student model on Dt by minimizing Eq. 16. Validate the student model on Dv end for return The student model with best performance on Dv where k ∈ Y and τ is a parameter used to control the variance of the distribution. 3.4.3 Unimodal Distillation. Since the one-shot labeled data Do is not sufficient to train a neural network, we use the pseudo labels produced by teacher model as a supplement to train the student model. Specifically, we train the student model by matching the output label distribution of student model qˆ(xi) = Yˆ i and the unimodal smoothed pseudo label of teacher model q ′ (xi) via a KL-divergence loss: LO D = Õ xi ∈Xs DK L ￾ qˆ(xi)||q ′ (xi)  , (16) where Xs denotes the set of essays from either one-shot data or the selected essays after label selection. 3.5 Training Flow of TGOD In summary, there are two steps in TGOD to train the Student Model under the one-shot setting, i.e., first generating pseudo labels of unlabeled essays by running the Teacher Model, and then training the Student Model by Ordinal Distillation. The whole training flow of TGOD is illustrated in Figure 1 and Alg. 1. In particular, considering that model selection is difficult to im￾plement under the one-shot supervised setting, we design a model selection strategy based on pseudo labels, which validates the model on a subset of pseudo labels. 4 EXPERIMENTS In this section, we first introduce the dataset and evaluation metric. Then we illustrate the experimental settings, the implementation details, and the performance comparison. Finally, we conduct ab￾lation study and model analysis to investigate the effectiveness of our proposed approach. Table 1: Statistics of the ASAP datasets. For column Genre, ARG denotes argumentative essays, RES denotes response essays, and NAR denotes narrative essays. The last column lists the score ranges. Prompt #Essay Genre Avg Len Range 1 1,783 ARG 350 2-12 2 1,800 ARG 350 1-6 3 1,726 RES 150 0-3 4 1,772 RES 150 0-3 5 1,805 RES 150 0-4 6 1,800 RES 150 0-4 7 1,569 NAR 250 0-30 8 723 NAR 650 0-60 4.1 Dataset and Evaluation Metric We conduct experiments on a public dataset ASAP (Automated Student Assessment Prize1 ), which is a widely-used benchmark dataset for the task of automated essay scoring. In ASAP, there are eight sets of essays corresponding to eight different prompts, and a total of 12,978 scored essays. These eight essay sets vary in essay number, genre, and score range, the details of which are listed in Table 1. To evaluate the performance of AES methods, we employ the quadratic weighted kappa (QWK) as the evaluation metric, which is the official metric of ASAP dataset. For each set of essays with possible scores Y = {1, 2, . . . ,K}, the QWK can be calculated to measure the agreement between the automated predicted scores (Rater A) and the resolved human scores (Rater B) as follows: κ = 1 − Í i,j wi,jOi,j Í i,j wi,jEi,j , (17) where wi,j = (i−j) 2 (K−1) 2 is calculated based on the difference between raters’ scores, O is a K-by-K histogram matrix, Oi,j is the number of essays that received a score i by Rater A and a score j by Rater B, and E is calculated as the normalized outer product between each rater’s histogram vector of scores. 4.2 Experimental Settings For the setting of ‘one-shot’, we conduct experiments by randomly sampling the one-shot labeled data to train the model and test the model on the rest unlabeled essays. To reduce randomness, under each case, we repeat the sampling of one-shot labeled data 20 times, and the average results are reported. For our proposed framework, we perform model selection based on the pseudo validation set. For other baseline methods, since one-shot labeled data is used for training and no extra labeled data can be used as a validation set to perform model selection, we report their best performance on test set as their upper bound performance for comparison. For the setting of ‘one-shot+history prompt’, we combine the one￾shot labeled data and the labeled data in a history prompt of the similar score range (e.g., P1 → P2, P2 → P1, P3 → P4, P4 → P3, and 1https://www.kaggle.com/c/asap-aes/data 2351
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有