Learning from Graph Propagation via Ordinal Distillation for One-Shot Automated Essay Scoring Zhiwei Jiang Meng Liu' Yafeng Yin jzw@nju.edu.cn mf1933061@smail.nju.edu.cn yafeng@nju.edu.cn State Key Laboratory for Novel State Key Laboratory for Novel State Key Laboratory for Novel Software Technology,Nanjing Software Technology,Nanjing Software Technology,Nanjing University University University Nanjing,China Nanjing,China Nanjing,China Hua Yu Zifeng Cheng Qing Gu huayu.yh@smail.nju.edu.cn chengzf@smail.nju.edu.cn guq@nju.edu.cn State Key Laboratory for Novel State Key Laboratory for Novel State Key Laboratory for Novel Software Technology,Nanjing Software Technology,Nanjing Software Technology,Nanjing University University University Nanjing,China Nanjing,China Nanjing,China ABSTRACT KEYWORDS One-shot automated essay scoring (AES)aims to assign scores Essay Scoring,One-Shot,Graph Propagation,Ordinal Distillation to a set of essays written specific to a certain prompt,with only ACM Reference Format: one manually scored essay per distinct score.Compared to the Zhiwei Jiang,Meng Liu,Yafeng Yin,Hua Yu,Zifeng Cheng,and Qing Gu previous-studied prompt-specific AES which usually requires a 2021.Learning from Graph Propagation via Ordinal Distillation for One- large number of manually scored essays for model training (e.g.. Shot Automated Essay Scoring.In Proceedings of the Web Conference 2021 about 600 manually scored essays out of totally 1000 essays),one- (WWW '21),April 19-23,2021,Ljubljana,Slovenia.ACM,New York,NY. shot AES can greatly reduce the workload of manual scoring.In this USA,10 pages.https:/doi.org/10.1145/3442381.3450017 paper,we propose a Transductive Graph-based Ordinal Distillation (TGOD)framework to tackle the task of one-shot AES.Specifically, 1 INTRODUCTION we design a transductive graph-based model as a teacher model to generate pseudo labels of unlabeled essays based on the one-shot Automated Essay Scoring(AES)aims to summarize the quality of labeled essays.Then,we distill the knowledge in the teacher model a student essay with a score or grade based on the factors such into a neural student model by learning from the high confidence as grammaticality,organization,and coherence.It is commercially pseudo labels.Different from the general knowledge distillation, valuable to be able to automate the scoring of millions of essays.In we propose an ordinal-aware unimodal distillation which makes a fact,AES has been developed and deployed in large-scale standard- ized tests such as TOEFL,GMAT,and GRE [2].Besides evaluating unimodal distribution constraint on the output of student model, the quality of essays,as an evaluation technique of text quality,AES to tolerate the minor errors existed in pseudo labels.Experimental results on the public dataset ASAP show that TGOD can improve can also be used conveniently to evaluate the quality of various the performance of existing neural AES models under the one-shot Web texts(e.g.,news,responses,and posts). Research on automated essay scoring has spanned the last 50 AES setting and achieve an acceptable average OWK of 0.69. years [25],and still continues to draw a lot of attention in the natu- CCS CONCEPTS ral language processing community [17].Traditional AES methods mainly rely on various handcrafted-features and score essays based Computing methodologies->Natural language processing: on regression methods [2,19,26,32,48].Recently,with the de- Information systems-Clustering and classification. velopment of deep learning technology,many models based on LSTM and CNN have been proposed [7,8,10,39,41].These models "Both authors contributed equally to this research. can automatically learn the features of essays and achieve better Corresponding author. performance than traditional methods. However,to train an effective neural AES model,it often needs a large number of manually scored essays for model training(e.g., This paper is published under the Creative Commons Attribution 40 International about 600 manually scored essays out of totally 1000 essays in a (CC-BY 4.0)license.Authors reserve their rights to disseminate the work on their personal and corporate Web sites with the appropriate attribution. test),which is labor intensive.This limits its application in some WWW '21,April 19-23,2021,Ljubljana,Slovenia real-world scenarios.To this end,some recent work considers using 2021 IW3C2 (International World Wide Web Conference Committee),published the scored essays under other prompts (ie.,topic of writing essay) under Creative Commons CC-BY 4.0 License. ACM ISBN978-1-4503-8312-7/21/04. to alleviate the burden of manual scoring under target prompt.But https:/doi.org/10.1145/3442381.3450017 due to the difference among prompts such as genre,score range, 2347
Learning from Graph Propagation via Ordinal Distillation for One-Shot Automated Essay Scoring Zhiwei Jiang∗† jzw@nju.edu.cn State Key Laboratory for Novel Software Technology, Nanjing University Nanjing, China Meng Liu∗ mf1933061@smail.nju.edu.cn State Key Laboratory for Novel Software Technology, Nanjing University Nanjing, China Yafeng Yin yafeng@nju.edu.cn State Key Laboratory for Novel Software Technology, Nanjing University Nanjing, China Hua Yu huayu.yh@smail.nju.edu.cn State Key Laboratory for Novel Software Technology, Nanjing University Nanjing, China Zifeng Cheng chengzf@smail.nju.edu.cn State Key Laboratory for Novel Software Technology, Nanjing University Nanjing, China Qing Gu guq@nju.edu.cn State Key Laboratory for Novel Software Technology, Nanjing University Nanjing, China ABSTRACT One-shot automated essay scoring (AES) aims to assign scores to a set of essays written specific to a certain prompt, with only one manually scored essay per distinct score. Compared to the previous-studied prompt-specific AES which usually requires a large number of manually scored essays for model training (e.g., about 600 manually scored essays out of totally 1000 essays), oneshot AES can greatly reduce the workload of manual scoring. In this paper, we propose a Transductive Graph-based Ordinal Distillation (TGOD) framework to tackle the task of one-shot AES. Specifically, we design a transductive graph-based model as a teacher model to generate pseudo labels of unlabeled essays based on the one-shot labeled essays. Then, we distill the knowledge in the teacher model into a neural student model by learning from the high confidence pseudo labels. Different from the general knowledge distillation, we propose an ordinal-aware unimodal distillation which makes a unimodal distribution constraint on the output of student model, to tolerate the minor errors existed in pseudo labels. Experimental results on the public dataset ASAP show that TGOD can improve the performance of existing neural AES models under the one-shot AES setting and achieve an acceptable average QWK of 0.69. CCS CONCEPTS • Computing methodologies → Natural language processing; • Information systems → Clustering and classification. ∗Both authors contributed equally to this research. †Corresponding author. This paper is published under the Creative Commons Attribution 4.0 International (CC-BY 4.0) license. Authors reserve their rights to disseminate the work on their personal and corporate Web sites with the appropriate attribution. WWW ’21, April 19–23, 2021, Ljubljana, Slovenia © 2021 IW3C2 (International World Wide Web Conference Committee), published under Creative Commons CC-BY 4.0 License. ACM ISBN 978-1-4503-8312-7/21/04. https://doi.org/10.1145/3442381.3450017 KEYWORDS Essay Scoring, One-Shot, Graph Propagation, Ordinal Distillation ACM Reference Format: Zhiwei Jiang, Meng Liu, Yafeng Yin, Hua Yu, Zifeng Cheng, and Qing Gu. 2021. Learning from Graph Propagation via Ordinal Distillation for OneShot Automated Essay Scoring. In Proceedings of the Web Conference 2021 (WWW ’21), April 19–23, 2021, Ljubljana, Slovenia. ACM, New York, NY, USA, 10 pages. https://doi.org/10.1145/3442381.3450017 1 INTRODUCTION Automated Essay Scoring (AES) aims to summarize the quality of a student essay with a score or grade based on the factors such as grammaticality, organization, and coherence. It is commercially valuable to be able to automate the scoring of millions of essays. In fact, AES has been developed and deployed in large-scale standardized tests such as TOEFL, GMAT, and GRE [2]. Besides evaluating the quality of essays, as an evaluation technique of text quality, AES can also be used conveniently to evaluate the quality of various Web texts (e.g., news, responses, and posts). Research on automated essay scoring has spanned the last 50 years [25], and still continues to draw a lot of attention in the natural language processing community [17]. Traditional AES methods mainly rely on various handcrafted-features and score essays based on regression methods [2, 19, 26, 32, 48]. Recently, with the development of deep learning technology, many models based on LSTM and CNN have been proposed [7, 8, 10, 39, 41]. These models can automatically learn the features of essays and achieve better performance than traditional methods. However, to train an effective neural AES model, it often needs a large number of manually scored essays for model training (e.g., about 600 manually scored essays out of totally 1000 essays in a test), which is labor intensive. This limits its application in some real-world scenarios. To this end, some recent work considers using the scored essays under other prompts (i.e., topic of writing essay) to alleviate the burden of manual scoring under target prompt. But due to the difference among prompts such as genre, score range, 2347
WWW '21,April 19-23,2021,Ljubljana,Slovenia Zhiwei Jiang,Meng Liu,Yafeng Yin,Hua Yu,Zifeng Cheng,and Qing Gu topic,and difficulty,these cross-prompt methods often perform of essays written to a certain prompt,y={1,2.....K)denote a set worse than the prompt-specific methods [9].Tackling the domain of pre-defined scores (labels)at ordinal scale,and (x,y)denote an adaptation among prompts is a challenging problem and there are essay and its ground-truth score(label)respectively.For one-shot some recent studies focusing on this line of work [5,14]. AES,we assume that we are given a set of one-shot labeled data In this paper,we consider another way without using data from D。={ci,班=i}where the set。={,yh)eDo}is other prompts.Given a set of essays towards a target prompt,we a subset of (i.e.,No E x),and the essay x e Xo with y i is consider if we can score all essays only based on a few manually the one-shot essay for the distinct score (label)i e y.Apart from scored essays.Extremely,we consider the one-shot scenario,that the one-shot labeled essays o,the rest essays in X constitute the is,only one manually scored essay per distinct score is given.In unlabeled essay set Xu ={xi,and thus Xu UXo =X.The goal practical writing tests,scoring staff usually evaluates the essays of one-shot AES is to learn a function F to predict the scores(labels) by first designing a criteria specific to the current test and then of the unlabelled essays x e Xu,based on the one-shot labeled data applying the criteria for essays scoring.To alleviate the burden Do and essay set by of scoring staff,we expect to firstly let the scoring staff express the criteria by one-shot manual scoring,and then use a specially- =F(x;Dox) (1) designed AES model to scoring the rest essays based on the one-shot data Typical AES approaches based on supervised learning would One-shot AES is a challenging task,since the one-shot labeled remove X and replace Do with a statistic "=0"(Do)in Eq.1. data is insufficient to train an effective neural AES model.To solve since they can usually learn a sufficient statistic 0*for prediction this problem,our intuition is whether we can augment the one- po(ylx)only based on labeled data Do.However,the one-shot shot labeled data with some pseudo labeled data,and then perform setting is never the case,since only few labeled data is given in Do, model training on the augmented labeled data.Obviously,there which is insufficient to train a statistic with good generalization. are two challenges:one is how to acquire the pseudo labeled data. We therefore exploit both the one-shot labeled data Do and the and the other is how to alleviate the disturbance brought by error unlabeled essays XuEX to learn the prediction function F,and pseudo labels during model training. thus adopt the more general form of F in Eq.1. To this end,we propose a Transductive Graph-based Ordinal Distillation (TGOD)framework for one-shot automated essay scor- ing,which is designed based on a teacher-student mechanism(i.e.. 3 THE TGOD FRAMEWORK knowledge distillation)[13].Specifically,we employ a transduc- In this section,we introduce the proposed TGOD framework,fol- tive graph-based model [52,53]as the teacher model to generate lowed by its technical details. pseudo labels,and then train the neural AES model(student model) by combining the pseudo labels and one-shot labels.Considering 3.1 An Overview of TGOD that there may exist many error labels among the pseudo labels, we select the pseudo labels with high confidence to improve the TGOD is designed based on the teacher-student mechanism.It quality of pseudo labels.Besides,considering that the score is at can enable a supervised neural student model to benefit from a semi-supervised teacher model under the one-shot essay scoring ordinal scale and an essay is easily to be assigned a score near its setting.While the one-shot labeled data is insufficient to train the ground-truth score (e.g.,3 is easily to be predicted as 2 or 4),we supervised neural student model,the student model can be trained proposed an ordinal-aware unimodal distillation strategy to tolerate by distilling the knowledge of the semi-supervised teacher model some pseudo labels with minor errors. on the unlabeled essays.Through a specially-designed ordinal dis The major contributions of this paper are summarized as follows: tillation strategy,the supervised neural student model can even For the one-shot automated essay scoring,we propose a outperform the semi-supervised teacher model. distillation framework based on graph propagation,which Specifically,as shown in Figure 1,TGOD contains three main alleviates the requirement of supervised neural AES model components:the Teacher Model which exploits the manifold struc- on labeled data by utilizing unsupervised data. ture among labeled and unlabeled essays based on graphs and We propose the label selection and the ordinal-aware uni- generates pseudo labels of unlabeled essays for distillation;the Stu- modal distillation strategies to alleviate the effect of error dent Model which tackles the essay scoring problem as an ordinal pseudo labels on the final AES model. classification problem and makes a unimodal distribution predic- The TGOD framework has no limitation on the architecture tion for essays;the Ordinal Distillation which distills the unimodal of student model,thus can be applied to many existing neu- smoothed Teacher Model's outputs into the Student Model.In the ral AES models.Experimental results on the public dataset following.we introduce these components of TGOD with technical demonstrate that our framework can effectively improve the details. performance of several classical neural AES models under the one-shot AES setting. 3.2 Graph-Based Label Propagation(Teacher) We introduce the Teacher Model illustrated in Figure 1,which is a 2 PROBLEM DEFINITION graph-based label propagation model and consists of three com- We first introduce some notation and formalize the one-shot auto- ponents:multiple graph construction that models the relationship mated essay scoring (AES)problem.Let=denote a set among essays from multiple aspects;label propagation that spreads 2348
WWW ’21, April 19–23, 2021, Ljubljana, Slovenia Zhiwei Jiang, Meng Liu, Yafeng Yin, Hua Yu, Zifeng Cheng, and Qing Gu topic, and difficulty, these cross-prompt methods often perform worse than the prompt-specific methods [9]. Tackling the domain adaptation among prompts is a challenging problem and there are some recent studies focusing on this line of work [5, 14]. In this paper, we consider another way without using data from other prompts. Given a set of essays towards a target prompt, we consider if we can score all essays only based on a few manually scored essays. Extremely, we consider the one-shot scenario, that is, only one manually scored essay per distinct score is given. In practical writing tests, scoring staff usually evaluates the essays by first designing a criteria specific to the current test and then applying the criteria for essays scoring. To alleviate the burden of scoring staff, we expect to firstly let the scoring staff express the criteria by one-shot manual scoring, and then use a speciallydesigned AES model to scoring the rest essays based on the one-shot data. One-shot AES is a challenging task, since the one-shot labeled data is insufficient to train an effective neural AES model. To solve this problem, our intuition is whether we can augment the oneshot labeled data with some pseudo labeled data, and then perform model training on the augmented labeled data. Obviously, there are two challenges: one is how to acquire the pseudo labeled data, and the other is how to alleviate the disturbance brought by error pseudo labels during model training. To this end, we propose a Transductive Graph-based Ordinal Distillation (TGOD) framework for one-shot automated essay scoring, which is designed based on a teacher-student mechanism (i.e., knowledge distillation) [13]. Specifically, we employ a transductive graph-based model [52, 53] as the teacher model to generate pseudo labels, and then train the neural AES model (student model) by combining the pseudo labels and one-shot labels. Considering that there may exist many error labels among the pseudo labels, we select the pseudo labels with high confidence to improve the quality of pseudo labels. Besides, considering that the score is at ordinal scale and an essay is easily to be assigned a score near its ground-truth score (e.g., 3 is easily to be predicted as 2 or 4), we proposed an ordinal-aware unimodal distillation strategy to tolerate some pseudo labels with minor errors. The major contributions of this paper are summarized as follows: • For the one-shot automated essay scoring, we propose a distillation framework based on graph propagation, which alleviates the requirement of supervised neural AES model on labeled data by utilizing unsupervised data. • We propose the label selection and the ordinal-aware unimodal distillation strategies to alleviate the effect of error pseudo labels on the final AES model. • The TGOD framework has no limitation on the architecture of student model, thus can be applied to many existing neural AES models. Experimental results on the public dataset demonstrate that our framework can effectively improve the performance of several classical neural AES models under the one-shot AES setting. 2 PROBLEM DEFINITION We first introduce some notation and formalize the one-shot automated essay scoring (AES) problem. Let X = {xi } N i=1 denote a set of essays written to a certain prompt, Y = {1, 2, ...,K} denote a set of pre-defined scores (labels) at ordinal scale, and (x,y) denote an essay and its ground-truth score (label) respectively. For one-shot AES, we assume that we are given a set of one-shot labeled data Do = {(xi ,yi = i)}K i=1 , where the set Xo = {xi |(xi ,yi) ∈ Do } is a subset of X (i.e., Xo ∈ X), and the essay x ∈ Xo with y = i is the one-shot essay for the distinct score (label) i ∈ Y. Apart from the one-shot labeled essays Xo , the rest essays in X constitute the unlabeled essay set Xu = {xi } Nu i=1 , and thus Xu ∪ Xo = X. The goal of one-shot AES is to learn a function F to predict the scores (labels) of the unlabelled essays x ∈ Xu , based on the one-shot labeled data Do and essay set X, by yˆ = F (x; Do, X). (1) Typical AES approaches based on supervised learning would remove X and replace Do with a statistic θ ∗ = θ ∗ (Do ) in Eq. 1, since they can usually learn a sufficient statistic θ ∗ for prediction pθ ∗ (y|x) only based on labeled data Do . However, the one-shot setting is never the case, since only few labeled data is given in Do , which is insufficient to train a statistic θ ∗ with good generalization. We therefore exploit both the one-shot labeled data Do and the unlabeled essays Xu ∈ X to learn the prediction function F , and thus adopt the more general form of F in Eq. 1. 3 THE TGOD FRAMEWORK In this section, we introduce the proposed TGOD framework, followed by its technical details. 3.1 An Overview of TGOD TGOD is designed based on the teacher-student mechanism. It can enable a supervised neural student model to benefit from a semi-supervised teacher model under the one-shot essay scoring setting. While the one-shot labeled data is insufficient to train the supervised neural student model, the student model can be trained by distilling the knowledge of the semi-supervised teacher model on the unlabeled essays. Through a specially-designed ordinal distillation strategy, the supervised neural student model can even outperform the semi-supervised teacher model. Specifically, as shown in Figure 1, TGOD contains three main components: the Teacher Model which exploits the manifold structure among labeled and unlabeled essays based on graphs and generates pseudo labels of unlabeled essays for distillation; the Student Model which tackles the essay scoring problem as an ordinal classification problem and makes a unimodal distribution prediction for essays; the Ordinal Distillation which distills the unimodal smoothed Teacher Model’s outputs into the Student Model. In the following, we introduce these components of TGOD with technical details. 3.2 Graph-Based Label Propagation (Teacher) We introduce the Teacher Model illustrated in Figure 1, which is a graph-based label propagation model and consists of three components: multiple graph construction that models the relationship among essays from multiple aspects; label propagation that spreads 2348
Learning from Graph Propagation via Ordinal Distillation for One-Shot Automated Essay Scoring WWW'21,April 19-23,2021,Ljubljana,Slovenia -+k-1h Student Model m-P Essay One-Shot Labeled Essays Ordinal Distillation ■△◆鱼 Leg CMB PMF Unimodal Ordinal Classifier edicted Label ooOO oOOO oOO 0000 D Unlabeled Multiple Graph Gi Label Guessing Graphs Essays Pseudo-Label Graph Gz Teacher Model Figure 1:Architecture of the Transductive Graph-Based Ordinal Distillation(TGOD)framework. labels from the one-shot essays to the unlabeled essays;label guess- one-shot essays No and labeled as yi j,otherwise Yii =0.Starting ing that generates the pseudo labels of unlabeled essays from results from Y,label propagation iteratively determines the unknown labels of multiple graph propagation. of essays in Au according to the graph structure using the following formulation: 3.2.1 Multiple Graphs Construction.To construct a graph on F+1=aSF+(1-a)Y, 4) the essay set A,we need first to extract the feature embedding of each essay xiE X.Specifically,we employ an embedding layer where Ft eFdenotes the predicted labels at the timestamp t,S followed by a mean pooling layer as the essay encoder fe()to denotes the normalized weight,and a e(0,1)controls the amount extract the feature embedding fe(xi)of essay xi. of propagated information.It is well known that the sequence(F) Based on the feature embedding of essays,we then construct has a closed-form solution as follows: a neighborhood graph G=(V,E,W)for the essay set X,where F=(I-aS)-1Y, (5) V=X denotes the node set,E denotes the edge set,and W denotes where I is the identity matrix [52] the adjacent matrix.To construct an appropriate graph,we employ the Gaussian kernel function [53]to calculate the adjacent matrix 3.23 Label Guessing.For each unlabeled essay in Xu,we pro- W: duce a"guess"for its label based on the predictions of label propa- gation on multiple graphs.This guess is later used as pseudo label Wij exp d (fe(xi).fe(xj)) 2o2 (2) of unlabeled essay for knowledge distillation where d(,)is a distance measure (e.g.,Euclidean distance)and o To do so,we first compute the average of the label distributions is a length scale parameter. predicted by label propagation on all the B graphs by To construct a k-nearest neighbor graph,we only keep the k- B Y'= 1 max values in each row of W,and then apply the normalized graph B」 (6) Laplacians [6]on W: S=D-WD-i. (3) where ydenotes the averaged label distribution matrixand denotes the final label distribution matrix generated by applying where D is a diagonal matrix with its (i,i)-value to be the sum of label propagation on graph G. the i-th row of W. Then,for each unlabeled essay xiXu,its pseudo labelyis While using different pre-trained word embeddings as the em- obtained as follows: bedding layer may result in different k-nearest neighbor graphs, () we can construct B graphs by using B types of pre-trained word yi=arg max Yij. 1≤j≤K embeddings (e.g,Word2Vec [20],GloVe [28],ELMo [31],BERT where Y denotes the j-th element of the i-th row vector of Y. [43]). 3.2.2 Label Propagation.We now describe how to get predic- 3.3 Ordinal-Aware Neural Network (Student) tions for the unlabeled essays set Xu using label propagation [23]. We introduce the Student Model illustrated in Figure 1,which is Let F denote the set of N x K sized matrix with nonnegative an ordinal-aware neural network model and consists of two main entries.We define a label matrix Ye F with Yij=1 if xi is from the components:essay encoder that extracts the feature embedding 2349
Learning from Graph Propagation via Ordinal Distillation for One-Shot Automated Essay Scoring WWW ’21, April 19–23, 2021, Ljubljana, Slovenia Multiple Graphs Construction Essay Encoder Essay Embedding Predicted Label 𝒀𝑼 Ordinal Distillation Pseudo-Label Label Guessing 𝒀𝑼 ′ Unimodal Ordinal Classifier Graph 𝑮𝟏 Label Distribution 𝑭𝑮𝟏 ∗ Label Propagation Graph 𝑮𝟐 Label Distribution 𝑭𝑮𝟐 ∗ Copy Expansion Softmax Sigmoid 𝒑 𝒑𝒌=𝟏 𝒑𝒌=𝟐 𝒑𝒌=𝑲 𝝊 log 𝐾 − 1 𝑘 − 1 + 𝒌 − 𝟏 log 𝒑 + (𝑲 − 𝒌) log(𝟏 − 𝒑) 𝝊 Log CMB PMF … Label Propagation Teacher Model Student Model One-Shot Labeled Essays Unlabeled Essays Figure 1: Architecture of the Transductive Graph-Based Ordinal Distillation (TGOD) framework. labels from the one-shot essays to the unlabeled essays; label guessing that generates the pseudo labels of unlabeled essays from results of multiple graph propagation. 3.2.1 Multiple Graphs Construction. To construct a graph on the essay set X, we need first to extract the feature embedding of each essay xi ∈ X. Specifically, we employ an embedding layer followed by a mean pooling layer as the essay encoder fe (·) to extract the feature embedding fe (xi) of essay xi . Based on the feature embedding of essays, we then construct a neighborhood graph G = (V, E,W ) for the essay set X, where V = X denotes the node set, E denotes the edge set, and W denotes the adjacent matrix. To construct an appropriate graph, we employ the Gaussian kernel function [53] to calculate the adjacent matrix W : Wij = exp − d fe (xi), fe (xj) 2σ 2 , (2) where d(·, ·) is a distance measure (e.g., Euclidean distance) and σ is a length scale parameter. To construct a k-nearest neighbor graph, we only keep the kmax values in each row ofW , and then apply the normalized graph Laplacians [6] on W : S = D − 1 2W D− 1 2 , (3) where D is a diagonal matrix with its (i,i)-value to be the sum of the i-th row of W . While using different pre-trained word embeddings as the embedding layer may result in different k-nearest neighbor graphs, we can construct B graphs by using B types of pre-trained word embeddings (e.g., Word2Vec [20], GloVe [28], ELMo [31], BERT [43]). 3.2.2 Label Propagation. We now describe how to get predictions for the unlabeled essays set Xu using label propagation [23]. Let F denote the set of N × K sized matrix with nonnegative entries. We define a label matrix Y ∈ F with Yij = 1 if xi is from the one-shot essays Xo and labeled asyi = j, otherwise Yij = 0. Starting fromY, label propagation iteratively determines the unknown labels of essays in Xu according to the graph structure using the following formulation: F t+1 = αSFt + (1 − α)Y, (4) where F t ∈ F denotes the predicted labels at the timestamp t, S denotes the normalized weight, and α ∈ (0, 1) controls the amount of propagated information. It is well known that the sequence {F t } has a closed-form solution as follows: F ∗ = (I − αS) −1Y, (5) where I is the identity matrix [52]. 3.2.3 Label Guessing. For each unlabeled essay in Xu , we produce a "guess" for its label based on the predictions of label propagation on multiple graphs. This guess is later used as pseudo label of unlabeled essay for knowledge distillation. To do so, we first compute the average of the label distributions predicted by label propagation on all the B graphs by Y ′ = 1 B Õ B b=1 F ∗ Gb , (6) where Y ′ denotes the averaged label distribution matrix, and F ∗ Gb denotes the final label distribution matrix generated by applying label propagation on graph Gb . Then, for each unlabeled essay xi ∈ Xu , its pseudo label y ′ i is obtained as follows: y ′ i = arg max 1≤j ≤K Y ′ ij , (7) where Y ′ ij denotes the j-th element of the i-th row vector of Y ′ . 3.3 Ordinal-Aware Neural Network (Student) We introduce the Student Model illustrated in Figure 1, which is an ordinal-aware neural network model and consists of two main components: essay encoder that extracts the feature embedding 2349
WWW '21,April 19-23,2021,Ljubljana,Slovenia Zhiwei Jiang,Meng Liu,Yafeng Yin,Hua Yu,Zifeng Cheng,and Qing Gu of the input essay:ordinal classifier that predicts a unimodal label where Yik denotes the k-th element of Yi.Based on Yi,the final distribution on the pre-defined scores for each input essay. predicted label i of essay xi can be obtained by: 3.3.1 Essay Encoder.We employ a neural network fo()to ex- i=arg max Yik. (12) tract features of an input xi,where fo(xi;)refers to the essay 1≤k≤K embedding and o indicates the parameters of the network.This module is not limited to a specific architecture and can be var- 3.4 Ordinal Distillation ious existing AES encoders.To demonstrate the universality of We introduce the Ordinal Distillation illustrated in Figure 1,which our framework and provide more fair comparisons in the experi- distills the pseudo-label knowledge of Teacher Model into the Stu- ments,we adopt the encoders adopted in some recent work(e.g., dent Model,and consists of three main steps:label selection that CNN-LSTM-Att [9].HA-LSTM [5],BERT [5]). selects high confidence pseudo-labels for later distillation;unimodal 3.3.2 Unimodal Ordinal Classifier.Unlike previous neural net- smoothing that enforces the label distribution of pseudo-label to be a unimodal probability distribution;unimodal distillation that min- work based AES models which predict the score of the input essay by using a regression layer(ie.a one-unit layer),we view the essay imizes the KL divergence between the predicted label distribution of Student Model and the unimodal smoothed label distribution of scoring as an ordinal classification problem and adopt an ordinal Teacher Model. classifier [3]for prediction To capture the ordinal relationship among classes,the unimodal 3.4.1 Label Selection.Considering that only one-shot labeled probability distribution(ie,the distribution has a peak at class k data is available for label propagation,the pseudo labels generated while decreasing its value when the class goes away from k)is usu- by Teacher Model may be noisy.Therefore,we propose a label ally used to restrict the shape of the predicted label distributions. selection strategy to select a subset of pseudo labels with high According to previous studies [3,22],some special exponential confidence. functions and the probability mass function(PMF)of both Pois- Specifically,for each distinct score k e y,we first collect all son distribution and binomial distribution can be used to enforce corresponding pseudo labels,that is,Ck=yly=k,xiE Xu). discrete unimodal probability distribution. and then rank these pseudo labels Ck according to their confidence. In our framework,we choose an extension of the binomial dis- We measure the confidence of a pseudo label y by calculating tribution,Conway-Maxwell binomial distribution(CMB)[16].as the negative Shannon entropy of its corresponding label distribu- the base distribution,and employ the PMF of the CMB to generate tion(Eq.13),so that a peaked distribution may tend to get a high the predicted unimodal probability distribution of essay xi: confidence. P(yi=k)= 1/K-1 S(p,v)k-1 p-1(1-p)K-k (8) Confidence(y)=-H(Y:) ∑写logz (13) where S(p.v)= (9) After that,we select top mk pseudo labels with high confidence from Ck by Here k∈y={1,2,..,K,0≤p≤1,and-oo≤v≤co.The mk min (Ckl,max(a,Cklx y)), (14) parameter v can be used to control the variance of the distribution. The case v=1 is the usual binomial distribution. where the threshold ratio y and the threshold number a are set to To be more specifically,we now describe the neural network ensure a sufficient number of pseudo labels are selected in the end architecture of the employed ordinal classifier based on the PMF of and avoid serious class imbalance problem. the CMB.As shown in Figure 1,the essay encoder is followed by a linear layer which transforms the essay embedding into a number 3.4.2 Unimodal Smoothing.Previous studies on knowledge dis- vER and a probability p e[0,1](by using sigmoid activation func- tillation [13,49]have shown that a soft or smoothed probability tion).The linear layer is then followed by a 'copy expansion'layer distribution from teacher model is more suitable for knowledge which expands the probability p into K probabilities corresponding distillation than one-hot probability distribution.Considering that to K distinct scores,that is,Pk=1 =Pk=2=...=Pk=K.The follow- essay scoring is an ordinal classification problem and an essay is ing layer then applies the 'Log CMB PMF'transformation on these more likely to be mispredicted as a score close to the ground-truth probabilities with different k: score,we enforce the distribution of pseudo labels produced by teacher model to be a unimodal smoothed probability distribution K-1 LCP(k:v,p)=vlog As mentioned before,some special exponential functions [22] k-1 +(k-1)logp (10) can be used to enforce discrete unimodal probability distribution. +(K-k)log(1-p), Therefore,we employ an exponential function to perform the uni- where the log operation is used to address numeric stability is- modal smoothing on both one-shot labels and pseudo labels: sues.Finally,a softmax layer is applied on the logit,LCP(k:v.p),to cp(-以 produce a unimodal probability distribution Yi for essay xi: ∑A:ep(U- xi∈Xo eLCP(k:v.p) g'(yi=kxi)= (15) exp(- =1eLCP(k:v.p) (11) 1p(-T xiEXu 2350
WWW ’21, April 19–23, 2021, Ljubljana, Slovenia Zhiwei Jiang, Meng Liu, Yafeng Yin, Hua Yu, Zifeng Cheng, and Qing Gu of the input essay; ordinal classifier that predicts a unimodal label distribution on the pre-defined scores for each input essay. 3.3.1 Essay Encoder. We employ a neural network fφ (·) to extract features of an input xi , where fφ (xi ;φ) refers to the essay embedding and φ indicates the parameters of the network. This module is not limited to a specific architecture and can be various existing AES encoders. To demonstrate the universality of our framework and provide more fair comparisons in the experiments, we adopt the encoders adopted in some recent work (e.g., CNN-LSTM-Att [9], HA-LSTM [5], BERT [5]). 3.3.2 Unimodal Ordinal Classifier. Unlike previous neural network based AES models which predict the score of the input essay by using a regression layer (i.e. a one-unit layer), we view the essay scoring as an ordinal classification problem and adopt an ordinal classifier [3] for prediction. To capture the ordinal relationship among classes, the unimodal probability distribution (i.e., the distribution has a peak at class k while decreasing its value when the class goes away from k) is usually used to restrict the shape of the predicted label distributions. According to previous studies [3, 22], some special exponential functions and the probability mass function (PMF) of both Poisson distribution and binomial distribution can be used to enforce discrete unimodal probability distribution. In our framework, we choose an extension of the binomial distribution, Conway–Maxwell binomial distribution (CMB) [16], as the base distribution, and employ the PMF of the CMB to generate the predicted unimodal probability distribution of essay xi : P(yi = k) = 1 S(p,υ) K − 1 k − 1 υ p k−1 (1 − p) K−k , (8) where S(p,υ) = Õ K k=1 K − 1 k − 1 υ p k−1 (1 − p) K−k . (9) Here k ∈ Y = {1, 2, . . . ,K}, 0 ≤ p ≤ 1, and −∞ ≤ υ ≤ ∞. The parameter υ can be used to control the variance of the distribution. The case υ = 1 is the usual binomial distribution. To be more specifically, we now describe the neural network architecture of the employed ordinal classifier based on the PMF of the CMB. As shown in Figure 1, the essay encoder is followed by a linear layer which transforms the essay embedding into a number υ ∈ R and a probability p ∈ [0, 1] (by using sigmoid activation function). The linear layer is then followed by a ‘copy expansion’ layer which expands the probability p into K probabilities corresponding to K distinct scores, that is, pk=1 = pk=2 = · · · = pk=K . The following layer then applies the ‘Log CMB PMF’ transformation on these probabilities with different k: LCP(k;υ,p) = υ log K − 1 k − 1 + (k − 1) logp + (K − k) log (1 − p), (10) where the log operation is used to address numeric stability issues. Finally, a softmax layer is applied on the logit, LCP(k;υ,p), to produce a unimodal probability distribution Yˆ i for essay xi : Yˆ ik = e LCP(k;υ,p) ÍK k=1 e LCP(k;υ,p) , (11) where Yˆ ik denotes the k-th element of Yˆ i . Based on Yˆ i , the final predicted label yˆi of essay xi can be obtained by: yˆi = arg max 1≤k ≤K Yˆ ik . (12) 3.4 Ordinal Distillation We introduce the Ordinal Distillation illustrated in Figure 1, which distills the pseudo-label knowledge of Teacher Model into the Student Model, and consists of three main steps: label selection that selects high confidence pseudo-labels for later distillation; unimodal smoothing that enforces the label distribution of pseudo-label to be a unimodal probability distribution; unimodal distillation that minimizes the KL divergence between the predicted label distribution of Student Model and the unimodal smoothed label distribution of Teacher Model. 3.4.1 Label Selection. Considering that only one-shot labeled data is available for label propagation, the pseudo labels generated by Teacher Model may be noisy. Therefore, we propose a label selection strategy to select a subset of pseudo labels with high confidence. Specifically, for each distinct score k ∈ Y, we first collect all corresponding pseudo labels, that is, Ck = {y ′ i |y ′ i = k, xi ∈ Xu }, and then rank these pseudo labels Ck according to their confidence. We measure the confidence of a pseudo label y ′ i by calculating the negative Shannon entropy of its corresponding label distribution (Eq. 13), so that a peaked distribution may tend to get a high confidence. Confidence(y ′ i ) = −H(Y ′ i ) = Õ K j=1 Y ′ ij log2 Y ′ ij (13) After that, we select top mk pseudo labels with high confidence from Ck by mk = min (|Ck |, max(a, |Ck | ×γ )), (14) where the threshold ratio γ and the threshold number a are set to ensure a sufficient number of pseudo labels are selected in the end and avoid serious class imbalance problem. 3.4.2 Unimodal Smoothing. Previous studies on knowledge distillation [13, 49] have shown that a soft or smoothed probability distribution from teacher model is more suitable for knowledge distillation than one-hot probability distribution. Considering that essay scoring is an ordinal classification problem and an essay is more likely to be mispredicted as a score close to the ground-truth score, we enforce the distribution of pseudo labels produced by teacher model to be a unimodal smoothed probability distribution. As mentioned before, some special exponential functions [22] can be used to enforce discrete unimodal probability distribution. Therefore, we employ an exponential function to perform the unimodal smoothing on both one-shot labels and pseudo labels: q ′ (yi = k|xi) = exp( −|k−yi | τ ) ÍK j=1 exp( −|j−yi | τ ) xi ∈ Xo exp( −|k−y ′ i | τ ) ÍK j=1 exp( −|j−y ′ i | τ ) xi ∈ Xu , (15) 2350
Learning from Graph Propagation via Ordinal Distillation for One-Shot Automated Essay Scoring WWW'21,April 19-23,2021,Ljubljana,Slovenia Algorithm 1 The Training Flow of TGOD Table 1:Statistics of the ASAP datasets.For column Genre, Input:The whole set of essays X,one-shot labeled data Do. ARG denotes argumentative essays,RES denotes response Output:An optimized student model. essays,and NAR denotes narrative essays.The last column Run the Teacher Model: lists the score ranges. Construct multiple graphs G*=(G1,G2,....GB}on X. for each Gh eG do Prompt #Essay Genre Avg Len Range Apply label propagation algorithm on Gh as Eq.5. 1 1,783 ARG 350 2-12 end for 2 1,800 ARG 350 1-6 Generate pseudo labels by label guessing as Eq.6 and 7. 3 1,726 RES 150 0-3 Train the Student Model by Ordinal Distillation: 4 1,772 RES 150 0-3 Select the pseudo labels with high confidence by Eq.13 and 14. 5 1,805 RES 150 0-4 Smooth the selected labels as Eg.15. 6 1,800 RES 150 0-4 Split selected essays into training set Dt and validation set Du 1,569 NAR 250 0-30 for all iter=1,...,MaxIter do 8 723 NAR 650 0-60 Optimize the student model on Dt by minimizing Eq.16. Validate the student model on D,, end for 4.1 Dataset and Evaluation Metric return The student model with best performance on D We conduct experiments on a public dataset ASAP(Automated Student Assessment Prize),which is a widely-used benchmark dataset for the task of automated essay scoring.In ASAP,there are where k ey and r is a parameter used to control the variance of eight sets of essays corresponding to eight different prompts,and a total of 12,978 scored essays.These eight essay sets vary in essay the distribution. number,genre,and score range,the details of which are listed in 3.4.3 Unimodal Distillation.Since the one-shot labeled data Table 1. Do is not sufficient to train a neural network,we use the pseudo To evaluate the performance of AES methods,we employ the labels produced by teacher model as a supplement to train the quadratic weighted kappa(OWK)as the evaluation metric,which student model. is the official metric of ASAP dataset.For each set of essays with Specifically,we train the student model by matching the output possible scores y={1,2.....K),the QWK can be calculated to label distribution of student model g(xi)=Y;and the unimodal measure the agreement between the automated predicted scores smoothed pseudo label of teacher model q'(xi)via a KL-divergence (Rater A)and the resolved human scores(Rater B)as follows: loss: K=1- ∑i,jwi,j0i,j (17) COD= DKL(gx)llg'(x》, (16) ∑ijwi,jEij where wi.j= (is calculated based on the difference between (i-2 where Xs denotes the set of essays from either one-shot data or the raters'scores,O is a K-by-K histogram matrix,Oi.j is the number selected essays after label selection. of essays that received a score i by Rater A and a score j by Rater B and E is calculated as the normalized outer product between each 3.5 Training Flow of TGOD rater's histogram vector of scores. In summary,there are two steps in TGOD to train the Student Model 4.2 Experimental Settings under the one-shot setting,i.e.,first generating pseudo labels of unlabeled essays by running the Teacher Model,and then training For the setting of'one-shot',we conduct experiments by randomly the Student Model by Ordinal Distillation.The whole training flow sampling the one-shot labeled data to train the model and test the of TGOD is illustrated in Figure 1 and Alg.1. model on the rest unlabeled essays.To reduce randomness,under In particular,considering that model selection is difficult to im- each case,we repeat the sampling of one-shot labeled data 20 times, plement under the one-shot supervised setting,we design a model and the average results are reported.For our proposed framework selection strategy based on pseudo labels,which validates the model we perform model selection based on the pseudo validation set. on a subset of pseudo labels. For other baseline methods.since one-shot labeled data is used for training and no extra labeled data can be used as a validation set to perform model selection,we report their best performance on test 4 EXPERIMENTS set as their upper bound performance for comparison. In this section,we first introduce the dataset and evaluation metric. For the setting of'one-shot+history prompt',we combine the one- Then we illustrate the experimental settings,the implementation shot labeled data and the labeled data in a history prompt of the details,and the performance comparison.Finally,we conduct ab- similar score range(e.g,P1→P2,P2→P1,P3→P4,P4→P3,and lation study and model analysis to investigate the effectiveness of our proposed approach. 1https://www.kaggle.com/c/asap-aes/data 2351
Learning from Graph Propagation via Ordinal Distillation for One-Shot Automated Essay Scoring WWW ’21, April 19–23, 2021, Ljubljana, Slovenia Algorithm 1 The Training Flow of TGOD Input: The whole set of essays X, one-shot labeled data Do . Output: An optimized student model. Run the Teacher Model: Construct multiple graphs G ∗ = {G1,G2, . . . ,GB } on X. for each Gb ∈ G ∗ do Apply label propagation algorithm on Gb as Eq. 5. end for Generate pseudo labels by label guessing as Eq. 6 and 7. Train the Student Model by Ordinal Distillation: Select the pseudo labels with high confidence by Eq. 13 and 14. Smooth the selected labels as Eq. 15. Split selected essays into training set Dt and validation set Dv . for all iter=1,. . . ,MaxIter do Optimize the student model on Dt by minimizing Eq. 16. Validate the student model on Dv end for return The student model with best performance on Dv where k ∈ Y and τ is a parameter used to control the variance of the distribution. 3.4.3 Unimodal Distillation. Since the one-shot labeled data Do is not sufficient to train a neural network, we use the pseudo labels produced by teacher model as a supplement to train the student model. Specifically, we train the student model by matching the output label distribution of student model qˆ(xi) = Yˆ i and the unimodal smoothed pseudo label of teacher model q ′ (xi) via a KL-divergence loss: LO D = Õ xi ∈Xs DK L qˆ(xi)||q ′ (xi) , (16) where Xs denotes the set of essays from either one-shot data or the selected essays after label selection. 3.5 Training Flow of TGOD In summary, there are two steps in TGOD to train the Student Model under the one-shot setting, i.e., first generating pseudo labels of unlabeled essays by running the Teacher Model, and then training the Student Model by Ordinal Distillation. The whole training flow of TGOD is illustrated in Figure 1 and Alg. 1. In particular, considering that model selection is difficult to implement under the one-shot supervised setting, we design a model selection strategy based on pseudo labels, which validates the model on a subset of pseudo labels. 4 EXPERIMENTS In this section, we first introduce the dataset and evaluation metric. Then we illustrate the experimental settings, the implementation details, and the performance comparison. Finally, we conduct ablation study and model analysis to investigate the effectiveness of our proposed approach. Table 1: Statistics of the ASAP datasets. For column Genre, ARG denotes argumentative essays, RES denotes response essays, and NAR denotes narrative essays. The last column lists the score ranges. Prompt #Essay Genre Avg Len Range 1 1,783 ARG 350 2-12 2 1,800 ARG 350 1-6 3 1,726 RES 150 0-3 4 1,772 RES 150 0-3 5 1,805 RES 150 0-4 6 1,800 RES 150 0-4 7 1,569 NAR 250 0-30 8 723 NAR 650 0-60 4.1 Dataset and Evaluation Metric We conduct experiments on a public dataset ASAP (Automated Student Assessment Prize1 ), which is a widely-used benchmark dataset for the task of automated essay scoring. In ASAP, there are eight sets of essays corresponding to eight different prompts, and a total of 12,978 scored essays. These eight essay sets vary in essay number, genre, and score range, the details of which are listed in Table 1. To evaluate the performance of AES methods, we employ the quadratic weighted kappa (QWK) as the evaluation metric, which is the official metric of ASAP dataset. For each set of essays with possible scores Y = {1, 2, . . . ,K}, the QWK can be calculated to measure the agreement between the automated predicted scores (Rater A) and the resolved human scores (Rater B) as follows: κ = 1 − Í i,j wi,jOi,j Í i,j wi,jEi,j , (17) where wi,j = (i−j) 2 (K−1) 2 is calculated based on the difference between raters’ scores, O is a K-by-K histogram matrix, Oi,j is the number of essays that received a score i by Rater A and a score j by Rater B, and E is calculated as the normalized outer product between each rater’s histogram vector of scores. 4.2 Experimental Settings For the setting of ‘one-shot’, we conduct experiments by randomly sampling the one-shot labeled data to train the model and test the model on the rest unlabeled essays. To reduce randomness, under each case, we repeat the sampling of one-shot labeled data 20 times, and the average results are reported. For our proposed framework, we perform model selection based on the pseudo validation set. For other baseline methods, since one-shot labeled data is used for training and no extra labeled data can be used as a validation set to perform model selection, we report their best performance on test set as their upper bound performance for comparison. For the setting of ‘one-shot+history prompt’, we combine the oneshot labeled data and the labeled data in a history prompt of the similar score range (e.g., P1 → P2, P2 → P1, P3 → P4, P4 → P3, and 1https://www.kaggle.com/c/asap-aes/data 2351
WWW '21,April 19-23,2021,Ljubljana,Slovenia Zhiwei Jiang,Meng Liu,Yafeng Yin,Hua Yu,Zifeng Cheng,and Qing Gu Table 2:The performance (QWK)of all comparison methods on ASAP dataset.The best measures are in bold.denotes that the data is referenced from previous studies and the setting is'one history prompt+10 essays from target prompt'.T()'refers to teacher model and'S()'refers to student model. Setting Method P1 P2 P3 P4 P5 P6 P7 Avg. T(4 Graphs) 0.6670.5250.6480.6930.7340.5700.6190.4470.613 OCLF+Distill(unimodal) 0.784 0.626 0.6520.689 0.7770.651 0.723 0.619 0.690 S(CNN-LSTM-Att) TGOD REG +Distill(score) 0.7720.6170.6490.6940.7730.6060.7210.6080.680 (Ours) 0CLF+Distill(unimoda0.7920.5930.6610.6890.759 0.6740.7380.635 0.693 S(HA-LSTM) REG Distill(score) 0.7800.565 0.6740.6780.741 0.6670.700 0.581 0.673 0CLF+Distill(unimodal)0.7720.5810.6900.7250.7760.691 S(BERT) 0.7660.5050.688 REG Distill(score) 0.7520.571 0.6650.644 0.773 0.6680.691 0.577 0.668 BLRR REG 0.7310.5530.5780.6440.6230.5810.5830.5740.608 OCLF 0.626 0.4430.352 0.5260.6430.475 0.1700.145 0.422 CNN-LSTM-Att REG 0.5450.477 0.2020.5690.6710.4930.580 0.641 0.522 One-Shot AES Model OCLE 0.576 05070617.0553 0.6350585 0.6200222 0.539 HA-LSTM REG 0.616 0.515 0.338 0.5310.746 0.649 0.555 0.480 0.554 OCLF 0.6950.5350.629 0.6210.7480.6600.706 0447 0.630 BERT REG 0.704 0.562 0.648 0.631 0.775 0.647 0.687 0.568 0.653 Word2Vec-MoT 0.703 0.525 0654 0.657 0.627 0571 0540 0429 0588 Label GloVe-MoT 0.675 0.552 0.642 0.668 0.686 0.346 0.588 0.385 0.593 Propagation ELMo-MoT 0658 0.382 0577 0.635 Semi- 0.583 0.640 0.443 0.422 0.543 Supervised BERT-MoT 0.668 0.467 0.603 0.641 0.753 0.545 0.615 0.471 0.595 Model Word2Vec-MoT 0.167 0423 0.479 0.507 0619 0474 0215 0188 0384 GloVe-MoT 0.152 0.435 0.530 0.547 0.488 0.131 0.135 0.350 TSVM ELMo-MoT 0.189 0327 0.480 0573 0.5410.412 0.224 0.109 0357 BERT-MoT 0.201 0.193 0.523 0.561 0.611 0.450 0.175 0.202 0.365 Reference Data 0552 0.691 0669 0603 CNN-LSTM-Att Re-Implement 0.592 0.553 0.666 0.680 0.6900.656 0.6400.565 0.630 Reference Data 0.570 0.681 0.704 0605 AES HA-LSTM One-Shot Re-Implement 0.633 0.545 0.6850.6830.7290.6290.2810.436 0.578 Model Reference Data 0.552 0.705 0.725 0.600 History Prompt BERT Re-Implement 0.6610.669 0.6510.6980.7090.5990.7250.5740.661 Few-Shot PROTO NET Meta-training 0.6930.599 0.6760.7140.7350.6120.5450.4150.624 Model TPN Meta-training 0.648 0.479 0.6630.6810.704 0.575 0.514 0.402 0.583 so on)to train the baseline AES model.For the few-shot models set to 0.001.For BERT,the 'uncased BERT-base model'is adopted, we use the data of history prompt as their meta training data. the Adam optimizer is adopted,and the learning rate is set to 0.001 4.3 Implementation Details 4.4 Comparison Methods In our TGOD framework,for the teacher model,we adopt four As described in Section 3.our framework employs a graph-based la- types of word embeddings(i.e.,Word2Vec,GloVe,ELMo,and BERT) bel propagation method as the teacher model and an ordinal-aware to construct four graphs for label guessing.The dimension of word neural network as the student model,which are transductive semi- embedding is 200.We fix the word embedding during training.The supervised model and supervised AES model respectively.Thus. k for constructing k-nearest neighbor graph is set 20.For label under the one-shot setting,we compare our models with existing selection,y is set to 0.25 and a is set to 50.For label smoothing. supervised AES models and semi-supervised models.Considering r is set to 30.For the student model,we adopt three neural AES that some previous studies on AES have often focused on combin- models (i.e.,CNN-LSTM-Att,HA-LSTM,and BERT)as the student ing few data in target prompt and the data in history prompts to model.We test the cases of using either the ordinal classifier(OCLF, perform essay scoring,we view them as a different one-shot like adopted by our framework)and the regression layer(REG,used setting.named one-shot plus history prompt.In this setting,we by the baseline AES models).While using the regression layer,the consider the existing AES models and classical few-shot models. smoothed label distribution is replaced by the score of pseudo labels. We implement four existing AES models: For the training of regression based AES models,the ground- BLRR [32]is based on hand-crafted features,and uses corre- truth scores of essays are rescaled into [0,1]for regression.To evalu- lated linear regression for prediction. ate the results,the predicted scores are rescaled to the original score CNN-LSTM-Att [9]is a neural AES model based on hierar- range of the corresponding prompts.For the hyper-parameters of chical architecture and attention mechanism. CNN-LSTM-Att and HA-LSTM,the hidden size is set to 100,dropout .HA-LSTM [5]is a neural AES model based on hierarchical is set to 0.5,the Adam optimizer is adopted,and the learning rate is architecture and self-attention mechanism. 2352
WWW ’21, April 19–23, 2021, Ljubljana, Slovenia Zhiwei Jiang, Meng Liu, Yafeng Yin, Hua Yu, Zifeng Cheng, and Qing Gu Table 2: The performance (QWK) of all comparison methods on ASAP dataset. The best measures are in bold. † denotes that the data is referenced from previous studies and the setting is ‘one history prompt + 10 essays from target prompt’. ‘T(·)’ refers to teacher model and ‘S(·)’ refers to student model. Setting Method P1 P2 P3 P4 P5 P6 P7 P8 Avg. One-Shot TGOD (Ours) T(4 Graphs) 0.667 0.525 0.648 0.693 0.734 0.570 0.619 0.447 0.613 S(CNN-LSTM-Att) OCLF + Distill(unimodal) 0.784 0.626 0.652 0.689 0.777 0.651 0.723 0.619 0.690 REG + Distill(score) 0.772 0.617 0.649 0.694 0.773 0.606 0.721 0.608 0.680 S(HA-LSTM) OCLF + Distill(unimodal) 0.792 0.593 0.661 0.689 0.759 0.674 0.738 0.635 0.693 REG + Distill(score) 0.780 0.565 0.674 0.678 0.741 0.667 0.700 0.581 0.673 S(BERT) OCLF + Distill(unimodal) 0.772 0.581 0.690 0.725 0.776 0.691 0.766 0.505 0.688 REG + Distill(score) 0.752 0.571 0.665 0.644 0.773 0.668 0.691 0.577 0.668 AES Model BLRR REG 0.731 0.553 0.578 0.644 0.623 0.581 0.583 0.574 0.608 CNN-LSTM-Att OCLF 0.626 0.443 0.352 0.526 0.643 0.475 0.170 0.145 0.422 REG 0.545 0.477 0.202 0.569 0.671 0.493 0.580 0.641 0.522 HA-LSTM OCLF 0.576 0.507 0.617 0.553 0.635 0.585 0.620 0.222 0.539 REG 0.616 0.515 0.338 0.531 0.746 0.649 0.555 0.480 0.554 BERT OCLF 0.695 0.535 0.629 0.621 0.748 0.660 0.706 0.447 0.630 REG 0.704 0.562 0.648 0.631 0.775 0.647 0.687 0.568 0.653 SemiSupervised Model Label Propagation Word2Vec-MoT 0.703 0.525 0.654 0.657 0.627 0.571 0.540 0.429 0.588 GloVe-MoT 0.675 0.552 0.642 0.668 0.686 0.546 0.588 0.385 0.593 ELMo-MoT 0.658 0.382 0.577 0.635 0.583 0.640 0.443 0.422 0.543 BERT-MoT 0.668 0.467 0.603 0.641 0.753 0.545 0.615 0.471 0.595 TSVM Word2Vec-MoT 0.167 0.423 0.479 0.507 0.619 0.474 0.215 0.188 0.384 GloVe-MoT 0.152 0.435 0.386 0.530 0.547 0.488 0.131 0.135 0.350 ELMo-MoT 0.189 0.327 0.480 0.573 0.541 0.412 0.224 0.109 0.357 BERT-MoT 0.201 0.193 0.523 0.561 0.611 0.450 0.175 0.202 0.365 One-Shot + History Prompt AES Model CNN-LSTM-Att Reference Data † − 0.552 − 0.691 − 0.669 − 0.603 − Re-Implement 0.592 0.553 0.666 0.680 0.690 0.656 0.640 0.565 0.630 HA-LSTM Reference Data † − 0.570 − 0.681 − 0.704 − 0.605 − Re-Implement 0.633 0.545 0.685 0.683 0.729 0.629 0.281 0.436 0.578 BERT Reference Data † − 0.552 − 0.705 − 0.725 − 0.600 − Re-Implement 0.661 0.669 0.651 0.698 0.709 0.599 0.725 0.574 0.661 Few-Shot Model PROTO NET Meta-training 0.693 0.599 0.676 0.714 0.735 0.612 0.545 0.415 0.624 TPN Meta-training 0.648 0.479 0.663 0.681 0.704 0.575 0.514 0.402 0.583 so on) to train the baseline AES model. For the few-shot models, we use the data of history prompt as their meta training data. 4.3 Implementation Details In our TGOD framework, for the teacher model, we adopt four types of word embeddings (i.e., Word2Vec, GloVe, ELMo, and BERT) to construct four graphs for label guessing. The dimension of word embedding is 200. We fix the word embedding during training. The k for constructing k-nearest neighbor graph is set 20. For label selection, γ is set to 0.25 and a is set to 50. For label smoothing, τ is set to 30. For the student model, we adopt three neural AES models (i.e., CNN-LSTM-Att, HA-LSTM, and BERT) as the student model. We test the cases of using either the ordinal classifier (OCLF, adopted by our framework) and the regression layer (REG, used by the baseline AES models). While using the regression layer, the smoothed label distribution is replaced by the score of pseudo labels. For the training of regression based AES models, the groundtruth scores of essays are rescaled into [0, 1] for regression. To evaluate the results, the predicted scores are rescaled to the original score range of the corresponding prompts. For the hyper-parameters of CNN-LSTM-Att and HA-LSTM, the hidden size is set to 100, dropout is set to 0.5, the Adam optimizer is adopted, and the learning rate is set to 0.001. For BERT, the ‘uncased BERT-base model’ is adopted, the Adam optimizer is adopted, and the learning rate is set to 0.001. 4.4 Comparison Methods As described in Section 3, our framework employs a graph-based label propagation method as the teacher model and an ordinal-aware neural network as the student model, which are transductive semisupervised model and supervised AES model respectively. Thus, under the one-shot setting, we compare our models with existing supervised AES models and semi-supervised models. Considering that some previous studies on AES have often focused on combining few data in target prompt and the data in history prompts to perform essay scoring, we view them as a different one-shot like setting, named one-shot plus history prompt. In this setting, we consider the existing AES models and classical few-shot models. We implement four existing AES models: • BLRR [32] is based on hand-crafted features, and uses correlated linear regression for prediction. • CNN-LSTM-Att [9] is a neural AES model based on hierarchical architecture and attention mechanism. • HA-LSTM [5] is a neural AES model based on hierarchical architecture and self-attention mechanism. 2352
Learning from Graph Propagation via Ordinal Distillation for One-Shot Automated Essay Scoring WWW'21,April 19-23,2021,Ljubljana,Slovenia Table 3:Ablation study of TGOD.The setting-US&OC'means that both unimodal smoothing and ordinal classifier are ablated from framework and general classification layer is adopted for prediction. Model Setting P2 P3 P4 P5 P6 P7 P8 Avg. TGOD(CNN-LSTM-Att) 0.784 0.626 0.652 0.689 0.777 0.651 0.723 0.619 0.690 label selection(LS) 0.729 0.514 0.665 0.664 0.743 0.668 0.697 0.574 0.657 -unimodal smoothing (US) 0.754 0.532 0.614 0.675 0.743 0.635 0.633 0.351 0.617 ordinal classifier(OC) 0.736 0.576 0.541 0.656 0.731 0.670 0.705 0.604 0.652 US&OC(=only label selection) 0.696 0.500 0.630 0.658 0.637 0.458 0.556 0.439 0.572 -LS&OC (only unimodal smoothing) 0.689 0.522 0.665 0.652 0.750 0.622 0.596 0.553 0.631 -LS&US (only ordinal classifier) 0.700 0.467 0.654 0.657 0.746 0.547 0.548 0.492 0.601 all (LS US OC) 0.680 0.472 0.661 0.658 0.743 0.556 0.528 0.421 0.590 BERT [5]is the widely-used pre-training model,which has be because that ordinal classification is more robust to the weak been used as an encoder for the task of AES. labels. We implement two classical semi-supervised models: By observing the semi-supervised models,we can find that by .Label Propagation [52]is a graph-based classification method just using word embedding to get the feature of essays,Label Prop- under transductive setting. agation can achieve a better performance than the supervised AES TSVM [15]is a margin-based classification method under models.By comparing Label Propagation with single graph to the transductive setting. teacher model in TGOD with 4 graphs,we can find that an ensemble We implement two classical few-shot models: of these graphs can produce a better teacher for TGOD than using Protopical Network [37]is a few-shot model based on metric only one graph. learning and adopts the episodic training procedure. By observing the models under the setting 'One-Shot History TPN [24]is a transductive few-shot model based on label Prompt',we can find that even with more labeled data from other propagation and adopts the episodic training procedure. prompt,these models do not outperform our TGOD. For our TGOD framework,we implement a baseline that replaces the ordinal-aware unimodal distillation with linear regression. 4.6 Ablation Study We explore the effects of the components designed specific to the one-shot setting,by removing each of them from TGOD individually. 4.5 Performance Comparison These components include:label selection(LS),unimodal smoothing (US),and ordinal classifier (OC).We remove them from TGOD in As shown in Table 2,the best performance is mostly achieved by our three ways:remove one of them,remove two of them,and remove TGOD framework with using different essays encoders(i.e.,CNN all of them. LSTM-Att.HA-LSTM.and BERT).By observing TGOD.we can find As shown in Table 3,after removing one of them from TGOD, the performance of the teacher model(ie.,graph-based label propa- gation)with 4 graphs is an average QWK of 0.613,based on which, the performance decreases a lot.This indicates that all of the three components are important to TGOD.After removing another one the student models can greatly outperform the teacher model.This of them from TGOD,the performance continues to decrease.After indicates that the design of learning from graph propagation is removing all of them from TGOD,the performance decreases to a effective for one-shot essay scoring. OWK of 0.590,which is even worse than the teacher model in TGOD. By observing the AES models under'One-Shot'setting,we can This indicates that distilling the pseudo labels to a classification find that among the four AES models,BERT performs best,which model without label processing can not prevent the model from can achieve a OWK of 0.630(by REG,ie.,regression)and 0.653 (by being disturbed by the noises or errors in pseudo labels.In addition, OCLF,i.e.,ordinal classification).Besides,the hand-crafted features the performance of'-US&OC'(which means only label selection is based method BLRR performs better than the CNN-LSTM-Att and used)is even worse than the performance of'-all'.This indicates HA-LSTM,but worse than BERT.This may be because that BLRR that label selection should be used along with other two components does not need to train an essay encoder,and BERT has a pre-trained (i.e.,US and OC),otherwise,it would fail to benefit the general encoder.By comparing these three neural AES models to our TGODs classification model(not ordinal aware),and even have a negative with the corresponding essay encoder,we can find that TGOD can impact on the final performance. greatly improve their performance under the one-shot setting.To be more detailed,for each neural AES model,we can find that the 4.7 Model Analysis performance of using OCLF(ie.,0.422,0.539,and 0.630 for three neural AES models)is worse than the performance of using REG In this part,we analyze the effects of the one-shot labeled data and (i.e.,0.522,0.554,and 0.653 for three neural AES models)when the graph construction on the performance of TGOD. directly trained on one-shot data,but under our TGOD framework, 4.7.1 Effect of one-shot data selection.For one-shot labeled the performance of using OCLF (ie..0.690,0.693,and 0.688 for three data.we first study the impacts of data selection on the performance neural AES models)is better than the performance of using REG of TGOD,that is,whether our TGOD framework is sensitive to the (i.e.,0.680,0.673,and 0.668 for three neural AES models).This may selection of one-shot essays.To this end,we repeat the sampling of 2353
Learning from Graph Propagation via Ordinal Distillation for One-Shot Automated Essay Scoring WWW ’21, April 19–23, 2021, Ljubljana, Slovenia Table 3: Ablation study of TGOD. The setting ‘− US&OC’ means that both unimodal smoothing and ordinal classifier are ablated from framework and general classification layer is adopted for prediction. Model Setting P1 P2 P3 P4 P5 P6 P7 P8 Avg. TGOD(CNN-LSTM-Att) 0.784 0.626 0.652 0.689 0.777 0.651 0.723 0.619 0.690 − label selection (LS) 0.729 0.514 0.665 0.664 0.743 0.668 0.697 0.574 0.657 − unimodal smoothing (US) 0.754 0.532 0.614 0.675 0.743 0.635 0.633 0.351 0.617 − ordinal classifier (OC) 0.736 0.576 0.541 0.656 0.731 0.670 0.705 0.604 0.652 − US&OC ( = only label selection) 0.696 0.500 0.630 0.658 0.637 0.458 0.556 0.439 0.572 − LS&OC ( = only unimodal smoothing) 0.689 0.522 0.665 0.652 0.750 0.622 0.596 0.553 0.631 − LS&US ( = only ordinal classifier) 0.700 0.467 0.654 0.657 0.746 0.547 0.548 0.492 0.601 − all (LS & US & OC) 0.680 0.472 0.661 0.658 0.743 0.556 0.528 0.421 0.590 • BERT [5] is the widely-used pre-training model, which has been used as an encoder for the task of AES. We implement two classical semi-supervised models: • Label Propagation [52] is a graph-based classification method under transductive setting. • TSVM [15] is a margin-based classification method under transductive setting. We implement two classical few-shot models: • Protopical Network [37] is a few-shot model based on metric learning and adopts the episodic training procedure. • TPN [24] is a transductive few-shot model based on label propagation and adopts the episodic training procedure. For our TGOD framework, we implement a baseline that replaces the ordinal-aware unimodal distillation with linear regression. 4.5 Performance Comparison As shown in Table 2, the best performance is mostly achieved by our TGOD framework with using different essays encoders (i.e., CNNLSTM-Att, HA-LSTM, and BERT). By observing TGOD, we can find the performance of the teacher model (i.e., graph-based label propagation) with 4 graphs is an average QWK of 0.613, based on which, the student models can greatly outperform the teacher model. This indicates that the design of learning from graph propagation is effective for one-shot essay scoring. By observing the AES models under ‘One-Shot’ setting, we can find that among the four AES models, BERT performs best, which can achieve a QWK of 0.630 (by REG, i.e., regression) and 0.653 (by OCLF, i.e., ordinal classification). Besides, the hand-crafted features based method BLRR performs better than the CNN-LSTM-Att and HA-LSTM, but worse than BERT. This may be because that BLRR does not need to train an essay encoder, and BERT has a pre-trained encoder. By comparing these three neural AES models to our TGODs with the corresponding essay encoder, we can find that TGOD can greatly improve their performance under the one-shot setting. To be more detailed, for each neural AES model, we can find that the performance of using OCLF (i.e., 0.422, 0.539, and 0.630 for three neural AES models) is worse than the performance of using REG (i.e., 0.522, 0.554, and 0.653 for three neural AES models) when directly trained on one-shot data, but under our TGOD framework, the performance of using OCLF (i.e., 0.690, 0.693, and 0.688 for three neural AES models) is better than the performance of using REG (i.e., 0.680, 0.673, and 0.668 for three neural AES models). This may be because that ordinal classification is more robust to the weak labels. By observing the semi-supervised models, we can find that by just using word embedding to get the feature of essays, Label Propagation can achieve a better performance than the supervised AES models. By comparing Label Propagation with single graph to the teacher model in TGOD with 4 graphs, we can find that an ensemble of these graphs can produce a better teacher for TGOD than using only one graph. By observing the models under the setting ‘One-Shot + History Prompt’, we can find that even with more labeled data from other prompt, these models do not outperform our TGOD. 4.6 Ablation Study We explore the effects of the components designed specific to the one-shot setting, by removing each of them from TGOD individually. These components include: label selection (LS), unimodal smoothing (US), and ordinal classifier (OC). We remove them from TGOD in three ways: remove one of them, remove two of them, and remove all of them. As shown in Table 3, after removing one of them from TGOD, the performance decreases a lot. This indicates that all of the three components are important to TGOD. After removing another one of them from TGOD, the performance continues to decrease. After removing all of them from TGOD, the performance decreases to a QWK of 0.590, which is even worse than the teacher model in TGOD. This indicates that distilling the pseudo labels to a classification model without label processing can not prevent the model from being disturbed by the noises or errors in pseudo labels. In addition, the performance of ‘− US&OC’ (which means only label selection is used) is even worse than the performance of ‘− all’. This indicates that label selection should be used along with other two components (i.e., US and OC), otherwise, it would fail to benefit the general classification model (not ordinal aware), and even have a negative impact on the final performance. 4.7 Model Analysis In this part, we analyze the effects of the one-shot labeled data and the graph construction on the performance of TGOD. 4.7.1 Effect of one-shot data selection. For one-shot labeled data, we first study the impacts of data selection on the performance of TGOD, that is, whether our TGOD framework is sensitive to the selection of one-shot essays. To this end, we repeat the sampling of 2353
WWW '21,April 19-23,2021,Ljubljana,Slovenia Zhiwei Jiang,Meng Liu,Yafeng Yin,Hua Yu,Zifeng Cheng,and Qing Gu 08 Prompts Number of Shot Prompts Ratio of Essays for Graph Constuction (a) (b) (c) (d) Figure 2:Effects of the one-shot labeled data and the graph construction on the performance of TGOD. one-shot labeled data 20 times,and record the corresponding per- indicates that combining multiple graphs for label guessing is an formance of TGOD(CNN-LSTM-Att is adopted as the essay encoder) effective way to provide pseudo labels with stable quality and thus for each sampling.For comparison,we record the performance of improves the performance of the Student Model. teacher model (Teacher),regression based student model (Student- REG),and ordinal classification based model (Student-OCLF). 4.74 Effect of the graph size.We then study the impacts of As shown in Figure 2(a),the red boxes often have a large vari- varying the number of essays for graph construction on the per- ance,which means that the performance of teacher is sensitive formance of TGOD,that is,whether our TGOD framework needs a to the selection of the one-shot labeled data.The blue and green large number of unlabeled data for graph construction and pseudo boxes have an obviously smaller variance than the corresponding label generation.To this end,we vary the ratio of essays used for red boxes.This indicates that after the process of label selection graph construction from 0.1 to 0.9 step by 0.2. and distillation,the student model is no longer as sensitive to the As shown in Figure 2(d),we can find that all the lines show a selection of one-shot labeled data as teacher model.By comparing trend that goes up first and then keeps stable after the ratio about the blue boxes and green boxes,we can find that Student-OCLF is 0.3.This indicates that 30%unlabeled essays is enough to run the more robust to the selection of one-shot data than Student-REG. teacher model and generate pseudo labels for our TGOD framework. 4.7.2 Effect of using more labeled data.We then study the 5 RELATED WORK impacts of using more labeled data on the performance of TGOD. In this section,we introduce briefly the following three research that is,whether our TGOD framework can be further improved by topics relevant to our work. providing more labeled data.To this end,we sample the labeled data by one-shot,three-shots,five-shots,and ten-shots,and record the 5.1 Automated Essay Scoring corresponding performance of TGOD(CNN-LSTM-Att is adopted Early research on AES mainly focused on the construction of au- as the essay encoder)for each setting. As shown in Figure 2(b),by observing the line of Avg.(with tomated composition scoring systems [11,26],which mainly com- black color),the overall performance of TGOD shows an upward bined surface features with regression models for essay scoring. trend,and the performance on the ten-shot labeled data has an Since this century,feature engineering has been used to design improvement of about 0.03 (on OWK)compared to that on the abundant linguistic features for essay scoring [29,30,38].More one-shot labeled data.By observing other eight lines,we can find recently,many neural-based methods have been proposed to learn that the overall performance of P3 shows a flat trend and P5 shows the features automatically [1,8,9,34,39,41].Among these methods. a slight downward trend.This may be because that when more prompt-specific methods are effective but the process of manual scoring is labor intensive.Generic methods [48]and cross-prompt labeled samples are added,the performance bottleneck may be the quality of graphs in teacher model,and thus the teacher model is neural based methods [5,9,14]are thus proposed to alleviate the not benefit from using more labeled data. burden of manual scoring. Most of the previous work tackled AES as a regression problem 4.7.3 Effect of combining multiple graphs.For the graph con- and used the Mean Square Error(MSE)as loss function for model struction in teacher model,we first study the impacts of adopting training [7,32,39].But they often used Quadratic Weighted Kappa multiple word embeddings for graph construction on the perfor- (OWK)[17,44]as their metric,which is a metric for ordinal classi- mance of TGOD,that is,whether our TGOD framework is benefit fication problem.This inconsistency may be because that it is more from combining multiple graphs for label guessing.To this end,we complicated to tackle it as an ordinal classification problem [44,47]. record the performance of the teacher model(graph propagation) and regression model can usually achieve good performance. and the student model(CNN-LSTM-Att)when using each of the four types of word embeddings and using them together. 5.2 Knowledge Distillation As shown in Figure 2(c).by observing the black line,we can find Knowledge distillation is originally proposed to transfer the knowl- that its end point(Teacher Model)is higher than that of the other edge of a complicated model to a simpler model by training the lines at most cases,regardless of the position of starting point.This simpler model with the soft targets provided by the complicated 2354
WWW ’21, April 19–23, 2021, Ljubljana, Slovenia Zhiwei Jiang, Meng Liu, Yafeng Yin, Hua Yu, Zifeng Cheng, and Qing Gu 0.4 0.5 0.6 0.7 0.8 QWK Prompts Teacher Student-REG Student-OCLF P1 P2 P3 P4 P5 P6 P7 P8 Avg. P1 P2 P3 P4 P5 P6 P7 P8 Avg. Prompts 0.25 0.35 0.45 0.55 0.65 0.75 0.85 QWK GloVe BERT Word2Vec ELMo ALL Tearcher Student (a) (b) (c) (d) Figure 2: Effects of the one-shot labeled data and the graph construction on the performance of TGOD. one-shot labeled data 20 times, and record the corresponding performance of TGOD (CNN-LSTM-Att is adopted as the essay encoder) for each sampling. For comparison, we record the performance of teacher model (Teacher), regression based student model (StudentREG), and ordinal classification based model (Student-OCLF). As shown in Figure 2(a), the red boxes often have a large variance, which means that the performance of teacher is sensitive to the selection of the one-shot labeled data. The blue and green boxes have an obviously smaller variance than the corresponding red boxes. This indicates that after the process of label selection and distillation, the student model is no longer as sensitive to the selection of one-shot labeled data as teacher model. By comparing the blue boxes and green boxes, we can find that Student-OCLF is more robust to the selection of one-shot data than Student-REG. 4.7.2 Effect of using more labeled data. We then study the impacts of using more labeled data on the performance of TGOD, that is, whether our TGOD framework can be further improved by providing more labeled data. To this end, we sample the labeled data by one-shot, three-shots, five-shots, and ten-shots, and record the corresponding performance of TGOD (CNN-LSTM-Att is adopted as the essay encoder) for each setting. As shown in Figure 2(b), by observing the line of Avg. (with black color), the overall performance of TGOD shows an upward trend, and the performance on the ten-shot labeled data has an improvement of about 0.03 (on QWK) compared to that on the one-shot labeled data. By observing other eight lines, we can find that the overall performance of P3 shows a flat trend and P5 shows a slight downward trend. This may be because that when more labeled samples are added, the performance bottleneck may be the quality of graphs in teacher model, and thus the teacher model is not benefit from using more labeled data. 4.7.3 Effect of combining multiple graphs. For the graph construction in teacher model, we first study the impacts of adopting multiple word embeddings for graph construction on the performance of TGOD, that is, whether our TGOD framework is benefit from combining multiple graphs for label guessing. To this end, we record the performance of the teacher model (graph propagation) and the student model (CNN-LSTM-Att) when using each of the four types of word embeddings and using them together. As shown in Figure 2(c), by observing the black line, we can find that its end point (Teacher Model) is higher than that of the other lines at most cases, regardless of the position of starting point. This indicates that combining multiple graphs for label guessing is an effective way to provide pseudo labels with stable quality and thus improves the performance of the Student Model. 4.7.4 Effect of the graph size. We then study the impacts of varying the number of essays for graph construction on the performance of TGOD, that is, whether our TGOD framework needs a large number of unlabeled data for graph construction and pseudo label generation. To this end, we vary the ratio of essays used for graph construction from 0.1 to 0.9 step by 0.2. As shown in Figure 2(d), we can find that all the lines show a trend that goes up first and then keeps stable after the ratio about 0.3. This indicates that 30% unlabeled essays is enough to run the teacher model and generate pseudo labels for our TGOD framework. 5 RELATED WORK In this section, we introduce briefly the following three research topics relevant to our work. 5.1 Automated Essay Scoring Early research on AES mainly focused on the construction of automated composition scoring systems [11, 26], which mainly combined surface features with regression models for essay scoring. Since this century, feature engineering has been used to design abundant linguistic features for essay scoring [29, 30, 38]. More recently, many neural-based methods have been proposed to learn the features automatically [1, 8, 9, 34, 39, 41]. Among these methods, prompt-specific methods are effective but the process of manual scoring is labor intensive. Generic methods [48] and cross-prompt neural based methods [5, 9, 14] are thus proposed to alleviate the burden of manual scoring. Most of the previous work tackled AES as a regression problem and used the Mean Square Error (MSE) as loss function for model training [7, 32, 39]. But they often used Quadratic Weighted Kappa (QWK) [17, 44] as their metric, which is a metric for ordinal classification problem. This inconsistency may be because that it is more complicated to tackle it as an ordinal classification problem [44, 47], and regression model can usually achieve good performance. 5.2 Knowledge Distillation Knowledge distillation is originally proposed to transfer the knowledge of a complicated model to a simpler model by training the simpler model with the soft targets provided by the complicated 2354
Learning from Graph Propagation via Ordinal Distillation for One-Shot Automated Essay Scoring WWW'21,April 19-23,2021,Ljubljana,Slovenia model [13].Since then,it has been widely adopted in a variety of Conference on Machine Learning,ICML 2017,Sydney,NSW.Australia,6-11 Au- learning tasks [18,35,46].Recently,several approaches [27,36,50] inResearch,Vol.70)Doina Precu and have been proposed to improve performance of knowledge distilla- [4]David Berthelot,Nicholas Carlini,Ian J.Goodfellow,Nicolas Papernot,Avital tion.They address how to extract information better from teacher Oliver,and Colin Raffel 2019.MixMatch:A Holistic Approach to Semi-Supervised networks and deliver it to students using the activations of inter- Learning.In Advances in Neural Information Processing Systems 32:Annual Confer- ence on Neural Information Processing Systems 2019,NeurIPS 2019,8-14 December mediate layers [36],attention maps [50],or relational information 2019,Vancouver,BC;Canada.5050-5060. between training examples [27].Besides,instead of transferring [5]Yue Cao,Hanqi Jin,Xiaojun Wan,and Zhiwei Yu.2020.Domain-Adaptive information from teacher to student,Zhang et al.51 proposed a Neural Automated Essay Scoring.In SIGIR '20:The 43rd International ACM SIGIR conference on research and development in Information Retrieval. mutual learning strategy.Our work differs from existing approaches [6]Fan Chung.1997.Spectral graph theory.Published for the Conference Board of in that we enforce the student model to learn a unimodal distribu- the mathematical sciences by the American Mathematical Society.. tion but not the output distribution of teacher model [7]Madalina Cozma,Andrei M.Butnaru,and Radu Tudor Ionescu.2018.Automated essay scoring with string kernels and v word embeddings.In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics 503-509. 5.3 Semi-Supervised Learning [8]Fei Dong and Yue Zhang.2016.Automatic Features for Essay Scoring-An Empirical Study.In Proceedings of the 2016 Conference on Empirical Methods in Semi-Supervised Learning(SSL)aims to label unlabeled data using Natural Language Processing.1072-1077. [9]Fei Dong.Yue Zhang,and Jie Yang.2017.Attention-based Recurrent Convolu knowledge learned from a small amount of labeled data combined tional Neural Network for Automatic Essay Scoring.In Proceedings of the 21st with a large amount of unlabeled data.SSL have two settings:trans- Conference on Con putational Natural Langu Learning (CoNLL 2017).153-162. ductive inference and inductive inference.The setting of transduc- [10]Youmna Farag.Helen Yannakoudakis,and Ted Briscoe.2018.Neural Automated Essay Scoring and Coherence Modeling for Adversarially Crafted Input.In Pro tive inference was first introduced by [42],which aims to infer the ceedings of the 2018 Conference of the North American Chapter of the Association label of unlabeled data directly from the labeled data.The classical for Computational Linguistics:Human Language Technologies,NAACL-HLT 2018. 263-271 methods include the Transductive Support Vector Machine(TSVM) [11]Peter W.Foltz,Darrell Laham,and Thomas K Landauer.1999.Automated Essay [15]and the graph-based label propagation [12,52,53].Recently, Scoring:Applications to Educational Technology.In Proceedings of EdMedia+ the neural version of graph-based label propagation has been de- Innovate Learning 1999,Betty Collis and Ron Oliver(Eds.).Association for the Advancement of Computing in Education (AACE).Seattle,WA USA,939-944. veloped [24].The setting of inductive inference aims to train an [12] Yanwei Fu,Timothy M.Hospedales,Tao Xiang.and Shaogang Gong.2015.Trans inductive model based on both labeled and unlabeled data.It has ductive Multi-View Zero-Shot Learning.IEEE Trans.Pattern Anal.Mach.IntelL a great development in recent years and many effective methods 37,11(2015).2332-2345. [13]Geoffrey E.Hinton,Oriol Vinyals,and Jeffrey Dean.2015.Distilling the Knowl have been proposed,such as Pseudo-Label [21],I Model [33],Mean edge in a Neural Network.CoRR abs/1503.02531 (2015). Teacher [40],MixMatch [4],UDA [45]. [14]Cancan Jin,Ben He,Kai Hui,and Le Sun.2018.TDNN:A Two-stage Deep Neural Network for Prompt-independent Automated Essay Scoring.In Proceedings of the 56th Annual Meeti ing ofthe Association for Computational Linguistics-1097 6 CONCLUSION [15]T.JOACHIMS.1999.Transductive Inference for Text Classification using Support Vector Machines.In Proc of International Conference on Machine Learning In this paper,we aim to perform essay scoring under one-shot [16]Joseph B.Kadane.2014.Sums of Possibly Associated Bernoulli Variables:The setting.To this end,we propose the TGOD framework to train a Conway-Maxwell-Binomial Distribution.Bayesian Analysis 11.2 (2014). [17]Zixuan Ke and Vincent Ng.2019.Automated Essay Scoring:A Survey of the State student neural AES model through a way of distilling the knowl- of the Art.In Proceedings of the Twenty-Eighth International Joint Conference on edge of a semi-supervised teacher model.In order to alleviate the Artificial Intelligence.6300-6308. negative effect of error pseudo labels on the student neural AES [18]Adhiguna Kuncoro,Chris Dyer,Laura Rimell,Stephen Clark,and Phil Blunsom 2019.Scalable Syntax-Aware Language Models Using Knowledge Distillation.In model,we introduce the label selection and ordinal distillation Proceedings of the 57th Conference of the Association for Computational Linguistics. strategies.Experimental results demonstrate the effectiveness of ACL 2019,Florence,Italy,July 28-August 2.2019.Volume 1:Long Papers.3472- 3484. the proposed TGOD framework for one-shot essay scoring.In the [19]Darrell Laham and Peter Foltz.2003.Automated scoring and annotation of essays future,we will try to improve the performance of teacher model with the Intelligent Essay Assessor. and student model by co-training or self-supervised learning. [20]Quoc V.Le and Tomas Mikolov.2014.Distributed Representations of Sentences and Documents.In Proceedings of the 31th International Conference on Machine Leaming,ICML 2014,Beijing.Chi a,21-263une2014.1188-1196. ACKNOWLEDGMENTS [21]Dong-Hyun Lee.2013.Pseudo-label:The simple and efficient semi-supervised learning method for deep neural networks.In Workshop on challenges in repre This work is supported by National Natural Science Foundation of sentation learning,ICML,Vol.3. [22]Xiaofeng Liu,Fangfang Fan,Lingsheng Kong.Zhihui Diao,Wanging Xie,Jun China under Grant Nos.61906085,61802169,61972192,41972111: Lu,and Jane You.2020.Unimodal regularized neuron stick-breaking for ordinal JiangSu Natural Science Foundation under Grant No.BK20180325: classification.Neurocomputing 388(2020),34-44. the Second Tibetan Plateau Scientific Expedition and Research [23]Yanbin Liu,Juho Lee,Minseop Park.Saehoon Kim,Eunho Yang.Sung Ju Hwang. and Yi Yang.2019.Learning to Propagate Labels:Transductive Propagation Program under Grant No.2019QZKK0204.This work is partially Network for Few-Shot Learning.In 7th International Conference on Learning supported by Collaborative Innovation Center of Novel Software Representations,ICLR 2019,New Orleans,LA,USA,May 6-9,2019.OpenReview.net. Technology and Industrialization. [24]Yanbin Liu,Juho Lee,Minseop Park,Saehoon Kim,Eunho Yang.Sung Ju Hwang. and Yi Yang.2019.Learning to Propagate Labels:Transductive Propagation Network for Few-Shot Learning.In 7th International Conference on Learning REFERENCES Representations,ICLR 2019,New Orleans,LA,USA,May 6-9,2019.OpenReview.net. [25]Ellis B Page.1966.The imminence of grading essays by computer.The Phi [1]Dimitrios Alikaniotis,Helen Yannakoudakis,and Marek Rei.2016.Automatic Delta Kappan47,5(1966),238-243. Text Scoring Using Neural Networks.In Proceedings of the 54th Annual Meeting [26]Ellis Batten Page.1994. Computer Grading of Student Prose,Using Modern of the Association for Computational Linguistics.715-725. Concepts and Software.Journal of Experimental Education 62,2(1994),127-142. [2]Yigal Attali and Jill Burstein.2006.Automated essay scoring with e-rater:v.2.0. [27]Wonpyo Park,Dongju Kim,Yan Lu,and Minsu Cho.2019.Relational Knowledge Journal of Technology Learning Assessment 4,2 (2006),i-21. Distillation.In IEEE Conference on Computer Vision and Pattern Recognition,CVPR [3]Christopher Beckham and Christopher J.Pal.2017.Unimodal Probability Distri- 2019,Long Beach,CAUS4,jume16-20,2019.3967-3976. butions for Deep Ordinal Classification.In Proceedings of the 34th International 2355
Learning from Graph Propagation via Ordinal Distillation for One-Shot Automated Essay Scoring WWW ’21, April 19–23, 2021, Ljubljana, Slovenia model [13]. Since then, it has been widely adopted in a variety of learning tasks [18, 35, 46]. Recently, several approaches [27, 36, 50] have been proposed to improve performance of knowledge distillation. They address how to extract information better from teacher networks and deliver it to students using the activations of intermediate layers [36], attention maps [50], or relational information between training examples [27]. Besides, instead of transferring information from teacher to student, Zhang et al. [51] proposed a mutual learning strategy. Our work differs from existing approaches in that we enforce the student model to learn a unimodal distribution but not the output distribution of teacher model. 5.3 Semi-Supervised Learning Semi-Supervised Learning (SSL) aims to label unlabeled data using knowledge learned from a small amount of labeled data combined with a large amount of unlabeled data. SSL have two settings: transductive inference and inductive inference. The setting of transductive inference was first introduced by [42], which aims to infer the label of unlabeled data directly from the labeled data. The classical methods include the Transductive Support Vector Machine (TSVM) [15] and the graph-based label propagation [12, 52, 53]. Recently, the neural version of graph-based label propagation has been developed [24]. The setting of inductive inference aims to train an inductive model based on both labeled and unlabeled data. It has a great development in recent years and many effective methods have been proposed, such as Pseudo-Label [21], Γ Model [33], Mean Teacher [40], MixMatch [4], UDA [45]. 6 CONCLUSION In this paper, we aim to perform essay scoring under one-shot setting. To this end, we propose the TGOD framework to train a student neural AES model through a way of distilling the knowledge of a semi-supervised teacher model. In order to alleviate the negative effect of error pseudo labels on the student neural AES model, we introduce the label selection and ordinal distillation strategies. Experimental results demonstrate the effectiveness of the proposed TGOD framework for one-shot essay scoring. In the future, we will try to improve the performance of teacher model and student model by co-training or self-supervised learning. ACKNOWLEDGMENTS This work is supported by National Natural Science Foundation of China under Grant Nos. 61906085, 61802169, 61972192, 41972111; JiangSu Natural Science Foundation under Grant No. BK20180325; the Second Tibetan Plateau Scientific Expedition and Research Program under Grant No. 2019QZKK0204. This work is partially supported by Collaborative Innovation Center of Novel Software Technology and Industrialization. REFERENCES [1] Dimitrios Alikaniotis, Helen Yannakoudakis, and Marek Rei. 2016. Automatic Text Scoring Using Neural Networks. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. 715–725. [2] Yigal Attali and Jill Burstein. 2006. Automated essay scoring with e-rater®; v.2.0. Journal of Technology Learning & Assessment 4, 2 (2006), i–21. [3] Christopher Beckham and Christopher J. Pal. 2017. Unimodal Probability Distributions for Deep Ordinal Classification. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017 (Proceedings of Machine Learning Research, Vol. 70), Doina Precup and Yee Whye Teh (Eds.). PMLR, 411–419. [4] David Berthelot, Nicholas Carlini, Ian J. Goodfellow, Nicolas Papernot, Avital Oliver, and Colin Raffel. 2019. MixMatch: A Holistic Approach to Semi-Supervised Learning. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, 8-14 December 2019, Vancouver, BC, Canada. 5050–5060. [5] Yue Cao, Hanqi Jin, Xiaojun Wan, and Zhiwei Yu. 2020. Domain-Adaptive Neural Automated Essay Scoring. In SIGIR ’20: The 43rd International ACM SIGIR conference on research and development in Information Retrieval. [6] Fan Chung. 1997. Spectral graph theory. Published for the Conference Board of the mathematical sciences by the American Mathematical Society,. [7] Madalina Cozma, Andrei M. Butnaru, and Radu Tudor Ionescu. 2018. Automated essay scoring with string kernels and word embeddings. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 503–509. [8] Fei Dong and Yue Zhang. 2016. Automatic Features for Essay Scoring - An Empirical Study. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 1072–1077. [9] Fei Dong, Yue Zhang, and Jie Yang. 2017. Attention-based Recurrent Convolutional Neural Network for Automatic Essay Scoring. In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017). 153–162. [10] Youmna Farag, Helen Yannakoudakis, and Ted Briscoe. 2018. Neural Automated Essay Scoring and Coherence Modeling for Adversarially Crafted Input. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018. 263–271. [11] Peter W. Foltz, Darrell Laham, and Thomas K Landauer. 1999. Automated Essay Scoring: Applications to Educational Technology. In Proceedings of EdMedia + Innovate Learning 1999, Betty Collis and Ron Oliver (Eds.). Association for the Advancement of Computing in Education (AACE), Seattle, WA USA, 939–944. [12] Yanwei Fu, Timothy M. Hospedales, Tao Xiang, and Shaogang Gong. 2015. Transductive Multi-View Zero-Shot Learning. IEEE Trans. Pattern Anal. Mach. Intell. 37, 11 (2015), 2332–2345. [13] Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. 2015. Distilling the Knowledge in a Neural Network. CoRR abs/1503.02531 (2015). [14] Cancan Jin, Ben He, Kai Hui, and Le Sun. 2018. TDNN: A Two-stage Deep Neural Network for Prompt-independent Automated Essay Scoring. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 1088–1097. [15] T. JOACHIMS. 1999. Transductive Inference for Text Classification using Support Vector Machines. In Proc of International Conference on Machine Learning. [16] Joseph B. Kadane. 2014. Sums of Possibly Associated Bernoulli Variables: The Conway-Maxwell-Binomial Distribution. Bayesian Analysis 11, 2 (2014). [17] Zixuan Ke and Vincent Ng. 2019. Automated Essay Scoring: A Survey of the State of the Art. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence. 6300–6308. [18] Adhiguna Kuncoro, Chris Dyer, Laura Rimell, Stephen Clark, and Phil Blunsom. 2019. Scalable Syntax-Aware Language Models Using Knowledge Distillation. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers. 3472– 3484. [19] Darrell Laham and Peter Foltz. 2003. Automated scoring and annotation of essays with the Intelligent Essay Assessor. [20] Quoc V. Le and Tomas Mikolov. 2014. Distributed Representations of Sentences and Documents. In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 21-26 June 2014. 1188–1196. [21] Dong-Hyun Lee. 2013. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML, Vol. 3. [22] Xiaofeng Liu, Fangfang Fan, Lingsheng Kong, Zhihui Diao, Wanqing Xie, Jun Lu, and Jane You. 2020. Unimodal regularized neuron stick-breaking for ordinal classification. Neurocomputing 388 (2020), 34–44. [23] Yanbin Liu, Juho Lee, Minseop Park, Saehoon Kim, Eunho Yang, Sung Ju Hwang, and Yi Yang. 2019. Learning to Propagate Labels: Transductive Propagation Network for Few-Shot Learning. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net. [24] Yanbin Liu, Juho Lee, Minseop Park, Saehoon Kim, Eunho Yang, Sung Ju Hwang, and Yi Yang. 2019. Learning to Propagate Labels: Transductive Propagation Network for Few-Shot Learning. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net. [25] Ellis B Page. 1966. The imminence of... grading essays by computer. The Phi Delta Kappan 47, 5 (1966), 238–243. [26] Ellis Batten Page. 1994. Computer Grading of Student Prose, Using Modern Concepts and Software. Journal of Experimental Education 62, 2 (1994), 127–142. [27] Wonpyo Park, Dongju Kim, Yan Lu, and Minsu Cho. 2019. Relational Knowledge Distillation. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019. 3967–3976. 2355
WWW '21,April 19-23,2021,Ljubljana,Slovenia Zhiwei Jiang,Meng Liu,Yafeng Yin,Hua Yu,Zifeng Cheng,and Qing Gu [28]Jeffrey Pennington,Richard Socher,and Christopher D.Manning.2014.GloVe: [48]Attali Yigal,Bridgeman Brent,and Trapani Catherine.2010.Performance of a Global Vectors for Word Representation.In Proceedings of the 2014 Conference on Generic Approach in Automated Essay Scoring.Journal of Technology Learning Empirical Methods in Natural Language Processing (EMNLP).1532-1543 &Assessment10,3(2010),17. [29]Isaac Persing.Alan Davis,and Vincent Ng.2010.Modeling organization in [49]Li Yuan,Francis E.H.Tay,Guilin Li,Tao Wang,and Jiashi Feng.2020.Revisiting student essays.In Proceedings of the 2010 Conference on Empirical Methods in Knowledge Distillation via Label Smoothing Regularization.In 2020 IEEE/CVF Natural Language Processing.229-239. Conference on Computer Vision and Pattern Recognition,CVPR 2020,Seattle,WA, [30]Isaac Persing and Vincent Ng.2014.Modeling prompt adherence in student essays U54,7ne13-19,2020.IEEE.3902-3910. In Proceedings of the 52nd Annual Meeting of the Association for Computational [50]Sergey Zagoruyko and Nikos Komodakis.2017.Paying More Attention to Atten- Linguistics.1534-1543. tion:Improving the Performance of Convolutional Neural Networks via Attention [31]Matthew E.Peters,Mark Neumann,MohitI lyyer,Matt Gardn Kenton Lee and Luke Zettlemover 2018 Deep Contertualized w Transfer In 5th International Conference on Learning Representations,ICLR 2017. Toulon,France,April 24-26,2017,Conference Track Proceedings. resentations.In Proceedings of the 2018 Conference of the North American Chapter [51]Ying Zhang.Tao Xiang,Timothy M.Hospedales,and Huchuan Lu.2018.Deep of the Association for Computational Linguistics:Human Language Technologies, Mutual Learning.In 2018 IEEE Conference on Computer Vision and Pattern Recog- NAACL-HLT 2018.New Orleans,Louisiana,USA,June 1-6,2018,Volume 1 (Long nition,CVPR 2018.Salt Lake City.UT.USA.June 18-22.2018.4320-4328. Papers12227-2237 [52]Dengyong Zhou,Olivier Bousquet,Thomas Na vin Lal,Jason Weston,and Bern [32]Peter Phandi,Kian Ming Adam Chai,and Hwee Tou Ng.2015.Flexible Domain hard Sch Olkopf.2003.Learning with Local and Global Consistency.Advances in Adaptation for Automated Essay Scoring Using Correlated Linear Regression neural information processing systems 16.3(2003). In Proceedings of the 2015 Conference on Empirical Methods in Natural Language [53]X.Zhu.2002.Learning from Labeled and Unlabeled Data with Label Propagation. Processing.431-439. Tech Report (2002). [33]Antti Rasmus,Mathias Berglund,Mikko Honkala,Harri Valpola,and Tapan Raiko.2015.Semi-supervised Learning with Ladder Networks.In Advances in Neural Information Processi Systems 28:Annual Conference on Neural Infor mation Processing Systems 2015,December 7-12,2015,Montreal,Quebec,Canada. 3546-3554. [34]Pedro Uria Rodriguez,Amir Jafari,and Christopher M.Ormerod.2019.Lan- guage models and Automated Essay Scoring. .CoRR abs/1909.09482(2019). rXi1909.09482htp:axiv.org/ahs/1909.09482 [35]Haggai Roitman,Guy Feigenblat,Doron Cohen,Odellia Boni,and David Konop- nicki.2020.Unsupervised Dual-Cascade Learning with Pseudo-Feedback Dis- tillation for Query-Focused Extractive Summarization.In WWW '20:The Wel Conference 2020,Taipei,Taiwan,April 20-24,2020.2577-2584. [36]Adriana Romero,Nicolas Ballas,Samira Ebrahimi Kahou,Antoine Chassang. Carlo Gatta,and Yoshua Bengio.2015.FitNets:Hints for Thin Deep Nets.In 3rd International Conference on Learning Representations,ICLR 2015,San Diego,CA USA,May 7-9,2015,Conference Track Proceedings [37]Jake Snell,Kevin Swersky.and Richard S.Zemel.2017.Prototypical Networks for Few-shot Learning.In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017,4-9 December 2017,Long Beach,CA,USA,Isabelle Guyon,Ulrike von Luxburg.Samy Bengic Hanna M.Wallach,Rob Fergus,S.V.N.Vishwanathan,and Roman Garnett (Eds.). 4077-4087 [38]Swapna Somasundaran,Jill Burstein,and Martin Chodorow.2014.Lexical chain- ing for measuring discourse coherence quality in test-taker essays.In Proceedings of the 25th International conference on computational linguistics.950-961 [39]Kaveh Taghipour and Hwee Tou Ng.2016.A Neural Approach to Automated Essay Scoring.In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing.1882-1891. [40]Antti Tarvainen and Harri Valpola.2017.Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results.In Advances in neural information processing systems.1195-1204. [41]Yi Tay,Minh C.Phan,Luu Anh Tuan,and Siu Ch ng Hui.2018.SkipFlow Incorporating Neural Coherence Features for End-to-End Automatic Text Scoring In Proceedings of the 32nd Conference on Artificial Intelligence(AAAl-18).5948- 5955. [42]Vladimir Vapnik.1999.An overview of statistical learning theory.IEEE Trans. Neural Networks 10,5(1999).988-999. [43]Ashish Vaswani,Noam Shazeer,Niki Parmar,Jakob Uszkoreit,Llion Jones Aidan N.Gomez,Lukasz Kaiser, and Illia Polosukhin.2017.Attention is All you Need.In Advances in Neural Information Processing Systems 30:Annual Con- ference on Neural Information Processing Systems 2017,4-9 December 2017,Long Beach,CA,USA.5998-6008. [44]Yucheng Wang.Zhongyu Wei,Yaqian Zhou,and Xuanjing Huang.2018.Auto- matic Essay Scoring Incorporating Rating Scher ent Learning. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing.791-797. [45]Qizhe Xie,Zihang Dai,Eduard Hovy,Minh-Thang Luong.and Quoc V Le 2019.Unsupervised data augmentation for consistency training.arXiv preprint rXiw1904.12848(2019). [46]Ruochen Xu and Yiming Yang.2017.Cross-lingual Distillation for Text Classifica- tion.In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics,ACL 2017,Vancouver,Canada,July 30-August 4,Volume 1:Long Papers. 1415-1425. [47]Helen Yannakoudakis and Ted Briscoe.2012.Modeling coherence in ESOL learner texts.In Proceedings of the Seventh Workshop on Building Educational Applications Using NLP,BEA@NAACL-HLT 2012 June 7,2012,Mon treal Canada Joel R.Tetreault,Jill Burstein,and Claudia Leacock(Eds.).The Association for Computer Linguistics,33-43. 2356
WWW ’21, April 19–23, 2021, Ljubljana, Slovenia Zhiwei Jiang, Meng Liu, Yafeng Yin, Hua Yu, Zifeng Cheng, and Qing Gu [28] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 1532–1543. [29] Isaac Persing, Alan Davis, and Vincent Ng. 2010. Modeling organization in student essays. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing. 229–239. [30] Isaac Persing and Vincent Ng. 2014. Modeling prompt adherence in student essays. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. 1534–1543. [31] Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep Contextualized Word Representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 1 (Long Papers). 2227–2237. [32] Peter Phandi, Kian Ming Adam Chai, and Hwee Tou Ng. 2015. Flexible Domain Adaptation for Automated Essay Scoring Using Correlated Linear Regression. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 431–439. [33] Antti Rasmus, Mathias Berglund, Mikko Honkala, Harri Valpola, and Tapani Raiko. 2015. Semi-supervised Learning with Ladder Networks. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada. 3546–3554. [34] Pedro Uria Rodriguez, Amir Jafari, and Christopher M. Ormerod. 2019. Language models and Automated Essay Scoring. CoRR abs/1909.09482 (2019). arXiv:1909.09482 http://arxiv.org/abs/1909.09482 [35] Haggai Roitman, Guy Feigenblat, Doron Cohen, Odellia Boni, and David Konopnicki. 2020. Unsupervised Dual-Cascade Learning with Pseudo-Feedback Distillation for Query-Focused Extractive Summarization. In WWW ’20: The Web Conference 2020, Taipei, Taiwan, April 20-24, 2020. 2577–2584. [36] Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. 2015. FitNets: Hints for Thin Deep Nets. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings. [37] Jake Snell, Kevin Swersky, and Richard S. Zemel. 2017. Prototypical Networks for Few-shot Learning. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (Eds.). 4077–4087. [38] Swapna Somasundaran, Jill Burstein, and Martin Chodorow. 2014. Lexical chaining for measuring discourse coherence quality in test-taker essays. In Proceedings of the 25th International conference on computational linguistics. 950–961. [39] Kaveh Taghipour and Hwee Tou Ng. 2016. A Neural Approach to Automated Essay Scoring. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 1882–1891. [40] Antti Tarvainen and Harri Valpola. 2017. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In Advances in neural information processing systems. 1195–1204. [41] Yi Tay, Minh C. Phan, Luu Anh Tuan, and Siu Cheung Hui. 2018. SkipFlow: Incorporating Neural Coherence Features for End-to-End Automatic Text Scoring. In Proceedings of the 32nd Conference on Artificial Intelligence(AAAI-18). 5948– 5955. [42] Vladimir Vapnik. 1999. An overview of statistical learning theory. IEEE Trans. Neural Networks 10, 5 (1999), 988–999. [43] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA. 5998–6008. [44] Yucheng Wang, Zhongyu Wei, Yaqian Zhou, and Xuanjing Huang. 2018. Automatic Essay Scoring Incorporating Rating Schema via Reinforcement Learning. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 791–797. [45] Qizhe Xie, Zihang Dai, Eduard Hovy, Minh-Thang Luong, and Quoc V Le. 2019. Unsupervised data augmentation for consistency training. arXiv preprint arXiv:1904.12848 (2019). [46] Ruochen Xu and Yiming Yang. 2017. Cross-lingual Distillation for Text Classification. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers. 1415–1425. [47] Helen Yannakoudakis and Ted Briscoe. 2012. Modeling coherence in ESOL learner texts. In Proceedings of the Seventh Workshop on Building Educational Applications Using NLP, BEA@NAACL-HLT 2012, June 7, 2012, Montréal, Canada, Joel R. Tetreault, Jill Burstein, and Claudia Leacock (Eds.). The Association for Computer Linguistics, 33–43. [48] Attali Yigal, Bridgeman Brent, and Trapani Catherine. 2010. Performance of a Generic Approach in Automated Essay Scoring. Journal of Technology Learning & Assessment 10, 3 (2010), 17. [49] Li Yuan, Francis E. H. Tay, Guilin Li, Tao Wang, and Jiashi Feng. 2020. Revisiting Knowledge Distillation via Label Smoothing Regularization. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020. IEEE, 3902–3910. [50] Sergey Zagoruyko and Nikos Komodakis. 2017. Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. [51] Ying Zhang, Tao Xiang, Timothy M. Hospedales, and Huchuan Lu. 2018. Deep Mutual Learning. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018. 4320–4328. [52] Dengyong Zhou, Olivier Bousquet, Thomas Navin Lal, Jason Weston, and Bernhard Sch Olkopf. 2003. Learning with Local and Global Consistency. Advances in neural information processing systems 16, 3 (2003). [53] X. Zhu. 2002. Learning from Labeled and Unlabeled Data with Label Propagation. Tech Report (2002). 2356