WWW '21,April 19-23,2021,Ljubljana,Slovenia Zhiwei Jiang,Meng Liu,Yafeng Yin,Hua Yu,Zifeng Cheng,and Qing Gu topic,and difficulty,these cross-prompt methods often perform of essays written to a certain prompt,y={1,2.....K)denote a set worse than the prompt-specific methods [9].Tackling the domain of pre-defined scores (labels)at ordinal scale,and (x,y)denote an adaptation among prompts is a challenging problem and there are essay and its ground-truth score(label)respectively.For one-shot some recent studies focusing on this line of work [5,14]. AES,we assume that we are given a set of one-shot labeled data In this paper,we consider another way without using data from D。={ci,班=i}where the set。={,yh)eDo}is other prompts.Given a set of essays towards a target prompt,we a subset of (i.e.,No E x),and the essay x e Xo with y i is consider if we can score all essays only based on a few manually the one-shot essay for the distinct score (label)i e y.Apart from scored essays.Extremely,we consider the one-shot scenario,that the one-shot labeled essays o,the rest essays in X constitute the is,only one manually scored essay per distinct score is given.In unlabeled essay set Xu ={xi,and thus Xu UXo =X.The goal practical writing tests,scoring staff usually evaluates the essays of one-shot AES is to learn a function F to predict the scores(labels) by first designing a criteria specific to the current test and then of the unlabelled essays x e Xu,based on the one-shot labeled data applying the criteria for essays scoring.To alleviate the burden Do and essay set by of scoring staff,we expect to firstly let the scoring staff express the criteria by one-shot manual scoring,and then use a specially- =F(x;Dox) (1) designed AES model to scoring the rest essays based on the one-shot data Typical AES approaches based on supervised learning would One-shot AES is a challenging task,since the one-shot labeled remove X and replace Do with a statistic "=0"(Do)in Eq.1. data is insufficient to train an effective neural AES model.To solve since they can usually learn a sufficient statistic 0*for prediction this problem,our intuition is whether we can augment the one- po(ylx)only based on labeled data Do.However,the one-shot shot labeled data with some pseudo labeled data,and then perform setting is never the case,since only few labeled data is given in Do, model training on the augmented labeled data.Obviously,there which is insufficient to train a statistic with good generalization. are two challenges:one is how to acquire the pseudo labeled data. We therefore exploit both the one-shot labeled data Do and the and the other is how to alleviate the disturbance brought by error unlabeled essays XuEX to learn the prediction function F,and pseudo labels during model training. thus adopt the more general form of F in Eq.1. To this end,we propose a Transductive Graph-based Ordinal Distillation (TGOD)framework for one-shot automated essay scor- ing,which is designed based on a teacher-student mechanism(i.e.. 3 THE TGOD FRAMEWORK knowledge distillation)[13].Specifically,we employ a transduc- In this section,we introduce the proposed TGOD framework,fol- tive graph-based model [52,53]as the teacher model to generate lowed by its technical details. pseudo labels,and then train the neural AES model(student model) by combining the pseudo labels and one-shot labels.Considering 3.1 An Overview of TGOD that there may exist many error labels among the pseudo labels, we select the pseudo labels with high confidence to improve the TGOD is designed based on the teacher-student mechanism.It quality of pseudo labels.Besides,considering that the score is at can enable a supervised neural student model to benefit from a semi-supervised teacher model under the one-shot essay scoring ordinal scale and an essay is easily to be assigned a score near its setting.While the one-shot labeled data is insufficient to train the ground-truth score (e.g.,3 is easily to be predicted as 2 or 4),we supervised neural student model,the student model can be trained proposed an ordinal-aware unimodal distillation strategy to tolerate by distilling the knowledge of the semi-supervised teacher model some pseudo labels with minor errors. on the unlabeled essays.Through a specially-designed ordinal dis The major contributions of this paper are summarized as follows: tillation strategy,the supervised neural student model can even For the one-shot automated essay scoring,we propose a outperform the semi-supervised teacher model. distillation framework based on graph propagation,which Specifically,as shown in Figure 1,TGOD contains three main alleviates the requirement of supervised neural AES model components:the Teacher Model which exploits the manifold struc- on labeled data by utilizing unsupervised data. ture among labeled and unlabeled essays based on graphs and We propose the label selection and the ordinal-aware uni- generates pseudo labels of unlabeled essays for distillation;the Stu- modal distillation strategies to alleviate the effect of error dent Model which tackles the essay scoring problem as an ordinal pseudo labels on the final AES model. classification problem and makes a unimodal distribution predic- The TGOD framework has no limitation on the architecture tion for essays;the Ordinal Distillation which distills the unimodal of student model,thus can be applied to many existing neu- smoothed Teacher Model's outputs into the Student Model.In the ral AES models.Experimental results on the public dataset following.we introduce these components of TGOD with technical demonstrate that our framework can effectively improve the details. performance of several classical neural AES models under the one-shot AES setting. 3.2 Graph-Based Label Propagation(Teacher) We introduce the Teacher Model illustrated in Figure 1,which is a 2 PROBLEM DEFINITION graph-based label propagation model and consists of three com- We first introduce some notation and formalize the one-shot auto- ponents:multiple graph construction that models the relationship mated essay scoring (AES)problem.Let=denote a set among essays from multiple aspects;label propagation that spreads 2348WWW ’21, April 19–23, 2021, Ljubljana, Slovenia Zhiwei Jiang, Meng Liu, Yafeng Yin, Hua Yu, Zifeng Cheng, and Qing Gu topic, and difficulty, these cross-prompt methods often perform worse than the prompt-specific methods [9]. Tackling the domain adaptation among prompts is a challenging problem and there are some recent studies focusing on this line of work [5, 14]. In this paper, we consider another way without using data from other prompts. Given a set of essays towards a target prompt, we consider if we can score all essays only based on a few manually scored essays. Extremely, we consider the one-shot scenario, that is, only one manually scored essay per distinct score is given. In practical writing tests, scoring staff usually evaluates the essays by first designing a criteria specific to the current test and then applying the criteria for essays scoring. To alleviate the burden of scoring staff, we expect to firstly let the scoring staff express the criteria by one-shot manual scoring, and then use a speciallydesigned AES model to scoring the rest essays based on the one-shot data. One-shot AES is a challenging task, since the one-shot labeled data is insufficient to train an effective neural AES model. To solve this problem, our intuition is whether we can augment the oneshot labeled data with some pseudo labeled data, and then perform model training on the augmented labeled data. Obviously, there are two challenges: one is how to acquire the pseudo labeled data, and the other is how to alleviate the disturbance brought by error pseudo labels during model training. To this end, we propose a Transductive Graph-based Ordinal Distillation (TGOD) framework for one-shot automated essay scoring, which is designed based on a teacher-student mechanism (i.e., knowledge distillation) [13]. Specifically, we employ a transductive graph-based model [52, 53] as the teacher model to generate pseudo labels, and then train the neural AES model (student model) by combining the pseudo labels and one-shot labels. Considering that there may exist many error labels among the pseudo labels, we select the pseudo labels with high confidence to improve the quality of pseudo labels. Besides, considering that the score is at ordinal scale and an essay is easily to be assigned a score near its ground-truth score (e.g., 3 is easily to be predicted as 2 or 4), we proposed an ordinal-aware unimodal distillation strategy to tolerate some pseudo labels with minor errors. The major contributions of this paper are summarized as follows: • For the one-shot automated essay scoring, we propose a distillation framework based on graph propagation, which alleviates the requirement of supervised neural AES model on labeled data by utilizing unsupervised data. • We propose the label selection and the ordinal-aware unimodal distillation strategies to alleviate the effect of error pseudo labels on the final AES model. • The TGOD framework has no limitation on the architecture of student model, thus can be applied to many existing neural AES models. Experimental results on the public dataset demonstrate that our framework can effectively improve the performance of several classical neural AES models under the one-shot AES setting. 2 PROBLEM DEFINITION We first introduce some notation and formalize the one-shot automated essay scoring (AES) problem. Let X = {xi } N i=1 denote a set of essays written to a certain prompt, Y = {1, 2, ...,K} denote a set of pre-defined scores (labels) at ordinal scale, and (x,y) denote an essay and its ground-truth score (label) respectively. For one-shot AES, we assume that we are given a set of one-shot labeled data Do = {(xi ,yi = i)}K i=1 , where the set Xo = {xi |(xi ,yi) ∈ Do } is a subset of X (i.e., Xo ∈ X), and the essay x ∈ Xo with y = i is the one-shot essay for the distinct score (label) i ∈ Y. Apart from the one-shot labeled essays Xo , the rest essays in X constitute the unlabeled essay set Xu = {xi } Nu i=1 , and thus Xu ∪ Xo = X. The goal of one-shot AES is to learn a function F to predict the scores (labels) of the unlabelled essays x ∈ Xu , based on the one-shot labeled data Do and essay set X, by yˆ = F (x; Do, X). (1) Typical AES approaches based on supervised learning would remove X and replace Do with a statistic θ ∗ = θ ∗ (Do ) in Eq. 1, since they can usually learn a sufficient statistic θ ∗ for prediction pθ ∗ (y|x) only based on labeled data Do . However, the one-shot setting is never the case, since only few labeled data is given in Do , which is insufficient to train a statistic θ ∗ with good generalization. We therefore exploit both the one-shot labeled data Do and the unlabeled essays Xu ∈ X to learn the prediction function F , and thus adopt the more general form of F in Eq. 1. 3 THE TGOD FRAMEWORK In this section, we introduce the proposed TGOD framework, followed by its technical details. 3.1 An Overview of TGOD TGOD is designed based on the teacher-student mechanism. It can enable a supervised neural student model to benefit from a semi-supervised teacher model under the one-shot essay scoring setting. While the one-shot labeled data is insufficient to train the supervised neural student model, the student model can be trained by distilling the knowledge of the semi-supervised teacher model on the unlabeled essays. Through a specially-designed ordinal distillation strategy, the supervised neural student model can even outperform the semi-supervised teacher model. Specifically, as shown in Figure 1, TGOD contains three main components: the Teacher Model which exploits the manifold structure among labeled and unlabeled essays based on graphs and generates pseudo labels of unlabeled essays for distillation; the Student Model which tackles the essay scoring problem as an ordinal classification problem and makes a unimodal distribution prediction for essays; the Ordinal Distillation which distills the unimodal smoothed Teacher Model’s outputs into the Student Model. In the following, we introduce these components of TGOD with technical details. 3.2 Graph-Based Label Propagation (Teacher) We introduce the Teacher Model illustrated in Figure 1, which is a graph-based label propagation model and consists of three components: multiple graph construction that models the relationship among essays from multiple aspects; label propagation that spreads 2348