Learning from Graph Propagation via Ordinal Distillation for One-Shot Automated Essay Scoring WWW'21,April 19-23,2021,Ljubljana,Slovenia model [13].Since then,it has been widely adopted in a variety of Conference on Machine Learning,ICML 2017,Sydney,NSW.Australia,6-11 Au- learning tasks [18,35,46].Recently,several approaches [27,36,50] inResearch,Vol.70)Doina Precu and have been proposed to improve performance of knowledge distilla- [4]David Berthelot,Nicholas Carlini,Ian J.Goodfellow,Nicolas Papernot,Avital tion.They address how to extract information better from teacher Oliver,and Colin Raffel 2019.MixMatch:A Holistic Approach to Semi-Supervised networks and deliver it to students using the activations of inter- Learning.In Advances in Neural Information Processing Systems 32:Annual Confer- ence on Neural Information Processing Systems 2019,NeurIPS 2019,8-14 December mediate layers [36],attention maps [50],or relational information 2019,Vancouver,BC;Canada.5050-5060. between training examples [27].Besides,instead of transferring [5]Yue Cao,Hanqi Jin,Xiaojun Wan,and Zhiwei Yu.2020.Domain-Adaptive information from teacher to student,Zhang et al.51 proposed a Neural Automated Essay Scoring.In SIGIR '20:The 43rd International ACM SIGIR conference on research and development in Information Retrieval. mutual learning strategy.Our work differs from existing approaches [6]Fan Chung.1997.Spectral graph theory.Published for the Conference Board of in that we enforce the student model to learn a unimodal distribu- the mathematical sciences by the American Mathematical Society.. tion but not the output distribution of teacher model [7]Madalina Cozma,Andrei M.Butnaru,and Radu Tudor Ionescu.2018.Automated essay scoring with string kernels and v word embeddings.In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics 503-509. 5.3 Semi-Supervised Learning [8]Fei Dong and Yue Zhang.2016.Automatic Features for Essay Scoring-An Empirical Study.In Proceedings of the 2016 Conference on Empirical Methods in Semi-Supervised Learning(SSL)aims to label unlabeled data using Natural Language Processing.1072-1077. [9]Fei Dong.Yue Zhang,and Jie Yang.2017.Attention-based Recurrent Convolu knowledge learned from a small amount of labeled data combined tional Neural Network for Automatic Essay Scoring.In Proceedings of the 21st with a large amount of unlabeled data.SSL have two settings:trans- Conference on Con putational Natural Langu Learning (CoNLL 2017).153-162. ductive inference and inductive inference.The setting of transduc- [10]Youmna Farag.Helen Yannakoudakis,and Ted Briscoe.2018.Neural Automated Essay Scoring and Coherence Modeling for Adversarially Crafted Input.In Pro tive inference was first introduced by [42],which aims to infer the ceedings of the 2018 Conference of the North American Chapter of the Association label of unlabeled data directly from the labeled data.The classical for Computational Linguistics:Human Language Technologies,NAACL-HLT 2018. 263-271 methods include the Transductive Support Vector Machine(TSVM) [11]Peter W.Foltz,Darrell Laham,and Thomas K Landauer.1999.Automated Essay [15]and the graph-based label propagation [12,52,53].Recently, Scoring:Applications to Educational Technology.In Proceedings of EdMedia+ the neural version of graph-based label propagation has been de- Innovate Learning 1999,Betty Collis and Ron Oliver(Eds.).Association for the Advancement of Computing in Education (AACE).Seattle,WA USA,939-944. veloped [24].The setting of inductive inference aims to train an [12] Yanwei Fu,Timothy M.Hospedales,Tao Xiang.and Shaogang Gong.2015.Trans inductive model based on both labeled and unlabeled data.It has ductive Multi-View Zero-Shot Learning.IEEE Trans.Pattern Anal.Mach.IntelL a great development in recent years and many effective methods 37,11(2015).2332-2345. [13]Geoffrey E.Hinton,Oriol Vinyals,and Jeffrey Dean.2015.Distilling the Knowl have been proposed,such as Pseudo-Label [21],I Model [33],Mean edge in a Neural Network.CoRR abs/1503.02531 (2015). Teacher [40],MixMatch [4],UDA [45]. [14]Cancan Jin,Ben He,Kai Hui,and Le Sun.2018.TDNN:A Two-stage Deep Neural Network for Prompt-independent Automated Essay Scoring.In Proceedings of the 56th Annual Meeti ing ofthe Association for Computational Linguistics-1097 6 CONCLUSION [15]T.JOACHIMS.1999.Transductive Inference for Text Classification using Support Vector Machines.In Proc of International Conference on Machine Learning In this paper,we aim to perform essay scoring under one-shot [16]Joseph B.Kadane.2014.Sums of Possibly Associated Bernoulli Variables:The setting.To this end,we propose the TGOD framework to train a Conway-Maxwell-Binomial Distribution.Bayesian Analysis 11.2 (2014). [17]Zixuan Ke and Vincent Ng.2019.Automated Essay Scoring:A Survey of the State student neural AES model through a way of distilling the knowl- of the Art.In Proceedings of the Twenty-Eighth International Joint Conference on edge of a semi-supervised teacher model.In order to alleviate the Artificial Intelligence.6300-6308. negative effect of error pseudo labels on the student neural AES [18]Adhiguna Kuncoro,Chris Dyer,Laura Rimell,Stephen Clark,and Phil Blunsom 2019.Scalable Syntax-Aware Language Models Using Knowledge Distillation.In model,we introduce the label selection and ordinal distillation Proceedings of the 57th Conference of the Association for Computational Linguistics. strategies.Experimental results demonstrate the effectiveness of ACL 2019,Florence,Italy,July 28-August 2.2019.Volume 1:Long Papers.3472- 3484. the proposed TGOD framework for one-shot essay scoring.In the [19]Darrell Laham and Peter Foltz.2003.Automated scoring and annotation of essays future,we will try to improve the performance of teacher model with the Intelligent Essay Assessor. and student model by co-training or self-supervised learning. [20]Quoc V.Le and Tomas Mikolov.2014.Distributed Representations of Sentences and Documents.In Proceedings of the 31th International Conference on Machine Leaming,ICML 2014,Beijing.Chi a,21-263une2014.1188-1196. ACKNOWLEDGMENTS [21]Dong-Hyun Lee.2013.Pseudo-label:The simple and efficient semi-supervised learning method for deep neural networks.In Workshop on challenges in repre This work is supported by National Natural Science Foundation of sentation learning,ICML,Vol.3. [22]Xiaofeng Liu,Fangfang Fan,Lingsheng Kong.Zhihui Diao,Wanging Xie,Jun China under Grant Nos.61906085,61802169,61972192,41972111: Lu,and Jane You.2020.Unimodal regularized neuron stick-breaking for ordinal JiangSu Natural Science Foundation under Grant No.BK20180325: classification.Neurocomputing 388(2020),34-44. the Second Tibetan Plateau Scientific Expedition and Research [23]Yanbin Liu,Juho Lee,Minseop Park.Saehoon Kim,Eunho Yang.Sung Ju Hwang. and Yi Yang.2019.Learning to Propagate Labels:Transductive Propagation Program under Grant No.2019QZKK0204.This work is partially Network for Few-Shot Learning.In 7th International Conference on Learning supported by Collaborative Innovation Center of Novel Software Representations,ICLR 2019,New Orleans,LA,USA,May 6-9,2019.OpenReview.net. Technology and Industrialization. [24]Yanbin Liu,Juho Lee,Minseop Park,Saehoon Kim,Eunho Yang.Sung Ju Hwang. and Yi Yang.2019.Learning to Propagate Labels:Transductive Propagation Network for Few-Shot Learning.In 7th International Conference on Learning REFERENCES Representations,ICLR 2019,New Orleans,LA,USA,May 6-9,2019.OpenReview.net. [25]Ellis B Page.1966.The imminence of grading essays by computer.The Phi [1]Dimitrios Alikaniotis,Helen Yannakoudakis,and Marek Rei.2016.Automatic Delta Kappan47,5(1966),238-243. Text Scoring Using Neural Networks.In Proceedings of the 54th Annual Meeting [26]Ellis Batten Page.1994. Computer Grading of Student Prose,Using Modern of the Association for Computational Linguistics.715-725. Concepts and Software.Journal of Experimental Education 62,2(1994),127-142. [2]Yigal Attali and Jill Burstein.2006.Automated essay scoring with e-rater:v.2.0. [27]Wonpyo Park,Dongju Kim,Yan Lu,and Minsu Cho.2019.Relational Knowledge Journal of Technology Learning Assessment 4,2 (2006),i-21. Distillation.In IEEE Conference on Computer Vision and Pattern Recognition,CVPR [3]Christopher Beckham and Christopher J.Pal.2017.Unimodal Probability Distri- 2019,Long Beach,CAUS4,jume16-20,2019.3967-3976. butions for Deep Ordinal Classification.In Proceedings of the 34th International 2355Learning from Graph Propagation via Ordinal Distillation for One-Shot Automated Essay Scoring WWW ’21, April 19–23, 2021, Ljubljana, Slovenia model [13]. Since then, it has been widely adopted in a variety of learning tasks [18, 35, 46]. Recently, several approaches [27, 36, 50] have been proposed to improve performance of knowledge distillation. They address how to extract information better from teacher networks and deliver it to students using the activations of intermediate layers [36], attention maps [50], or relational information between training examples [27]. Besides, instead of transferring information from teacher to student, Zhang et al. [51] proposed a mutual learning strategy. Our work differs from existing approaches in that we enforce the student model to learn a unimodal distribution but not the output distribution of teacher model. 5.3 Semi-Supervised Learning Semi-Supervised Learning (SSL) aims to label unlabeled data using knowledge learned from a small amount of labeled data combined with a large amount of unlabeled data. SSL have two settings: transductive inference and inductive inference. The setting of transductive inference was first introduced by [42], which aims to infer the label of unlabeled data directly from the labeled data. The classical methods include the Transductive Support Vector Machine (TSVM) [15] and the graph-based label propagation [12, 52, 53]. Recently, the neural version of graph-based label propagation has been developed [24]. The setting of inductive inference aims to train an inductive model based on both labeled and unlabeled data. It has a great development in recent years and many effective methods have been proposed, such as Pseudo-Label [21], Γ Model [33], Mean Teacher [40], MixMatch [4], UDA [45]. 6 CONCLUSION In this paper, we aim to perform essay scoring under one-shot setting. To this end, we propose the TGOD framework to train a student neural AES model through a way of distilling the knowledge of a semi-supervised teacher model. In order to alleviate the negative effect of error pseudo labels on the student neural AES model, we introduce the label selection and ordinal distillation strategies. Experimental results demonstrate the effectiveness of the proposed TGOD framework for one-shot essay scoring. In the future, we will try to improve the performance of teacher model and student model by co-training or self-supervised learning. ACKNOWLEDGMENTS This work is supported by National Natural Science Foundation of China under Grant Nos. 61906085, 61802169, 61972192, 41972111; JiangSu Natural Science Foundation under Grant No. BK20180325; the Second Tibetan Plateau Scientific Expedition and Research Program under Grant No. 2019QZKK0204. This work is partially supported by Collaborative Innovation Center of Novel Software Technology and Industrialization. REFERENCES [1] Dimitrios Alikaniotis, Helen Yannakoudakis, and Marek Rei. 2016. Automatic Text Scoring Using Neural Networks. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. 715–725. [2] Yigal Attali and Jill Burstein. 2006. Automated essay scoring with e-rater®; v.2.0. Journal of Technology Learning & Assessment 4, 2 (2006), i–21. [3] Christopher Beckham and Christopher J. Pal. 2017. Unimodal Probability Distributions for Deep Ordinal Classification. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017 (Proceedings of Machine Learning Research, Vol. 70), Doina Precup and Yee Whye Teh (Eds.). PMLR, 411–419. [4] David Berthelot, Nicholas Carlini, Ian J. Goodfellow, Nicolas Papernot, Avital Oliver, and Colin Raffel. 2019. MixMatch: A Holistic Approach to Semi-Supervised Learning. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, 8-14 December 2019, Vancouver, BC, Canada. 5050–5060. [5] Yue Cao, Hanqi Jin, Xiaojun Wan, and Zhiwei Yu. 2020. Domain-Adaptive Neural Automated Essay Scoring. In SIGIR ’20: The 43rd International ACM SIGIR conference on research and development in Information Retrieval. [6] Fan Chung. 1997. Spectral graph theory. Published for the Conference Board of the mathematical sciences by the American Mathematical Society,. [7] Madalina Cozma, Andrei M. Butnaru, and Radu Tudor Ionescu. 2018. Automated essay scoring with string kernels and word embeddings. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 503–509. [8] Fei Dong and Yue Zhang. 2016. Automatic Features for Essay Scoring - An Empirical Study. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 1072–1077. [9] Fei Dong, Yue Zhang, and Jie Yang. 2017. Attention-based Recurrent Convolutional Neural Network for Automatic Essay Scoring. In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017). 153–162. [10] Youmna Farag, Helen Yannakoudakis, and Ted Briscoe. 2018. Neural Automated Essay Scoring and Coherence Modeling for Adversarially Crafted Input. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018. 263–271. [11] Peter W. Foltz, Darrell Laham, and Thomas K Landauer. 1999. Automated Essay Scoring: Applications to Educational Technology. In Proceedings of EdMedia + Innovate Learning 1999, Betty Collis and Ron Oliver (Eds.). Association for the Advancement of Computing in Education (AACE), Seattle, WA USA, 939–944. [12] Yanwei Fu, Timothy M. Hospedales, Tao Xiang, and Shaogang Gong. 2015. Transductive Multi-View Zero-Shot Learning. IEEE Trans. Pattern Anal. Mach. Intell. 37, 11 (2015), 2332–2345. [13] Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. 2015. Distilling the Knowledge in a Neural Network. CoRR abs/1503.02531 (2015). [14] Cancan Jin, Ben He, Kai Hui, and Le Sun. 2018. TDNN: A Two-stage Deep Neural Network for Prompt-independent Automated Essay Scoring. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 1088–1097. [15] T. JOACHIMS. 1999. Transductive Inference for Text Classification using Support Vector Machines. In Proc of International Conference on Machine Learning. [16] Joseph B. Kadane. 2014. Sums of Possibly Associated Bernoulli Variables: The Conway-Maxwell-Binomial Distribution. Bayesian Analysis 11, 2 (2014). [17] Zixuan Ke and Vincent Ng. 2019. Automated Essay Scoring: A Survey of the State of the Art. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence. 6300–6308. [18] Adhiguna Kuncoro, Chris Dyer, Laura Rimell, Stephen Clark, and Phil Blunsom. 2019. Scalable Syntax-Aware Language Models Using Knowledge Distillation. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers. 3472– 3484. [19] Darrell Laham and Peter Foltz. 2003. Automated scoring and annotation of essays with the Intelligent Essay Assessor. [20] Quoc V. Le and Tomas Mikolov. 2014. Distributed Representations of Sentences and Documents. In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 21-26 June 2014. 1188–1196. [21] Dong-Hyun Lee. 2013. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML, Vol. 3. [22] Xiaofeng Liu, Fangfang Fan, Lingsheng Kong, Zhihui Diao, Wanqing Xie, Jun Lu, and Jane You. 2020. Unimodal regularized neuron stick-breaking for ordinal classification. Neurocomputing 388 (2020), 34–44. [23] Yanbin Liu, Juho Lee, Minseop Park, Saehoon Kim, Eunho Yang, Sung Ju Hwang, and Yi Yang. 2019. Learning to Propagate Labels: Transductive Propagation Network for Few-Shot Learning. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net. [24] Yanbin Liu, Juho Lee, Minseop Park, Saehoon Kim, Eunho Yang, Sung Ju Hwang, and Yi Yang. 2019. Learning to Propagate Labels: Transductive Propagation Network for Few-Shot Learning. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net. [25] Ellis B Page. 1966. The imminence of... grading essays by computer. The Phi Delta Kappan 47, 5 (1966), 238–243. [26] Ellis Batten Page. 1994. Computer Grading of Student Prose, Using Modern Concepts and Software. Journal of Experimental Education 62, 2 (1994), 127–142. [27] Wonpyo Park, Dongju Kim, Yan Lu, and Minsu Cho. 2019. Relational Knowledge Distillation. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019. 3967–3976. 2355