正在加载图片...
Enriching Word Embeddings with Domain Knowledge for Readability Assessment Zhiwei Jiang and Qing Gu*and Yafeng Yin and Daoxu Chen State Key Laboratory for Novel Software Technology, Nanjing University,Nanjing 210023,China jiangzhiwei@outlook.com,guq,yafeng,cdx@nju.edu.cn Abstract In this paper,we present a method which learns the word embedding for readability assessment. For the existing word embedding models,they typically focus on the syntactic or semantic rela- tions of words,while ignoring the reading difficulty,thus they may not be suitable for readability assessment.Hence,we provide the knowledge-enriched word embedding (KEWE),which en- codes the knowledge on reading difficulty into the representation of words.Specifically,we extract the knowledge on word-level difficulty from three perspectives to construct a knowledge graph,and develop two word embedding models to incorporate the difficulty context derived from the knowledge graph to define the loss functions.Experiments are designed to apply KEWE for readability assessment on both English and Chinese datasets,and the results demonstrate both effectiveness and potential of KEWE. 1 Introduction Readability assessment is a classic problem in natural language processing,which attracts many re- searchers'attention in recent years (Todirascu et al.,2016;Schumacher et al.,2016:Cha et al.,2017). The objective is to evaluate the readability of texts by levels or scores.The majority of recent readabil- ity assessment methods are based on the framework of supervised learning (Schwarm and Ostendorf, 2005)and build classifiers from hand-crafted features extracted from the texts.The performance of these methods depends on designing effective features to build high-quality classifiers. Designing hand-crafted features are essential but labor-intensive.It is desirable to learn representative features from the texts automatically.For document-level readability assessment,an effective feature learning method is to construct the representation of documents by combining the representation of the words contained (Kim,2014).For the representation of word,a useful technique is to learn the word representation as a dense and low-dimensional vector,which is called word embedding.Existing word embedding models(Collobert et al.,2011;Mikolov et al.,2013;Pennington et al.,2014)can be used for readability assessment,but the effectiveness is compromised by the fact that these models typically focus on the syntactic or semantic relations of words,while ignoring the reading difficulty.As a result,words with similar functions or topics,such as"man"and "gentleman",are mapped into close vectors although their reading difficulties are different.It calls for incorporating the knowledge on reading difficulty when training the word embedding. In this paper,we provide the knowledge-enriched word embedding (KEWE)for readability assess- ment,which encodes the knowledge on reading difficulty into the representation of words.Specifically, we define the word-level difficulty from three perspectives,and use the extracted knowledge to construct a knowledge graph.After that,we derive the difficulty context of words from the knowledge graph,and develop two word embedding models to incorporate the difficulty context to define the loss functions. We apply KEWE for document-level readability assessment under the supervised framework.The experiments are conducted on four datasets of either English or Chinese.The results demonstrate that *Corresponding Author This work is licensed under a Creative Commons Attribution 4.0 International Licence.Licence details: http://creativecommons.org/licenses/by/4.0/. 366 Proceedings of the 27th International Conference on Computational Linguistics,pages 366-378 Santa Fe,New Mexico,USA,August 20-26,2018.Proceedings of the 27th International Conference on Computational Linguistics, pages 366–378 Santa Fe, New Mexico, USA, August 20-26, 2018. 366 Enriching Word Embeddings with Domain Knowledge for Readability Assessment Zhiwei Jiang and Qing Gu∗ and Yafeng Yin and Daoxu Chen State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China jiangzhiwei@outlook.com, {guq,yafeng,cdx}@nju.edu.cn Abstract In this paper, we present a method which learns the word embedding for readability assessment. For the existing word embedding models, they typically focus on the syntactic or semantic rela￾tions of words, while ignoring the reading difficulty, thus they may not be suitable for readability assessment. Hence, we provide the knowledge-enriched word embedding (KEWE), which en￾codes the knowledge on reading difficulty into the representation of words. Specifically, we extract the knowledge on word-level difficulty from three perspectives to construct a knowledge graph, and develop two word embedding models to incorporate the difficulty context derived from the knowledge graph to define the loss functions. Experiments are designed to apply KEWE for readability assessment on both English and Chinese datasets, and the results demonstrate both effectiveness and potential of KEWE. 1 Introduction Readability assessment is a classic problem in natural language processing, which attracts many re￾searchers’ attention in recent years (Todirascu et al., 2016; Schumacher et al., 2016; Cha et al., 2017). The objective is to evaluate the readability of texts by levels or scores. The majority of recent readabil￾ity assessment methods are based on the framework of supervised learning (Schwarm and Ostendorf, 2005) and build classifiers from hand-crafted features extracted from the texts. The performance of these methods depends on designing effective features to build high-quality classifiers. Designing hand-crafted features are essential but labor-intensive. It is desirable to learn representative features from the texts automatically. For document-level readability assessment, an effective feature learning method is to construct the representation of documents by combining the representation of the words contained (Kim, 2014). For the representation of word, a useful technique is to learn the word representation as a dense and low-dimensional vector, which is called word embedding. Existing word embedding models (Collobert et al., 2011; Mikolov et al., 2013; Pennington et al., 2014) can be used for readability assessment, but the effectiveness is compromised by the fact that these models typically focus on the syntactic or semantic relations of words, while ignoring the reading difficulty. As a result, words with similar functions or topics, such as “man” and “gentleman”, are mapped into close vectors although their reading difficulties are different. It calls for incorporating the knowledge on reading difficulty when training the word embedding. In this paper, we provide the knowledge-enriched word embedding (KEWE) for readability assess￾ment, which encodes the knowledge on reading difficulty into the representation of words. Specifically, we define the word-level difficulty from three perspectives, and use the extracted knowledge to construct a knowledge graph. After that, we derive the difficulty context of words from the knowledge graph, and develop two word embedding models to incorporate the difficulty context to define the loss functions. We apply KEWE for document-level readability assessment under the supervised framework. The experiments are conducted on four datasets of either English or Chinese. The results demonstrate that ∗Corresponding Author This work is licensed under a Creative Commons Attribution 4.0 International Licence. Licence details: http://creativecommons.org/licenses/by/4.0/
向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有