(McLaughlin,1969)and the FK formula (Kincaid,Fish- Graph-Based Label Propagation burne,Rogers,Chissom,1975).A key observation in these studies is that the vocabulary used in a document Graph-based label propagation is applied on a graph to usually determines its readability (Pitler Nenkova. propagate class labels from labeled nodes to unlabeled 2008).A general way of using the vocabulary is the statis- ones (Subramanya,Petrov,Pereira,2010).It has been tical language model (Collins-Thompson Callan,2004; successfully applied in various applications,such as dictio- Kidwell,Lebanon,Collins-Thompson,2009).More nary construction (Kim,Verma,Yeh,2013),word seg- recently,researchers have explored complex linguistic fea- mentation and tagging (Zeng,Wong,Chao,Trancoso, 2013),and sentiment classification (Ponomareva Thel- tures combined with classification models to obtain robust and effective readability prediction methods (Denning, wall,2012).Typically,a graph-based label propagation Pera,Ng,2016;Feng,Jansche,Huenerfauth,Elhadad, method consists of two main steps:graph construction and 2010;Schwarm Ostendorf,2005).While most studies label propagation.During the first step,some forms of are conducted for English,there are studies for other lan- edge weight estimation and edge pruning are required to guages,such as French(Francois Fairon,2012),German build an efficient graph (Jebara,Wang,Chang,2009; (Hancke,Vajjala,Meurers,2012),and Bangla (Sinha, Ponomareva Thelwall,2012).In addition,nodes in the Dasgupta,&Basu,2014).In addition,researchers have graph can be heterogeneous;for example,both instance used the representation learning techniques for readability nodes and class nodes can coexist in a birelational graph assessment (Cha,Gwon,Kung,2017;Tseng,Hung, (Jiang,2011),and nodes (words)of English can be linked Sung,Chen,2016). to nodes (words)of the target language (Chinese)in a bilingual graph (Gao,Wei,Li,Liu,Zhou,2015).During the second step,propagation algorithms are required to propagate the label distributions to all the nodes (Kim The Bag-of-Words Model et al.,2013;Subramanya et al.,2010)so that the classes of The Bow model has been widely used for document unlabeled nodes can be predicted. classification owing to its simplicity and general applica- bility.It constructs a feature space that contains all the distinct words of a language (or text corpus).Tradition- The Proposed Method ally,it assumes that words are independent,while In this section,we first present the overview of the pro- recently,capturing the word coupling relationship has posed method (GRAW+).Then we describe two main attracted much attention (Cao,2015).Billhardt,Borrajo, parts of the method in detail:feature representation and and Maojo(2002)studied the coupling relationship based readability classification on the co-occurrence of words in the same documents. Kalogeratos and Likas(2012)generalized the relationship An Overview of the Method by taking into account the distance among words in the sentences of each document.Cheng,Miao,Wang,and The framework of GRAW+is depicted in Figure 2. Cao (2013)estimated the co-occurrence of words by min- GRAW+takes an auxiliary text corpus and a target docu- ing the transitive properties.Inspired by these studies,this ment set as inputs.The auxiliary text corpus contains unla- article modifies the Bow model for readability assess- beled sentences used to construct the word coupling ment,and provides the coupled BoW model incorporating matrix.The target document set contains both labeled and the co-occurrence of words in different levels of reading unlabeled documents on readability.The objective of difficulty. GRAW+is to predict the reading levels of the unlabeled Datasets Feature Representation Classification The Coupled Bag-of-Words Model raph Construction Auxiliary Text Corpus Constructing word coupling matrix Graph Merging Constructing Generating coupled bag-of-words model bag-of-words model Label Propagation Target Document Set Extracting linguistic features Label for unlabeled documents FIG.2.The framework of GRAW+.[Color figure can be viewed at wileyonlinelibrary.com] JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY-May 2019 435 D0:10.1002/asi(McLaughlin, 1969) and the FK formula (Kincaid, Fishburne, Rogers, & Chissom, 1975). A key observation in these studies is that the vocabulary used in a document usually determines its readability (Pitler & Nenkova, 2008). A general way of using the vocabulary is the statistical language model (Collins-Thompson & Callan, 2004; Kidwell, Lebanon, & Collins-Thompson, 2009). More recently, researchers have explored complex linguistic features combined with classification models to obtain robust and effective readability prediction methods (Denning, Pera, & Ng, 2016; Feng, Jansche, Huenerfauth, & Elhadad, 2010; Schwarm & Ostendorf, 2005). While most studies are conducted for English, there are studies for other languages, such as French (François & Fairon, 2012), German (Hancke, Vajjala, & Meurers, 2012), and Bangla (Sinha, Dasgupta, & Basu, 2014). In addition, researchers have used the representation learning techniques for readability assessment (Cha, Gwon, & Kung, 2017; Tseng, Hung, Sung, & Chen, 2016). The Bag-of-Words Model The BoW model has been widely used for document classification owing to its simplicity and general applicability. It constructs a feature space that contains all the distinct words of a language (or text corpus). Traditionally, it assumes that words are independent, while recently, capturing the word coupling relationship has attracted much attention (Cao, 2015). Billhardt, Borrajo, and Maojo (2002) studied the coupling relationship based on the co-occurrence of words in the same documents. Kalogeratos and Likas (2012) generalized the relationship by taking into account the distance among words in the sentences of each document. Cheng, Miao, Wang, and Cao (2013) estimated the co-occurrence of words by mining the transitive properties. Inspired by these studies, this article modifies the BoW model for readability assessment, and provides the coupled BoW model incorporating the co-occurrence of words in different levels of reading difficulty. Graph-Based Label Propagation Graph-based label propagation is applied on a graph to propagate class labels from labeled nodes to unlabeled ones (Subramanya, Petrov, & Pereira, 2010). It has been successfully applied in various applications, such as dictionary construction (Kim, Verma, & Yeh, 2013), word segmentation and tagging (Zeng, Wong, Chao, & Trancoso, 2013), and sentiment classification (Ponomareva & Thelwall, 2012). Typically, a graph-based label propagation method consists of two main steps: graph construction and label propagation. During the first step, some forms of edge weight estimation and edge pruning are required to build an efficient graph (Jebara, Wang, & Chang, 2009; Ponomareva & Thelwall, 2012). In addition, nodes in the graph can be heterogeneous; for example, both instance nodes and class nodes can coexist in a birelational graph (Jiang, 2011), and nodes (words) of English can be linked to nodes (words) of the target language (Chinese) in a bilingual graph (Gao, Wei, Li, Liu, & Zhou, 2015). During the second step, propagation algorithms are required to propagate the label distributions to all the nodes (Kim et al., 2013; Subramanya et al., 2010) so that the classes of unlabeled nodes can be predicted. The Proposed Method In this section, we first present the overview of the proposed method (GRAW+). Then we describe two main parts of the method in detail: feature representation and readability classification. An Overview of the Method The framework of GRAW+ is depicted in Figure 2. GRAW+ takes an auxiliary text corpus and a target document set as inputs. The auxiliary text corpus contains unlabeled sentences used to construct the word coupling matrix. The target document set contains both labeled and unlabeled documents on readability. The objective of GRAW+ is to predict the reading levels of the unlabeled FIG. 2. The framework of GRAW+. [Color figure can be viewed at wileyonlinelibrary.com] JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY—May 2019 DOI: 10.1002/asi 435