GRAW+:A Two-View Graph Propagation Method With Word Coupling for Readability Assessment Zhiwei Jiang State Key Laboratory for Novel Software Technology,Nanjing University,Nanjing,210023,China. E-mail:jiangzhiwei@outlook.com Qing Gu State Key Laboratory for Novel Software Technology,Nanjing University,Nanjing,210023,China. E-mail:guq@nju.edu.cn Yafeng Yin State Key Laboratory for Novel Software Technology,Nanjing University,Nanjing,210023,China. E-mail:yafeng@nju.edu.cn Jianxiang Wang State Key Laboratory for Novel Software Technology,Nanjing University,Nanjing,210023,China. E-mail:wjxnju@outlook.com Daoxu Chen State Key Laboratory for Novel Software Technology,Nanjing University,Nanjing,210023,China. E-mail:cdx@nju.edu.cn Existing methods for readability assessment usually con- Introduction struct inductive classification models to assess the readabil- ity of singular text documents based on extracted features. Readability assessment evaluates the reading difficulties which have been demonstrated to be effective.However. of text documents,which are normally represented as dis- they rarely make use of the interrelationship among docu- crete reading levels.Automatic readability assessment is a ments on readability,which can help increase the accuracy of readability assessment.In this article,we adopt a graph- challenging task,which has attracted researchers'attention based classification method to model and utilize the relation- from the beginning of the last century (Collins-Thompson, ship among documents using the coupled bag-of-words 2014).Traditionally,it can be used by educationists to model.We propose a word coupling method to build the coupled bag-of-words model by estimating the correlation choose appropriate reading materials for students of differ- between words on reading difficulty.In addition,we propose ent education or grade levels.In modern times,it can be a two-view graph propagation method to make use of both used by web search engines to do personalized searches the coupled bag-of-words model and the linguistic features. based on web users'educational backgrounds. Our method employs a graph merging operation to combine graphs built according to different views,and improves the Existing methods for readability assessment usually con- label propagation by incorporating the ordinal relation centrate on feature engineering and then applying inductive among reading levels.Experiments were conducted on both classification models to utilize the features.In the early English and Chinese data sets,and the results demonstrate stages,researchers proposed readability formulas to mea- both effectiveness and potential of the method. sure the readability of texts (Zakaluk Samuels,1988). These formulas are usually attained by linear regression on several easy-to-compute text features relevant to reading difficulty.Recently,by employing machine-learning tech- Received March 26,2017:revised June 20,2018:accepted July 14,2018 niques,classification-based methods have been proposed and 2019 ASIS&T.Published online February 18.2019 in Wiley Online demonstrated to be more effective than readability formulas Library(wileyonlinelibrary.com).DOI:10.1002/asi.24123 (Benjamin,2012;Collins-Thompson,2014).These methods JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY,70(5):433-447,2019
GRAW+: A Two-View Graph Propagation Method With Word Coupling for Readability Assessment Zhiwei Jiang State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, 210023, China. E-mail: jiangzhiwei@outlook.com Qing Gu State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, 210023, China. E-mail: guq@nju.edu.cn Yafeng Yin State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, 210023, China. E-mail: yafeng@nju.edu.cn Jianxiang Wang State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, 210023, China. E-mail: wjxnju@outlook.com Daoxu Chen State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, 210023, China. E-mail: cdx@nju.edu.cn Existing methods for readability assessment usually construct inductive classification models to assess the readability of singular text documents based on extracted features, which have been demonstrated to be effective. However, they rarely make use of the interrelationship among documents on readability, which can help increase the accuracy of readability assessment. In this article, we adopt a graphbased classification method to model and utilize the relationship among documents using the coupled bag-of-words model. We propose a word coupling method to build the coupled bag-of-words model by estimating the correlation between words on reading difficulty. In addition, we propose a two-view graph propagation method to make use of both the coupled bag-of-words model and the linguistic features. Our method employs a graph merging operation to combine graphs built according to different views, and improves the label propagation by incorporating the ordinal relation among reading levels. Experiments were conducted on both English and Chinese data sets, and the results demonstrate both effectiveness and potential of the method. Introduction Readability assessment evaluates the reading difficulties of text documents, which are normally represented as discrete reading levels. Automatic readability assessment is a challenging task, which has attracted researchers’ attention from the beginning of the last century (Collins-Thompson, 2014). Traditionally, it can be used by educationists to choose appropriate reading materials for students of different education or grade levels. In modern times, it can be used by web search engines to do personalized searches based on web users’ educational backgrounds. Existing methods for readability assessment usually concentrate on feature engineering and then applying inductive classification models to utilize the features. In the early stages, researchers proposed readability formulas to measure the readability of texts (Zakaluk & Samuels, 1988). These formulas are usually attained by linear regression on several easy-to-compute text features relevant to reading difficulty. Recently, by employing machine-learning techniques, classification-based methods have been proposed and demonstrated to be more effective than readability formulas (Benjamin, 2012; Collins-Thompson, 2014). These methods Received March 26, 2017; revised June 20, 2018; accepted July 14, 2018 © 2019 ASIS&T • Published online February 18, 2019 in Wiley Online Library (wileyonlinelibrary.com). DOI: 10.1002/asi.24123 JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY, 70(5):433–447, 2019
combine the rich representation of texts with sophisticated To build the coupled Bow model,the key point is to prediction models. construct the word coupling matrix.For this purpose,we Most current methods assess the readability of text doc- first estimate the occurrence distributions of words in sen- uments singularly,and ignore the interrelationship among tences of different reading difficulties,and then compute documents on readability,which can be useful in assessing their similarities on reading difficulty based on the the readability of documents based on the labeled ones. distributions. For example,two documents can be of the same reading Besides the coupled Bow model,the linguistic features level,if they consist of words that have similar reading dif- can also be adopted by our method.On the one hand,we ficulty.Hence,we propose a graph propagation method for use the linguistic features as complementation of the readability assessment,which can model and utilize the coupled Bow model to construct graphs from multiple interrelationship among text documents. views.On the other,the linguistic features are used to rein- To measure the relationship among documents,we use the force the label propagation algorithm by providing the bag-of-words (Bow)model,which is commonly used for text prior knowledge. classification and clustering (Huang,2008;Sebastiani,2002). In this article,we propose a two-view graph propaga- However,to measure the relationship on readability,the basic tion method with word coupling for readability assess- BoW model requires improvements,since it ignores the fact ment.Our contributions are as follows (a preliminary that different words may have similar reading difficulties. version of this work appeared in Jiang,Sun,Gu,Bai, Figure 1 illustrates the improved use of the Bow model for Chen,2015).(i)We apply the graph-based method for readability assessment using a simple example.In Figure 1, readability assessment,which can make use of the interre- the left matrix is built from the basic Bow model for three lationship among documents to estimate their readability. documents(that is,D1,D2,and D3)consisting of four tokens (ii)We propose the coupled BoW model,which can be (that is,school,law,syllabus,and decree).Among the three used to measure the similarity of documents on reading documents,D1 and D2 are two relatively difficult documents difficulty.(iii)We propose a two-view graph building both containing two easy words(school or law)and two diffi- strategy to make use of both the coupled Bow model and cult words(syllabus or decree),while D3 is an easy document the linguistic features.(iv)We propose a reinforced label that contains two easy words (school).By calculating the propagation algorithm,which can make use of the ordinal cosine similarities based on the basic BoW model (the bottom relation among reading levels.Extensive experiments left subfigure).the result shows that D1 is more similar to D3 were carried out on data sets of both English and Chinese. than to D2,which is inconsistent with their similarities on Compared with the state-of-art methods,the results dem- reading difficulty. onstrate both effectiveness and the potential of our To overcome the shortcoming of the basic Bow model. method. we designed a word coupling method.As shown in Figure 1,the word coupling method first measures the sim- ilarities among words on reading difficulties(the word cou- Background and Related Work pling matrix).Then the method makes the words of similar Readability Assessment difficulties (for example,school and law)share their occur- rence frequencies with each other (by matrix multiplica- Research on automatic readability assessment has tion),which leads to the coupled Bow model (the coupled spanned the last 70 years (Benjamin,2012).Early research Bow matrix).In this way,the documents will be similar mainly focused on the designing of readability formulas on readability if their words have similar distributions on (Zakaluk Samuels,1988).Many well-known readability reading difficulties. formulas have been developed,such as the SMOG formula school 1 syllabus decree school law decree school 18可 syllsbus decree D1 0 2 school 0.5 0.5 0 0 DI 0.5 0. 0 0 D2 D2 0 0.5 0.5 03 0 0.5 0.5 D3 decree sim(D1,D2)=1 sim(D1,D2)=0 D1 D2 D1- —D2 sim(D1,D3=0.707 sim(D1,D3j=0.7071 simD2,D3)=0.707 sim(D2,D3)=0 D3 03 FIG.1.A motivation example of the word coupling method.The left matrix is a basic Bow matrix.The central matrix is a word coupling matrix.The right matrix is the coupled BoW matrix.[Color figure can be viewed at wileyonlinelibrary.com] 434 JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY-May 2019 D0l:10.1002/asi
combine the rich representation of texts with sophisticated prediction models. Most current methods assess the readability of text documents singularly, and ignore the interrelationship among documents on readability, which can be useful in assessing the readability of documents based on the labeled ones. For example, two documents can be of the same reading level, if they consist of words that have similar reading dif- ficulty. Hence, we propose a graph propagation method for readability assessment, which can model and utilize the interrelationship among text documents. To measure the relationship among documents, we use the bag-of-words (BoW) model, which is commonly used for text classification and clustering (Huang, 2008; Sebastiani, 2002). However, to measure the relationship on readability, the basic BoW model requires improvements, since it ignores the fact that different words may have similar reading difficulties. Figure 1 illustrates the improved use of the BoW model for readability assessment using a simple example. In Figure 1, the left matrix is built from the basic BoW model for three documents (that is, D1, D2, and D3) consisting of four tokens (that is, school, law, syllabus, and decree). Among the three documents, D1 and D2 are two relatively difficult documents both containing two easy words (school or law) and two diffi- cult words (syllabus or decree), while D3 is an easy document that contains two easy words (school). By calculating the cosine similarities based on the basic BoW model (the bottom left subfigure), the result shows that D1 is more similar to D3 than to D2, which is inconsistent with their similarities on reading difficulty. To overcome the shortcoming of the basic BoW model, we designed a word coupling method. As shown in Figure 1, the word coupling method first measures the similarities among words on reading difficulties (the word coupling matrix). Then the method makes the words of similar difficulties (for example, school and law) share their occurrence frequencies with each other (by matrix multiplication), which leads to the coupled BoW model (the coupled BoW matrix). In this way, the documents will be similar on readability if their words have similar distributions on reading difficulties. To build the coupled BoW model, the key point is to construct the word coupling matrix. For this purpose, we first estimate the occurrence distributions of words in sentences of different reading difficulties, and then compute their similarities on reading difficulty based on the distributions. Besides the coupled BoW model, the linguistic features can also be adopted by our method. On the one hand, we use the linguistic features as complementation of the coupled BoW model to construct graphs from multiple views. On the other, the linguistic features are used to reinforce the label propagation algorithm by providing the prior knowledge. In this article, we propose a two-view graph propagation method with word coupling for readability assessment. Our contributions are as follows (a preliminary version of this work appeared in Jiang, Sun, Gu, Bai, & Chen, 2015). (i) We apply the graph-based method for readability assessment, which can make use of the interrelationship among documents to estimate their readability. (ii) We propose the coupled BoW model, which can be used to measure the similarity of documents on reading difficulty. (iii) We propose a two-view graph building strategy to make use of both the coupled BoW model and the linguistic features. (iv) We propose a reinforced label propagation algorithm, which can make use of the ordinal relation among reading levels. Extensive experiments were carried out on data sets of both English and Chinese. Compared with the state-of-art methods, the results demonstrate both effectiveness and the potential of our method. Background and Related Work Readability Assessment Research on automatic readability assessment has spanned the last 70 years (Benjamin, 2012). Early research mainly focused on the designing of readability formulas (Zakaluk & Samuels, 1988). Many well-known readability formulas have been developed, such as the SMOG formula FIG. 1. A motivation example of the word coupling method. The left matrix is a basic BoW matrix. The central matrix is a word coupling matrix. The right matrix is the coupled BoW matrix. [Color figure can be viewed at wileyonlinelibrary.com] 434 JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY—May 2019 DOI: 10.1002/asi
(McLaughlin,1969)and the FK formula (Kincaid,Fish- Graph-Based Label Propagation burne,Rogers,Chissom,1975).A key observation in these studies is that the vocabulary used in a document Graph-based label propagation is applied on a graph to usually determines its readability (Pitler Nenkova. propagate class labels from labeled nodes to unlabeled 2008).A general way of using the vocabulary is the statis- ones (Subramanya,Petrov,Pereira,2010).It has been tical language model (Collins-Thompson Callan,2004; successfully applied in various applications,such as dictio- Kidwell,Lebanon,Collins-Thompson,2009).More nary construction (Kim,Verma,Yeh,2013),word seg- recently,researchers have explored complex linguistic fea- mentation and tagging (Zeng,Wong,Chao,Trancoso, 2013),and sentiment classification (Ponomareva Thel- tures combined with classification models to obtain robust and effective readability prediction methods (Denning, wall,2012).Typically,a graph-based label propagation Pera,Ng,2016;Feng,Jansche,Huenerfauth,Elhadad, method consists of two main steps:graph construction and 2010;Schwarm Ostendorf,2005).While most studies label propagation.During the first step,some forms of are conducted for English,there are studies for other lan- edge weight estimation and edge pruning are required to guages,such as French(Francois Fairon,2012),German build an efficient graph (Jebara,Wang,Chang,2009; (Hancke,Vajjala,Meurers,2012),and Bangla (Sinha, Ponomareva Thelwall,2012).In addition,nodes in the Dasgupta,&Basu,2014).In addition,researchers have graph can be heterogeneous;for example,both instance used the representation learning techniques for readability nodes and class nodes can coexist in a birelational graph assessment (Cha,Gwon,Kung,2017;Tseng,Hung, (Jiang,2011),and nodes (words)of English can be linked Sung,Chen,2016). to nodes (words)of the target language (Chinese)in a bilingual graph (Gao,Wei,Li,Liu,Zhou,2015).During the second step,propagation algorithms are required to propagate the label distributions to all the nodes (Kim The Bag-of-Words Model et al.,2013;Subramanya et al.,2010)so that the classes of The Bow model has been widely used for document unlabeled nodes can be predicted. classification owing to its simplicity and general applica- bility.It constructs a feature space that contains all the distinct words of a language (or text corpus).Tradition- The Proposed Method ally,it assumes that words are independent,while In this section,we first present the overview of the pro- recently,capturing the word coupling relationship has posed method (GRAW+).Then we describe two main attracted much attention (Cao,2015).Billhardt,Borrajo, parts of the method in detail:feature representation and and Maojo(2002)studied the coupling relationship based readability classification on the co-occurrence of words in the same documents. Kalogeratos and Likas(2012)generalized the relationship An Overview of the Method by taking into account the distance among words in the sentences of each document.Cheng,Miao,Wang,and The framework of GRAW+is depicted in Figure 2. Cao (2013)estimated the co-occurrence of words by min- GRAW+takes an auxiliary text corpus and a target docu- ing the transitive properties.Inspired by these studies,this ment set as inputs.The auxiliary text corpus contains unla- article modifies the Bow model for readability assess- beled sentences used to construct the word coupling ment,and provides the coupled BoW model incorporating matrix.The target document set contains both labeled and the co-occurrence of words in different levels of reading unlabeled documents on readability.The objective of difficulty. GRAW+is to predict the reading levels of the unlabeled Datasets Feature Representation Classification The Coupled Bag-of-Words Model raph Construction Auxiliary Text Corpus Constructing word coupling matrix Graph Merging Constructing Generating coupled bag-of-words model bag-of-words model Label Propagation Target Document Set Extracting linguistic features Label for unlabeled documents FIG.2.The framework of GRAW+.[Color figure can be viewed at wileyonlinelibrary.com] JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY-May 2019 435 D0:10.1002/asi
(McLaughlin, 1969) and the FK formula (Kincaid, Fishburne, Rogers, & Chissom, 1975). A key observation in these studies is that the vocabulary used in a document usually determines its readability (Pitler & Nenkova, 2008). A general way of using the vocabulary is the statistical language model (Collins-Thompson & Callan, 2004; Kidwell, Lebanon, & Collins-Thompson, 2009). More recently, researchers have explored complex linguistic features combined with classification models to obtain robust and effective readability prediction methods (Denning, Pera, & Ng, 2016; Feng, Jansche, Huenerfauth, & Elhadad, 2010; Schwarm & Ostendorf, 2005). While most studies are conducted for English, there are studies for other languages, such as French (François & Fairon, 2012), German (Hancke, Vajjala, & Meurers, 2012), and Bangla (Sinha, Dasgupta, & Basu, 2014). In addition, researchers have used the representation learning techniques for readability assessment (Cha, Gwon, & Kung, 2017; Tseng, Hung, Sung, & Chen, 2016). The Bag-of-Words Model The BoW model has been widely used for document classification owing to its simplicity and general applicability. It constructs a feature space that contains all the distinct words of a language (or text corpus). Traditionally, it assumes that words are independent, while recently, capturing the word coupling relationship has attracted much attention (Cao, 2015). Billhardt, Borrajo, and Maojo (2002) studied the coupling relationship based on the co-occurrence of words in the same documents. Kalogeratos and Likas (2012) generalized the relationship by taking into account the distance among words in the sentences of each document. Cheng, Miao, Wang, and Cao (2013) estimated the co-occurrence of words by mining the transitive properties. Inspired by these studies, this article modifies the BoW model for readability assessment, and provides the coupled BoW model incorporating the co-occurrence of words in different levels of reading difficulty. Graph-Based Label Propagation Graph-based label propagation is applied on a graph to propagate class labels from labeled nodes to unlabeled ones (Subramanya, Petrov, & Pereira, 2010). It has been successfully applied in various applications, such as dictionary construction (Kim, Verma, & Yeh, 2013), word segmentation and tagging (Zeng, Wong, Chao, & Trancoso, 2013), and sentiment classification (Ponomareva & Thelwall, 2012). Typically, a graph-based label propagation method consists of two main steps: graph construction and label propagation. During the first step, some forms of edge weight estimation and edge pruning are required to build an efficient graph (Jebara, Wang, & Chang, 2009; Ponomareva & Thelwall, 2012). In addition, nodes in the graph can be heterogeneous; for example, both instance nodes and class nodes can coexist in a birelational graph (Jiang, 2011), and nodes (words) of English can be linked to nodes (words) of the target language (Chinese) in a bilingual graph (Gao, Wei, Li, Liu, & Zhou, 2015). During the second step, propagation algorithms are required to propagate the label distributions to all the nodes (Kim et al., 2013; Subramanya et al., 2010) so that the classes of unlabeled nodes can be predicted. The Proposed Method In this section, we first present the overview of the proposed method (GRAW+). Then we describe two main parts of the method in detail: feature representation and readability classification. An Overview of the Method The framework of GRAW+ is depicted in Figure 2. GRAW+ takes an auxiliary text corpus and a target document set as inputs. The auxiliary text corpus contains unlabeled sentences used to construct the word coupling matrix. The target document set contains both labeled and unlabeled documents on readability. The objective of GRAW+ is to predict the reading levels of the unlabeled FIG. 2. The framework of GRAW+. [Color figure can be viewed at wileyonlinelibrary.com] JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY—May 2019 DOI: 10.1002/asi 435
documents in the target set based on the labeled ones. computing the distribution divergence.Since sentences GRAW+includes two main stages:feature representation with labeled difficulty levels are hard to acquire,we use and readability classification. unlabeled sentences instead,and label the sentences by During the first stage,the documents in the data set are estimating their difficulty levels with heuristic functions mapped into feature vectors from two distinct views:the Figure 3 presents the three steps required to construct cBow (coupled Bow)view and the linguistic view.From the word coupling matrix:per-sentence reading difficulty the cBow view,the coupled Bow model is required,and estimation,per-word difficulty distribution estimation,and from the linguistic view,suitable linguistic features can be word coupling matrix construction.In the first step,each borrowed from previous studies.By representing docu- sentence in the text corpus is assigned a weak label,which ments from these two views,both word-level difficulty dis- is a reading score computed in a heuristic function.In the tribution and document-level readability features can be second step,based on the weak labels,the difficulty distri- measured,and hence provide extensive information for bution of each word is estimated,according to their distri- readability assessment. butions of occurrences in these sentences.In the final step, During the second stage,a two-view graph propagation the similarities among words are calculated using the distri- method is proposed for readability classification,which con- bution divergence. sists of three main steps:graph construction,graph merging, and label propagation.In the first step,both labeled and unla- beled documents are used to build graphs,where each docu- Step 1:Per-sentence reading difficulty estimation. ment is represented by a node,and their similarities on The precise estimation of sentence-level readability is a reading difficulty are represented by edge weights.In the sec- hard problem and has recently attracted the attention of ond step,the intra-view merging operation is used to merge many researchers (Pilan,Volodina,Johansson,2014; the homogeneous graphs within the same view,and the inter- Schumacher,Eskenazi,Frishkoff,&Collins-Thompson, view merging operation is used to merge the heterogeneous 2016;Vajjala Meurers,2014).For efficiency,we use graphs from different views.In the third step,a label propa- heuristic functions to make a rough estimation.Specifically, gation algorithm is designed on the merged graph. we consider the linguistic features designed for readability assessment that have been demonstrated to be effective in The Coupled Bag-of-Words Model previous studies (Feng et al.,2010;Schumacher et al., 2016),and choose the most used linguistic features that can To build the cBow model,first we construct the word be operated at the sentence level to build the heuristic func- coupling matrix.Then we transform the basic Bow model tions.In total,eight heuristic functions h E(len,ans,anc, to the coupled BoW model using the word coupling matrix. l,art,ntr,pth,anp}corresponding to eight distinct fea- tures from three aspects are used to compute the reading Constructing the word coupling matrix.As stated before, score of a sentence,as shown in Table 1. the word coupling matrix is constructed to represent the Let S denote the set of all the sentences ready for con- similarities of word pairs on reading difficulty.For this structing the word coupling matrix.Given a sentence purpose,we assume the simple fact that easy words tend to s E S,its reading difficulty can be quantified as a reading appear in easy sentences,while difficult words tend to score r(s)=h(s)by using one of the eight functions.The appear in difficult sentences.Hence,we can estimate the more difficult s is,the greater r(s)will be. reading difficulty of a word by its distributions of occur- Considering that the reading score r(s)may be continu- rence probabilities in sentences from different difficulty ous,we discretize r(s)into several difficulty levels.Let n levels.The difficulty distributions can then be used to esti- denote the predetermined number of difficulty levels, mate the similarities among words on reading difficulty by andrespectively,denote the maximum and minimum Sentence set Sentences Levels Words Distributions Extracting Estimating the Estimating the Constructing the readability of difficulty distributions word coupling sentences sentences of words matrix Documents Documents Extracting Constructing Generating documents general BoW model coupled BoW model FIG.3.The process of representing documents using the coupled bag-of-words model.[Color figure can be viewed at wileyonlinelibrary.com] 436 JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY-May 2019 Dol:10.1002/asi
documents in the target set based on the labeled ones. GRAW+ includes two main stages: feature representation and readability classification. During the first stage, the documents in the data set are mapped into feature vectors from two distinct views: the cBoW (coupled BoW) view and the linguistic view. From the cBoW view, the coupled BoW model is required, and from the linguistic view, suitable linguistic features can be borrowed from previous studies. By representing documents from these two views, both word-level difficulty distribution and document-level readability features can be measured, and hence provide extensive information for readability assessment. During the second stage, a two-view graph propagation method is proposed for readability classification, which consists of three main steps: graph construction, graph merging, and label propagation. In the first step, both labeled and unlabeled documents are used to build graphs, where each document is represented by a node, and their similarities on reading difficulty are represented by edge weights. In the second step, the intra-view merging operation is used to merge the homogeneous graphs within the same view, and the interview merging operation is used to merge the heterogeneous graphs from different views. In the third step, a label propagation algorithm is designed on the merged graph. The Coupled Bag-of-Words Model To build the cBoW model, first we construct the word coupling matrix. Then we transform the basic BoW model to the coupled BoW model using the word coupling matrix. Constructing the word coupling matrix. As stated before, the word coupling matrix is constructed to represent the similarities of word pairs on reading difficulty. For this purpose, we assume the simple fact that easy words tend to appear in easy sentences, while difficult words tend to appear in difficult sentences. Hence, we can estimate the reading difficulty of a word by its distributions of occurrence probabilities in sentences from different difficulty levels. The difficulty distributions can then be used to estimate the similarities among words on reading difficulty by computing the distribution divergence. Since sentences with labeled difficulty levels are hard to acquire, we use unlabeled sentences instead, and label the sentences by estimating their difficulty levels with heuristic functions. Figure 3 presents the three steps required to construct the word coupling matrix: per-sentence reading difficulty estimation, per-word difficulty distribution estimation, and word coupling matrix construction. In the first step, each sentence in the text corpus is assigned a weak label, which is a reading score computed in a heuristic function. In the second step, based on the weak labels, the difficulty distribution of each word is estimated, according to their distributions of occurrences in these sentences. In the final step, the similarities among words are calculated using the distribution divergence. Step 1: Per-sentence reading difficulty estimation. The precise estimation of sentence-level readability is a hard problem and has recently attracted the attention of many researchers (Pilán, Volodina, & Johansson, 2014; Schumacher, Eskenazi, Frishkoff, & Collins-Thompson, 2016; Vajjala & Meurers, 2014). For efficiency, we use heuristic functions to make a rough estimation. Specifically, we consider the linguistic features designed for readability assessment that have been demonstrated to be effective in previous studies (Feng et al., 2010; Schumacher et al., 2016), and choose the most used linguistic features that can be operated at the sentence level to build the heuristic functions. In total, eight heuristic functions h 2 {len, ans, anc, lv, art, ntr, pth, anp} corresponding to eight distinct features from three aspects are used to compute the reading score of a sentence, as shown in Table 1. Let S denote the set of all the sentences ready for constructing the word coupling matrix. Given a sentence s 2 S, its reading difficulty can be quantified as a reading score r(s) = h(s) by using one of the eight functions. The more difficult s is, the greater r(s) will be. Considering that the reading score r(s) may be continuous, we discretize r(s) into several difficulty levels. Let η denote the predetermined number of difficulty levels, rh max and rh min, respectively, denote the maximum and minimum FIG. 3. The process of representing documents using the coupled bag-of-words model. [Color figure can be viewed at wileyonlinelibrary.com] 436 JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY—May 2019 DOI: 10.1002/asi
TABLE 1.Three aspects of estimating reading difficulty of sentences using heuristic functions. Aspect Function Description Surface len(s) the length of the sentence s. ans(s) the average number of syllables (or strokes for Chinese)per word (or character for Chinese)in s. anc(s) the average number of characters per word in s. Lexical Iv(s) the number of distinct types of POS,that is,part of speech,in s. atr(s) the ratio of adjectives in s. ntr(s) the ratio of nouns in s. Syntactic pth(s) the height of the syntax parser tree of s. anp(s) the average number of (noun,verb,and preposition)phrases in s. reading score of all the sentences in S,where h refers to version of Kullback-Leibler divergence (Kullback Leibler, one of the eight functions.To determine the difficulty level 1951)to measure their distribution difference,which aver- (s)((s)[1.)of a sentence s,the range[]is ages the values of the divergence computed from both direc- evenly divided into n intervals.I"(s)will be i,if the reading tions.The equation is: score r(s)resides in the i-th interval.For each of the three aspects,we compute one I'(s)for a sentence s by combin- 1 ing the heuristic functions using the following equations. cKL()=(KL(PnlPe)+KL(Pnp)) (3) Bsur(s)=max(Hen(s),lans(s),lanc(s)) where KL(p)=∑,p(log8 and iis the element index.After that,the logistic function is applied to get the pex(s)=max(I(s).Ia(s).I"M(s)) (1) normalized distribution similarity,that is: sim(,2)=1+ea阿 (4) Isyn(s)=max(Ipth(s).Iamp(s)) Given a word ti,only A other words with highest corre- lation (similarity)are selected to build the neighbor set of Step 2:Per-word difficulty distribution estimation. ti,denoted as N(ti).If a word t;is not selected (that is, The difficulty distribution of each word is computed based (t)),the corresponding sim(ti.1)will be assigned on the sentence-level reading difficulty.Since each sentence 0.After that,the word coupling matrix (C")with sim(ti,), contains many words and each word may appear in many as its elements are normalized along the rows so that the sentences,we estimate the difficulty distributions of words sum of each row is 1.Based on three different I'(s),we according to their distributions of occurrences in sentences. construct three distinct word coupling matrices Cr Let y denote the set of all the words appearing in S,P and csyn denotes the difficulty distribution of a word(term)tV.P, While a large volume of vocabulary will make the con- is a vector containing n(that is,the number of difficulty struction of the word coupling matrix time-consuming,we levels)values,the i-th part of which can be calculated by provide a strategy to filter out less informative words based Equation 2. on their distributions on reading difficulty.The filtering mea- sure is the entropy of the words,which can be calculated by n0=L∑8t∈-aI附=) Equation 5.By sorting the words ascendingly according to (2) entropy,the last a E[0,1]proportion will be filtered out. Ent(t)=H(p:)=->p,(i)logp,(i) (5) where n,refers to the number of sentences containing t. i=1 The indicator function (x)retums 1 if x is true and 0 otherwise. Generating the Coupled Bag-Of-Words Model.In the Step 3:Word coupling matrix construction. basic Bow model,words are treated as being independent Given the set of words V,a word coupling matrix is of each other,and the corresponding BoW matrix is sparse defined as CERMVIxMl,the element of which reflects the and ignores the similarity among words on reading diffi- correlation between two words (that is,terms).The correla- culty.For readability assessment,the coupled BoW model tion between each pair of words can be computed according can be implemented by multiplying the word coupling to the similarity measure of their difficulty distributions. matrix and the basic Bow matrix,and the resulting Given two words(terms)t and t2,whose difficulty dis-coupled Bow matrix will be dense and focus on similari- tributions are p and p,respectively,we use a symmetric ties on reading difficulty. JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY-May 2019 437 D0:10.1002/asi
reading score of all the sentences in S, where h refers to one of the eight functions. To determine the difficulty level l h (s) (l h (s) 2 [1, η]) of a sentence s, the range rh min,rh max is evenly divided into η intervals. l h (s) will be i, if the reading score r(s) resides in the i-th interval. For each of the three aspects, we compute one l * (s) for a sentence s by combining the heuristic functions using the following equations. l surð Þ¼ s max llenð Þs ,l ansð Þs ,l ancð Þs l lexð Þs ¼ max llvð Þs ,l atrð Þs ,l ntrð Þs ð1Þ l synð Þ¼ s max lpthð Þs ,l anpð Þs Step 2: Per-word difficulty distribution estimation. The difficulty distribution of each word is computed based on the sentence-level reading difficulty. Since each sentence contains many words and each word may appear in many sentences, we estimate the difficulty distributions of words according to their distributions of occurrences in sentences. Let V denote the set of all the words appearing in S, pt denotes the difficulty distribution of a word (term) t 2 V. pt is a vector containing η (that is, the number of difficulty levels) values, the i-th part of which can be calculated by Equation 2. ptð Þ¼i 1 nt X s2S δð Þ t 2 s δðÞ ð l sð Þ¼ i 2Þ where nt refers to the number of sentences containing t. The indicator function δ(x) returns 1 if x is true and 0 otherwise. Step 3: Word coupling matrix construction. Given the set of words V, a word coupling matrix is defined as C 2 Rj j V × j j V , the element of which reflects the correlation between two words (that is, terms). The correlation between each pair of words can be computed according to the similarity measure of their difficulty distributions. Given two words (terms) t1 and t2, whose difficulty distributions are pt1 and pt2 , respectively, we use a symmetric version of Kullback–Leibler divergence (Kullback & Leibler, 1951) to measure their distribution difference, which averages the values of the divergence computed from both directions. The equation is: cKLð Þ¼ t1,t2 1 2 KL pt1 kpt2 ð Þ + KL pt2 kpt1 ð Þð ð Þ 3Þ where KLðp j jqÞ ¼P i p ið Þlogp ið Þ q ið Þ and i is the element index. After that, the logistic function is applied to get the normalized distribution similarity, that is: sim tð Þ¼ 1,t2 2 1 + ecKLð Þ t1,t2 ð4Þ Given a word ti, only λ other words with highest correlation (similarity) are selected to build the neighbor set of ti, denoted as N tð Þi . If a word tj is not selected (that is, tj2N= tð Þi ), the corresponding sim(ti, tj) will be assigned 0. After that, the word coupling matrix (C* ) with sim(ti, tj), as its elements are normalized along the rows so that the sum of each row is 1. Based on three different l * (s), we construct three distinct word coupling matrices Csur, Clex, and Csyn. While a large volume of vocabulary will make the construction of the word coupling matrix time-consuming, we provide a strategy to filter out less informative words based on their distributions on reading difficulty. The filtering measure is the entropy of the words, which can be calculated by Equation 5. By sorting the words ascendingly according to entropy, the last α 2 [0, 1] proportion will be filtered out. Ent tð Þ¼ H pð Þ¼t − X η i¼1 ptð Þi logptðÞ ð i 5Þ Generating the Coupled Bag-Of-Words Model. In the basic BoW model, words are treated as being independent of each other, and the corresponding BoW matrix is sparse and ignores the similarity among words on reading diffi- culty. For readability assessment, the coupled BoW model can be implemented by multiplying the word coupling matrix and the basic BoW matrix, and the resulting coupled BoW matrix will be dense and focus on similarities on reading difficulty. TABLE 1. Three aspects of estimating reading difficulty of sentences using heuristic functions. Aspect Function Description Surface len(s) the length of the sentence s. ans(s) the average number of syllables (or strokes for Chinese) per word (or character for Chinese) in s. anc(s) the average number of characters per word in s. Lexical lv(s) the number of distinct types of POS, that is, part of speech, in s. atr(s) the ratio of adjectives in s. ntr(s) the ratio of nouns in s. Syntactic pth(s) the height of the syntax parser tree of s. anp(s) the average number of (noun, verb, and preposition) phrases in s. JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY—May 2019 DOI: 10.1002/asi 437
One of the popular schemes of the Bow model is TF- Lexical Features.Lexical features are relevant to the lexi- IDF (Term Frequency and Inverse Document Frequency). cal types (for example,part of speech),which can be Given the set of documents D and the set of words V,the acquired by lexical analysis.Vajjala and Meurers (2012) TF-IDF matrix is defined as Mbow ERMVIxI,which can be employed lexical richness measures for readability assess- calculated based on the logarithmically scaled term(that is, ment,which include 15 SLA (Second Language Acquisi- word)frequency (Salton Buckley,1988)as follows: tion)measures and two extra designed measures.Jiang et al.(2014)developed counterparts of all the measures for Chinese documents.We adopt the two sets of lexical fea- Mto"=tfi.d.idfi.d=(1+logf(t,d)).log 1+ D例 Hdteab tures for both English and Chinese.In addition,we also (6) add the five features proposed by Feng et al.(2010)to compensate for the missed lexical types. where ft,d)is the number of times that a term (word)tE V occurs in a document dED. By adopting the TF-IDF matrix MoW from the basic Syntactic Features.Syntactic features are features that BoW model,the coupled TF-IDF matrix M"can be gener- are computed based on the syntactic structures of sentences that may require the parse tree analysis.Following Vajjala ated by the following equation: and Meurers (2012),we adopt features computed on units M"=C*.Mbow of three levels:sentence,clause,and T-unit.In addition, (7) we add the features designed in Jiang et al.(2014).which Technically,three coupled TF-IDF matrices Mu,M'er count the relative ratios of different types of parse tree nodes and phrases (that is,subtrees).Examples include the and Mym can be built according to the three word coupling average number of noun phrases per sentence,the average matrices C,developed in the previous section. number of parse tree nodes per words,and the ratio of the extra high tree. The Linguistic Features The readability of documents is influenced by many fac- Two-View Graph Propagation tors,such as vocabulary,composition,syntax,semantics, Based on the cBoW and linguistic views,we propose a and so on.Since vocabulary factors have been incorporated two-view graph propagation method for readability classifi- in the cBow model,we integrate the other three factors cation.While the general graph-based label propagation into the linguistic view as complementation of the cBow (Zhu Ghahramani,2002)contains two steps (that is. view.From the linguistic view,we build M!ERID(n graph construction and label propagation),our method adds refers to the number of features)for the documents in D. an extra graph merging step to make use of multiple graphs. Based on the recent work on document-level readability In addition,since grade levels are in ordinal scale,we further assessment (Feng et al.,2010;Jiang,Sun,Gu,Chen, propose a reinforced label propagation algorithm. 2014;Vajjala Meurers,2012),we select three groups of linguistic features:surface features,lexical features,and syntactic features,which are described as follows.Since Graph Construction.Given a feature representation our proposed method aims for language-independency,we X[M,Mlex,My",M'],we can build a directed graph select mostly the language-independent features and add G"to represent the interrelationship on readability among some popular language-dependent features adapted from the documents,where the node set D contains all the docu- English to Chinese. ments.Given the similarity function,we link nodediD to dD with an edge of weight Gij,defined as: Surface Features.Surface features are the kind of fea- sim(d,d)if djeK(d) tures that can be directly acquired by counting the gram- 0 (8) otherwise matical units in a document,such as the average number of characters per word,syllables per word,and words per where K(d;)is the set of k-nearest neighbors of d;with sentence (Vajjala Meurers,2012).We adopt them in our top-k similarities.The similarity function sim(di,d)can be method and add extra features used in Jiang et al.(2014). defined by the Euclidean distance as follows: The extra features include the average number of polysyl- labic words(for example,the number of syllables is greater 1 than 3)per sentence,the average number of difficult words sim(di,dj) (9) (for example,the number of characters is greater than 10) V∑”1(K-X2+e per sentence,the ratio of distinct words (that is,without repetition),and the ratio of unique words. where e is a small constant to avoid zero denominators 438 JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY-May 2019 Dol:10.1002/asi
One of the popular schemes of the BoW model is TFIDF (Term Frequency and Inverse Document Frequency). Given the set of documents D and the set of words V, the TF-IDF matrix is defined as Mbow 2 Rj j V × jDj, which can be calculated based on the logarithmically scaled term (that is, word) frequency (Salton & Buckley, 1988) as follows: Mbow t,d ¼ tft,d idft,d ¼ ð Þ 1 + logf tð Þ ,d log 1 + j j D j j f g djt 2 d ð6Þ where f(t, d) is the number of times that a term (word) t 2 V occurs in a document d 2 D. By adopting the TF-IDF matrix Mbow from the basic BoW model, the coupled TF-IDF matrix M* can be generated by the following equation: M* ¼ C* Mbow ð7Þ Technically, three coupled TF-IDF matrices Msur, Mlex, and Msyn can be built according to the three word coupling matrices C* , developed in the previous section. The Linguistic Features The readability of documents is influenced by many factors, such as vocabulary, composition, syntax, semantics, and so on. Since vocabulary factors have been incorporated in the cBoW model, we integrate the other three factors into the linguistic view as complementation of the cBoW view. From the linguistic view, we build Ml 2 Rnl × jDj (nl refers to the number of features) for the documents in D. Based on the recent work on document-level readability assessment (Feng et al., 2010; Jiang, Sun, Gu, & Chen, 2014; Vajjala & Meurers, 2012), we select three groups of linguistic features: surface features, lexical features, and syntactic features, which are described as follows. Since our proposed method aims for language-independency, we select mostly the language-independent features and add some popular language-dependent features adapted from English to Chinese. Surface Features. Surface features are the kind of features that can be directly acquired by counting the grammatical units in a document, such as the average number of characters per word, syllables per word, and words per sentence (Vajjala & Meurers, 2012). We adopt them in our method and add extra features used in Jiang et al. (2014). The extra features include the average number of polysyllabic words (for example, the number of syllables is greater than 3) per sentence, the average number of difficult words (for example, the number of characters is greater than 10) per sentence, the ratio of distinct words (that is, without repetition), and the ratio of unique words. Lexical Features. Lexical features are relevant to the lexical types (for example, part of speech), which can be acquired by lexical analysis. Vajjala and Meurers (2012) employed lexical richness measures for readability assessment, which include 15 SLA (Second Language Acquisition) measures and two extra designed measures. Jiang et al. (2014) developed counterparts of all the measures for Chinese documents. We adopt the two sets of lexical features for both English and Chinese. In addition, we also add the five features proposed by Feng et al. (2010) to compensate for the missed lexical types. Syntactic Features. Syntactic features are features that are computed based on the syntactic structures of sentences that may require the parse tree analysis. Following Vajjala and Meurers (2012), we adopt features computed on units of three levels: sentence, clause, and T-unit. In addition, we add the features designed in Jiang et al. (2014), which count the relative ratios of different types of parse tree nodes and phrases (that is, subtrees). Examples include the average number of noun phrases per sentence, the average number of parse tree nodes per words, and the ratio of the extra high tree. Two-View Graph Propagation Based on the cBoW and linguistic views, we propose a two-view graph propagation method for readability classifi- cation. While the general graph-based label propagation (Zhu & Ghahramani, 2002) contains two steps (that is, graph construction and label propagation), our method adds an extra graph merging step to make use of multiple graphs. In addition, since grade levels are in ordinal scale, we further propose a reinforced label propagation algorithm. Graph Construction. Given a feature representation X 2 {Msur, Mlex, Msyn, Ml }, we can build a directed graph G* to represent the interrelationship on readability among the documents, where the node set D contains all the documents. Given the similarity function, we link node di 2 D to dj 2 D with an edge of weight G* i,j , defined as: G* i,j ¼ sim di,dj if dj 2 Kð Þ di 0 otherwise ( ð8Þ where Kð Þ di is the set of k-nearest neighbors of di with top-k similarities. The similarity function sim(di, dj) can be defined by the Euclidean distance as follows: sim di,dj ¼ 1 ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Pn v¼1 Xv,i −Xv,j 2 q + ϵ ð9Þ where ϵ is a small constant to avoid zero denominators. 438 JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY—May 2019 DOI: 10.1002/asi
Graph Merging.By constructing graphs from two views, labeled set D={di,d2,.d}and the unlabeled set we get four graphs,where three graphs correspond to the Du={d+1,dx+2.,dn).The goal of label propagation is three coupled TF-IDF matrices (denoted G,Glex,and to propagate class labels from the labeled nodes (that is, G,respectively),and one graph corresponds to the lin- documents with known reading levels)to the entire graph. guistic features (denoted G).The former three graphs are We use a simplified label propagation algorithm presented homogeneous,since they share the same view,while the in Subramanya et al.(2010),which has been proved effec- last graph is different.For the three homogeneous graphs, tive (Kim et al.,2013).The algorithm iteratively updates we provide an intra-view homogeneous graph merging the label distribution on a document node using the follow- strategy to merge them into one (denoted G).To combine ing equation: the graphs from both views,we provide an inter-view het- erogeneous graph merging strategy to merge graph Ge and graph G into the final graphG g26y= qgy)8d∈D)+】 Ga.rg-y) (11) Kd vEK(d) Intra-view homogeneous graph merging. Given the three graphs (that is,G,Glex and G"),where each node has k neighbors,we merge them into the graph Ge where At the left side of Equation 11.y)is the afterward each node still has k neighbors.The basic idea is keeping probability of y (that is,the reading level y)on a node d at the common edges while removing edges containing the i-th iteration.At the right side,Ka is the normalizing redundant information,as shown in Figure 4.Given a node constant to make sure the sum of the level probabilities is v,firstly we reserve its neighbors which are common in all 1,and qa(y)is the initial probability of y on d if d is ini- three graphs.Secondly,for the candidate nodes,which are tially labeled (1 if d is labeled as level y or 0 otherwise). neighbors of v in at least one graph,we select one by one 6(x)is the indicator function.K(d)denotes the set of the node which possesses the least number of common neighbors of d.The iteration stops when the changes in neighbors (that is,the nodes that are already selected in (y)for all the nodes and reading levels are small K(v)).The objective is to keep the number of triangles in enough (for example,less than e),or i exceeds a prede- G to a minimum.The edge weights of G are averaged on fined number(for example,greater than 30).For each unla- the corresponding edges appeared in the three graphs. beled document di,its predicted reading level is y if the element qd(y)is greatest in the latest label distribution qd Inter-view heterogeneous graph merging.Considering that edges of either Ge or G describe the relationship of Reinforced label propagation.The above label propa- documents on readability from a certain perspective,we gation algorithm (Subramanya et al.,2010)treats the class reserve all the edges (that is,some nodes will have >k labels as in nominal scale,and ignores the ordinal relation neighbors)and use the factor B to balance their weights on among reading levels.We develop a reinforced label prop- the final graph G.The following equation defines the agation algorithm to utilize the ordinal relationship.Pre- strategy: classification is required using the linguistic features, which can provide extra information to amplify the edge G品=BG+(1-)G 10) weights. Let documents have m reading levels in ordinal scale. The parameter BE[0,1]is used to specify the relative Given the x labeled documents with reading levels [y1..., importance of the two graphs.Clearly,the case of B=0 or y,a preclassifier can be trained and the prelabels can then 1 reduces the graph to a single view be predicted for the unlabeled documents,denoted as +,Thus,we have the a priori label of all the Label propagation.Given a graph G constructed in pre- documents (1,.+,,n),where zi=yi(i(1, vious sections,its nodes are divided into two sets:the .…,x)for labeled documents and=yG∈{x+l,…,n}) for unlabeled documents. candidate neighbors Based on the a priori labels,we amplify the weights of the edges which connect two nodes of similar a priori labels.The reinforced label propagation iteratively updates the label distribution on a document node using the follow- ing enhanced equation: common neighbors y)= qy)(d∈D)+ G-(y v∈K(d) FIG.4.Illustration of the intra-view homogeneous graph merging strat- egy.[Color figure can be viewed at wileyonlinelibrary.com] Gd.y=(m-kzd-Zvl)Gd.v (12) JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY-May 2019 439 D0:10.1002/asi
Graph Merging. By constructing graphs from two views, we get four graphs, where three graphs correspond to the three coupled TF-IDF matrices (denoted Gsur, Glex, and Gsyn, respectively), and one graph corresponds to the linguistic features (denoted Gl ). The former three graphs are homogeneous, since they share the same view, while the last graph is different. For the three homogeneous graphs, we provide an intra-view homogeneous graph merging strategy to merge them into one (denoted Gc ). To combine the graphs from both views, we provide an inter-view heterogeneous graph merging strategy to merge graph Gc and graph Gl into the final graph Gcl. Intra-view homogeneous graph merging. Given the three graphs (that is, Gsur, Glex and Gsyn), where each node has k neighbors, we merge them into the graph Gc where each node still has k neighbors. The basic idea is keeping the common edges while removing edges containing redundant information, as shown in Figure 4. Given a node v, firstly we reserve its neighbors which are common in all three graphs. Secondly, for the candidate nodes, which are neighbors of v in at least one graph, we select one by one the node which possesses the least number of common neighbors (that is, the nodes that are already selected in Kc ð Þv ). The objective is to keep the number of triangles in Gc to a minimum. The edge weights of Gc are averaged on the corresponding edges appeared in the three graphs. Inter-view heterogeneous graph merging. Considering that edges of either Gc or Gl describe the relationship of documents on readability from a certain perspective, we reserve all the edges (that is, some nodes will have >k neighbors) and use the factor β to balance their weights on the final graph Gcl. The following equation defines the strategy: Gcl i,j ¼ βGc i,j + 1ð Þ −β Gl i,j ð10Þ The parameter β 2 [0, 1] is used to specify the relative importance of the two graphs. Clearly, the case of β = 0 or 1 reduces the graph to a single view. Label propagation. Given a graph G constructed in previous sections, its nodes are divided into two sets: the labeled set Dl ¼ f g d1,d2,,dx and the unlabeled set Du ¼ f g dx + 1,dx + 2,,dn . The goal of label propagation is to propagate class labels from the labeled nodes (that is, documents with known reading levels) to the entire graph. We use a simplified label propagation algorithm presented in Subramanya et al. (2010), which has been proved effective (Kim et al., 2013). The algorithm iteratively updates the label distribution on a document node using the following equation: q ð Þi d ð Þ¼ y 1 κd q0 dð Þy δð Þ d 2 Dl + X v2Kð Þ d Gd,vqð Þ i−1 v ð Þy 0 @ 1 A ð11Þ At the left side of Equation 11, q ð Þi d ð Þy is the afterward probability of y (that is, the reading level y) on a node d at the i-th iteration. At the right side, κd is the normalizing constant to make sure the sum of the level probabilities is 1, and q0 dð Þy is the initial probability of y on d if d is initially labeled (1 if d is labeled as level y or 0 otherwise). δ(x) is the indicator function. Kð Þ d denotes the set of neighbors of d. The iteration stops when the changes in q ð Þi d ð Þy for all the nodes and reading levels are small enough (for example, less than e −3 ), or i exceeds a prede- fined number (for example, greater than 30). For each unlabeled document di, its predicted reading level is y if the element qdi ð Þy is greatest in the latest label distribution qdi . Reinforced label propagation. The above label propagation algorithm (Subramanya et al., 2010) treats the class labels as in nominal scale, and ignores the ordinal relation among reading levels. We develop a reinforced label propagation algorithm to utilize the ordinal relationship. Preclassification is required using the linguistic features, which can provide extra information to amplify the edge weights. Let documents have m reading levels in ordinal scale. Given the x labeled documents with reading levels {y1, , yx}, a preclassifier can be trained and the prelabels can then be predicted for the unlabeled documents, denoted as y^x + 1 f g ,,y^n . Thus, we have the a priori label of all the documents {z1, , zx, zx + 1, , zn}, where zi = yi (i 2 {1, , x}) for labeled documents and zj ¼ y^j (j 2 {x + 1, , n}) for unlabeled documents. Based on the a priori labels, we amplify the weights of the edges which connect two nodes of similar a priori labels. The reinforced label propagation iteratively updates the label distribution on a document node using the following enhanced equation: q ð Þi d ð Þ¼ y 1 κd q0 dð Þy δð Þ d 2 Dl + X v2Kð Þ d G0 d,vqð Þ i−1 v ð Þy 0 @ 1 A G0 d,v ¼ ð Þ m−j j zd −zv Gd,v ð12Þ FIG. 4. Illustration of the intra-view homogeneous graph merging strategy. [Color figure can be viewed at wileyonlinelibrary.com] JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY—May 2019 DOI: 10.1002/asi 439
where G is a weighted graph the edge weights of which Comparisons to the State-of-the-Art Methods are modified from Ga.according to the difference in cur- rent reading levels of their end nodes.For each unlabeled To address ROl,we implement the following readabil- document di,its final reading level is y if qd (y)is greatest ity assessment methods and compare our method GRAW+ with them: in the final label distribution qd SMOG (McLaughlin,1969)and FK (Kincaid et al.,1975)are Experiments two widely used readability formulas.We reserve their features and refine the coefficients on both data sets to befit the reading In this section,we conduct experiments on data sets of (grade)levels. both English and Chinese to investigate the following four SUM (Collins-Thompson Callan,2004)is a word-based research questions: method,which trains one unigram model for each reading RQ1:Whether the proposed method (that is,GRAW+) level,and applies model smoothing among the reading levels. outperforms the state-of-the-art methods for readability V&M (Vajjala Meurers,2012)is one of the current best assessment? readability assessment methods for English,which adopts three RQ2:What are the effects of the word coupling groups of features for classification.As majority of the features matrix on the performance of the coupled bag-of-words are designed specifically for English,we run V&M on ENCT only. model? Jiang (Jiang et al.,2014)is a readability assessment method for RQ3:How effective is the two-view graph propagation Chinese.It adopts five groups of features and designs an ordi- method,including the graph merging strategies and rein- nal multiclass classification with voting for classification.We forced label propagation algorithm? run Jiang on CPT only. RQ4:Whether introducing the external text corpus can SG-NN is a word embedding-based readability assessment improve the quality of the word coupling matrix? method proposed by Tseng et al.(2016).In SG-NN,the repre- sentation of a document is generated by adding up the word embedding of all words in the document.The word embedding Corpus and Performance Measures model used is Skip-Gram.The classification model used is the regularized neural network with one hidden layer. To evaluate our proposed method,we used two data SG-KM-SVM is a word embedding-based readability assess- sets.The first is CPT(Chinese primary textbook)(Jiang ment method proposed by Cha et al.(2017).In SG-KM-SVM. et al.,2014),which contains Chinese documents of six the representation of a document is generated by applying aver- reading levels corresponding to six grades.The second is age pooling on the word embedding and cluster membership of ENCT (English New Concept textbook)(Jiang et al., all words in the document.The word embedding model used is 2015),which contains English documents of four read- Skip-Gram.The cluster membership is generated by K-means. ing levels.Both data sets (shown in Table 2)are built SVM(Support Vector Machine)is used to predict the reading from well-known textbooks where documents are orga- level of a document. nized into grades by credible educationists.For the docu- SVM(Support Vector Machine)and LR (Logistic Regression) ments in CPT,we use the ICTCLAS tool (Zhang,2013; are two classification models that have widely been used for readability assessment in previous studies (Feng et al.,2010: Zhang,Yu,Xiong,Liu,2003)to do the word Jiang et al.,2014). segmentation TSVM (Transductive SVM)(Joachims,1998)is a classical We conducted experiments on both CPT and ENCT transductive method,which has not been applied in readability using the hold-out validation,which randomly divides a assessment.Since GRAW+is also a transductive method,we data set into labeled(training)and unlabeled(test)sets by run TSVM here as a baseline. stratified sampling.The labeling proportion is varied to OLR (Ordinal LR)(Mccullagh,1980)is a variant of LR that investigate the performance of a method under different can predict in ordinal scale.As the reinforced label propagation circumstances.To reduce variability,under each case, in GRAW+also exploits the ordinal relation among reading 100 rounds of hold-out validation are performed,and the levels,we run OLR here as another baseline. validation results are averaged over all the rounds.To tune Bi-LP is a graph propagation method that applies label propa- the hyperparameters,we randomly choose one partition gation on a complex graph (Gao et al..2015;Jiang,2011).Bi- from the training set as the development set.We chose the LP builds two separate subgraphs from cBow view and lin- precision(P),recall (R),and F1-measure (F1)as the per- guistic view,and connects them using the bipartite subgraph. The label propagation algorithm is performed on the integrated formance measures. graph and leads to two distributions for each document.A TABLE 2.Statistics of the English and Chinese data sets Data set Language #Grade #Doc #Sent #Word CPT Chinese 6 637 16.145 234.372 ENCT English 279 4,671 62.921 440 JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY-May 2019 D0l:10.1002/asi
where G0 is a weighted graph the edge weights of which are modified from Gd, v according to the difference in current reading levels of their end nodes. For each unlabeled document di, its final reading level is y if qdi ð Þy is greatest in the final label distribution qdi . Experiments In this section, we conduct experiments on data sets of both English and Chinese to investigate the following four research questions: RQ1: Whether the proposed method (that is, GRAW+) outperforms the state-of-the-art methods for readability assessment? RQ2: What are the effects of the word coupling matrix on the performance of the coupled bag-of-words model? RQ3: How effective is the two-view graph propagation method, including the graph merging strategies and reinforced label propagation algorithm? RQ4: Whether introducing the external text corpus can improve the quality of the word coupling matrix? Corpus and Performance Measures To evaluate our proposed method, we used two data sets. The first is CPT (Chinese primary textbook) (Jiang et al., 2014), which contains Chinese documents of six reading levels corresponding to six grades. The second is ENCT (English New Concept textbook) (Jiang et al., 2015), which contains English documents of four reading levels. Both data sets (shown in Table 2) are built from well-known textbooks where documents are organized into grades by credible educationists. For the documents in CPT, we use the ICTCLAS tool (Zhang, 2013; Zhang, Yu, Xiong, & Liu, 2003) to do the word segmentation. We conducted experiments on both CPT and ENCT using the hold-out validation, which randomly divides a data set into labeled (training) and unlabeled (test) sets by stratified sampling. The labeling proportion is varied to investigate the performance of a method under different circumstances. To reduce variability, under each case, 100 rounds of hold-out validation are performed, and the validation results are averaged over all the rounds. To tune the hyperparameters, we randomly choose one partition from the training set as the development set. We chose the precision (P), recall (R), and F1-measure (F1) as the performance measures. Comparisons to the State-of-the-Art Methods To address RQ1, we implement the following readability assessment methods and compare our method GRAW+ with them: • SMOG (McLaughlin, 1969) and FK (Kincaid et al., 1975) are two widely used readability formulas. We reserve their features and refine the coefficients on both data sets to befit the reading (grade) levels. • SUM (Collins-Thompson & Callan, 2004) is a word-based method, which trains one unigram model for each reading level, and applies model smoothing among the reading levels. • V&M (Vajjala & Meurers, 2012) is one of the current best readability assessment methods for English, which adopts three groups of features for classification. As majority of the features are designed specifically for English, we run V&M on ENCT only. • Jiang (Jiang et al., 2014) is a readability assessment method for Chinese. It adopts five groups of features and designs an ordinal multiclass classification with voting for classification. We run Jiang on CPT only. • SG-NN is a word embedding-based readability assessment method proposed by Tseng et al. (2016). In SG-NN, the representation of a document is generated by adding up the word embedding of all words in the document. The word embedding model used is Skip-Gram. The classification model used is the regularized neural network with one hidden layer. • SG-KM-SVM is a word embedding-based readability assessment method proposed by Cha et al. (2017). In SG-KM-SVM, the representation of a document is generated by applying average pooling on the word embedding and cluster membership of all words in the document. The word embedding model used is Skip-Gram. The cluster membership is generated by K-means. SVM (Support Vector Machine) is used to predict the reading level of a document. • SVM (Support Vector Machine) and LR (Logistic Regression) are two classification models that have widely been used for readability assessment in previous studies (Feng et al., 2010; Jiang et al., 2014). • TSVM (Transductive SVM) (Joachims, 1998) is a classical transductive method, which has not been applied in readability assessment. Since GRAW+ is also a transductive method, we run TSVM here as a baseline. • OLR (Ordinal LR) (Mccullagh, 1980) is a variant of LR that can predict in ordinal scale. As the reinforced label propagation in GRAW+ also exploits the ordinal relation among reading levels, we run OLR here as another baseline. • Bi-LP is a graph propagation method that applies label propagation on a complex graph (Gao et al., 2015; Jiang, 2011). BiLP builds two separate subgraphs from cBoW view and linguistic view, and connects them using the bipartite subgraph. The label propagation algorithm is performed on the integrated graph and leads to two distributions for each document. A TABLE 2. Statistics of the English and Chinese data sets. Data set Language #Grade #Doc #Sent #Word CPT Chinese 6 637 16,145 234,372 ENCT English 4 279 4,671 62,921 440 JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY—May 2019 DOI: 10.1002/asi
TABLE 3.The average precision,recall,and Fl-measures (%of the 11 methods for readability assessment on either data set (the labeling proportion is0.7). CPT ENCT Methods Precision Recall F1-measure Precision Recall Fl-measure SMOG 28.48 25.38 21.12 55.00 41.41 41.20 FK 34.48 24.68 18.73 60.78 45.65 46.16 SUM 37.39 33.35 33.78 71.18 71.45 67.44 V&M 86.98 85.08 85.63 Jiang 48.07 47.53 47.30 SG-NN 45.58 45.87 44.96 88.93 88.17 88.24 SG-KM-SVM 40.96 40.77 39.98 80.87 79.92 80.33 SVM Linguistic 48.43 48.01 47.83 87.95 87.08 86.85 TF-IDF 43.10 44.64 42.47 92.26 90.03 90.66 Linguistic+TF-IDF 51.04 51.26 50.74 88.02 86.00 85.96 LR Linguistic 46.53 46.43 46.16 88.05 87.16 87.28 TF-IDF 33.10 34.00 30.65 88.26 86.18 86.69 Linguistic+TF-IDF 46.64 46.78 46.22 90.80 89.17 89.67 TSVM Linguistic 53.32 51.28 48.95 90.57 87.72 88.31 TF-IDF 37.55 38.92 30.93 28.54 46.28 35.30 Linguistic+TF-IDF 44.35 41.12 33.47 27.83 45.14 34.43 OLR Linguistic 51.02 47.93 47.98 51.53 54.40 49.93 TF-IDF 34.25 30.38 27.44 55.01 55.35 53.18 Linguistic+TF-IDF 51.90 50.51 49.27 62.62 62.33 60.27 Bi-LP LP 46.52 46.50 44.92 87.53 83.19 83.63 Reinforced LP 47.41 46.87 45.90 88.71 84.51 85.04 GRAW G 50.41 50.60 49.67 90.27 88.16 88.67 32.99 38.98 32.95 86.16 78.80 78.55 G9 51.43 53.26 50.86 92.39 90.68 91.15 GRAW+ G 53.70 53.64 53.19 91.86 90.55 90.88 40.99 43.99 39.97 87.34 80.90 81.00 G 54.21 55.16 54.07 93.33 92.02 92.38 simple average is used to determine the final class label of each column refer to the maximum (best)measures gained by document. the methods. GRAW (Jiang et al.,2015)is the previous version of our From Table 3.the readability formulas (SMOG and FK) method.We verify if GRAW+can improve the performance of perform poorly in both the precision and recall measures, GRAW by adding new facilities. so that their F1-measures are generally the poorest.How- ever,SMOG and FK still have acceptable performance on For the four baseline classification models (that is.LR. the English data set ENCT.The unigram model (SUM) SVM,TSVM,and OLR),documents are represented as performs a little better than the readability formulas.On feature vectors.To make a fair comparison,we build the ENCT,it has relatively good performance,while on the feature vector by concatenating two sets of features:the Chinese data set CPT,its performance is not satisfactory. linguistic features used in this article (denoted Linguistic). Both V&M on ENCT and Jiang on CPT perform well, and the TF-IDF features (denoted TF-IDF).since GRAW+ which means both the linguistic features developed and the incorporates both. classifiers trained are useful.The two word embedding For GRAW+,we apply the reinforced label propagation based methods (SG-NN and SG-KM-SVM)achieve better on each of the three graphs:G,G,and Ge,denoted as performance measures than SUM on both data sets.On GRAW,GRAW,GRAW!,respectively.Unless other- ENCT,SG-NN performs better than V&M.By adopting wise specified,we fixed n to 3,a to 0,and B to 0.5.and features from our proposed two views,the two commonly k are tuned on the development set.Sentences from the used models (SVM and LR)perform a little better than whole data set are used as the auxiliary text corpus.LR is V&M and Jiang,which demonstrates the usefulness of the used to build the preclassifier for the reinforced label two views.The transductive method (TSVM)slightly out- propagation. performs SVM and LR,which suggests that the unlabeled Table 3 gives the average performance of each method documents can provide valuable information for readability on both data sets where the proportion of the labeled(train- classification.In addition,by adding the TF-IDF features ing)set is 0.7.Specifically,the precision,recall,and into the feature set,the performance of both SVM and LR Fl-measure of all the methods are calculated by averaging have improved,but the performance of TSVM becomes the results on all reading(grade)levels on either English or worse.This may be due to the heterogeneous spaces of the Chinese data sets.The values marked in bold in each two views,which should not be roughly combined through JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY-May 2019 441 D0:10.1002/asi
simple average is used to determine the final class label of each document. • GRAW (Jiang et al., 2015) is the previous version of our method. We verify if GRAW+ can improve the performance of GRAW by adding new facilities. For the four baseline classification models (that is, LR, SVM, TSVM, and OLR), documents are represented as feature vectors. To make a fair comparison, we build the feature vector by concatenating two sets of features: the linguistic features used in this article (denoted Linguistic), and the TF-IDF features (denoted TF-IDF), since GRAW+ incorporates both. For GRAW+, we apply the reinforced label propagation on each of the three graphs: Gc , Gl , and Gcl, denoted as GRAWc + , GRAWl + , GRAWcl + , respectively. Unless otherwise specified, we fixed η to 3, α to 0, and β to 0.5. λ and k are tuned on the development set. Sentences from the whole data set are used as the auxiliary text corpus. LR is used to build the preclassifier for the reinforced label propagation. Table 3 gives the average performance of each method on both data sets where the proportion of the labeled (training) set is 0.7. Specifically, the precision, recall, and F1-measure of all the methods are calculated by averaging the results on all reading (grade) levels on either English or Chinese data sets. The values marked in bold in each column refer to the maximum (best) measures gained by the methods. From Table 3, the readability formulas (SMOG and FK) perform poorly in both the precision and recall measures, so that their F1-measures are generally the poorest. However, SMOG and FK still have acceptable performance on the English data set ENCT. The unigram model (SUM) performs a little better than the readability formulas. On ENCT, it has relatively good performance, while on the Chinese data set CPT, its performance is not satisfactory. Both V&M on ENCT and Jiang on CPT perform well, which means both the linguistic features developed and the classifiers trained are useful. The two word embedding based methods (SG-NN and SG-KM-SVM) achieve better performance measures than SUM on both data sets. On ENCT, SG-NN performs better than V&M. By adopting features from our proposed two views, the two commonly used models (SVM and LR) perform a little better than V&M and Jiang, which demonstrates the usefulness of the two views. The transductive method (TSVM) slightly outperforms SVM and LR, which suggests that the unlabeled documents can provide valuable information for readability classification. In addition, by adding the TF-IDF features into the feature set, the performance of both SVM and LR have improved, but the performance of TSVM becomes worse. This may be due to the heterogeneous spaces of the two views, which should not be roughly combined through TABLE 3. The average precision, recall, and F1-measures (%) of the 11 methods for readability assessment on either data set (the labeling proportion is 0.7). Methods CPT ENCT Precision Recall F1-measure Precision Recall F1-measure SMOG 28.48 25.38 21.12 55.00 41.41 41.20 FK 34.48 24.68 18.73 60.78 45.65 46.16 SUM 37.39 33.35 33.78 71.18 71.45 67.44 V&M —— — 86.98 85.08 85.63 Jiang 48.07 47.53 47.30 —— — SG-NN 45.58 45.87 44.96 88.93 88.17 88.24 SG-KM-SVM 40.96 40.77 39.98 80.87 79.92 80.33 SVM Linguistic 48.43 48.01 47.83 87.95 87.08 86.85 TF-IDF 43.10 44.64 42.47 92.26 90.03 90.66 Linguistic+TF-IDF 51.04 51.26 50.74 88.02 86.00 85.96 LR Linguistic 46.53 46.43 46.16 88.05 87.16 87.28 TF-IDF 33.10 34.00 30.65 88.26 86.18 86.69 Linguistic+TF-IDF 46.64 46.78 46.22 90.80 89.17 89.67 TSVM Linguistic 53.32 51.28 48.95 90.57 87.72 88.31 TF-IDF 37.55 38.92 30.93 28.54 46.28 35.30 Linguistic+TF-IDF 44.35 41.12 33.47 27.83 45.14 34.43 OLR Linguistic 51.02 47.93 47.98 51.53 54.40 49.93 TF-IDF 34.25 30.38 27.44 55.01 55.35 53.18 Linguistic+TF-IDF 51.90 50.51 49.27 62.62 62.33 60.27 Bi-LP LP 46.52 46.50 44.92 87.53 83.19 83.63 Reinforced LP 47.41 46.87 45.90 88.71 84.51 85.04 GRAW Gc 50.41 50.60 49.67 90.27 88.16 88.67 Gf 32.99 38.98 32.95 86.16 78.80 78.55 Gcf 51.43 53.26 50.86 92.39 90.68 91.15 GRAW+ Gc 53.70 53.64 53.19 91.86 90.55 90.88 Gl 40.99 43.99 39.97 87.34 80.90 81.00 Gcl 54.21 55.16 54.07 93.33 92.02 92.38 JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY—May 2019 DOI: 10.1002/asi 441
concatenation.By employing the ordinal relation among Effects of the Word Coupling Matrix the reading levels,OLR achieves a good performance on CPT.But on ENCT its performance is poor,which sug- For RQ2,we first compare the coupled BoW model to gests the instability of OLR.On both data sets,both the basic BoW model for graph construction.Three graphs GRAW and GRAW+can outperform Bi-LP with either LP are built by using each of the three coupled Bow matrices or reinforced LP,which demonstrates that our graph merg- (that is,Msur,Mlex,and M")generated from the three ing strategy is more effective than the graph integrating word coupling matrices (that is,C,Cler,and C). strategy in Bi-LP.By comparing Bi-LP with Ge and G respectively,and one graph is built by using the TF-IDF from GRAW+,which roughly correspond to the two sub- (that is,basic Bow)matrix for comparison.The general graphs in Bi-LP,it can be observed that the performance label propagation is applied on each graph to measure the of Bi-LP is better than G but worse than G.By simply performance of readability classification.The labeling pro- linking the two subgraphs through the bipartite graph,Bi- portion is varied from 0.1 to 0.9 on both the English and LP may not take full advantage of the two subgraphs.In Chinese data sets.Figure 6a depicts the averaged general,GRAWe (applying the general label propagation F1-measures resulting from the four graphs.From on Ge)performs better than all the above baselines,which Figure 6a,the three coupled BoW matrices greatly outper- demonstrates the effectiveness of our method.And last,by form the TF-IDF matrix,especially on the Chinese data set applying the reinforced label propagation algorithm, CPT.This demonstrates that the word coupling matrices GRAWe performs the best in all the three measures on are effective in improving the performance of the basic BoW model for readability assessment. both data sets. Second,we investigate which parts of GRAW+(the We studied the effect of the labeling proportion on the parameters,the word filtering,and the size of auxiliary performance of these methods on both data sets.The sentence set)take effect on the performance of the word F1-measure averaged over the reading levels was used, coupling matrices. since it is a good representative of the three measures according to Table 3.Figure 5 depicts the performance trends of the five baseline methods and the two versions of The effect of the parameters n and h.To investigate the our method (that is,GRAW and GRAW+)by varying the effects of n and A on the performance of the word coupling labeling proportion from 0.1 to 0.9 step by 0.1. matrices,we vary the values and compute the average From Figure 5,neither SMOG nor FK benefits from the Fl-measures on the two data sets,as shown in enlarged labeled set.This suggests that the performance of Figure 6b.The graph was built using M",and the other the readability formulas can hardly be improved by accu- two graphs present similar trends.From Figure 6b,a small mulating training data.The other methods achieve better n(for example,2 or 3)is good on the Chinese data set performance on larger labeled sets,and outperform the CPT.However,on the English data set ENCT,n=2 leads two readability formulas even if the labeling proportion to the poorest performance.It shows that the increasing of is small.Both Jiang on CPT and V&M on ENCT perform n causes vibrated performance,and the trend is further better than SUM.GRAW outperforms the baseline complicated when involving A.Above all,n=3 gives a methods over all the labeling proportions on both data sets preferable option on both data sets.For most matrices and performs well even when the labeling proportion is exhibit similar trends that rise first and then keep stable on small.Again,as the enhanced version of GRAW,the per- both data sets,while some may drop when A is too great. formance of GRAW+is consistently improved over the This suggests that making a relatively large number of labeling proportions. neighbors for each word (that is,A=2,800 on CPT and 0.6 0.8 0.4 0.7 0.3 ·SMOG SMOG 0.2 SUM SUM lan▣ V&M 04 GRAW GRAW GRAW GRAW+ 0.1 0.3 0.1 0.2 0.30.4 0.5 0.6 0.7 0.8 0.9 01 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Proportion of the labeled set Proportion of the labeled set FIG.5.The average F1-measures of the seven methods on both data sets (Jiang is running on CPT only,while V&M is running on ENCT only)with the labeling proportion varied from 0.1 to 0.9 step by 0.1.[Color figure can be viewed at wileyonlinelibrary.com] 442 JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY-May 2019 D0l:10.1002/asi
concatenation. By employing the ordinal relation among the reading levels, OLR achieves a good performance on CPT. But on ENCT its performance is poor, which suggests the instability of OLR. On both data sets, both GRAW and GRAW+ can outperform Bi-LP with either LP or reinforced LP, which demonstrates that our graph merging strategy is more effective than the graph integrating strategy in Bi-LP. By comparing Bi-LP with Gc and Gl from GRAW+, which roughly correspond to the two subgraphs in Bi-LP, it can be observed that the performance of Bi-LP is better than Gl but worse than Gc . By simply linking the two subgraphs through the bipartite graph, BiLP may not take full advantage of the two subgraphs. In general, GRAWcf (applying the general label propagation on Gcf) performs better than all the above baselines, which demonstrates the effectiveness of our method. And last, by applying the reinforced label propagation algorithm, GRAWcl + performs the best in all the three measures on both data sets. We studied the effect of the labeling proportion on the performance of these methods on both data sets. The F1-measure averaged over the reading levels was used, since it is a good representative of the three measures according to Table 3. Figure 5 depicts the performance trends of the five baseline methods and the two versions of our method (that is, GRAW and GRAW+) by varying the labeling proportion from 0.1 to 0.9 step by 0.1. From Figure 5, neither SMOG nor FK benefits from the enlarged labeled set. This suggests that the performance of the readability formulas can hardly be improved by accumulating training data. The other methods achieve better performance on larger labeled sets, and outperform the two readability formulas even if the labeling proportion is small. Both Jiang on CPT and V&M on ENCT perform better than SUM. GRAW outperforms the baseline methods over all the labeling proportions on both data sets and performs well even when the labeling proportion is small. Again, as the enhanced version of GRAW, the performance of GRAW+ is consistently improved over the labeling proportions. Effects of the Word Coupling Matrix For RQ2, we first compare the coupled BoW model to the basic BoW model for graph construction. Three graphs are built by using each of the three coupled BoW matrices (that is, Msur, Mlex, and Msyn) generated from the three word coupling matrices (that is, Csur, Clex, and Csyn), respectively, and one graph is built by using the TF-IDF (that is, basic BoW) matrix for comparison. The general label propagation is applied on each graph to measure the performance of readability classification. The labeling proportion is varied from 0.1 to 0.9 on both the English and Chinese data sets. Figure 6a depicts the averaged F1-measures resulting from the four graphs. From Figure 6a, the three coupled BoW matrices greatly outperform the TF-IDF matrix, especially on the Chinese data set CPT. This demonstrates that the word coupling matrices are effective in improving the performance of the basic BoW model for readability assessment. Second, we investigate which parts of GRAW+ (the parameters, the word filtering, and the size of auxiliary sentence set) take effect on the performance of the word coupling matrices. The effect of the parameters η and λ. To investigate the effects of η and λ on the performance of the word coupling matrices, we vary the values and compute the average F1-measures on the two data sets, as shown in Figure 6b. The graph was built using Msyn, and the other two graphs present similar trends. From Figure 6b, a small η (for example, 2 or 3) is good on the Chinese data set CPT. However, on the English data set ENCT, η = 2 leads to the poorest performance. It shows that the increasing of η causes vibrated performance, and the trend is further complicated when involving λ. Above all, η = 3 gives a preferable option on both data sets. For λ, most matrices exhibit similar trends that rise first and then keep stable on both data sets, while some may drop when λ is too great. This suggests that making a relatively large number of neighbors for each word (that is, λ = 2,800 on CPT and FIG. 5. The average F1-measures of the seven methods on both data sets (Jiang is running on CPT only, while V&M is running on ENCT only) with the labeling proportion varied from 0.1 to 0.9 step by 0.1. [Color figure can be viewed at wileyonlinelibrary.com] 442 JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY—May 2019 DOI: 10.1002/asi