combine the rich representation of texts with sophisticated To build the coupled Bow model,the key point is to prediction models. construct the word coupling matrix.For this purpose,we Most current methods assess the readability of text doc- first estimate the occurrence distributions of words in sen- uments singularly,and ignore the interrelationship among tences of different reading difficulties,and then compute documents on readability,which can be useful in assessing their similarities on reading difficulty based on the the readability of documents based on the labeled ones. distributions. For example,two documents can be of the same reading Besides the coupled Bow model,the linguistic features level,if they consist of words that have similar reading dif- can also be adopted by our method.On the one hand,we ficulty.Hence,we propose a graph propagation method for use the linguistic features as complementation of the readability assessment,which can model and utilize the coupled Bow model to construct graphs from multiple interrelationship among text documents. views.On the other,the linguistic features are used to rein- To measure the relationship among documents,we use the force the label propagation algorithm by providing the bag-of-words (Bow)model,which is commonly used for text prior knowledge. classification and clustering (Huang,2008;Sebastiani,2002). In this article,we propose a two-view graph propaga- However,to measure the relationship on readability,the basic tion method with word coupling for readability assess- BoW model requires improvements,since it ignores the fact ment.Our contributions are as follows (a preliminary that different words may have similar reading difficulties. version of this work appeared in Jiang,Sun,Gu,Bai, Figure 1 illustrates the improved use of the Bow model for Chen,2015).(i)We apply the graph-based method for readability assessment using a simple example.In Figure 1, readability assessment,which can make use of the interre- the left matrix is built from the basic Bow model for three lationship among documents to estimate their readability. documents(that is,D1,D2,and D3)consisting of four tokens (ii)We propose the coupled BoW model,which can be (that is,school,law,syllabus,and decree).Among the three used to measure the similarity of documents on reading documents,D1 and D2 are two relatively difficult documents difficulty.(iii)We propose a two-view graph building both containing two easy words(school or law)and two diffi- strategy to make use of both the coupled Bow model and cult words(syllabus or decree),while D3 is an easy document the linguistic features.(iv)We propose a reinforced label that contains two easy words (school).By calculating the propagation algorithm,which can make use of the ordinal cosine similarities based on the basic BoW model (the bottom relation among reading levels.Extensive experiments left subfigure).the result shows that D1 is more similar to D3 were carried out on data sets of both English and Chinese. than to D2,which is inconsistent with their similarities on Compared with the state-of-art methods,the results dem- reading difficulty. onstrate both effectiveness and the potential of our To overcome the shortcoming of the basic Bow model. method. we designed a word coupling method.As shown in Figure 1,the word coupling method first measures the sim- ilarities among words on reading difficulties(the word cou- Background and Related Work pling matrix).Then the method makes the words of similar Readability Assessment difficulties (for example,school and law)share their occur- rence frequencies with each other (by matrix multiplica- Research on automatic readability assessment has tion),which leads to the coupled Bow model (the coupled spanned the last 70 years (Benjamin,2012).Early research Bow matrix).In this way,the documents will be similar mainly focused on the designing of readability formulas on readability if their words have similar distributions on (Zakaluk Samuels,1988).Many well-known readability reading difficulties. formulas have been developed,such as the SMOG formula school 1 syllabus decree school law decree school 18可 syllsbus decree D1 0 2 school 0.5 0.5 0 0 DI 0.5 0. 0 0 D2 D2 0 0.5 0.5 03 0 0.5 0.5 D3 decree sim(D1,D2)=1 sim(D1,D2)=0 D1 D2 D1- —D2 sim(D1,D3=0.707 sim(D1,D3j=0.7071 simD2,D3)=0.707 sim(D2,D3)=0 D3 03 FIG.1.A motivation example of the word coupling method.The left matrix is a basic Bow matrix.The central matrix is a word coupling matrix.The right matrix is the coupled BoW matrix.[Color figure can be viewed at wileyonlinelibrary.com] 434 JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY-May 2019 D0l:10.1002/asicombine the rich representation of texts with sophisticated prediction models. Most current methods assess the readability of text documents singularly, and ignore the interrelationship among documents on readability, which can be useful in assessing the readability of documents based on the labeled ones. For example, two documents can be of the same reading level, if they consist of words that have similar reading dif- ficulty. Hence, we propose a graph propagation method for readability assessment, which can model and utilize the interrelationship among text documents. To measure the relationship among documents, we use the bag-of-words (BoW) model, which is commonly used for text classification and clustering (Huang, 2008; Sebastiani, 2002). However, to measure the relationship on readability, the basic BoW model requires improvements, since it ignores the fact that different words may have similar reading difficulties. Figure 1 illustrates the improved use of the BoW model for readability assessment using a simple example. In Figure 1, the left matrix is built from the basic BoW model for three documents (that is, D1, D2, and D3) consisting of four tokens (that is, school, law, syllabus, and decree). Among the three documents, D1 and D2 are two relatively difficult documents both containing two easy words (school or law) and two diffi- cult words (syllabus or decree), while D3 is an easy document that contains two easy words (school). By calculating the cosine similarities based on the basic BoW model (the bottom left subfigure), the result shows that D1 is more similar to D3 than to D2, which is inconsistent with their similarities on reading difficulty. To overcome the shortcoming of the basic BoW model, we designed a word coupling method. As shown in Figure 1, the word coupling method first measures the similarities among words on reading difficulties (the word coupling matrix). Then the method makes the words of similar difficulties (for example, school and law) share their occurrence frequencies with each other (by matrix multiplication), which leads to the coupled BoW model (the coupled BoW matrix). In this way, the documents will be similar on readability if their words have similar distributions on reading difficulties. To build the coupled BoW model, the key point is to construct the word coupling matrix. For this purpose, we first estimate the occurrence distributions of words in sentences of different reading difficulties, and then compute their similarities on reading difficulty based on the distributions. Besides the coupled BoW model, the linguistic features can also be adopted by our method. On the one hand, we use the linguistic features as complementation of the coupled BoW model to construct graphs from multiple views. On the other, the linguistic features are used to reinforce the label propagation algorithm by providing the prior knowledge. In this article, we propose a two-view graph propagation method with word coupling for readability assessment. Our contributions are as follows (a preliminary version of this work appeared in Jiang, Sun, Gu, Bai, & Chen, 2015). (i) We apply the graph-based method for readability assessment, which can make use of the interrelationship among documents to estimate their readability. (ii) We propose the coupled BoW model, which can be used to measure the similarity of documents on reading difficulty. (iii) We propose a two-view graph building strategy to make use of both the coupled BoW model and the linguistic features. (iv) We propose a reinforced label propagation algorithm, which can make use of the ordinal relation among reading levels. Extensive experiments were carried out on data sets of both English and Chinese. Compared with the state-of-art methods, the results demonstrate both effectiveness and the potential of our method. Background and Related Work Readability Assessment Research on automatic readability assessment has spanned the last 70 years (Benjamin, 2012). Early research mainly focused on the designing of readability formulas (Zakaluk & Samuels, 1988). Many well-known readability formulas have been developed, such as the SMOG formula FIG. 1. A motivation example of the word coupling method. The left matrix is a basic BoW matrix. The central matrix is a word coupling matrix. The right matrix is the coupled BoW matrix. [Color figure can be viewed at wileyonlinelibrary.com] 434 JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY—May 2019 DOI: 10.1002/asi