One of the popular schemes of the Bow model is TF- Lexical Features.Lexical features are relevant to the lexi- IDF (Term Frequency and Inverse Document Frequency). cal types (for example,part of speech),which can be Given the set of documents D and the set of words V,the acquired by lexical analysis.Vajjala and Meurers (2012) TF-IDF matrix is defined as Mbow ERMVIxI,which can be employed lexical richness measures for readability assess- calculated based on the logarithmically scaled term(that is, ment,which include 15 SLA (Second Language Acquisi- word)frequency (Salton Buckley,1988)as follows: tion)measures and two extra designed measures.Jiang et al.(2014)developed counterparts of all the measures for Chinese documents.We adopt the two sets of lexical fea- Mto"=tfi.d.idfi.d=(1+logf(t,d)).log 1+ D例 Hdteab tures for both English and Chinese.In addition,we also (6) add the five features proposed by Feng et al.(2010)to compensate for the missed lexical types. where ft,d)is the number of times that a term (word)tE V occurs in a document dED. By adopting the TF-IDF matrix MoW from the basic Syntactic Features.Syntactic features are features that BoW model,the coupled TF-IDF matrix M"can be gener- are computed based on the syntactic structures of sentences that may require the parse tree analysis.Following Vajjala ated by the following equation: and Meurers (2012),we adopt features computed on units M"=C*.Mbow of three levels:sentence,clause,and T-unit.In addition, (7) we add the features designed in Jiang et al.(2014).which Technically,three coupled TF-IDF matrices Mu,M'er count the relative ratios of different types of parse tree nodes and phrases (that is,subtrees).Examples include the and Mym can be built according to the three word coupling average number of noun phrases per sentence,the average matrices C,developed in the previous section. number of parse tree nodes per words,and the ratio of the extra high tree. The Linguistic Features The readability of documents is influenced by many fac- Two-View Graph Propagation tors,such as vocabulary,composition,syntax,semantics, Based on the cBoW and linguistic views,we propose a and so on.Since vocabulary factors have been incorporated two-view graph propagation method for readability classifi- in the cBow model,we integrate the other three factors cation.While the general graph-based label propagation into the linguistic view as complementation of the cBow (Zhu Ghahramani,2002)contains two steps (that is. view.From the linguistic view,we build M!ERID(n graph construction and label propagation),our method adds refers to the number of features)for the documents in D. an extra graph merging step to make use of multiple graphs. Based on the recent work on document-level readability In addition,since grade levels are in ordinal scale,we further assessment (Feng et al.,2010;Jiang,Sun,Gu,Chen, propose a reinforced label propagation algorithm. 2014;Vajjala Meurers,2012),we select three groups of linguistic features:surface features,lexical features,and syntactic features,which are described as follows.Since Graph Construction.Given a feature representation our proposed method aims for language-independency,we X[M,Mlex,My",M'],we can build a directed graph select mostly the language-independent features and add G"to represent the interrelationship on readability among some popular language-dependent features adapted from the documents,where the node set D contains all the docu- English to Chinese. ments.Given the similarity function,we link nodediD to dD with an edge of weight Gij,defined as: Surface Features.Surface features are the kind of fea- sim(d,d)if djeK(d) tures that can be directly acquired by counting the gram- 0 (8) otherwise matical units in a document,such as the average number of characters per word,syllables per word,and words per where K(d;)is the set of k-nearest neighbors of d;with sentence (Vajjala Meurers,2012).We adopt them in our top-k similarities.The similarity function sim(di,d)can be method and add extra features used in Jiang et al.(2014). defined by the Euclidean distance as follows: The extra features include the average number of polysyl- labic words(for example,the number of syllables is greater 1 than 3)per sentence,the average number of difficult words sim(di,dj) (9) (for example,the number of characters is greater than 10) V∑”1(K-X2+e per sentence,the ratio of distinct words (that is,without repetition),and the ratio of unique words. where e is a small constant to avoid zero denominators 438 JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY-May 2019 Dol:10.1002/asiOne of the popular schemes of the BoW model is TFIDF (Term Frequency and Inverse Document Frequency). Given the set of documents D and the set of words V, the TF-IDF matrix is defined as Mbow 2 Rj j V × jDj, which can be calculated based on the logarithmically scaled term (that is, word) frequency (Salton & Buckley, 1988) as follows: Mbow t,d ¼ tft,d idft,d ¼ ð Þ 1 + logf tð Þ ,d log 1 + j j D j j f g djt 2 d ð6Þ where f(t, d) is the number of times that a term (word) t 2 V occurs in a document d 2 D. By adopting the TF-IDF matrix Mbow from the basic BoW model, the coupled TF-IDF matrix M* can be generated by the following equation: M* ¼ C* Mbow ð7Þ Technically, three coupled TF-IDF matrices Msur, Mlex, and Msyn can be built according to the three word coupling matrices C* , developed in the previous section. The Linguistic Features The readability of documents is influenced by many factors, such as vocabulary, composition, syntax, semantics, and so on. Since vocabulary factors have been incorporated in the cBoW model, we integrate the other three factors into the linguistic view as complementation of the cBoW view. From the linguistic view, we build Ml 2 Rnl × jDj (nl refers to the number of features) for the documents in D. Based on the recent work on document-level readability assessment (Feng et al., 2010; Jiang, Sun, Gu, & Chen, 2014; Vajjala & Meurers, 2012), we select three groups of linguistic features: surface features, lexical features, and syntactic features, which are described as follows. Since our proposed method aims for language-independency, we select mostly the language-independent features and add some popular language-dependent features adapted from English to Chinese. Surface Features. Surface features are the kind of features that can be directly acquired by counting the grammatical units in a document, such as the average number of characters per word, syllables per word, and words per sentence (Vajjala & Meurers, 2012). We adopt them in our method and add extra features used in Jiang et al. (2014). The extra features include the average number of polysyllabic words (for example, the number of syllables is greater than 3) per sentence, the average number of difficult words (for example, the number of characters is greater than 10) per sentence, the ratio of distinct words (that is, without repetition), and the ratio of unique words. Lexical Features. Lexical features are relevant to the lexical types (for example, part of speech), which can be acquired by lexical analysis. Vajjala and Meurers (2012) employed lexical richness measures for readability assessment, which include 15 SLA (Second Language Acquisition) measures and two extra designed measures. Jiang et al. (2014) developed counterparts of all the measures for Chinese documents. We adopt the two sets of lexical features for both English and Chinese. In addition, we also add the five features proposed by Feng et al. (2010) to compensate for the missed lexical types. Syntactic Features. Syntactic features are features that are computed based on the syntactic structures of sentences that may require the parse tree analysis. Following Vajjala and Meurers (2012), we adopt features computed on units of three levels: sentence, clause, and T-unit. In addition, we add the features designed in Jiang et al. (2014), which count the relative ratios of different types of parse tree nodes and phrases (that is, subtrees). Examples include the average number of noun phrases per sentence, the average number of parse tree nodes per words, and the ratio of the extra high tree. Two-View Graph Propagation Based on the cBoW and linguistic views, we propose a two-view graph propagation method for readability classifi- cation. While the general graph-based label propagation (Zhu & Ghahramani, 2002) contains two steps (that is, graph construction and label propagation), our method adds an extra graph merging step to make use of multiple graphs. In addition, since grade levels are in ordinal scale, we further propose a reinforced label propagation algorithm. Graph Construction. Given a feature representation X 2 {Msur, Mlex, Msyn, Ml }, we can build a directed graph G* to represent the interrelationship on readability among the documents, where the node set D contains all the documents. Given the similarity function, we link node di 2 D to dj 2 D with an edge of weight G* i,j , defined as: G* i,j ¼ sim di,dj if dj 2 Kð Þ di 0 otherwise ( ð8Þ where Kð Þ di is the set of k-nearest neighbors of di with top-k similarities. The similarity function sim(di, dj) can be defined by the Euclidean distance as follows: sim di,dj ¼ 1 ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Pn v¼1 Xv,i −Xv,j 2 q + ϵ ð9Þ where ϵ is a small constant to avoid zero denominators. 438 JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY—May 2019 DOI: 10.1002/asi