正在加载图片...
TABLE 1.Three aspects of estimating reading difficulty of sentences using heuristic functions. Aspect Function Description Surface len(s) the length of the sentence s. ans(s) the average number of syllables (or strokes for Chinese)per word (or character for Chinese)in s. anc(s) the average number of characters per word in s. Lexical Iv(s) the number of distinct types of POS,that is,part of speech,in s. atr(s) the ratio of adjectives in s. ntr(s) the ratio of nouns in s. Syntactic pth(s) the height of the syntax parser tree of s. anp(s) the average number of (noun,verb,and preposition)phrases in s. reading score of all the sentences in S,where h refers to version of Kullback-Leibler divergence (Kullback Leibler, one of the eight functions.To determine the difficulty level 1951)to measure their distribution difference,which aver- (s)((s)[1.)of a sentence s,the range[]is ages the values of the divergence computed from both direc- evenly divided into n intervals.I"(s)will be i,if the reading tions.The equation is: score r(s)resides in the i-th interval.For each of the three aspects,we compute one I'(s)for a sentence s by combin- 1 ing the heuristic functions using the following equations. cKL()=(KL(PnlPe)+KL(Pnp)) (3) Bsur(s)=max(Hen(s),lans(s),lanc(s)) where KL(p)=∑,p(log8 and iis the element index.After that,the logistic function is applied to get the pex(s)=max(I(s).Ia(s).I"M(s)) (1) normalized distribution similarity,that is: sim(,2)=1+ea阿 (4) Isyn(s)=max(Ipth(s).Iamp(s)) Given a word ti,only A other words with highest corre- lation (similarity)are selected to build the neighbor set of Step 2:Per-word difficulty distribution estimation. ti,denoted as N(ti).If a word t;is not selected (that is, The difficulty distribution of each word is computed based (t)),the corresponding sim(ti.1)will be assigned on the sentence-level reading difficulty.Since each sentence 0.After that,the word coupling matrix (C")with sim(ti,), contains many words and each word may appear in many as its elements are normalized along the rows so that the sentences,we estimate the difficulty distributions of words sum of each row is 1.Based on three different I'(s),we according to their distributions of occurrences in sentences. construct three distinct word coupling matrices Cr Let y denote the set of all the words appearing in S,P and csyn denotes the difficulty distribution of a word(term)tV.P, While a large volume of vocabulary will make the con- is a vector containing n(that is,the number of difficulty struction of the word coupling matrix time-consuming,we levels)values,the i-th part of which can be calculated by provide a strategy to filter out less informative words based Equation 2. on their distributions on reading difficulty.The filtering mea- sure is the entropy of the words,which can be calculated by n0=L∑8t∈-aI附=) Equation 5.By sorting the words ascendingly according to (2) entropy,the last a E[0,1]proportion will be filtered out. Ent(t)=H(p:)=->p,(i)logp,(i) (5) where n,refers to the number of sentences containing t. i=1 The indicator function (x)retums 1 if x is true and 0 otherwise. Generating the Coupled Bag-Of-Words Model.In the Step 3:Word coupling matrix construction. basic Bow model,words are treated as being independent Given the set of words V,a word coupling matrix is of each other,and the corresponding BoW matrix is sparse defined as CERMVIxMl,the element of which reflects the and ignores the similarity among words on reading diffi- correlation between two words (that is,terms).The correla- culty.For readability assessment,the coupled BoW model tion between each pair of words can be computed according can be implemented by multiplying the word coupling to the similarity measure of their difficulty distributions. matrix and the basic Bow matrix,and the resulting Given two words(terms)t and t2,whose difficulty dis-coupled Bow matrix will be dense and focus on similari- tributions are p and p,respectively,we use a symmetric ties on reading difficulty. JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY-May 2019 437 D0:10.1002/asireading score of all the sentences in S, where h refers to one of the eight functions. To determine the difficulty level l h (s) (l h (s) 2 [1, η]) of a sentence s, the range rh min,rh max  is evenly divided into η intervals. l h (s) will be i, if the reading score r(s) resides in the i-th interval. For each of the three aspects, we compute one l * (s) for a sentence s by combin￾ing the heuristic functions using the following equations. l surð Þ¼ s max llenð Þs ,l ansð Þs ,l ancð Þs   l lexð Þs ¼ max llvð Þs ,l atrð Þs ,l ntrð Þs   ð1Þ l synð Þ¼ s max lpthð Þs ,l anpð Þs   Step 2: Per-word difficulty distribution estimation. The difficulty distribution of each word is computed based on the sentence-level reading difficulty. Since each sentence contains many words and each word may appear in many sentences, we estimate the difficulty distributions of words according to their distributions of occurrences in sentences. Let V denote the set of all the words appearing in S, pt denotes the difficulty distribution of a word (term) t 2 V. pt is a vector containing η (that is, the number of difficulty levels) values, the i-th part of which can be calculated by Equation 2. ptð Þ¼i 1 nt X s2S δð Þ t 2 s δðÞ ð l sð Þ¼ i 2Þ where nt refers to the number of sentences containing t. The indicator function δ(x) returns 1 if x is true and 0 otherwise. Step 3: Word coupling matrix construction. Given the set of words V, a word coupling matrix is defined as C 2 Rj j V × j j V , the element of which reflects the correlation between two words (that is, terms). The correla￾tion between each pair of words can be computed according to the similarity measure of their difficulty distributions. Given two words (terms) t1 and t2, whose difficulty dis￾tributions are pt1 and pt2 , respectively, we use a symmetric version of Kullback–Leibler divergence (Kullback & Leibler, 1951) to measure their distribution difference, which aver￾ages the values of the divergence computed from both direc￾tions. The equation is: cKLð Þ¼ t1,t2 1 2 KL pt1 kpt2 ð Þ + KL pt2 kpt1 ð Þð ð Þ 3Þ where KLðp j jqÞ ¼P i p ið Þlogp ið Þ q ið Þ and i is the element index. After that, the logistic function is applied to get the normalized distribution similarity, that is: sim tð Þ¼ 1,t2 2 1 + ecKLð Þ t1,t2 ð4Þ Given a word ti, only λ other words with highest corre￾lation (similarity) are selected to build the neighbor set of ti, denoted as N tð Þi . If a word tj is not selected (that is, tj2N= tð Þi ), the corresponding sim(ti, tj) will be assigned 0. After that, the word coupling matrix (C* ) with sim(ti, tj), as its elements are normalized along the rows so that the sum of each row is 1. Based on three different l * (s), we construct three distinct word coupling matrices Csur, Clex, and Csyn. While a large volume of vocabulary will make the con￾struction of the word coupling matrix time-consuming, we provide a strategy to filter out less informative words based on their distributions on reading difficulty. The filtering mea￾sure is the entropy of the words, which can be calculated by Equation 5. By sorting the words ascendingly according to entropy, the last α 2 [0, 1] proportion will be filtered out. Ent tð Þ¼ H pð Þ¼t − X η i¼1 ptð Þi logptðÞ ð i 5Þ Generating the Coupled Bag-Of-Words Model. In the basic BoW model, words are treated as being independent of each other, and the corresponding BoW matrix is sparse and ignores the similarity among words on reading diffi- culty. For readability assessment, the coupled BoW model can be implemented by multiplying the word coupling matrix and the basic BoW matrix, and the resulting coupled BoW matrix will be dense and focus on similari￾ties on reading difficulty. TABLE 1. Three aspects of estimating reading difficulty of sentences using heuristic functions. Aspect Function Description Surface len(s) the length of the sentence s. ans(s) the average number of syllables (or strokes for Chinese) per word (or character for Chinese) in s. anc(s) the average number of characters per word in s. Lexical lv(s) the number of distinct types of POS, that is, part of speech, in s. atr(s) the ratio of adjectives in s. ntr(s) the ratio of nouns in s. Syntactic pth(s) the height of the syntax parser tree of s. anp(s) the average number of (noun, verb, and preposition) phrases in s. JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY—May 2019 DOI: 10.1002/asi 437
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有