正在加载图片...
usage frequency by counting the word frequency lists from the text corpus.The other way is estimating the probability distribution of words over the sentence-level difficulties,which is motivated by Jiang et al.(2015).Usage difficulty is defined on both.By discretizing the range of word frequency into b intervals of equal size,the usage frequency level of a word w is i,if its frequency resides in the ith intervals.By estimating the probability distribution vector P from sentence-level difficulties,we can define K andKi,Pl Structure difficulty.When building readability formulas,researchers have found that the structure of words could imply its difficulty (Flesch,1948;Gunning,1952;McLaughlin,1969).For example, words with more syllables are usually more difficult than words with less syllables.We call the difficulty reflected by structure of words as the structure difficulty.Formally,given a word w,its structure difficulty can be described by a distributionK over the word structures. Words in different languages may have their own special structural characteristics.For example,in English,the structural characteristics of words relate to syllables,characters,affixes,and subwords. Whereas in Chinese,the structural characteristics of words relate to strokes and radicals of Chinese characters.Here we use the number of syllables(strokes for Chinese)and characters in a word w to describe its structure difficulty.By discretizing the range of each number into intervals,Kis obtained by counting the interval in which w resides,respectively. 3.1.2 Knowledge Graph Construction After extracting the domain knowledge on word-level difficulty,we then quantitatively represent the knowledge by a graph.We define the knowledge graph as an undirected graph G=(V,E),where V is the set of vertices,each of which represents a word,and E is the set of edges,each of which represents the relation (i.e.,similarity)between two words on difficulty.Each edge e EE is a vertex pair (wi,wj) and is associated with a weight zij,which indicates the strength of the relation.If no edge exists between wi and wj,the weight zij =0.We define two edge types in the graph:Sim-edge and Dissim-edge.The former indicates that its end words have similar difficulty and is associated with a positive weight.The latter indicates that its end words have significant different difficulty and is associated with a negative weight.We derived the edges from the similarities computed between pairs of the words'knowledge vectors.Formally,given the extracted knowledge vector K=[K of a word w,E can be constructed using the similarity between pairs of words (w;,w;)as follows: sim(Kw:,Kw;) wj∈Np(w) sim(K:,Kw) wj∈Wn(u) (1) 0 otherwise where sim()is a similarity function (e.g.,cosine similarity),Np(wi)refers to the set of k most similar (i.e.,greatest similarity)neighbors of wi,and An(wi)refers to the set of k most dissimilar (i.e.,least similarity)neighbors of wi. 3.1.3 Knowledge Graph-based Word Embedding After constructing the knowledge graph,which models the relationship among words on difficulty,we can derive the difficulty context from the graph and train the word embedding focused on reading diffi- culty.For the graph-based difficulty context,given a word w,we define its difficulty context as the set of other words that have relevance to w on difficulty.Specifically,we define two types of difficulty context, positive context and negative context,corresponding to the two types of edges in the knowledge graph (i.e.,Sim_edge and Dissim-edge). Unlike the context defined on texts,which can be sampled by sliding windows over consecutive words, the context defined on a graph requires special sampling strategies.Different sampling strategies may define the context differently.For difficulty context,we design two relatively intuitive strategies,the ran- dom walk strategy and the immediate neighbors strategy,for the sampling of either positive or negative context. From the type Sim_edge,we sample the positive target-context pairs where the target word and the context words are similar on difficulty.Since the similarity is generally transitive,we adopt the random 369369 usage frequency by counting the word frequency lists from the text corpus. The other way is estimating the probability distribution of words over the sentence-level difficulties, which is motivated by Jiang et al. (2015). Usage difficulty is defined on both. By discretizing the range of word frequency into b intervals of equal size, the usage frequency level of a word w is i, if its frequency resides in the ith intervals. By estimating the probability distribution vector Pw from sentence-level difficulties, we can define KU w ∈ R 1+|Pw| , and KU w,i = [i, Pw]. Structure difficulty. When building readability formulas, researchers have found that the structure of words could imply its difficulty (Flesch, 1948; Gunning, 1952; McLaughlin, 1969). For example, words with more syllables are usually more difficult than words with less syllables. We call the difficulty reflected by structure of words as the structure difficulty. Formally, given a word w, its structure difficulty can be described by a distribution KS w over the word structures. Words in different languages may have their own special structural characteristics. For example, in English, the structural characteristics of words relate to syllables, characters, affixes, and subwords. Whereas in Chinese, the structural characteristics of words relate to strokes and radicals of Chinese characters. Here we use the number of syllables (strokes for Chinese) and characters in a word w to describe its structure difficulty. By discretizing the range of each number into intervals, KS w is obtained by counting the interval in which w resides, respectively. 3.1.2 Knowledge Graph Construction After extracting the domain knowledge on word-level difficulty, we then quantitatively represent the knowledge by a graph. We define the knowledge graph as an undirected graph G = (V, E), where V is the set of vertices, each of which represents a word, and E is the set of edges, each of which represents the relation (i.e., similarity) between two words on difficulty. Each edge e ∈ E is a vertex pair (wi , wj ) and is associated with a weight zij , which indicates the strength of the relation. If no edge exists between wi and wj , the weight zij = 0. We define two edge types in the graph: Sim edge and Dissim edge. The former indicates that its end words have similar difficulty and is associated with a positive weight. The latter indicates that its end words have significant different difficulty and is associated with a negative weight. We derived the edges from the similarities computed between pairs of the words’ knowledge vectors. Formally, given the extracted knowledge vector Kw = [KA w , KU w , KS w] of a word w, E can be constructed using the similarity between pairs of words (wi , wj ) as follows: zij =    sim(Kwi , Kwj ) wj ∈ Np(wi) −sim(Kwi , Kwj ) wj ∈ Nn(wi) 0 otherwise (1) where sim() is a similarity function (e.g., cosine similarity), Np(wi) refers to the set of k most similar (i.e., greatest similarity) neighbors of wi , and Nn(wi) refers to the set of k most dissimilar (i.e., least similarity) neighbors of wi . 3.1.3 Knowledge Graph-based Word Embedding After constructing the knowledge graph, which models the relationship among words on difficulty, we can derive the difficulty context from the graph and train the word embedding focused on reading diffi- culty. For the graph-based difficulty context, given a word w, we define its difficulty context as the set of other words that have relevance to w on difficulty. Specifically, we define two types of difficulty context, positive context and negative context, corresponding to the two types of edges in the knowledge graph (i.e., Sim edge and Dissim edge). Unlike the context defined on texts, which can be sampled by sliding windows over consecutive words, the context defined on a graph requires special sampling strategies. Different sampling strategies may define the context differently. For difficulty context, we design two relatively intuitive strategies, the ran￾dom walk strategy and the immediate neighbors strategy, for the sampling of either positive or negative context. From the type Sim edge, we sample the positive target-context pairs where the target word and the context words are similar on difficulty. Since the similarity is generally transitive, we adopt the random
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有