stop words and around terms that occu_中国高校课件下载中心

点击下载：《电子商务 E-business》阅读文献：Mining ideas from textual information

正在加载图片...

D. Thorleuchter et al. Expert Systems with Applications 37(2010)7182-7188 stop words and around terms that occur both in the new text and in the problem description. (w+k)≥1)v(+j=n)hi∈{1…,n,(2) One important decision to be taken is to determine the length of a text pattern. Text patterns should not be too small so that they tef=m>(w)≥)v-j=1)m∈1…,n(3 contain all terms representing a new idea. Further text patterns that are related to the new idea. For example if we set the length of After computing Pe and rg r, we can build a text pattern T, around the text patterns to l then a text pattern contains the selected term, the term w from the text t=[w1,.,Wnl I terms from its left context and also I terms from its right context. The cardinality of the set of stop word filtered and stemmed terms Ti=w -tort,., Wi,..., wi+1 from this pattern is normally smaller than 2*1+1 because some terms are stop words, some terms occur twice and some terms For each text pattern from the new text, we create a term vector in have the same stem vector space model. The size of the vector is defined by the number In this paper, we do not use a constant length I for all patterns of different stemmed and stop word filtered terms in the new text. but a variable length of text patterns based on a dynamic adapta- for text pattern encoding, we use binary term vectors that means a tion of its context. This is realized by using a term vector element is set to one if the corresponding unstemmed term heme based on the difference between stop words and non-stop is used in the text pattern and to zero if the term is not. We also ce of a stop w not as high as the importance of a non-stop word If an author for build text patterns from the problem description and create term mulates an idea very briefly by joining catchwords together then vectors as described above. he normally does not use many stop words and the text pattern ing measure. This idea mining measure is described in Section 4. By style that means his writing is not expressed in a clear and simple description, we can compute a result value always between 0% way then he normally uses more stop words and the text pattern and 100% using this measure. The greater the result value the more ngth has to be larger. In the idea mining application the value is the probability that the vector from the new text represents a of text pattern length I and the percentage of the importance of stop words u and of non-stop words v can be provided by the user. new and useful idea concerning a vector from the problem To compute the variable length of a text pattern, we firstly de- fine the term weighting scheme Ne use this measure for comparing vectors from the new text to their most similar vectors from the problem description but Definition 1. Let (a text)T=wi,, Wan]be a list of terms(words) vector to its most similar vectors predominate result values from w, in order of appearance and let n e N be the number of terms int comparing a vector to its further vectors. For example if a vector stop terms(Thorleuchter, Van den Poel, Prinzie, 2010b)and let from the new text is similar to one from the problem descripe y then the idea is not new to the user regardless whether result m E N be the number of terms in 2. Let the percentage ueN be a ues from comparing this vector to further vectors from the prob- term weighting coefficient for stop words. Let the percentage ve lem description are greater than zero. therefore, we can be sure be a term weighting coefficient for non-stop words. Then, we define f(wi)EN as term weighting scheme that a vector represents a new and useful idea only if it gets a great result value from idea mining measure concerning one of its most fs( wi) uw∈E wife(i∈ (1) similar vectors. Further, the computing of the idea mining measure is time consuming. Therefore, it is necessary to limit the number of comparisons with idea mining measure for implementing an idea We give an example for this. The text pattern components for mining application. equency conversion of infrared lasers' is built around the word We choose a two-step classification way. In the first step. conversion. It contains the word conversion itself, three terms compare each vector from the new text to all vectors from the from its left context(components for frequency ), and three terms problem description by using the well-known Euclidean distance from its right context (of infrared lasers). Here, we use a constant measure. Fortunately, the computing of the Euclidean distance ngth 1-3 and a term weighting scheme with a-B=100%. This measure is not time consuming so that it is suited for implement means the importance of a stop word is equal to the importance ing in an idea mining application In detail, for each vector from the of a non-stop word. The next text pattern is an example for a var- new text, we identify all vectors from the problem description iable length: 'ln a lst phase, known but so far not available mate where the euclidean distance result value is the lowest that means rials and technologies such as layer systems and crystals. This text we identify the most similar vectors In the second step we com- pattern is built around the word'technologies Here we use a con- pare each vector from the new text to its most similar vectors using stant length 1=3 and a term weighting scheme with u= 10% and the idea mining measure 100%. As a result, this text pattern contains six terms from Each vector from the new text that is compared to several the right context and eleven terms from the left context of the term similar vectors -gets the highest result value from idea mining technologies. In this example, non-stop words are phase, materi- measure as result value. To identify a new and useful idea we als, technologies, layer, systems, and crystal. We compute the use alpha-cut method An alpha-cut of the idea mining measure re- number of terms from the left and right context as described sult value is the set of all vectors from the new text such that the appertaining result value is greater than or equal to alpha(a). In the idea mining application, the user can provide the value of Definition 2. Letle N be a constant length of text patterns. Let et be the number of terms from the left context of a text pattern that 4 ldea mining measure built around the term w. Let hgi be the number of terms from he right context of a text pattern that is built around the term w With the idea mining measure, we compare a vector that repre- Then, we define t E N and Aght EN as ents a text pattern from the new text to its most similar vectorsstop words and around terms that occur both in the new text and in the problem description. One important decision to be taken is to determine the length of a text pattern. Text patterns should not be too small so that they contain all terms representing a new idea. Further text patterns should not be too large so that only terms occur in the text patterns that are related to the new idea. For example if we set the length of the text patterns to l then a text pattern contains the selected term, l terms from its left context and also l terms from its right context. The cardinality of the set of stop word filtered and stemmed terms from this pattern is normally smaller than 2 l + 1 because some terms are stop words, some terms occur twice and some terms have the same stem. In this paper, we do not use a constant length l for all patterns but a variable length of text patterns based on a dynamic adaptation of its context. This is realized by using a term weighting scheme based on the difference between stop words and non-stop words because the importance of a stop word in a text pattern is not as high as the importance of a non-stop word. If an author formulates an idea very briefly by joining catchwords together then he normally does not use many stop words and the text pattern length can be small. If an author formulates an idea in a flowery style that means his writing is not expressed in a clear and simple way then he normally uses more stop words and the text pattern length has to be larger. In the idea mining application the value of text pattern length l and the percentage of the importance of stop words u and of non-stop words v can be provided by the user. To compute the variable length of a text pattern, we firstly de- fine the term weighting scheme. Definition 1. Let (a text) T ¼ ½w1; ... ; wn be a list of terms (words) wi in order of appearance and let n 2 N be the number of terms in T and i 2 ½1; ... ; n. Let R ¼ fw~ 1; ... ; w~ mg be a set of domain specific stop terms (Thorleuchter, Van den Poel, & Prinzie, 2010b) and let m 2 N be the number of terms in R. Let the percentage u 2 N be a term weighting coefficient for stop words. Let the percentage v 2 N be a term weighting coefficient for non-stop words. Then, we define fgðwiÞ 2 N as term weighting scheme: fg ðwiÞ ¼ ujwi 2 R vjwi R R ð8i 2 f1; ... ; ngÞ: ð1Þ We give an example for this. The text pattern ‘components for frequency conversion of infrared lasers’ is built around the word ‘conversion’. It contains the word conversion itself, three terms from its left context (components for frequency), and three terms from its right context (of infrared lasers). Here, we use a constant length l = 3 and a term weighting scheme with a = b = 100%. This means the importance of a stop word is equal to the importance of a non-stop word. The next text pattern is an example for a variable length: ‘In a 1st phase, known but so far not available materials and technologies such as layer systems and crystals’. This text pattern is built around the word ‘technologies’. Here we use a constant length l = 3 and a term weighting scheme with u = 10% and v = 100%. As a result, this text pattern contains six terms from the right context and eleven terms from the left context of the term ‘technologies’. In this example, non-stop words are phase, materials, technologies, layer, systems, and crystal. We compute the number of terms from the left and right context as described below: Definition 2. Let l 2 N be a constant length of text patterns. Let l left i be the number of terms from the left context of a text pattern that is built around the term wi. Let l right i be the number of terms from the right context of a text pattern that is built around the term wi. Then, we define l left i 2 N and l right i 2 N as: l right i ¼ min j X j k¼1 fg ðwiþkÞ P l ! _ ði þ j ¼ nÞ 8i 2 f g 1; ... ; n ; ð2Þ l left i ¼ min j X j k¼1 fg ðwikÞ P l ! _ ði j ¼ 1Þ 8i 2 f g 1; ... ; n : ð3Þ After computing l left i and l right i , we can build a text pattern Ti around the term wi from the text T ¼ ½w1; ... ; wn. Ti ¼ wil left i ; ... ; wi; ... ; wiþl right i h i: ð4Þ For each text pattern from the new text, we create a term vector in vector space model. The size of the vector is defined by the number of different stemmed and stop word filtered terms in the new text. For text pattern encoding, we use binary term vectors that means a vector element is set to one if the corresponding unstemmed term is used in the text pattern and to zero if the term is not. We also build text patterns from the problem description and create term vectors as described above. To identify new and useful ideas, we create a specific idea mining measure. This idea mining measure is described in Section 4. By comparing a vector from the new text to one from the problem description, we can compute a result value always between 0% and 100% using this measure. The greater the result value the more is the probability that the vector from the new text represents a new and useful idea concerning a vector from the problem description. We use this measure for comparing vectors from the new text to their most similar vectors from the problem description but not to all vectors. This is because result values from comparing a vector to its most similar vectors predominate result values from comparing a vector to its further vectors. For example if a vector from the new text is similar to one from the problem description then the idea is not new to the user regardless whether result values from comparing this vector to further vectors from the problem description are greater than zero. Therefore, we can be sure that a vector represents a new and useful idea only if it gets a great result value from idea mining measure concerning one of its most similar vectors. Further, the computing of the idea mining measure is time consuming. Therefore, it is necessary to limit the number of comparisons with idea mining measure for implementing an idea mining application. We choose a two-step classification way. In the first step, we compare each vector from the new text to all vectors from the problem description by using the well-known Euclidean distance measure. Fortunately, the computing of the Euclidean distance measure is not time consuming so that it is suited for implementing in an idea mining application. In detail, for each vector from the new text, we identify all vectors from the problem description where the Euclidean distance result value is the lowest that means we identify the most similar vectors. In the second step, we compare each vector from the new text to its most similar vectors using the idea mining measure. Each vector from the new text – that is compared to several similar vectors – gets the highest result value from idea mining measure as result value. To identify a new and useful idea we use alpha-cut method. An alpha-cut of the idea mining measure result value is the set of all vectors from the new text such that the appertaining result value is greater than or equal to alpha ða~Þ. In the idea mining application, the user can provide the value of a~. 4. Idea mining measure With the idea mining measure, we compare a vector that represents a text pattern from the new text to its most similar vectors 7184 D. Thorleuchter et al. / Expert Systems with Applications 37 (2010) 7182–7188

<<向上翻页向下翻页>>

点击下载：《电子商务 E-business》阅读文献：Mining ideas from textual information